Parsing G22.2110 Programming Languages May 24, 2012 New York University Chanseok Oh (chanseok@cs.nyu.edu) • Chapter 2 Scanning Parsing • Overview – Scanner, Tokenizer, Lexer, Lexical Analyzer IF ( A >= .30 ) THEN { … IF, LPARAN, IDENT(A), GTE, FPN(.30), RPARAN, THEN, … • Tokens, Lexemes • DFA , NFA, Regular expressions • lex, flex, Jlex – Parser • DPDA, Deterministic context-free grammars • Yacc, Bison • Table of Contents – Practical parsers (Linear time) • LL (top-down, predictive) • LR (bottom-up, shift-reduce) – Related side-topics • Ambiguity, Language and parser hierarchy – Examples: Simple Calculator Language • A Language – A set of strings (of given symbols) • • • • • { { { { { finite, set, with, five, strings } ab, aaba, abbaba, … } 0n1n } aibj | i < j } void main() { int i = 0 }, … } – Is an input string in the language? • cf. Recursive, Turing-decidable languages • Context-Free Languages (CFL) – Languages that can be generated by • CFG’s – Languages that can be determined by • PDA’s – Not all languages are CF. – CFG: suitable for most PL’s. • <sentence> := <subject> <verb> <object> PERIOD – Deterministic CFL • Example Here is our CFG: S A A Input: := := := id , id ; A A sum , a1 , ptr ; S := A := A := • Parse Tree S id , id ; A sum , A a1 , ptr A ; A A • Ambiguous Grammars E E E E E + E – E * E / E – Is it ambiguous? Undecidable. – No general procedure for converting to unambiguous grammars – Can be allowed to some extent for deterministic parsing, e.g., by defining precedence or associativity. • Parsers – LL (Left-to-right, Left-most derivation) • Top-down • Predictive • Simple and easy to understand – LR (Left-to-right, Right-most derivation) • Bottom-up • Shift-reduce • Most common in production-level • SLR (Simple) • LALR (Look-ahead) • LL(k) Parser – LL(k) Parser • Uses k look-ahead symbols • Does not backtrack (deterministic). – LL(1) is the most popular kind of LL parser. – LL(k) Languages • Not all CFL’s are LL(k) languages. CFL LL(k) LL CFL • LL Parsing Example <id_list> := <id_list_tail> := <id_list_tail> := id <id_list_tail> , id <id_list_tail> ; It is an LL grammar. The language is also LL. Input to parse: sum , a1 , ptr ; CFL LL • • Parse Tree <id_list> := id <id_list_tail> <id_list_tail> := , id <id_list_tail> <id_list_tail> := ; <id_list> <id_list_tail> <id_list_tail> <id_list_tail> sum , a1 , ptr ; • LR Parser – LR(k) parser • Uses k look-ahead symbols. • Usually k is 1, and the term LR Parser is often intended to refer to this case. – LR(k) Languages • Not all CFL’s are LR(k) languages. CFL LR Language Relationships Unambiguous languages Ambiguous languages LL(1) LR(1) LALR SLR LR(0) LL(0) • LR Parsing Example With the same grammar, id_list id_list_tail id_list_tail id id_list_tail , id id_list_tail ; It is also an LR grammar, and the language is LR. CFL LR(1) LL • Input to parse (as before): sum , a1 , ptr ; • Parse Tree <id_list> := id <id_list_tail> <id_list_tail> := , id <id_list_tail> <id_list_tail> := ; <id_list> <id_list_tail> <id_list_tail> <id_list_tail> sum , a1 , ptr ; • Another LR Parsing Example Consider a modified grammar, <id_list> := <id_list_prefix> := <id_list_prefix> := <id_list_prefix> ; <id_list_prefix> , id id The grammar is not LL, (though the language itself is both LR and LL). <id_list> <id_list_prefix> <id_list_prefix> := := := <id_list_prefix> ; <id_list_prefix> , id id <id_list> • LR Parsing <id_list_prefix> <id_list_prefix> <id_list_prefix> sum , a1 , ptr ; • Simple Calculator Language 3+(4*1) total := 7 read n write ( 10 – ( total + 1 ) / 3 * n ) • Simple Arithmetic Expression E E + E | E – E E * E | E / E E id | number | ( E ) • Simple Arithmetic Expression expr term factor add_op mult_op term | expr add_op term factor | term mult_op factor id | number | ( expr ) + | * | / – LL language, but not LL grammar (yet LR one) – Two most common obstacles to “LL(1)-ness” • Left-recursion • Common prefixes stmt stmt stmt_list id := expr id ( arg_list ) • Converting to LL-Grammars stmt stmt_list stmt stmt stmt_list_tail stmt stmt_list stmt stmt_list | є id := expr id ( arg_list ) id | stmt_list_tail := expr | ( arg_list ) – Alternatively, you can employ conflict-resolution rules. • Converted LL(1) Grammar expr term_tail term factor_tail factor add_op mult_op Not every CFG can be converted to LL grammar. Why? term term_tail add_op term term_tail | є factor | factor_tail mult_op factor factor_tail | є ( expr ) | id | number + | * | / CFL LL • LL(1) for Simple Calculator Language program stmt_list stmt expr term_tail term factor_tail factor add_op mult_op stmt_list $$ stmt stmt_list | є id := expr | read id | write expr term term_tail add_op term term_tail | є factor factor_tail mult_op factor factor_tail | є ( expr ) | id | number + | * | / Added three more production rules to the previous LL(1) grammar for expressions. • LL Parsing – Input program read A read B sum := A + B write sum write sum / 2 • Predict Sets program stmt_list stmt expr term_tail term factor_tail factor add_op mult_op stmt_list $$ {id, read, write, $$} stmt stmt_list {id, read, write} | є {$$} id := expr {id} read id {read} | write expr {write} term term_tail {(, id, number} add_op term term_tail {+,-} є {), id, read, write, $$} factor factor_tail {(, id, number} mult_op factor factor_tail {*, /} є {+, -, ), id, read, write, $$} ( expr ) {(} | id {id} | number {number} + {+} | - {-} * {*} | / {/} • Predict Sets stmt id := expr {id} read id {read} write expr {write} – Notice the pair-wise disjoint sets: {id}, {read} ,{write} – You are to expand stmt. – Look ahead 1 token (LL(1)). • LL(1) program stmt_list stmt expr term_tail term factor_tail factor add_op mult_op stmt_list $$ stmt stmt_list | є id := expr | read id | write expr term term_tail add_op term term_tail | є factor factor_tail mult_op factor factor_tail | є ( expr ) | id | number + | * | / • Better grammar: LR(1) program stmt_list stmt expr term factor add_op mult_op stmt_list $$ stmt_list stmt | stmt id := expr | read id | write expr term | expr add_op term factor | term mult_op factor id | number | ( expr ) + | * | / – More intuitive than LL • However, not exactly the same language (no empty string) – Left-recursive is advantageous. • LR Parsing – With the same input program, read A read B sum := A + B write sum write sum / 2 • State Transition Diagram State 0 program stmt_list stmt (Initial state) ● stmt_list $$ ● stmt_list stmt ● stmt ● id := expr ● read id ● write expr read State 0’ stmt_list stmt ● stmt Reduce (shifting stmt_list) stmt_list program stmt_list stmt State 1 stmt read ● id id State 1’ stmt read id ● Reduce (shifting stmt from a viewpoint of State 0) State 2 stmt_list ● $$ stmt_list ● stmt ● id := expr ● read id ● write expr • Shift/Reduce Conflicts expr factor … ● term id ● • Reduce/Reduce Conflicts expr factor id ● id ● • Resolving Conflicts • LR(0) – Any LR language has an LR(0) grammar (with $$). – Not practical: prohibitively large and unintuitive • SLR – SLR grammar: no shift/reduce or reduce/reduce conflicts when using FOLLOW sets – FOLLOW sets: also used in LL to generate PREDICT sets • LALR(1) – – – – LALR(1) grammar (may not be SLR) Same states as SLR Improvement over SLR with local look-ahead LALR’s are the most common parsers in practice. • LR(1) – LR(1) grammars (may not be LALR(1) or SLR)