Context-Free Grammar and Parsing Cmp Sci 4280 Context-free grammar • What is a context-free grammar? • A set of recursive rules used to generate patterns of strings • We can think of a computer program as a pattern of strings Context-free grammar • Nearly entire syntax of whole programming languages can be described by context-free grammars • Chomsky developed in 1950’s Noam Chomsky • Backus-Naur Form (BNF) – Nearly identical to Chomsky’s context-free grammars – Natural for describing syntax • Similar method used by Panini to describe Sanskrit syntax more than 2000 years ago • We’ll use ‘BNF’ and ‘grammar’ interchangeably BNF • BNF is a metalanguage for programming languages – It is a language that is used to describe another language • Uses terminals – Tokens in computer program • Uses abstractions called nonterminals – Nonterminals are enclosed in <> brackets or start with uppercase letter <assign> → <var> = <expression> • LHS is the nonterminal being defined • RHS consists of some mixture of: • Terminals (tokens) • References to other abstractions (nonterminals) • Altogether the definition is called a rule or production A BNF grammar is a collection of productions BNF • Nonterminals can have two or more distinct definitions – These can be listed separately in different rules – Or they can be listed in one rule and separated by ‘|’ <expression> → <var> + <var> <expression> → <var> - <var> <expression> → <var> + <var> | <var> - <var> Recursion • A rule is recursive if its LHS appears in its RHS <ident_list> → identifier | identifier, <ident_list> • BNF is able to create lists to any length use recursion • BNF is sufficiently powerful to describe: • Lists • The order in which constructs must appear • Nested structures to any depth • Imply operator precedence and associativity • A grammar is a generative device for defining languages • Begin with a special nonterminal of the grammar called the start symbol • Start symbol for a computer program is often named <program> or <start> What is the shortest program we could generate? A sequence of rule applications is called a derivation What is one derivation of this language? Each successive string is derived from the previous string by replacing one of the nonterminals with one of that nonterminal’s definitions Here the leftmost nonterminal is replaced at each step Referred to as ‘leftmost derivation’ Parentheses What leftmost derivation would produce: A=B*(A+C) One approach to determine if a given program is in given CFG: Not efficient in practice Language Recognizers CFG Notes Derivation is useful for testing for syntax errors, but compiler needs to be able to recover program structure to generate target Parse trees are used for this purpose Leftmost derivation of A = B * ( A + C ) • Every internal node is labeled with a nonterminal • Every leaf is labeled with a terminal Parentheses are usually omitted Consider this grammar: E -> E + E | E - E | E * E | E / E | ( E ) | ID ID -> x | y | z What is leftmost derivation of: (x+y)/(x–y) Build the parse tree Consider this grammar: E -> E + E | E - E | E * E | E / E | ( E ) | ID ID -> x | y | z What is leftmost derivation of: x+y*z Build the parse tree Can you draw a structurally different (not isomorphic) tree? Let x = 1, y = 2, z = 3 Do these trees produce the same results? Ambiguity • A grammar that is able to generate more than one different tree for a given sentence – Not just isomorphically different Problem is lack of precedence on operators What is 3 x 4 + 2 ^ 2 ? What is 3 x (4 + 2) ^ 2 ? Standard precedence: 1. Parantheses 2. Exponentiation 3. Multiplication / Division 4. Addition / Subtraction PEMDAS What about -2 ^ 2 x 3 ? Unary negation has higher precedence than multiplication/division but lower precedence than exponentiation for standard rules Not always standard precedence in this class! How can we fix this grammar? E -> E + E | E - E | E * E | E / E | ( E ) | id id -> x | y | z E T F id -> E + E | E - E | T -> T * T | T / T | F -> ( E ) | id -> x | y | z E T F id -> E + E | E - E | T -> T * T | T / T | F -> ( E ) | id -> x | y | z Use this modified grammar to parse: x-y*z Use this modified grammar to parse: x-y+z What is a different parse tree for this sentence? We must consider associativity! Operator Associativity If two operators have the same precedence in an expression, associativity of operators indicate the order in which they are executed What is the direction of associativity for the following in C/C++? Assignment statements: y = 3 x=y=7 What are the values of x and y? Assignment is right associative in C/C++ Operator Associativity What is the direction of associativity for the following in C/C++? y=3-2+5 What is the value of y? Addition/subtraction is left associative in C/C++ Multiplication/division is left associative in C/C++ How can we fix this grammar? We can rewrite the grammar again while restricting the recursion accordingly E T F id -> E + E | E - E | T -> T * T | T / T | F -> ( E ) | id -> x | y | z E T F id -> E + T | E - T | T -> T * F | T / F | F -> ( E ) | id -> x | y | z Use this modified grammar to parse: x-y+z