SYNTAX Outline Programming Language Specification Lexical Structure of PLs Syntactic Structure of PLs Context-Free Grammar / BNF Parse Trees Abstract Syntax Trees Ambiguous Grammar Associativity and Precedence EBNFs and Syntax Diagrams Nandigam 2 Programming Language Specification PLs require precise definitions (i.e. no ambiguity) Language form (Syntax) Language meaning (Semantics) Consequently, PLs are specified using formal notation: Formal syntax Formal semantics Nandigam Tokens Grammar Operational Denotational Axiomatic 3 Lexical Structure of PLs Nandigam 4 Lexical Structure of PLs Main task of scanner: identify tokens (cont.) Basic building blocks of programs E.g. keywords, identifiers, numbers, punctuation marks Lexeme – an instance of a token. One can think of programs as strings of lexemes rather than of characters A token of a language is a category of its lexemes (or instances) Some tokens can have one or more lexemes E.g. keyword, identifier, number In some cases, a token has only one single possible lexeme E.g. equal_sign, plus_op, mult_op Nandigam 5 Lexical Structure of PLs (cont.) Consider the following Java statement: index = 2 * count + 17 ; The lexemes and tokens of this statement are: Lexemes Nandigam Tokens index identifier = equal_sign 2 int_literal * mult_op count identifier + plus_op 17 int_literal ; semicolon 6 Lexical Structure of PLs (cont.) Tokens in a programming language are described formally by regular expressions. Regular expressions – descriptions of patterns of characters Regular expression operations Basic operations Additional operations Nandigam Concatenation Choice or selection Repetition Grouping One or more repetitions Range of characters Optional Any character item sequencing | * () + [-] ? . 7 Lexical Structure of PLs (cont.) Regular expression examples (a|b)*c [0-9]+ Floating-point literals [a-zA-Z][a-zA-Z0-9_]* Nandigam Integer constants with one or more digits [0-9]+(\.[0-9]+)? String that match include ababaac, aac, bbc, c, and babc Identifiers 8 Lexical Structure of PLs Scanners generators: (cont.) lex, flex ANTLR – Another Tool for Language Recognition These programs can be used to generate a program (i.e., a scanner) that can extract tokens from a stream of characters. Many PLs provide good support for regular expressions – Java, C#, Perl, Ruby, … Support for regular expressions in Java Nandigam java.util.regex package split() method of String class 9 Syntactic Structure of PLs Specifying the form of a programming language Tokens Syntax – organization of tokens Nandigam Regular Expression Context-Free Grammars (CFGs) 10 Context-Free Grammar Context-free grammars (CFGs) are used to describe the syntax of PLs. BNF (Backus-Naur Form) is a notation for describing syntax. Proposed by Noam Chomsky – a noted linguist Proposed by John Backus and Peter Naur CFG and BNF are nearly identical and are used interchangeably. BNF is a metalanguage for programming languages. A metalanguage is a language that is used to describe another language. Nandigam 11 Context-Free Grammar CFG or BNF consists of a series of rules or productions. Productions are made up of: Nonterminals – structures that are broken down into further structures Terminals – things that cannot be broken down Metasymbols (cont.) Symbols that are part of CFG/BNF These are not actual symbols in the language being described Sometimes, a metasymbol is also an actual symbol in a language One of the nonterminals is designated as the start symbol. The start symbol stands for the entire structure being defined. Nandigam 12 Context-Free Grammar (cont.) CFG/BNF Example (Figure 4.2, page 83) (1) sentence → noun-phrase verb-phrase . (2) noun-phrase → article noun (3) article → a | the (4) noun → girl | dog (5) verb-phrase → verb noun-phrase (6) verb → sees | pets Nandigam 13 Context-Free Grammar (cont.) The language of a CFG is the set of strings of terminals that can be generated from the start symbol by a derivation: sentence noun-phrase verb-phrase . (rule 1) article noun verb-phrase . (rule 2) the noun verb-phrase . (rule 3) the girl verb-phrase . (rule 4) the girl verb noun-phrase . (rule 5) the girl sees noun-phrase . (rule 6) the girl sees article noun . (rule 2) the girl sees a noun . (rule 3) the girl sees a dog . (rule 4) Nandigam 14 Context-Free Grammar (cont.) Derivation – Generating sentences of the language through a sequence of applications of rules (or productions), beginning with a special nonterminal called the start symbol. Leftmost derivation – The replaced nonterminal is always the leftmost nonterminal. Rightmost derivation – The replaced nonterminal is always the rightmost nonterminal. A derivation may be neither leftmost nor rightmost. Derivation order has no effect on the language generated by a grammar. Nandigam 15 Context-Free Grammar (cont.) A grammar for a small language <program> → begin <stmt_list> end <stmt_list> → <stmt> | <stmt> ; <stmt_list> <stmt> → <var> := <expr> <expr> → <var> + <var> | <var> - <var> | <var> <var> → A | B | C Derive the following program: begin A := B + C ; B := C end Is the language defined by this grammar finite or infinite? Nandigam 16 Context-Free Grammar (cont.) Left recursive rule – A BNF rule is left recursive if the left-hand side (LHS) appears at the beginning of its right-hand side (RHS). Right recursive rule – A BNF rule is right recursive if the LHS appears at the right end of the RHS. Examples: Uses of recursion in BNF: number number digit | digit digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 expr expr + expr | expr expr | ( expr ) | number Nandigam to show repetition to describe complex structures 17 Parse Trees A parse tree is a graphical representation of hierarchical syntactic structure of sentences. It describes graphically the replacement process in a derivation. A parse tree is labeled by nonterminals at interior nodes and terminals at leaves. A parse tree better expresses the structure inherent in a derivation. expr number expr ( digit 2 Nandigam expr ) number 4 digit expr digit expr digit number number * + expr 4 3 number number digit digit 2 3 18 Parse Trees (cont.) Problem 1: <assign> → <id> := <expr> <expr> → <id> + <expr> | <id> * <expr> | ( <expr> ) | <id> <id> → A | B | C Show a leftmost derivation and a parse tree for each of the following statements: A A A A Nandigam := := := := A B A B +(B *C) +C+A *(B+C) *(C*(A+B)) 19 Parse Trees (cont.) Problem 2: Describe, in English, the language defined by the following grammar: <S> → <A> <B> <C> <A> → a <A> | a <B> → b <B> | b <C> → c <C> | c Problem 3: Consider the following grammar: <S> → <A> a <B> b <A> → <A> b | b <B> → a <B> | a Which of the following sentences are in the language generated by this grammar? baab bbbab bbaaaaa bbaab Nandigam 20 Parse Trees (cont.) Problem 4: Consider the following grammar: <S> → <S> → <A> → <B> → a <S> c <B> <A> | b c <A> | c d | <A> Which of the following sentences are in the language generated by the grammar? abcd acccbd acccbcc acd accc Nandigam 21 Abstract Syntax Trees Parse trees are still too detailed in their structure, since every step in a derivation is expressed as nodes Abstract Syntax Tree or (just syntax tree) shows the essential structure of a parse tree. AST is more compact than the corresponding parse tree An (abstract) syntax tree condenses a parse tree to its essential structure Language designers and translator writers are most interested in abstract syntax. A programmer is most interested in concrete syntax Examples on the next two slides… Nandigam 22 Abstract Syntax Trees 4 number digit number number digit digit 4 3 3 2 2 Parse Tree Nandigam (cont.) Corresponding AST 23 Abstract Syntax Trees (cont.) * expr expr ( expr * ) expr number + 4 digit expr + expr 4 number number digit digit 2 3 Parse Tree Nandigam 2 3 Corresponding AST 24 Ambiguous Grammars A grammar is ambiguous if it is possible to construct two or more distinct parse trees for the same string Example: Grammar: expr expr + expr | expr expr | ( expr ) | NUMBER Expression: 2 + 3 * 4 Parse trees – ambiguity in operator precedence expr expr NUMBER (2) Nandigam + expr expr expr expr * expr expr + expr NUMBER (3) NUMBER (4) * expr NUMBER (4) NUMBER (2) NUMBER (3) 25 Ambiguous Grammars (cont.) Another Example: Grammar: expr expr + expr | expr expr | ( expr ) | NUMBER Expression: 2 - 3 - 4 Parse trees – ambiguity in operator associativity expr expr NUMBER (2) - expr expr expr expr - expr expr - expr NUMBER (3) Nandigam NUMBER (4) - expr NUMBER (4) NUMBER (2) NUMBER (3) 26 Ambiguous Grammars Ways to resolve ambiguities in a grammar Revise grammar – desired approach Provide disambiguating rule (semantic help) Revising grammar to address precedence and associativity ambiguities (cont.) Do not write rules that allow a parse tree to grow on both left and right sides Use left recursive rules for left-associative operators Use right recursive rules for right-associative operators Add new rules that establish “precedence cascade” between rules to specify precedence Make sure operators with higher precedence appear lower in the cascade of rules Revised grammar expr expr + term | term term term * factor | factor factor ( expr ) | NUMBER Nandigam 27 Ambiguous Grammars (cont.) Problem 1: <expr> → <expr> + <expr> | <expr> - <expr> | <expr> * <expr> | <expr> / <expr> | ( <expr> ) | NUMBER NUMBER = [0-9]+ Show that this grammar is ambiguous by constructing two distinct parse trees for each of the following expressions: 30 + 5 + 2 30 – 5 – 2 30 * 5 * 2 30 / 5 / 2 30 + 5 * 2 Nandigam 28 Ambiguous Grammars (cont.) Revised unambiguous grammar <expr> → <expr> + <term> | <expr> - <term> | <term> <term> → <term> * <factor> | <term> / <factor> | <factor> <factor> → ( <expr> ) | NUMBER NUMBER Nandigam = [0-9]+ 29 Ambiguous Grammars (cont.) Problem 2: Show that the following grammar is ambiguous: <S> → <A> <A> → <A> + <A> | <id> <id> → a | b | c Nandigam 30 Ambiguous Grammars Are there other alternatives to resolving ambiguities? (cont.) Yes, but they change the language! Fully-parenthesized expressions: expr ( expr + expr ) | ( expr - expr ) | NUMBER Prefix expressions: expr + expr expr | - expr expr | NUMBER Nandigam 31 Extended BNF Adds new metasymbols (or operations) to BNF to enhance readability and writability. These new extensions do not enhance the descriptive power of BNF. It facilitates development of parsing tools based on an approach called Recursive-Descent Parsing. New metasymbols added to EBNF: Nandigam {} [] (|) zero or more repetitions optional parts multiple-choice 32 Extended BNF Examples: BNF: EBNF: <number> → <number> <digit> | <digit> <number> → <digit> {<digit>} BNF: EBNF: <expr> → <expr> + <term> | <term> <expr> → <term> {+ <term>} BNF: EBNF: <expr> → <term> ^ <expr> | <term> <expr> → <term> [^ <expr>] BNF: <selection> → if <logic-expr> then <statement> | if <logic-expr> then <statement> else <statement> <selection> →if <logic-expr> then <statement> [else <statement>] EBNF BNF: EBNF: Nandigam (cont.) <for-stmt> → for <var> := <expr> to <expr> do <statement> | for <vat> := <expr> downto <expr> do <statement> <for-stmt> → for <var> := <expr> (to | downto) <expr> do <stmt> 33 Extended BNF (cont.) More examples: BNF: <expr> → <expr> + <term> | <term> <term> → <term> * <power> | <term> / <power> | <term> % <power> | <power> <power> → <factor> ^ <power> | factor <factor> → (<expr>) | NUMBER NUMBER = [0-9]+ EBNF: <expr> → <term> {+ <term>} <term> → <power> { * <power> | / <power> | % <power> → <factor> [^ <power>] <factor> → (<expr>) | NUMBER NUMBER = [0-9]+ Nandigam <power> } 34 Syntax Diagrams A graphical representation for a grammar rule An alternative to EBNF Circle or ovals for terminals Squares or rectangles for nonterminals Terminals and nonterminals are connected with lines and arrows Visually appealing but takes up space Rarely seen any more: EBNF is much more compact if-statement if ( statement Nandigam expression ) else statement 35