Syntax Introduction Syntax is the grammar of a language. The syntax rules define what is a valid program – all the way from a complete program down to the smallest expression. In the following sections we will cover several topics related to syntax tokenizing, syntax parsing, and how they relate to program processing Backus-Naur Form (BNF) and its extensions for expressing syntax parse trees and abstract syntax trees lexemes, tokens, and tokenizing regular expressions and their application tools for tokenizing and parsing the recursive descent algorithm Both compilers and interpreters must read the source code of a program and somehow convert it into a sequence of executable instructions or declarations of what task the software is to perform. What are the tasks that a compiler or interpreter must perform to process a source program? 1. read the program from a file or a buffer in an IDE 2. Lexical Analysis or Tokenizing: divide the program into a sequence of words, separators, operations, and other meaningful elements (called lexemes). In the process, it also removes white space and comments between lexemes. The result of this process is a stream of lexemes or tokens. 3. Syntactic Analysis: make sense out of the stream of tokens. A parser matches the tokens against grammar rules to assemble them into larger syntactic units such as expressions, statements, procedures, modules or classes, and program units, just as humans assemble a stream of words into phrases and sentences. If an error or unrecognizable sequence of tokens is encountered, the parser should indicate the error (along with a reference to the point where is was found), then try to recover and continue processing. If no errors are found, the result of this step is usually a parse tree describing the program in some form of intermediate code. 4. Semantic Analysis: decide the meaning of the parts of the parse tree (or other result of syntactic analysis) and how to convert these into executable statements. Some interpreters perform this operation immediately upon recognizing a valid, complete instruction. The intermediate code may optionally be scrutinized to see if the efficiency can be improved (without changing the program logic) in a process of optimization. 5. Code Generation: generate machine instructions (in the case of a compiler) or execute the instructions directly (in the case of an interpreter). For a compiler, the output is often called target code or object code. After all that, the object file produced by a compiler is still not ready to be run. It normally needs to be combined (linked) with other compilation units and with pre-compiled code for the language’s API, system calls to the operating system, etc. These are contained in libraries; a linker program combines objects and resolves external references to produce an executable program. The output of the linker is an “executable” program that can be loaded and run. But in a strict sense, even this program may not be executable. The executable program contains position independent code and may contain references to some additional functions or data that are to be resolved at run-time. This code and data is contained in dynamic link libraries on Microsoft Windows and shared libraries on Unix/Linux. An Example An an illustration, let’s look at a fairly useless C program. In C, executable statements must be part of a function. A minimal program consists of a single main( ) function. Excluding the preprocessor “#include” directives, this simple program would look like this: /* area of a circle */ int main( ) { float radius /* radius of the circle */ = 2.5; float PI = 3.14159; float area; area = PI*radius*radius; printf(“The area is %f\n”, area); } The tokenizer would scan this program and construct the first few tokens like this: token category int RESERVED WORD main IDENTIFIER ( SEPARATOR ) SEPARATOR { SEPARATOR float RESERVED WORD radius IDENTIFIER = OPERATOR 2.5 NUMERIC CONSTANT ; SEPARATOR The tokenizer discards comments and white space; they are significant only as token separators. The tokenizer also makes no attempt to verify matching separators such as { and } -- that's the job of the parser (syntactic analyzer). In places where some other separator is present, white space can by omitted in most languages.1 The above function could be written without white space, the style preferred by many beginning programming students: int main(){float radius=2.5,PI=3.14159,area;area=PI* radius*radius;printf(“The area is %f\n”,area);} 1 The syntax permits white space to be omitted, but it is good programming practice to include white space, even around separators such as ( ) . Backus-Naur Form In the 1950’s, the renowned linguist Noam Chomsky devised four classes for formally defining the grammar of languages. The two simplest of these classes, context-free grammar and normal grammar, subsequently proved to be suitable for describing the syntax of computer languages. The idea of formally expressing computer language syntax as a context-free grammar using an abstract notation is attributed to John Backus, an architect of Fortran and member of the ACM group that developed Algol. At a 1959 international conference on Algol, Backus described a formal notation for Algol’s syntax (Backus, 1959) that was subsequently modified by Peter Naur for describing Algol 60 (Naur, 1960). The original BNF soon proved to be somewhat cumbersome, requiring recursive definitions and a long list of alternatives to describe syntax. Extensions were added to simplify expression of alternatives, repetitive clauses, and optional syntax, collectively known as Extended BNF (EBNF). BNF and EBNF are now almost universally used to describe syntax. BNF Notation BNF consists of a list of rules or productions that describe syntax. Consider a rule for an “if” statement with optional “else” clause, as in the C language: if ( x > 0 ) result = y/x; if ( x > 0 ) result = y/x; else result = y; A BNF to define this sort of expression as an “if_statement” would be: if_statement → if ( boolean_expression ) statement | if ( boolean_expression ) statement else statement The simplest tokens, which are not defined by rules are called terminal symbols, the others are called nonterminal symbols. In a BNF definition, all nonterminal symbols must be defined by rules. In this example, if_statement, boolean_expression, and statement are non-terminals; if, else, and ( ) are terminal symbols. The vertical bar ( | ) means “or”. The collection of allowed terminal symbols must also be defined somewhere. These are often defined separately in a lexical grammar. Variations in BNF notation for productions exist, to accommodate different written formats (e.g., absence of special characters and formatting). A common notation for plain text documents is: <if-statement> ::= if ( <boolean-expression> ) <statement> | if ( <boolean-expression> ) <statement> else <statement> another variation, to avoid the troublesome right arrow character and avoid “-“ and “_” in names: IfStatement :: if ( BooleanExpression ) Statement | if ( BooleanExpression ) Statement else Statement Literal values are sometimes placed in quotation marks to distinguish them. This becomes important in EBNF. Quotations can also clarify when a space required, since two nonterminal symbols separated by a space usually means concatenation, as in the example grammar below. Digit → ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’ In the Java Language Specification Sun uses this notation (but they more complicated definition of “IfStatement” than shown here): IfStatement: if ( Expression ) Statement if ( Expression ) Statement else Statement in this notation, italic font indicates nonterminals, and each indented line is an alternative production (no “or” bar). For compact listing of simple alternatives, Sun uses the phrase “one of” Digit: one of 0 1 2 3 4 5 6 7 8 9 Specifying BNF rules for terminal symbols such as integers and identifiers can be tedious, as shown in the following example. Later we’ll see how to use regular expressions to represent them more succinctly. Example: To illustrate the use of BNF, let’s define rules for a simple grammar consisting of only assignment and the arithmetic operations + and -. In the next section, we will study how the productions affect operator precedence and associativity – for now we merely specify what constitutes a legal assignment. The grammar will allow assignments such as: x = 2 + 4 + 11.5 y = x + 77 - 0.1 sum = x + y First, we need rules to the terminal symbols -- the tokens in the language. These are the rules for the lexical grammar because they define the lexemes that we want the lexical analyzer (tokenizer) to return. It would be inefficient to have the lexical analyzer simply return each character as a token, putting all the work in the parser. The tokens in this grammar will be integer and floating point numbers, identified consisting of letters and digits, = sign, and basic arithmetic operations. Numbers can be integers or floating point. An integer may have an optional “-“ prefix, but no leading zero unless the value is zero (09 is not allowed). A floating point value can be of the form “12.” “12.345”, “.345”, or any of these with a minus prefix. The rules for numbers are: NonZeroDigit → 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Digit → 0 | NonZeroDigit Digits → Digit | Digits Digit UnsignedInt → 0 | NonZeroDigit UnsignedInt Integer → UnsignedInt | - UnsignedInt FloatingPt → Integer . | Integer . Digits | . Digits | - . Digits NumericConst → Integer | FloatingPt In these rules, a space between symbols means concatenation, not the requirement of a literal space. Identifiers (variable names) can be any sequence of letters and digits provided that the first character is a letter. → a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|s|t|u|v|w|x|y|z Letter |A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|S|T|U|V|W|X|Y|Z → Letter | Identifier Letter | Identifier Digit Identifier The other lexical units are the operators and assignment symbol. Software for generating a real tokenizer would also let us specify how white space characters should be handled, but for simplicity we'll ignore this detail. Operator → + | -|*|/ AssignmentOp → = The syntactic grammar, that defines the valid statements, is given next. Since an expression can involve any number of arithmetic operations, a recursive rule for expressions is needed. Assignment → Identifier = Expression Expression → Expression Operator Factor | Factor Factor → NumericConst | Identifier Applying Rules To Parse Expressions A parser uses the grammar rules to construct syntactic units from the stream of input tokens. One of the nonterminal symbols in the context free grammar must be designated as the start symbol, that defines all valid inputs. The parser will attempt to match the entire input stream to the start symbol. In this example, Assignment is the start symbol. A parse tree shows graphically how an input is matched against a sequence of productions. Consider the parse tree for: x = y - 12 * z Assignment = Identifier Expression Expression x Expression Factor Identifier Operator Operator Factor - NumericConst * Factor Identifier z 12 y The matching of lexical grammar rules (Identifier → Letter → 'x') has been omitted, since they would not be performed by the parser. A parser constructs a parse tree as a data structure; each node in the tree shows a production that is matched by part of the input. But, once the entire input has been successfully matched, much of this information is irrelevant for code generation. A semantic analyzer typically simplifies this tree by removing unnecessary nodes. The result is an abstract syntax tree, as shown below. Each node in the abstract syntax tree would contain information about the type of symbol contained by the node; for identifiers, the node would contain a pointer to the identifier in a symbol table. Assignment x * = z y Associativity and Order 12 BNF rules need to be carefully written to achieve the correct order and associativity of the parsed code, and to avoid ambiguity. A parse tree is read or traversed in normal order to evaluate (or generate code for) an expression. In the above example, "y - 12" would be evaluated before "* z", so the result would be assign x the value (y-12)*z; not the usual precedence of operations. This problem is because the grammar doesn't contain any information that distinguishes arithmetic operators and subexpressions. We could fix this by adding separate definitions for a Term and Factor as in standard arithmetic: Assignment → Identifier = Expression Expression → Expression + Term | Expression - Term | Term Term → Term * Factor | Term / Factor | Factor Factor → NumericConst | Identifier Now when the parser attempts to match "x = y - 12 * z" to rules for expression, the matching would occur in the following order: x = y - 12 * z Identifier = Identifier - NumericConst * Identifier Identifier = Factor - Factor * Factor Identifier = Factor - Term * Factor Identifier = Factor - Term * Factor Identifier = Factor - Term Identifier = Term - Term Identifier = Expression - Term Identifier = Expression Assignment The above example illustrates how a parser might match tokens to the productions in a bottoms up order, the strategy used by LR parsers. The tokenizer identifies x, y, 12, and z as Identifier and NumericConst. The parser then seeks to reduce the token stream by matching groups of tokens to a production and replacing them with the nonterminal name on the left side of the rule. The resulting abstract syntax tree and evaluative order of this assignment are: Assignment x - = x = y - (12 * z) * y z 12 The productions for Expression and Term define these values recursively, with the nonterminal on the left side of an expression, called left recursion. The choice of left recursion or right recursion affects the results of parsing, so the choice must be made that achieves the desired result. Suppose we replace the left recursive rules (above) with right recursion : Expression →Term + Expression | Term - Expression | Term Term → Factor * Term | Factor / Term | Factor Factor → NumericConst | Identifier This changes the associativity of the arithmetic operations. For example, consider x = 10 - 5 - 3. Using the right recursive rules above, and taking advantage of the opportunity to use top-down parsing (the result using bottoms-up parsing would be the same), this assignment will be parsed as: Assignment Identifier = x = Term - x = Factor - Term - Expression x = NumericConst - Factor - Term x = 10 - NumericConst - Factor x = 10 - 5 - NumericConst x = 10 - 5 - 3 Expression Expression Assignment The abstract syntax tree for this is: Evalating this tree requires evaluating nodes from the bottom up, leading to the result: x - = 10 - x = 10 - ( 5 - 3 ) = 10 - 2 = 8 3 5 Using right recursion to define rules for addition and subtract made those operations right associative. It is left as an exercise to show that the abstract syntax tree produced by the original (left recursive) grammar rules would lead to the evaluation x = (10 - 5) - 3 = 2. This result can be summarized as: Left Recursion corresponds to left associativity of operators in a production Right Recursion corresponds to right associativity of operators in a production. The order in which rules refer to other rules also affects the precedence of operations. In the above grammar rules, Assignment is defined in terms of Expression, Expression in terms of Term, and Term in terms of Factor. As a result, a Factor will be evaluated before the including Term, and a Term evaluated before the including Expression. This gives * and / (matched in Factor) higher precedence than + and - (matched in Term). Extended BNF Notation Extensions have been added to BNF to simplify writing of alternatives and reduce the need for recursive definitions. EBNF includes the following notation: Notation Meaning Example (a|b|c) any one of a, b, or c Operator ::= ( + | - | * | / ) {a} zero or more occurrences of the item in braces Expression ::= Term { Operator Term } [a] item is brackets is optional. It can occur 0 or 1 time. Term ::= [-]Number EBNF replaces explicit recursion (a rule using its own left-hand-side symbol in its production) with repetition using the { .. } notation. Here’s a comparison for a simple arithmetic grammar: BNF EBNF expression ::= expression + term | expression - term | term term ::= term * factor | term / factor | factor factor ::= ( expression ) | ID | NUMBER expression ::= term { (+|-) term } term ::= factor ::= factor { (*|/) factor } ‘(‘ expression ‘)‘ | ID | NUMBER Notice that the EBNF rules don’t explicitly use the left-side nonterminal in the rule definition, but there is still some implicit recursion. In the rule for factor, the parenthesis representing actual tokens are placed in parenthesis to distinguish them from the parenthesis metasymbols (EBNF notation) indicating alternatives. BNF and EBNF are equally powerful for representing grammar rules. The choice of notation may be dictated by implementation: the parser generating programs yacc, bison, and CUP require input rules in BNF style, while EBNF is more suitable for implementing a parser using the recursive descent algorithm. EBNF can also eliminate some ambiguity in BNF rules. Additional EBNF Notation Several variations on EBNF notation exist. Some common constructs are: Notation Meaning Example symbolopt subscript “opt” in place of [...] for optional part attribute ::= finalopt datatype ID ; { a }+ one or more occurrences of the StatementBlock ::= begin { Statement ; }+ item in braces end Regular Expressions and Tokenizing to be added: see lecture slides on lexemes, tokens, and regular expressions Resources The Java Language Specification, http://java.sun.com/docs/books/jls/, makes extensive use of BNF.