Notes on Syntax and BNF

advertisement
Syntax
Introduction
Syntax is the grammar of a language. The syntax rules define what is a valid program – all the
way from a complete program down to the smallest expression. In the following sections we will
cover several topics related to syntax
 tokenizing, syntax parsing, and how they relate to program processing
 Backus-Naur Form (BNF) and its extensions for expressing syntax
 parse trees and abstract syntax trees
 lexemes, tokens, and tokenizing
 regular expressions and their application
 tools for tokenizing and parsing
 the recursive descent algorithm
Both compilers and interpreters must read the source code of a program and somehow convert it
into a sequence of executable instructions or declarations of what task the software is to perform.
What are the tasks that a compiler or interpreter must perform to process a source program?
1.
read the program from a file or a buffer in an IDE
2.
Lexical Analysis or Tokenizing: divide the program into a sequence of words, separators,
operations, and other meaningful elements (called lexemes). In the process, it also removes
white space and comments between lexemes. The result of this process is a stream of
lexemes or tokens.
3.
Syntactic Analysis: make sense out of the stream of tokens. A parser matches the tokens
against grammar rules to assemble them into larger syntactic units such as expressions,
statements, procedures, modules or classes, and program units, just as humans assemble a
stream of words into phrases and sentences. If an error or unrecognizable sequence of tokens
is encountered, the parser should indicate the error (along with a reference to the point where
is was found), then try to recover and continue processing. If no errors are found, the result
of this step is usually a parse tree describing the program in some form of intermediate code.
4.
Semantic Analysis: decide the meaning of the parts of the parse tree (or other result of
syntactic analysis) and how to convert these into executable statements. Some interpreters
perform this operation immediately upon recognizing a valid, complete instruction. The
intermediate code may optionally be scrutinized to see if the efficiency can be improved
(without changing the program logic) in a process of optimization.
5.
Code Generation: generate machine instructions (in the case of a compiler) or execute the
instructions directly (in the case of an interpreter). For a compiler, the output is often called
target code or object code.
After all that, the object file produced by a compiler is still not ready to be run. It normally needs
to be combined (linked) with other compilation units and with pre-compiled code for the
language’s API, system calls to the operating system, etc. These are contained in libraries; a
linker program combines objects and resolves external references to produce an executable
program.
The output of the linker is an “executable” program that can be loaded and run. But in a strict
sense, even this program may not be executable. The executable program contains position
independent code and may contain references to some additional functions or data that are to be
resolved at run-time. This code and data is contained in dynamic link libraries on Microsoft
Windows and shared libraries on Unix/Linux.
An Example
An an illustration, let’s look at a fairly useless C program. In C, executable statements must be
part of a function. A minimal program consists of a single main( ) function. Excluding the
preprocessor “#include” directives, this simple program would look like this:
/* area of a circle */
int main( ) {
float radius /* radius of the circle */ = 2.5;
float PI = 3.14159;
float area;
area = PI*radius*radius;
printf(“The area is %f\n”, area);
}
The tokenizer would scan this program and construct the first few tokens like this:
token
category
int
RESERVED WORD
main
IDENTIFIER
(
SEPARATOR
)
SEPARATOR
{
SEPARATOR
float
RESERVED WORD
radius
IDENTIFIER
=
OPERATOR
2.5
NUMERIC CONSTANT
;
SEPARATOR
The tokenizer discards comments and white space; they are significant only as token separators.
The tokenizer also makes no attempt to verify matching separators such as { and } -- that's the
job of the parser (syntactic analyzer). In places where some other separator is present, white
space can by omitted in most languages.1 The above function could be written without white
space, the style preferred by many beginning programming students:
int main(){float radius=2.5,PI=3.14159,area;area=PI*
radius*radius;printf(“The area is %f\n”,area);}
1
The syntax permits white space to be omitted, but it is good programming practice to include white space, even
around separators such as ( ) .
Backus-Naur Form
In the 1950’s, the renowned linguist Noam Chomsky devised four classes for formally defining
the grammar of languages. The two simplest of these classes, context-free grammar and normal
grammar, subsequently proved to be suitable for describing the syntax of computer languages.
The idea of formally expressing computer language syntax as a context-free grammar using an
abstract notation is attributed to John Backus, an architect of Fortran and member of the ACM
group that developed Algol. At a 1959 international conference on Algol, Backus described a
formal notation for Algol’s syntax (Backus, 1959) that was subsequently modified by Peter Naur
for describing Algol 60 (Naur, 1960).
The original BNF soon proved to be somewhat cumbersome, requiring recursive definitions and a
long list of alternatives to describe syntax. Extensions were added to simplify expression of
alternatives, repetitive clauses, and optional syntax, collectively known as Extended BNF
(EBNF). BNF and EBNF are now almost universally used to describe syntax.
BNF Notation
BNF consists of a list of rules or productions that describe syntax. Consider a rule for an “if”
statement with optional “else” clause, as in the C language:
if ( x > 0 ) result = y/x;
if ( x > 0 ) result = y/x; else result = y;
A BNF to define this sort of expression as an “if_statement” would be:
if_statement → if ( boolean_expression ) statement
| if ( boolean_expression ) statement else statement
The simplest tokens, which are not defined by rules are called terminal symbols, the others are
called nonterminal symbols. In a BNF definition, all nonterminal symbols must be defined by
rules. In this example, if_statement, boolean_expression, and statement are non-terminals; if,
else, and ( ) are terminal symbols. The vertical bar ( | ) means “or”.
The collection of allowed terminal symbols must also be defined somewhere. These are often
defined separately in a lexical grammar.
Variations in BNF notation for productions exist, to accommodate different written formats (e.g.,
absence of special characters and formatting). A common notation for plain text documents is:
<if-statement> ::= if ( <boolean-expression> ) <statement>
| if ( <boolean-expression> ) <statement> else <statement>
another variation, to avoid the troublesome right arrow character and avoid “-“ and “_” in names:
IfStatement :: if ( BooleanExpression ) Statement
| if ( BooleanExpression ) Statement else Statement
Literal values are sometimes placed in quotation marks to distinguish them. This becomes
important in EBNF. Quotations can also clarify when a space required, since two nonterminal
symbols separated by a space usually means concatenation, as in the example grammar below.
Digit
→ ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’
In the Java Language Specification Sun uses this notation (but they more complicated definition
of “IfStatement” than shown here):
IfStatement:
if ( Expression ) Statement
if ( Expression ) Statement else Statement
in this notation, italic font indicates nonterminals, and each indented line is an alternative
production (no “or” bar). For compact listing of simple alternatives, Sun uses the phrase “one of”
Digit: one of
0 1 2 3 4 5 6 7 8 9
Specifying BNF rules for terminal symbols such as integers and identifiers can be tedious, as
shown in the following example. Later we’ll see how to use regular expressions to represent
them more succinctly.
Example: To illustrate the use of BNF, let’s define rules for a simple grammar consisting of only
assignment and the arithmetic operations + and -. In the next section, we will study how the
productions affect operator precedence and associativity – for now we merely specify what
constitutes a legal assignment. The grammar will allow assignments such as:
x = 2 + 4 + 11.5
y = x + 77 - 0.1
sum = x + y
First, we need rules to the terminal symbols -- the tokens in the language. These are the rules for
the lexical grammar because they define the lexemes that we want the lexical analyzer
(tokenizer) to return. It would be inefficient to have the lexical analyzer simply return each
character as a token, putting all the work in the parser. The tokens in this grammar will be
integer and floating point numbers, identified consisting of letters and digits, = sign, and basic
arithmetic operations.
Numbers can be integers or floating point. An integer may have an optional “-“ prefix, but no
leading zero unless the value is zero (09 is not allowed). A floating point value can be of the
form “12.” “12.345”, “.345”, or any of these with a minus prefix. The rules for numbers are:
NonZeroDigit
→ 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Digit
→ 0 | NonZeroDigit
Digits
→ Digit | Digits Digit
UnsignedInt
→ 0 | NonZeroDigit UnsignedInt
Integer
→ UnsignedInt | - UnsignedInt
FloatingPt
→ Integer . | Integer . Digits | . Digits | - . Digits
NumericConst
→ Integer | FloatingPt
In these rules, a space between symbols means concatenation, not the requirement of a literal space.
Identifiers (variable names) can be any sequence of letters and digits provided that the first character
is a letter.
→ a|b|c|d|e|f|g|h|i|j|k|l|m|n|o|p|q|s|t|u|v|w|x|y|z
Letter
|A|B|C|D|E|F|G|H|I|J|K|L|M|N|O|P|Q|S|T|U|V|W|X|Y|Z
→ Letter | Identifier Letter | Identifier Digit
Identifier
The other lexical units are the operators and assignment symbol. Software for generating a real
tokenizer would also let us specify how white space characters should be handled, but for
simplicity we'll ignore this detail.
Operator
→ + | -|*|/
AssignmentOp
→ =
The syntactic grammar, that defines the valid statements, is given next. Since an expression can
involve any number of arithmetic operations, a recursive rule for expressions is needed.
Assignment
→ Identifier = Expression
Expression
→ Expression Operator Factor | Factor
Factor
→ NumericConst | Identifier
Applying Rules To Parse Expressions
A parser uses the grammar rules to construct syntactic units from the stream of input tokens. One
of the nonterminal symbols in the context free grammar must be designated as the start symbol,
that defines all valid inputs. The parser will attempt to match the entire input stream to the start
symbol. In this example, Assignment is the start symbol.
A parse tree shows graphically how an input is matched against a sequence of productions.
Consider the parse tree for: x = y - 12 * z
Assignment
=
Identifier
Expression
Expression
x
Expression
Factor
Identifier
Operator
Operator Factor
-
NumericConst
*
Factor
Identifier
z
12
y
The matching of lexical grammar rules (Identifier → Letter → 'x') has been omitted, since they
would not be performed by the parser. A parser constructs a parse tree as a data structure; each
node in the tree shows a production that is matched by part of the input. But, once the entire
input has been successfully matched, much of this information is irrelevant for code generation.
A semantic analyzer typically simplifies this tree by removing unnecessary nodes. The result is
an abstract syntax tree, as shown below. Each node in the abstract syntax tree would contain
information about the type of symbol contained by the node; for identifiers, the node would
contain a pointer to the identifier in a symbol table.
Assignment
x
*
=
z
y
Associativity and Order
12
BNF rules need to be carefully written to achieve the correct order and associativity of the parsed
code, and to avoid ambiguity. A parse tree is read or traversed in normal order to evaluate (or
generate code for) an expression. In the above example, "y - 12" would be evaluated before
"* z", so the result would be assign x the value (y-12)*z; not the usual precedence of
operations.
This problem is because the grammar doesn't contain any information that distinguishes
arithmetic operators and subexpressions. We could fix this by adding separate definitions for a
Term and Factor as in standard arithmetic:
Assignment
→ Identifier = Expression
Expression
→ Expression + Term | Expression - Term | Term
Term
→ Term * Factor | Term / Factor | Factor
Factor
→ NumericConst | Identifier
Now when the parser attempts to match "x = y - 12 * z" to rules for expression, the matching
would occur in the following order:
x
=
y
-
12
*
z
Identifier
=
Identifier
-
NumericConst
*
Identifier
Identifier
=
Factor
-
Factor
*
Factor
Identifier
=
Factor
-
Term
*
Factor
Identifier
=
Factor
-
Term
*
Factor
Identifier
=
Factor
-
Term
Identifier
=
Term
-
Term
Identifier
=
Expression
-
Term
Identifier
=
Expression
Assignment
The above example illustrates how a parser might match tokens to the productions in a bottoms
up order, the strategy used by LR parsers. The tokenizer identifies x, y, 12, and z as Identifier
and NumericConst. The parser then seeks to reduce the token stream by matching groups of
tokens to a production and replacing them with the nonterminal name on the left side of the rule.
The resulting abstract syntax tree and evaluative order of this assignment are:
Assignment
x
-
=
x = y - (12 * z)
*
y
z
12
The productions for Expression and Term define these values recursively, with the nonterminal
on the left side of an expression, called left recursion. The choice of left recursion or right
recursion affects the results of parsing, so the choice must be made that achieves the desired
result. Suppose we replace the left recursive rules (above) with right recursion :
Expression
→Term + Expression | Term - Expression | Term
Term
→ Factor * Term | Factor / Term | Factor
Factor
→ NumericConst | Identifier
This changes the associativity of the arithmetic operations. For example, consider x = 10 - 5 - 3.
Using the right recursive rules above, and taking advantage of the opportunity to use top-down
parsing (the result using bottoms-up parsing would be the same), this assignment will be parsed
as:
Assignment
Identifier
=
x
=
Term
-
x
=
Factor
-
Term
-
Expression
x
=
NumericConst
-
Factor
-
Term
x
=
10
-
NumericConst
-
Factor
x
=
10
-
5
-
NumericConst
x
=
10
-
5
-
3
Expression
Expression
Assignment
The abstract syntax tree for this is:
Evalating this tree requires evaluating nodes
from the bottom up, leading to the result:
x
-
=
10
-
x = 10 - ( 5 - 3 ) = 10 - 2 = 8
3
5
Using right recursion to define rules for
addition and subtract made those operations right associative. It is left as an exercise to show that
the abstract syntax tree produced by the original (left recursive) grammar rules would lead to the
evaluation x = (10 - 5) - 3 = 2.
This result can be summarized as:
Left Recursion corresponds to left associativity of operators in a production
Right Recursion corresponds to right associativity of operators in a production.
The order in which rules refer to other rules also affects the precedence of operations. In the
above grammar rules, Assignment is defined in terms of Expression, Expression in terms of Term,
and Term in terms of Factor. As a result, a Factor will be evaluated before the including Term,
and a Term evaluated before the including Expression. This gives * and / (matched in Factor)
higher precedence than + and - (matched in Term).
Extended BNF Notation
Extensions have been added to BNF to simplify writing of alternatives and reduce the need for
recursive definitions. EBNF includes the following notation:
Notation
Meaning
Example
(a|b|c)
any one of a, b, or c
Operator ::= ( + | - | * | / )
{a}
zero or more occurrences of
the item in braces
Expression ::= Term { Operator Term }
[a]
item is brackets is optional. It
can occur 0 or 1 time.
Term ::= [-]Number
EBNF replaces explicit recursion (a rule using its own left-hand-side symbol in its production)
with repetition using the { .. } notation. Here’s a comparison for a simple arithmetic grammar:
BNF
EBNF
expression ::= expression + term
| expression - term
| term
term ::=
term * factor
| term / factor
| factor
factor ::=
( expression )
| ID
| NUMBER
expression ::= term { (+|-) term }
term ::=
factor ::=
factor { (*|/) factor }
‘(‘ expression ‘)‘
| ID
| NUMBER
Notice that the EBNF rules don’t explicitly use the left-side nonterminal in the rule definition, but
there is still some implicit recursion. In the rule for factor, the parenthesis representing actual
tokens are placed in parenthesis to distinguish them from the parenthesis metasymbols (EBNF
notation) indicating alternatives.
BNF and EBNF are equally powerful for representing grammar rules. The choice of notation
may be dictated by implementation: the parser generating programs yacc, bison, and CUP
require input rules in BNF style, while EBNF is more suitable for implementing a parser using
the recursive descent algorithm. EBNF can also eliminate some ambiguity in BNF rules.
Additional EBNF Notation
Several variations on EBNF notation exist. Some common constructs are:
Notation
Meaning
Example
symbolopt
subscript “opt” in place of [...]
for optional part
attribute ::= finalopt datatype ID ;
{ a }+
one or more occurrences of the StatementBlock ::= begin { Statement ; }+
item in braces
end
Regular Expressions and Tokenizing
to be added: see lecture slides on lexemes, tokens, and regular expressions
Resources
The Java Language Specification, http://java.sun.com/docs/books/jls/, makes extensive use of
BNF.
Download