COP4020 Programming Languages Syntax analysis

advertisement
COP4020
Programming
Languages
Syntax analysis
Prof. Xin Yuan
Overview




Syntax analysis overview
Grammar and context-free grammar
Grammar derivations
Parse trees
5/29/2016
COP4020 Spring 2014
2
Syntax analysis

Syntax analysis is done by the parser.


Detects whether the program is written following the grammar
rules and reports syntax errors.
Produces a parse tree from which intermediate code can be
generated.
token
Source
program
Lexical
analyzer
parser
Request
for token
Symbol
table
Parse
tree
Rest of
front end
Int.
code

The syntax of a programming language is described
by a context-free grammar (Backus-Naur Form
(BNF)).
 Similar to the languages specified by regular
expressions, but more general.
 A grammar gives a precise syntactic specification
of a language.
 From some classes of grammars, tools exist that
can automatically construct an efficient parser.
These tools can also detect syntactic ambiguities
and other problems automatically.
 A compiler based on a grammatical description of
a language is more easily maintained and
updated.
Grammars

A grammar has four components G=(N, T, P, S):
T is a finite set of tokens (terminal symbols)
N is a finite set of nonterminals
P is a finite set of productions of the form    .
Where   ( N  T ) * N ( N  T ) * and   ( N  T ) *
S is a special nonterminal that is a designated start
symbol
5/29/2016
COP4020 Spring 2014
5
Example

Grammar for expression (T=?, N=?, P=?, S=?)

Production:
E ->E+E
E-> E-E
E-> (E)
E-> -E
E->num
E->id

How does this correspond to a language?
Informally, you can expand the non-terminals using the productions
until all are expanded: the ending sentence (a sequence of tokens)
is recognized by the grammar.
5/29/2016
COP4020 Spring 2014
6
Language recognized by a
grammar

We say “aAb derives awb in one step”, denoted as
“aAb=>awb”, if A->w is a production and a and b are
arbitrary strings of terminal or nonterminal symbols.

We* say a1 derives am if a1=>a2=>…=>am, written as
a1=>am

The languages L(G) defined by G are the set of strings
*
of the terminals w such that S=>w.
5/29/2016
COP4020 Spring 2014
7
Example
A->aA
A->bA
A->a
A->b

G=(N, T, P, S)





N=?
T=?
P=?
S=?
What is the language recognized by this grammer?
5/29/2016
COP4020 Spring 2014
8


Chomsky Hierarchy (classification of
grammars)
A grammar is said to be

regular if it is






right-linear, where each production in P has the form, A  w
or A  wB . Here, A and B are non-terminals and w is a terminal
or left-linear
context-free if each production in P is of the form A  
, where A N and   ( N  T ) *
context sensitive if each production in P is of the form   
where |  ||  |
unrestricted if each production in P is of the form   
where   
All languages recognized by regular expression can
be represented by a regular grammar.

A context free grammar has four components
G=(N, T, P, S):
T is a finite set of tokens (terminal symbols)
N is a finite set of nonterminals
P is a finite set of productions of the form
Where A N and   ( N  T ) * .

A 
S is a special nonterminal that is a designated
start symbol.
Context free grammar is more expressive than
regular expression. Consider language
{ab, aabb, aaabbb, …}
BNF Notation (another form of
context free grammar)

Backus-Naur Form (BNF) notation for productions:
<nonterminal> ::= sequence of (non)terminals
where




5/29/2016
Each terminal in the grammar is a token
A <nonterminal> defines a syntactic category
The symbol | denotes alternative forms in a production
The special symbol  denotes empty
COP4020 Spring 2014
11
Example
::= program <id> ( <id> <More_ids> ) ; <Block> .
::= <Variables> begin <Stmt> <More_Stmts> end
::= , <id> <More_ids>
|
<Variables>
::= var <id> <More_ids> : <Type> ; <More_Variables>
|
<More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables>
|
<Stmt>
::= <id> := <Exp>
| if <Exp> then <Stmt> else <Stmt>
| while <Exp> do <Stmt>
| begin <Stmt> <More_Stmts> end
<More_Stmts>
::= ; <Stmt> <More_Stmts>
|
<Exp>
::= <num>
| <id>
| <Exp> + <Exp>
| <Exp> - <Exp>
<Program>
<Block>
<More_ids>
5/29/2016
COP4020 Spring 2014
12
Derivations

From a grammar we can derive strings (= sequences of
tokens)


The opposite process of parsing
Starting with the grammar’s designated start symbol, in
each derivation step a nonterminal is replaced by a righthand side of a production for that nonterminal


5/29/2016
A sentence (in the language) is a sequence of terminals that can
be derived from the start symbol.
A sentential form is a sequence of terminals and nonterminals
that can be derived from the start symbol.
COP4020 Spring 2014
13
Example Derivation
<expression>
<operator>
::= identifier
| unsigned_integer
| - <expression>
| ( <expression> )
| <expression> <operator> <expression>
::= + | - | * | /
Sentential forms
Start symbol
<expression>
 <expression> <operator> <expression>
Replacement of
nonterminal with one
 <expression> <operator> identifier
of its productions
 <expression> + identifier
 <expression> <operator> <expression> + identifier
 <expression> <operator> identifier + identifier
 <expression> * identifier + identifier
The final string is
 identifier * identifier + identifier
the yield
5/29/2016
COP4020 Spring 2014
14
Rightmost versus Leftmost
Derivations

When the nonterminal on the far right (left) in a sentential form is
replaced in each derivation step the derivation is called right-most
(left-most)
<expression>
 <expression> <operator> <expression>
 <expression> <operator> identifier
Replace in rightmost derivation
Replace in rightmost derivation
Replace in leftmost derivation
<expression>
 <expression> <operator> <expression>
 identifier <operator> <expression>
Replace in leftmost derivation
5/29/2016
COP4020 Spring 2014
15
A Language Generated by a
Grammar


A context-free grammar is a generator of a context-free language
The language defined by a grammar G is the set of all strings w that
can be derived from the start symbol S
L(G) = { w | S * w }
<S> ::= a | ‘(’ <S> ‘)’
L(G) = { set of all strings a (a) ((a)) (((a))) … }
<S> ::= <B> | <C>
<B> ::= <C> + <C>
<C> ::= 0 | 1
L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 }
5/29/2016
COP4020 Spring 2014
16
Parse Trees

A parse tree depicts the end result of a derivation



The internal nodes are the nonterminals
The children of a node are the symbols (terminals and
nonterminals) on a right-hand side of a production
The leaves are the terminals
<expression>
<expression>
<operator>
<expression>
<expression> <operator> <expression>
identifier
5/29/2016
*
identifier
COP4020 Spring 2014
+
identifier
17
Parse Trees
<expression>
 <expression> <operator> <expression>
 <expression> <operator> identifier
 <expression> + identifier
 <expression> <operator> <expression> + identifier
 <expression> <operator> identifier + identifier
 <expression> * identifier + identifier
 identifier * identifier + identifier
<expression>
<expression>
<operator>
<expression>
<expression> <operator> <expression>
identifier
5/29/2016
*
identifier
COP4020 Spring 2014
+
identifier
18
Ambiguity


There is another parse tree for the same grammar and
input: the grammar is ambiguous
This parse tree is not desired, since it appears that + has
precedence over *
<expression>
<expression>
<operator>
<expression>
<expression> <operator> <expression>
identifier
5/29/2016
*
identifier
COP4020 Spring 2014
+
identifier
19
Ambiguous Grammars



Ambiguous grammar: more than one distinct derivation
of a string results in different parse trees
A programming language construct should have only
one parse tree to avoid misinterpretation by a compiler
For expression grammars, associativity and precedence
of operators is used to disambiguate
<expression>
<term>
<factor>
<add_op>
<mult_op>
5/29/2016
::= <term> | <expression> <add_op> <term>
::= <factor> | <term> <mult_op> <factor>
::= identifier | unsigned_integer | - <factor> | ( <expression> )
::= + | ::= * | /
COP4020 Spring 2014
20
Ambiguous if-then-else:
the “Dangling Else”

A classical example of an ambiguous grammar are the
grammar productions for if-then-else:
<stmt> ::= if <expr> then <stmt>
| if <expr> then <stmt> else <stmt>


It is possible to hack this into unambiguous productions
for the same syntax, but the fact that it is not easy
indicates a problem in the programming language design
Ada uses different syntax to avoid ambiguity:
<stmt> ::= if <expr> then <stmt> end if
| if <expr> then <stmt> else <stmt> end if
5/29/2016
COP4020 Spring 2014
21
Download