Parsing_PPT-2.ppt

advertisement
CSE 305
Lecture – Parsing
Feb 11, 2016
Program Statements
stmt  assign | cond | loop | cmpd
assign  var = expr ;
expr  aexp | bexp
cond  if ‘(‘ expr ‘)’ stmt [else stmt]
loop  while ‘(‘ expr ‘)’ stmt
cmpd  ‘{’ stmts ‘}’
stmts  { stmt }
Ambiguity in Grammars
A grammar G is said to be ambiguous if some
sentence in L(G) has at least two parse trees.
If one grammar for a language L is ambiguous,
it does not mean that every grammar for L
is ambiguous – if the latter were true, L would be
called an inherently ambiguous language (these
are not common in practice).
Unambiguous grammars are important for parsing.
Feb 11, 2016
3
CSE 305 / Jayaraman
Ambiguity Example
Preferred
Parse
if (exp1)
while (exp2)
if (exp3)
stmt1
else stmt2
if (exp1)
while (exp2)
if (exp3)
stmt1
else stmt2
Feb 11, 2016
if (exp1)
while (exp2)
if (exp3)
stmt1
else stmt2
4
CSE 305 / Jayaraman
Read Sebesta on Ambiguity
The “dangling else” ambiguity is one of the classic
examples of an ambiguous grammar. Sebesta has
more on this example, including how to make the
grammar unambiguous.
Unambiguous grammars are desirable for parsing,
since the parse tree guides further analysis, including
code generation.
Feb 11, 2016
5
CSE 305 / Jayaraman
Ambiguity in Expressions
assign  var = expr
expr  aexp | bexp
aexp  …
bexp  …
Ambiguous, because both
aexp and bexp generate
id, (id), ((id)), etc.
We can merge the aexp and bexp grammars, as follows:
assign  var = expr
expr  term | term op1 expr
term  fact | fact op2 term
fact  num | true | false | ‘(‘ expr ‘)’
op1  + | - | ‘||’
op2  * | / | &&
Need for Attributes
The merged expression grammar is unambiguous, but it
generates many incorrectly typed expressions, such as:
10 & 20
true * 101
10*20 || false – 30
…
We need to constrain the grammar through the use of
attributes and semantic clauses, to avoid over-generation.
The needed attribute here is type information for every
expression and sub-expression.
Feb 11, 2016
7
CSE 305 / Jayaraman
The next two slides were not
presented in class, but are
included here for completeness,
since reference was made to
attributes and attribute
grammars (inherited and
synthesized attributes) during the
lecture.
Feb 11, 2016
8
CSE 305 / Jayaraman
Using Attributes and Semantic Rules
to Specify Type-Correctness
SYNTAX RULE
op1
op1
t
op2
op2
t
SEMANTICS RULE
 (+ | - )
 ‘||’
t = “int”
t = “bool”
t
 (* | / )
 ‘&&’
t = “int”
t = “bool”
fact
t
 num
t = “int”
fact
t
 true
t = “bool”
fact
t
 false
t = “bool”
fact
t
 ‘( expr
‘)’
t = t2
t
is a
synthesized
attribute
Inherited and Synthesized Attributes
ATTRIBUTE GRAMMAR
program  decls
type
type
type
decls
t
t
t
ST
TYPE OF ATTRIBUTE
stmts
ST
 int
 real
 bool
ST
t is synthesized
t is synthesized
t is synthesized
 type
idlist ST, t  id
ST is inherited by
decls and stmts
i1
t
idlist
{ , id
ST, t
i2
}
;
ST is inherited;
t is synthesized by type
and inherited by idlist
ST and t are inherited,
id1 and id2 synthesized
Broader Context for Parsing:
Compiler Phases
Source Code
Source Code
lexical
Feb 11, 2016
Compiler
Analysis
syntactic
Target Code
semantic
Synthesis
intermediate
code generation
11
Target Code
optimization
code genrn
CSE 305 / Jayaraman
Compiler Structure
lexical
syntactic
semantic
1. Lexical:
translates sequence of characters
into sequence of ‘tokens’
2. Syntactic: translates sequence of tokens into a
‘parse tree’; also builds symbol table
3. Semantic: traverses parse tree and performs
global checks, e.g. type-checking,
actual-parameter correspondence
Feb 11, 2016
12
CSE 305 / Jayaraman
Compiler Structure (cont’d)
interm code
generation
machine
code gen.
optimization
4. Intermediate CG:
Traverses parse tree and generates
‘abstract machine code’, e.g. triples, quadruples
5. Optimization:
Performs control and data flow analysis;
remove redundant ops, move loop invariant
ops outside loop
6. Code Generation:
Translate intermediate code to actual
machine code
Feb 11, 2016
13
CSE 305 / Jayaraman
A Simple Example
lexer
Source Code
// declarations
// not shown
f = 1;
i = 1;
while (i < n) {
i = i + 1;
f = f * i;
}
print(f);
Feb 11, 2016
parser
…
Lexical Tokens
id
op
int
p
id
op
int
p
key
…
1
1
1
1
2
1
1
1
1
…
Target Code
Parse Tree
…
p1
p1
op1
id1
key1
op1
int1
id2
code gen’r
…
op2
int1
id2
14
id3
LD R1, #1
ST R1, Mf
LD R2, #1
ST R2, Mi
LD R3, Mn
L: CMP R2, R3
JF Out
INC R2
ST R2, Mi
MUL R1, R2
ST R1, Mf
JMP L
Out: Print Mf
CSE 305 / Jayaraman
Java Bytecodes
public static int
fact(int n) {
// n >= 0;
int f = 1;
int i = 1;
while (i < n) {
i = i + 1;
f = f * i;
}
return f;
}
cmd> javap –c Factorial
Lexical Analyzer (lex)
• Scans the input file character by character, skips over
comments and white space (except in Python where
indentation is important).
• Two main outputs: token and value
• Token is an integer code for each lexical class:
identifiers, numbers, keywords, operators, punctuation
• Value is the actual instance:
• for identifier, it is the string;
• for numbers, it is their numeric value;
• for keywords, operators and punctuation, the
token code = token value
Clarifying the Lexical-Syntax
Analyzer Interaction
• Although the diagram shows the lexical analyzer
feeding its output to the syntax analyzer, in practice,
the syntax analyzer calls the lexical analyzer
repeatedly.
• At each call, the lexical analyzer prepares the next
token for the syntax analyzer.
• The lexical analyzer would not need to create an
explicit ‘Lexical Token’ table, as shown in the
previous diagram, since the syntax analyzer only
needs to work with one token at a time.
Feb 11, 2016
17
CSE 305 / Jayaraman
Design of a Simple Parser
• We will see how to design a top-down
parser for simple language.
• In the next few slides is the structure of
the lexical analyzer – some of the
details and terminology taken from
Sebesta.
• Then we will see how to design the
parser.
Feb 11, 2016
18
CSE 305 / Jayaraman
Token Codes
class Token {
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
public
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
static
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
final
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
int
SEMICOLON = 0;
COMMA = 1;
NOT_EQ = 2;
ADD_OP = 3;
SUB_OP = 4;
MULT_OP = 5;
DIV_OP = 6;
ASSIGN_OP = 7;
GREATER_OP = 8;
LESSER_OP = 9;
LEFT_PAREN= 10;
RIGHT_PAREN= 11;
LEFT_BRACE= 12;
RIGHT_BRACE= 13;
ID = 14;
INT_LIT = 15;
KEY_IF = 16;
KEY_INT = 17;
KEY_ELSE = 18;
KEY_WHILE = 19;
KEY_END = 20;
}
Feb 11, 2016
19
CSE 305 / Jayaraman
Lexer: Lexical Analyzer
public class Lexer {
static private Buffer buffer = new Buffer(…);
Feb 11, 2016
}
static public int nextToken; // code
static public int intValue;
// value
…
public static int lex() {
…
sets nextToken and intValue
each time it is called
…
}
20
CSE 305 / Jayaraman
Parsing Strategies
There are two broad strategies for parsing:
* top-down parsing (a.k.a. recursive-descent parsing)
* bottom-up parsing
Top-down parsing is less powerful than bottom-up parsing.
But it is preferred when manually constructing a parser.
Tools such as YACC and JavaCC automatically construct a
bottom-up parser from a grammar, but this bottom-up parser
is hard to understand.
Top-down Parsing
E
Grammar*:
EE+T
ET
TT*F
TF
F  id
F(E)
T
T
F
a
T
+
E
*
F
F
+
b
*
c
Choosing the correct expansion at each step is the issue.
* This grammar is not suited for top-down parsing - will discuss later.
Feb 11, 2016
22
CSE 305 / Jayaraman
Bottom-up Parsing
E
Grammar:
EE+T
ET
TT*F
TF
F  id
F(E)
T
T
F
a
T
+
E
*
F
F
+
b
*
c
Choosing whether to ‘shift’ or ‘reduce’ and, if the latter,
choosing the correct reduction are the issues.
Feb 11, 2016
23
CSE 305 / Jayaraman
Deterministic Parsing
The term ‘deterministic parsing’ means that the parser
can, at each step, correctly decide which rule to use
without any guesswork. This requires some peeking
into (or, looking ahead in) the input. For example:
stmt  assign | cond | loop | cmpd
For a top-down parser to decide which of the above four
cases applies, it needs to look into the input to see which
is the next symbol, or “token”, in the input:
identifier, if, while, {
Feb 11, 2016
24
CSE 305 / Jayaraman
Constructing a Top-down Parser
(one void procedure per nonterminal)
Case 1: Alternation on RHS of rule, e.g.,
stmt  assign | cond | loop | cmpd
Parser code:
void stmt () {
switch (Lexer.nextToken) {
case Token.ID :
{
case Token.IF :
{
case Token.WHILE: {
case Token.LBRACE: {
default: break;
}
}
Feb 11, 2016
25
assign(); break;
cond(); break;
loop(); break;
cmpd(); break;
}
}
}
}
CSE 305 / Jayaraman
Constructing Top-down Parser
(cont’d)
Case 2: Sequencing on RHS of a rule, e.g.,
decl  type idlist
Parser code:
void decl() {
type();
idlist();
}
Feb 11, 2016
26
CSE 305 / Jayaraman
Constructing Parser(cont’d)
Case 3: Terminal Symbols on RHS of a rule:
factor  num | ‘(‘ expr ‘)’
Parser code:
void factor() {
switch (Lexer.nextToken) {
case Token.INT_LIT: int i = Lexer.intValue;
Lexer.lex();
break;
case Token.LPAR:
Lexer.lex();
expr();
if (Lexer.nextToken == Token.RPAR)
Lexer.lex();
else syntaxerror(“missing ‘)’”);
default: break;
}
Feb 11, 2016
27
CSE 305 / Jayaraman
Constructing a Top-down Parser
(cont’d)
Case 4: Left-Factoring the RHS of a rule:
expr  term | term + expr
Parser Code:
void expr() {
term();
if (Lexer.nextToken == Token.ADD_OP) {
Lexer.lex();
expr();
}
}
Left-recursion is not compatible
with Top-down Parsing
Problem: Left-recursive Rule
expr  term | expr + term
Problem:
We cannot decide which alternative to use
even with lookahead.
Reason:
The recursion in ‘expr’ must eventually end
in ‘term’, thus both alternatives have the
same set of leading terminal symbols.
Recognizer vs Parser
Terminology: A “recognizer” only outputs a yes/no answer
indicating whether an input string belongs to L(G), the
language defined by a grammar G.
A “parser” builds upon the basic structure provided by the
recognizer, enhancing it with attributes and semantic
actions so as to produce additional output.
Adding attributes to the Parser
In a top-down parser, the attribute information is
incorporated as follows:
- Inherited attributes of a grammar rule become
input parameters of the corresponding
procedure.
- Synthesized attributesd become output
parameters of the procedure.
NOTE: Java does not have output parameters,
hence we explain how synthesized attributes are
represented in a Java setting.
Feb 11, 2016
31
CSE 305 / Jayaraman
Object-Oriented
Top-down Recognizer
A top-down parser is basically a set of mutually-recursive
procedures, one per nonterminal of the grammar.
In the object-oriented approach, each such procedure can
be made the constructor of a class. Two benefits:
(i) Run-time object structure = parse tree.
(ii) When grammar rules are enhanced with attributes, the
synthesized attributes become fields of the class. Thus,
synthesized attributes are like “decorations” added to
the nodes of the parse tree.
Demo of an OO
Top-down Recognizer
written in Java and run under
JIVE: Java Interactive
Visualization Environment
http://www.cse.buffalo.edu/jive
Feb 11, 2016
33
CSE 305 / Jayaraman
Download