lecture11 - Department of Computer Science

advertisement
The meaning of it all!
The role of finite automata and
grammars in compiler design
Compiler-Compilers!
There are many applications of automata and grammars inside and
outside computer science, the main applications in computer science
being in the area of compiler design.
Suppose we have designed (on paper) a new programming language with nice
features. We have worked out their syntax and the way they should work. Now, all
we need is a compiler for this language.
It’s too complex to write a compiler from scratch!
What we could do is to make use of theoretical tools like automata and grammars
that recognize/generate strings of symbols of various kinds* and formally specify
the syntax of the new computer programming language. Such a formal
specification (plus other details) can be used by “magical programs” known as
compiler-compilers to automatically generate a compiler for the programming
language!
But for these theoretical tools we would have to spell out the syntax of a language in,
say, plain English which will not be precise enough and programs would find it hard
to “understand” such a description to be able to generate a compiler on its own.
* A program can be viewed as a (very long!) string that adheres to certain rules dictated by the
programming language.
Admiral Grace Hopper,
Pioneer of compiler design
Lexical Analysis
constant
keyword
for(i=0;i<=10;i++)
for ( i = 0 ; i <= 10 ; i ++ )
Lexical Analyzer
Raw stream
of characters
Stream of
tokens
identifier
Symbol Table
i
10
Parsing
For ( I = 0 ; I <= 10 ; I ++ )
‘Parse’ – to relate
Statement
assignment
FOR-statement
for
(
exp
;
exp
Assign_stmt exp
id
i
= exp id
const
0
i
;
exp
exp
<=
id
const
10
i
)
++
statement
Finite state automata as lexical analysers
Automaton for recognizing keywords
0
W
F
I
E
H
1
O
7
F
11
L
14
2
8
12
15
I
L
3
R
9
other than
letter/digit
other than
letter/digit
S
16
letter
0
2
1
letter, digit
E
5
10
13
E
Automaton for recognizing identifiers
other than
letter / digit
4
other than
letter/digit
17
other than
letter/digit
18
6
Converting a Finite state automaton into a
computer program
A:
Read next_char
If next_char is a letter goto B
Automata for recognizing identifiers
else FAIL( )
letter
A
B:
other than
letter / digit
C
B
letter, digit
Read next_char
If next_char is either a letter
or a digit goto B
else goto C
FAIL( ) is a function that “puts back” the character just read
and starts up the next transition diagram.
Note: Instead of using “A” and “B” as labels for GOTO statements, one could use
them as names of individual functions/procedures that can be invoked.
Grammars as syntax specification tools
Finite state automata are used to describe tokens.
Grammars are much more “expressive” than finite
state automata and can be used to describe more
complicated syntactical structures in a program---for
instance, the syntax of a FOR statement in C
language.
Grammars only describe/generate strings. We need a
process which, given an input string (a statement in a
program, say), pronounces whether it is derivable from a
given grammar or not. Such a process is known as parsing.
Types of Parsing
(i)
a b b c d
S  aAcBe
A  Ab | b
Bd
S
top down
(ii)
aAcB e
A b d
b
e
B
A
A
bottom up
S
Reducing the input string to the start symbol
We take a “chunk” of the input string and REDUCE it to
(replace it with) the symbol on LHS of a production rule.
In other words, the parse tree is constructed by beginning at the leaves
and working up towards the root.
“Expanding” the start symbol down to the input string
“EXPANDING” the start symbol (according to production rules of the given grammar), and
subsequently every non-terminal symbol that occurs in the “expansion”* till we arrive at the input
string. (* technically, it is called a sentential form)
Shift-Reduce: a bottom up parsing technique
a
b
b
c
d
e
S  aAcBe
A  Ab | b
Bd
$
Input
b
a
$
b
A
a
$
d
c
A
a
$
e
B
c
A
a
$
S
$
We shift symbols from input string (from left to right) usually onto a stack so that the “chunk” of symbols
(matching the RHS of a production) which is to be reduced to the corresponding LHS wiil eventually
appear on top of stack. (The chunk getting reduced is referred to as the “handle”.)
What is a handle?
A substring of the input string that matches the RHS of a production replacing
which (by the corresponding LHS) would eventually lead to a reduction to the
start symbol is called a handle.
A Right Most Derivation (RMD)
S  aAcBeC
 aAcde
 aAbcde
 abbcde
S  aAcBe
A  Ab | b
Bd
Bottom up parsing can be viewed as “RMD in
reverse direction”.
Non-terminal symbols on the right get expanded first before those
on the left get expanded. When we do this in reverse, though,
(now reducing symbols---not expanding) pieces of string on the
left get reduced first before those on the right.
The problem with discovering handles
Discovering the handle may not be easy always!
There may be more than one substring emerging on top-of-stack that matches
the RHS of a production.
a b b c d
A A
b
a
$
b
A
a
$
A
a
$
A
A
a
$
e
S  aAcBe
A  Ab | b
Bd
There’s no way a AAcde can be reduced
to S. (When we make an incorrect choice of
?
handle we get stuck half-way through, before we
can arrive at the start symbol.)
The problem with discovering handles
In the exercises we did, we took decisions as to when
to shift and when to reduce symbols (by ourselves,
using our cleverness!).
However, these can (and must) be done automatically
by the parser program in tune with the given grammar.
The well-known LR parser can do this and is beyond
our present scope.
Top down parsing
Formal :
Construct parse tree (for the input) by beginning at the root and creating the
nodes (of the tree) in preorder. In other words, it’s an attempt to find a Left
Most Derivation for an input string.
Informal :
Instead of starting to work on the input string and reduce it to the start symbol
(by replacing “chunks” of it with non-terminal symbols), we begin with the start
symbol itself and ask: “How can I expand this in order to arrive (eventually) at
the input string?” We ask the same question for every non-terminal symbol
occurring in the resulting expansions. We choose an appropriate expansion of
a certain non-terminal by glancing at the input string, i.e. by taking cues from
the symbol being scanned (and also the next few symbols) in the input.
Top down parsing: an example
S  cAd
A  ab | a
Start with S. Only one
expansion is possible.
c
S
c Ad
cAd
a b
(i)
Input
d
S
Now, how to expand A? Try every
expansion one by one!
S
OK! It matches with the
first symbol in the input.
a
c
(ii)
a
d
match!
(iii)
mismatch!
(so, try
another
expansion)
cAd
a
S
cAd
a
(iv)
c
a
d
match!
(so, move
on!)
c
a
d
match!
(we’re
done!)
Top down parsing: an example
A program to do top-down parsing might
use a separate procedure for every nonterminal
function S( )
{
if input_symbol = ‘c ’ then
{
ADVANCE( );
if A( ) then
{
if input_symbol = ‘d’ then
{
ADVANCE( );
return TRUE;
}
}
}
return FALSE;
}
function A( )
{
isave = input_pointer;
if input_symbol = ‘a ’ then
{
ADVANCE( );
if input_symbol = ‘b’ then
{
ADVANCE( );
return TRUE;
}
}
input_pointer = isave;
/* Try second expansion */
if input_symbol = ‘a’ then
{
ADVANCE( );
return TRUE;
}
return FALSE;
}
Problems with this approach
(i) Order in which the expansions are tried
S  cAd
A  a | ab
S
cAd
a
match match
c
Input
a
b
mismatch!
d is part of S and hence a new expansion for S will
be tried (in vain!). Hence, cabd will be rejected as
invalid (but is actually valid).
d
Remedy: Rewrite grammar so that there is no
more than ONE expansion for every non-terminal
sharing the same “prefix”; use “left factoring” to
realise this.
Problems with this approach
(ii) Left recursion
A  Aα
Production rule of the said form exhibits (immediate) left recursion. (More
precisely,) a grammar has left recursion if, at some point, A “yields” Aα,
i.e. if Aα can be derived from A in one or more steps.
Why is left recursion dangerous? It’s because the function A( )
corresponding to non-terminal A) will be forced to invoke itself repeatedly
and endlessly.
Remedy:
To eliminate left recursion (from the grammar)!
Eliminating (immediate) left recursion
A  Aα | β
A  β A’
AAα
Aαα
Aααα
βααα
A’  αA’ | ε
e.g. E  E + T | T
TT*F|F
F  ( E ) | id
E  TE’
E’  +TE’ | ε
T  FT’
T’  *FT’ | ε
F  ( E ) | id
Left factoring
A  αβ | αγ
A  αA’
A’  β | γ
e.g.
S  iCtS | iCtSeS | a
Cb
S  iCtSS’ | a
S’  eS | ε
Cb
Download