Parsing The scanner recognizes words The parser recognizes syntactic units Parser operations:

advertisement
Parsing
 The scanner recognizes words
 The parser recognizes syntactic units
 Parser operations:
 Check and verify syntax based on specified syntax rules
 Report errors
 Build IR
 Automation:
 The process can be automated
1
Parsing
 Check and verify syntax based on specified syntax rules
 Are regular expressions sufficient for describing syntax?
 Example 1: Infix expressions
 Example 2: Nested parentheses
 We use Context-Free Grammars (CFGs) to specify context-free
syntax.
 A CFG describes how a sentence of a language may be
generated.
 Example:
EvilLaugh  mwa EvilCackle
EvilCackle  ha EvilCackle
EvilCackle  ha!

Use this grammar to generate the sentence mwa ha ha ha!
2
CFGs
 A CFG is a quadruple (N, T, R, S) where
 N is the set of non-terminal symbols
 T is the set of terminal symbols
 S N is the starting symbol
 R  N(NT)* is a set of rules
 Example: The grammar of nested parentheses
G = (N, T, R, S) where



N = {S}
T ={ (, ) }
R ={ S (S) , SSS, S }
3
Derivations
 The language described by a CFG is the set of strings that can
be derived from the start symbol using the rules of the
grammar.
 At each step, we choose a non-terminal to replace.
S (S) (SS) ((S)S) (( )S) (( )(S)) (( )((S))) (( )(( )))
derivation
sentential form
This example demonstrates a leftmost derivation :
one where we always expand the leftmost non-terminal
in the sentential form.
4
Derivations and parse trees
 We can describe a derivation using a graphical
representation called parse tree:




the root is labeled with the start symbol, S
each internal node is labeled with a non-terminal
the children of an internal node A are the right-hand side of
a production A
each leaf is labeled with a terminal
 A parse tree has a unique leftmost and a unique
rightmost derivation (however, we cannot tell which
one was used by looking at the tree)
5
Derivations and parse trees
 So, how can we use the grammar described earlier to
verify the syntax of "(( )((( ))))"?


We must try to find a derivation for that string.
We can work top-down (starting at the root/start symbol) or
bottom-up (starting at the leaves).
 Careful!
 There may be more than one grammars to describe the same
language.
 Not all grammars are suitable
6
Problems in parsing
 Consider S  if E then S else S | if E then S
 What is the parse tree for
if E then if E then S else S
 There are two possible parse trees! This problem is called
ambiguity
S
if E then S
if E then S else S
S
if E then S else S
if E then S
 A CFG is ambiguous if one or more terminal strings have
multiple leftmost derivations from the start symbol.
7
Ambiguity
 There is no general algorithm to tell whether a CFG is
ambiguous or not.
 There is no standard procedure for eliminating
ambiguity.
 Some languages are inherently ambiguous.

In those cases, any grammar we come up with will be
ambiguous.
8
Ambiguity
 In general, we try to eliminate ambiguity by rewriting
the grammar.
 Example:

EE+E | EE | id
becomes:
EE+T | T
TTF | F
F id
9
Ambiguity
 In general, we try to eliminate ambiguity by rewriting
the grammar.
 Example:

Sif E then S else S | if E then S | other
becomes:
S  EwithElse | EnoElse
EwithElse  if E then EwithElse else EwithElse | other
EnoElse  if E then S
| if E then EwithElse else EnoElse
10
Top-down parsing
 Main idea:
 Start at the root, grow towards leaves
 Pick a production and try to match input
 May need to backtrack
 Example:
 Use the expression grammar to parse x-2*y
11
Grammar problems
 Because we try to generate a leftmost derivation by
scanning the input from left to right, grammars of the
form A  A x may cause endless recursion.
 Such grammars are called left-recursive and they
must be transformed if we want to use a top-down
parser.
12
Left recursion
 A grammar is left recursive if for a non-terminal A,
there is a derivation A+ A
 There are three types of left recursion:
 direct (A  A x)
 indirect (A  B C, B  A )
 hidden (A  B A, B  )
13
Left recursion
 To eliminate direct left recursion replace
A  A1 | A2 | ... | Am | 1 | 2 | ... | n
with
A  1B | 2B | ... | nB
B  1B | 2B | ... | mB | 
14
Left recursion
 How about
SE
E  E+T
ET
T  E-T
T  id
this:
There is direct recursion: EE+T
There is indirect recursion: TE+T, ET
Algorithm for eliminating indirect recursion
List the nonterminals in some order A1, A2, ...,An
for i=1 to n
for j=1 to i-1
if there is a production AiAj,
replace Aj with its rhs
eliminate any direct left recursion on Ai
15
Eliminating indirect left recursion
ordering: S, E, T, F
SE
E  E+T
ET
T  E-T
TF
F  E*F
F  id
i=S
SE
E  E+T
ET
T  E-T
TF
F  E*F
F  id
i=E
SE
E  TE'
E'+TE'|
T  E-T
TF
F  E*F
F  id
i=T, j=E
SE
E  TE'
E'+TE'|
T  TE'-T
TF
F  E*F
F  id
SE
E  TE'
E'+TE'|
T  FT'
T'  E'-TT'|
F  E*F
F  id
16
Eliminating indirect left recursion
i=F, j=E
SE
E  TE'
E'+TE'|
T  FT'
T'  E'-TT'|
F  TE'*F
F  id
i=F, j=T
SE
E  TE'
E'+TE'|
T  FT'
T'  E'-TT'|
F  FT'E'*F
F  id
SE
E  TE'
E'+TE'|
T  FT'
T'  E'-TT'|
F  idF'
F'  T'E'*FF'|
17
Grammar problems
 Consider S  if E then S else S | if E then S
 Which of the two productions should we use to expand nonterminal S when the next token is if?
 We can solve this problem by factoring out the common part
in these rules. This way, we are postponing the decision
about which rule to choose until we have more information
(namely, whether there is an else or not).
 This is called left factoring
18
Left factoring
A  1 | 2 |...| n | 
becomes
A  B| 
B  1 | 2 |...| n
19
Grammar problems
 A symbol XV is useless if
 there is no derivation from X to any string in the language
(non-terminating)
 there is no derivation from S that reaches a sentential form
containing X (non-reachable)
 Reduced grammar = a grammar that does not
contain any useless symbols.
20
Useless symbols
 In order to remove useless symbols, apply two
algorithms:


First, remove all non-terminating symbols
Then, remove all non-reachable symbols.
 The order is important!
 For example, consider S + X where  contains a nonterminating symbol. What will happen if we apply the
algorithms in the wrong order?
 Concrete example: S  AB | a, A a
21
Useless symbols
 Example
Initial grammar:
Algorithm 1 (terminating symbols):
S AB | CA
A is in because of A a
A a
C is in because of C b
B CB | AB
D is in because of D d
C cB | b
S is in because A, C are in and S AC
D aD | d
22
Useless symbols
 Example continued
After algorithm 1:
Algorithm 2 (reachable symbols):
S CA
S is in because it is the start symbol
A a
C and A are in because S is in and S CA
C b
Final grammar:
D aD | d
S CA
A a
C b
23
Download