Recognizers 26-Jul-16

advertisement
Recognizers
26-Jul-16
Parsers and recognizers

Given a grammar (say, in BNF) and a string,


A recognizer will tell whether the string belongs to the language defined
by the grammar
A parser will try to build a tree corresponding to the string, according to
the rules of the grammar
Input string
Recognizer result
2 + 3 * 4
true
2 + 3 *
false
Parser result
Error
2
Building a recognizer


One way of building a recognizer from a grammar is
called recursive descent
Recursive descent is pretty easy to implement, once you
figure out the basic ideas



Recursive descent is a great way to build a “quick and dirty”
recognizer or parser
Production-quality parsers use much more sophisticated and
efficient techniques
In the following slides, I’ll talk about how to do
recursive descent, and give some examples in Java
3
Review of BNF and EBNF


“Plain” BNF

< > indicate a nonterminal that needs to be further expanded, for example,
<variable>

Symbols not enclosed in < > are terminals; they represent themselves, for
example, if, while, (

The symbol ::= means is defined as

The symbol | means or; it separates alternatives, for example,
<addop> ::= + | Extended BNF

[ ] enclose an optional part of the rule


Example:
<if statement> ::= if ( <condition> ) <statement> [ else
<statement> ]
{ } mean the enclosed can be repeated zero or more times

Example:
<parameter list> ::= ( ) |
( { <parameter> , } <parameter> )
4
Recognizing simple alternatives, I

Consider the following BNF rule:



<add_operator> ::= + | That is, an add operator is a plus sign or a minus sign
To recognize an add operator, we need to get the next token,
and test whether it is one of these characters


If it is a plus or a minus, we simply return true
But what if it isn’t?


We not only need to return false, but we also need to put the token back
because it doesn’t belong to us, and some other grammar rule probably
wants it
Our tokenizer needs to be able to take back tokens


Usually, it’s enough to be able to put just one token back
More complex grammars may require the ability to put back several
tokens
5
Recognizing simple alternatives, II


Our rule is <add_operator> ::= + | Our method for recognizing an <add_operator>
(which we will simply call addOperator) looks like
this:

public boolean addOperator() {
Get the next token, call it t
If t is a “+”, return true
If t is a “-”, return true
If t is anything else,
put the token back
return false
}
6
Java code


public boolean addOperator() {
Token t = myTokenizer.next();
if (t.type == Type.SYMBOL && t.value.equals("+")) {
return true;
}
if (t.type == Type.SYMBOL && t.value.equals("-")) {
return true;
}
myTokenizer.pushBack();
return false;
}
While this code isn’t particularly long or hard to read, we are going to have a
lot of very similar methods
7
Helper methods

Remember the DRY principle: Don’t Repeat Yourself



If we turn each BNF production directly into Java, we will be
writing a lot of very similar code
We should write some auxiliary or “helper” methods to hide
some of the details for us
First helper method:

private boolean symbol(String expectedSymbol)


Gets the next token and tests whether it matches the
expectedSymbol
 If it matches, returns true
 If it doesn’t match, puts the symbol back and returns false
We’ll look more closely at this method in a moment
8
Recognizing simple alternatives, III


Our rule is <add_operator> ::= + | Our pseudocode is:


public boolean addOperator() {
Get the next token, call it t
If t is a “+”, return true
If t is a “-”, return true
If t is anything else,
put the token back
return false
}
Thanks to our helper method, our actual Java code is:

public boolean addOperator() {
return symbol("+") || symbol("-");
}
9
First implementation of symbol

Here’s what symbol does:






Gets a token
Makes sure that the token is a symbol
Compares the symbol to the desired symbol (by value)
If all the above is satisfied, returns true
Else (if not satisfied) puts the token back, and returns false
private boolean symbol(String value) {
Token t = tokenizer.next();
if (t.type == Type.SYMBOL && value.equals(t.value())) {
return true;
}
else {
tokenizer.pushBack();
return false;
}
}
10
Implementing symbol


We can implement methods name, number, and maybe eol
the same way
All this code will look pretty much alike



The main difference is in checking for the type
The DRY principle suggests we should use a helper method for symbol
private boolean symbol(String expectedValue) {
return nextTokenMatches(Type.SYMBOL, expectedValue);
}
11
nextTokenMatches #1

The nextTokenMatches method should:





Get a token
Compare types and values
Return true if the token is as expected
Put the token back and return false if it doesn’t match
private boolean nextTokenMatches(Type type, String value) {
Token t = tokenizer.next();
if (type == t.type() && value.equals(t.value())) {
return true;
}
else {
tokenizer.pushBack(1);
return false;
}
}
12
nextTokenMatches #2

The previous method is fine for symbols, but what if we only care
about the type?




For example, we want to get a number—any number
We need to compare only type, not value
private boolean nextTokenMatches(Type type, String value) {
Token t = tokenizer.next();
omit this parameter
if (type == t.type() && value.equals(t.getValue())) return true;
else tokenizer.pushBack(1);
omit this test
return false;
}
It’s easier to overload nextTokenMatches than to combine the
two versions, and both versions are fairly short, so we are
probably better off with the code duplication
13
addOperator reprise

public boolean addOperator() {
return symbol("+") || symbol("-");
}

private boolean symbol(String expectedValue) {
return nextTokenMatches(Type.SYMBOL, expectedValue);
}

private boolean nextTokenMatches(Type type, String value) {
Token t = tokenizer.next();
if (type == t.type() &&
value.equals(t.value())) return true;
else tokenizer.pushBack();
return false;
}
14
Sequences, I

Suppose we want to recognize a grammar rule in which one
thing follows another, for example,



<empty_list> ::= “[” “]”
(I put quotes around these brackets to distinguish them from the EBNF
metasymbols for “optional”)
Here’s some code we might try:

public boolean emptyList() {
return symbol("[") && symbol("]");
}
15
Sequences, I

Here’s the grammar rule again:


“]”
The code for this would be fairly simple...


<empty_list> ::= “[”
public boolean emptyList() {
return symbol("[") && symbol("]");
}
...except for one thing...

What happens if we get [ 5 ]?




We recognize and accept the [
We reject (and put back) the 5
We cannot also put back the [, because we can only put back one thing
Putting back two things isn’t enough—what about [ 1, 2, 3] ?
16
Sequences, II


The grammar rule is <empty_list> ::= “[”
And the token string contains [ 5 ]

Solution #1: Write a pushBack method that push back more than one
token at a time






You might be able to get away with this, depending on the grammar
For example, for any reasonable grammar, (2 + 3 +) is clearly an error
Solution #3: Change the grammar


This will allow you to put the back both the “[” and the “5”
You have to be very careful of the order in which you return tokens
This is a good use for a Stack
But you never know when to quit!
Solution #2: Call it an error


“]”
Tricky, and may not be possible
Solution #4: Combine rules

See the next slide
17
Implementing a fancier pushBack()

To push back more tokens than one, you need to
either:



Make your tokenizer keep track of the last several tokens
(and have a pushBack(int n) method, or
Expect the calling program to tell you what tokens to push
back (with a pushBack(Token t) method)
I’ve had you implement your own Tokenizer


This was so you would understand state machines
In practice, you would probably use Java’s built-in
StreamTokenizer
18
Extending StreamTokenizer

java.io.StreamTokenizer does almost everything you
need in a tokenizer



To push back more tokens than one, you need to either:



Its pushBack() method only “puts back” a single token
If you need more than that, you have to extend StreamTokenizer
Make your extended tokenizer keep track of the last several tokens (and
have a pushBack(int n) method, or
Expect the calling program to tell you what tokens to push back (with a
pushBack(Token t) method)
Plus, you will have to override nextToken()


Inside your nextToken() method, you can call super.nextToken()
to get the next never-before-seen token
Your nextToken() method will also have to do something about nval
and sval, such as provide methods to get these values
19
Sequences, III

Suppose the grammar really says
<list> ::= “[” “]” | “[” <number> “]”

Now your pseudocode should look something like this:

public boolean list() {
if first token is “[” {
if second token is “]” return true
else if second token is a number {
if third token is “]” return true
else error
}
else put back first token
}
20
Sequences, IV


Another possibility is to revise the grammar (but make sure the
new grammar is equivalent to the old one!)
Old grammar:


“]” |
“[” <number> “]”
New grammar:


<list> ::= “[”
<list> ::= “[” <rest_of_list>
<rest_of_list> ::= “]” | <number> “]”
New pseudocode:

public boolean list() {
if first token is “[” {
if restOfList()
return true
}
else put back first token
}

private boolean restOfList() {
if first token is “]”, return true
if first token is a number and
second token is a “]”,
return true
else return false
}
21
Simple sequences in Java

Suppose you have this rule:

<factor> ::= ( <expression> )

A good way to do this is often to test whether the grammar rule is not met

public boolean factor() {
if (symbol("(")) {
if (!expression()) {
error("Error in parenthesized expression");
}
if (!symbol(")")) {
error("Unclosed parenthetical expression");
}
return true;
}
return false;
}
To do this, you need to be careful that the “(” is not the start of some other production
that can be used where a factor can be used

In other words, be sure that if you get a “(” it must start a factor


Also, error(String) must throw an Exception—why?
22
false vs. error

When should a method return false, and when should it
report an error?


false means that this method did not recognize its input
Report an error if you know that something has gone wrong




In other words, you know that no other method will recognize the input,
either
public boolean ifStatement() {
if you don’t see “if”, return false
// could be some other kind of statement
if you don’t see a condition, return an error
// “if” is a keyword that must start an if
statement
If you see if, and it isn’t followed by a condition, there is nothing else
that it could be
This isn’t completely mechanical; you have to decide
23
Sequences and alternatives




Here’s the real grammar rule for <factor>:
<factor> ::= <name>
| <number>
| ( <expression> )
And here’s the actual code:
public
if
if
if
boolean factor() {
(name()) return true;
(number()) return true;
(symbol("(")) {
if (!expression()) error("Error in parenthesized expression");
if (!symbol(")")) error("Unclosed parenthetical expression");
return true;
}
return false;
}
24
Recursion, I

Here’s an unfortunate (but legal!) grammar rule:


Here’s some code for it:


<expression> ::= <term> | <expression> + <term>
public boolean expression() {
if (term()) return true;
if (!expression()) return false;
if (!addOperator()) return false;
if (!term()) error("Error in expression after '+' ");
return true;
}
Do you see the problem?
25
Recursion, I

Here’s the rule again:


And the code:



<expression> ::= <term> | <expression> + <term>
public boolean expression() {
if (term()) return true;
if (!expression()) return false;
if (!addOperator()) return false;
if (!term()) error("Error in expression after '+' ");
return true;
}
We aren’t recurring with a simpler case, therefore, we have an
infinite recursion
Our grammar rule is left recursive (the recursive part is the
leftmost thing in the definition)
26
Recursion, II

Here’s our unfortunate grammar rule again:


Here’s an equivalent, right recursive rule:


<expression> ::= <term> [ + <expression> ]
Here’s some (much happier!) code for it:


<expression> ::= <term> | <expression> + <term>
public boolean expression() {
if (!term()) return false;
if (!addOperator()) return true;
if (!expression()) error("Error in expression after '+'");
return true;
}
This works for the Recognizer, but will cause problems later

We’ll cross that bridge when we come to it
27
Extended BNF—optional parts

Extended BNF uses brackets to indicate optional parts of rules


Example:
<if_statement> ::=
if <condition> <statement> [ else <statement> ]
Pseudocode for this example:
public boolean ifStatement() {
if you don’t see “if”, return false
if you don’t see a condition, return an error
if you don’t see a statement, return an error
if you see an “else” {
if you see a “statement”, return true
else return an error
}
else return true;
}
28
Extended BNF—zero or more

Extended BNF uses braces to indicate parts of a rule
that can be repeated



Example: <expression> ::= <term> { + <term> }
Pseudocode for this example:
public boolean expression() {
if you don’t see a term, return false
while you see a “+” {
if you don’t see a term, return an error
}
return true
}
29
Back to parsers


A parser is like a recognizer
The difference is that, when a parser recognizes
something, it does something about it


Usually, what a parser does is build a tree
If the thing that is being parsed is a program, then

You can write another program that “walks” the tree and
executes the statements and expressions as it finds them


Such a program is called an interpreter
You can write another program that “walks” the tree and
produces code in some other language (usually assembly
language) that does the same thing

Such a program is called a compiler
30
Conclusions

If you start with a BNF definition of a language,

You can write a recursive descent recognizer to tell you
whether an input string “belongs to” that language (is a valid
program in that language)


You can write a recursive descent parser to create a parse
tree representing the program


Writing such a recognizer is a “cookbook” exercise—you just follow
the recipe and it works (hopefully)
The parse tree can later be used to execute the program
BNF is purely syntactic


BNF tells you what is legal, and how things are put together
BNF has nothing to say about what things actually mean
31
The End
32
Download