Abstract Syntax and its Representation

advertisement
Abstract Syntax and its Representation, Parsing, Unparsing
The first thing we must understand is BNF, Backus Naur Form, which is a
metalanguage – a language that describes other languages. Here is an example we will
develop throughout this paper.
1)
<Aexpression> := = ( + <Aexpression> <Aexpression> )
2)
3)
:= = <Texpression>
<Texpression> := = Symbol | Number
This BNF grammar consists of two “productions” or rules that describe how to generate
strings that are “valid”. The set of all possible valid strings is the language defined by the
grammar. In BNF, expressions of the form < symbol > are called “non-terminal”
symbols. Other symbols in the grammar are called “terminal” symbols. Starting with the
symbol <Aexpression> we can replace any nonterminal symbol with the expression on
the right of “:= =”. Replacements continue until all nonterminal symbols have been
removed. Note: Production 2 is equivalent to <Aexpression> := = <Texpression>.
Here’s an example. Each line represents the application of a single production. The
production number is listed to the right.
<Aexpression>
(+ <Aexpression>
(+ <Aexpression>
(+ <Texpression>
(+ <Texpression>
(+ <Texpression>
(+ <Texpression>
(+
9
(+
9
<Aexpression> )
( + <Aexpression>
( + <Aexpression>
( + <Texpression>
( + <Texpression>
( + <Texpression>
( + <Texpression>
( +
7
<Aexpression>
<Aexpression>
<Aexpression>
<Texpression>
x
x
x
))
))
))
))
))
))
))
1
1
2
2
2
3
3
3
Since the last line contains only terminal symbols, the expression
“(+ 9 (+ 7 x))” is a valid sentence in the language defined by the grammar. The lines
above represent a “derivation” of the final string. In general, there are many possible
derivations for a given string.
Can you find derivations for each of the following strings?
(+ 3 2)
(+ (+ 2 3)(+ a b))
(+ 9 (+ x y))
We can think of the BNF grammar as defining a concrete syntax or the external
representation. The external representation is typically converted to an internal
representation. The internal representation is one that is more convenient for software
manipulation. Notice that <Aexpression> is an recursive production since it describes an
Aexpression in terms of itself. Scheme is particularly effective in representing
“inductive” structures by using the define-datatype form. Here is the Scheme definition
of the inductive structure for storing Aexpressions.
(define symnum?
(lambda (exp)
(or (symbol? exp)(number? exp))))
(define-datatype aexpression aexpression?
(add-exp
(exp1 aexpression?)
(exp2 aexpression?))
(t-exp
(id
symnum?)))
Since Texpressions are either Scheme symbols or numbers, we first define a predicate
called “symnum?”. This predicate returns true if the passed parameter is a symbol or a
number. Next we define the “aexpression” structure using define-datatype. There are
two variants: 1) add-exp - this variant models production 1, and 2) t-exp – this variant
models production 3. When defining the Aexpression structure we must define a variant
for each non-terminal symbol in the list of productions. By defining the aexpression
datatype we have created an internal representation for the “external” strings in the
language. Lets look at some examples.
Example 1
Consider the Aexpression string (+ 3 4). Lets derive that string using the
productions above.
<Aexpression>
(+ <Aexpression>
(+ <Texpression>
(+ <Texpression>
(+
3
(+
3
<Aexpression>
<Aexpression>
<Texpression>
<Texpression>
4
)
)
)
)
)
1
2
2
3
3
We can derive an internal representation of the same string by mimicking the steps above
with the Scheme structure.
(aexpression)
(add-exp
(add-exp
(add-exp
(add-exp
(add-exp
aexpression aepression)
(t-exp id) aexpression)
(t-exp id) (t-exp id))
(t-exp 3) (t-exp id))
(t-exp 3) (t-exp 4))
The process of converting the external representation of a string like (+ 3 4 ) to an
internal representation like (add-exp (t-exp 3) (t-exp 4)) is called parsing. The internal
representation is easier to process with software, and can dispense with some of the
syntactic sugar that may appear in the external representation.
You can test the aexpression structure with the following code.
(define a (t-exp 1))
(define b (t-exp 2))
(define
(define
(define
(define
(define
c
d
e
f
g
(t-exp 3))
(t-exp 4))
(add-exp a b))
(add-exp c d))
(add-exp e f))
(aexpression? g)
The variable g references the expression below
1)
(add-exp (add-exp (t-exp 1) (t-exp 2)) (add-exp (t-exp 3) (t-exp 4)))
(Notice that g refers to the internal representation of the external string below.
2)
( + (+ 1 2) (+ 3 4) ).
A parsing program could take 2) as input and produce 1). An unparsing program could
take 1) as input and produce 2). We look at the unparser first, since it should be easier
(with cases). Here is the code,
(define unparse-aexpression
(lambda (ex)
(cases aexpression ex
(add-exp (exp1 exp2) (list '+ (unparse-aexpression exp1)
(unparse-aexpression exp2)))
(t-exp (id) id))))
The code above uses the cases form to process each variant record.
When an expression like (add-exp 3 4) is encountered, the output is a
list which starts with a + sign, followed by recursive calls for exp1
and exp2. When an expression like (t-exp 3) is encountered, the
function returns 3.
The parser function is listed below. The form it takes is dependent on the external
syntax. This is more properly the subject of chapter 3. Here is the code for the parser.
(define parse-aexpression
(lambda (datum)
(cond
((symnum? datum) (t-exp datum))
((and (pair? datum)(eqv? (car datum) '+))
(add-exp (parse-aexpression (cadr datum))
(parse-aexpression (caddr datum))))
(else (eopl:error 'parse-expression
"Invalid concrete syntax ~s" datum)))))
We can test the parsing and unparsing functions with the code below.
(define
(define
(define
(define
(define
(define
a
b
c
d
e
f
(t-exp 1))
(t-exp 2))
(t-exp 3))
(t-exp 4))
(add-exp a b))
(add-exp c d))
(define g (add-exp e f))
(unparse-aexpression g)
(define h (quote (+ (+ 1 2)(+ 3 4))))
(unparse-aexpression (parse-aexpression h))
(parse-aexpression (unparse-aexpression g))
Download