Abstract Syntax and its Representation, Parsing, Unparsing The first thing we must understand is BNF, Backus Naur Form, which is a metalanguage – a language that describes other languages. Here is an example we will develop throughout this paper. 1) <Aexpression> := = ( + <Aexpression> <Aexpression> ) 2) 3) := = <Texpression> <Texpression> := = Symbol | Number This BNF grammar consists of two “productions” or rules that describe how to generate strings that are “valid”. The set of all possible valid strings is the language defined by the grammar. In BNF, expressions of the form < symbol > are called “non-terminal” symbols. Other symbols in the grammar are called “terminal” symbols. Starting with the symbol <Aexpression> we can replace any nonterminal symbol with the expression on the right of “:= =”. Replacements continue until all nonterminal symbols have been removed. Note: Production 2 is equivalent to <Aexpression> := = <Texpression>. Here’s an example. Each line represents the application of a single production. The production number is listed to the right. <Aexpression> (+ <Aexpression> (+ <Aexpression> (+ <Texpression> (+ <Texpression> (+ <Texpression> (+ <Texpression> (+ 9 (+ 9 <Aexpression> ) ( + <Aexpression> ( + <Aexpression> ( + <Texpression> ( + <Texpression> ( + <Texpression> ( + <Texpression> ( + 7 <Aexpression> <Aexpression> <Aexpression> <Texpression> x x x )) )) )) )) )) )) )) 1 1 2 2 2 3 3 3 Since the last line contains only terminal symbols, the expression “(+ 9 (+ 7 x))” is a valid sentence in the language defined by the grammar. The lines above represent a “derivation” of the final string. In general, there are many possible derivations for a given string. Can you find derivations for each of the following strings? (+ 3 2) (+ (+ 2 3)(+ a b)) (+ 9 (+ x y)) We can think of the BNF grammar as defining a concrete syntax or the external representation. The external representation is typically converted to an internal representation. The internal representation is one that is more convenient for software manipulation. Notice that <Aexpression> is an recursive production since it describes an Aexpression in terms of itself. Scheme is particularly effective in representing “inductive” structures by using the define-datatype form. Here is the Scheme definition of the inductive structure for storing Aexpressions. (define symnum? (lambda (exp) (or (symbol? exp)(number? exp)))) (define-datatype aexpression aexpression? (add-exp (exp1 aexpression?) (exp2 aexpression?)) (t-exp (id symnum?))) Since Texpressions are either Scheme symbols or numbers, we first define a predicate called “symnum?”. This predicate returns true if the passed parameter is a symbol or a number. Next we define the “aexpression” structure using define-datatype. There are two variants: 1) add-exp - this variant models production 1, and 2) t-exp – this variant models production 3. When defining the Aexpression structure we must define a variant for each non-terminal symbol in the list of productions. By defining the aexpression datatype we have created an internal representation for the “external” strings in the language. Lets look at some examples. Example 1 Consider the Aexpression string (+ 3 4). Lets derive that string using the productions above. <Aexpression> (+ <Aexpression> (+ <Texpression> (+ <Texpression> (+ 3 (+ 3 <Aexpression> <Aexpression> <Texpression> <Texpression> 4 ) ) ) ) ) 1 2 2 3 3 We can derive an internal representation of the same string by mimicking the steps above with the Scheme structure. (aexpression) (add-exp (add-exp (add-exp (add-exp (add-exp aexpression aepression) (t-exp id) aexpression) (t-exp id) (t-exp id)) (t-exp 3) (t-exp id)) (t-exp 3) (t-exp 4)) The process of converting the external representation of a string like (+ 3 4 ) to an internal representation like (add-exp (t-exp 3) (t-exp 4)) is called parsing. The internal representation is easier to process with software, and can dispense with some of the syntactic sugar that may appear in the external representation. You can test the aexpression structure with the following code. (define a (t-exp 1)) (define b (t-exp 2)) (define (define (define (define (define c d e f g (t-exp 3)) (t-exp 4)) (add-exp a b)) (add-exp c d)) (add-exp e f)) (aexpression? g) The variable g references the expression below 1) (add-exp (add-exp (t-exp 1) (t-exp 2)) (add-exp (t-exp 3) (t-exp 4))) (Notice that g refers to the internal representation of the external string below. 2) ( + (+ 1 2) (+ 3 4) ). A parsing program could take 2) as input and produce 1). An unparsing program could take 1) as input and produce 2). We look at the unparser first, since it should be easier (with cases). Here is the code, (define unparse-aexpression (lambda (ex) (cases aexpression ex (add-exp (exp1 exp2) (list '+ (unparse-aexpression exp1) (unparse-aexpression exp2))) (t-exp (id) id)))) The code above uses the cases form to process each variant record. When an expression like (add-exp 3 4) is encountered, the output is a list which starts with a + sign, followed by recursive calls for exp1 and exp2. When an expression like (t-exp 3) is encountered, the function returns 3. The parser function is listed below. The form it takes is dependent on the external syntax. This is more properly the subject of chapter 3. Here is the code for the parser. (define parse-aexpression (lambda (datum) (cond ((symnum? datum) (t-exp datum)) ((and (pair? datum)(eqv? (car datum) '+)) (add-exp (parse-aexpression (cadr datum)) (parse-aexpression (caddr datum)))) (else (eopl:error 'parse-expression "Invalid concrete syntax ~s" datum))))) We can test the parsing and unparsing functions with the code below. (define (define (define (define (define (define a b c d e f (t-exp 1)) (t-exp 2)) (t-exp 3)) (t-exp 4)) (add-exp a b)) (add-exp c d)) (define g (add-exp e f)) (unparse-aexpression g) (define h (quote (+ (+ 1 2)(+ 3 4)))) (unparse-aexpression (parse-aexpression h)) (parse-aexpression (unparse-aexpression g))