CSC338 chap02

advertisement
1
Study of a Simple Compiler
In this chapter we will study a simple compiler and study
the different steps to build a compiler. This chapter will
be an introduction of the rest of the course.
2
Arithmetic expression processing
using the stack
The stack operations are:
• Push (x) : puts the value of X in the top of the
stack
• Pop () : returns the value in the top of the stack.
Before using the stack for arithmetic expression
processing we have to translate the expression
from Infix form to postfix form.
3
Examples of expression translation
Infix
1+5
1+5*2
(1+5) * 2
9–5+2
Postfix
15+
152*+
15+2*
95–2+
4
Processing of expression
To process an arithmetic expression using the stack
we have to follow the following steps:
1) Read the expression from left to write
2) When getting a number put it in the top of the
stack (using push).
3) When getting an operation:
 Get the first number from the top of the stack (using pop)
 Get the second number from the top of the stack (using
pop)
 Do the operation between the first number and the second
number.
 Put the result in the top if the stack (using push).
5
If we process the following expression
Translation
1+5*2
152*+
1
5
1
2
5
1
push 1
push 5
push 2
10
1
11
pop r1
pop r1
Pop r2
Pop r2
mult r2,r1
add r2,r1
push r2
push r2
6
Exercise
1) Process the other expression in the above table (page 3) using
the stack.
2) Complete the following table.
Infix
1-5
1+5-2
9 – 3 / (1+2)
(9-3)/1+2
Postfix
7
Simple compiler structure
Character stream
(Infix
representation)
Lexical analyzer
Token
stream
Intermediate
Syntax-directed translator representation
(Postfix
Representation)
8
Grammar
Grammar (context free grammar (CFG))
1) Set of Tokens (called terminal symbols(
2) Set of Non-terminals
3) Set of rules each has
 Left part (Non-terminal)
 Arrow
 Right part (sequence (string) of Tokens and/or Non-terminal
symbols)
4) Start symbol (one of Non-terminal symbols)
9
1) Example 1:
List  list + digit
List  list – digit
List  digit
Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
This may be written as follow:
List  list + digit | list – digit | digit
Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
10
- Terminal symbols (Tokens)
+ - 0 1 2 3 4 5 6 7 8 9
- Non-terminals
Digit , List
- Starting non-terminal
List
 String of tokens: is a sequence of number of
Tokens or terminal symbols. This number may be
zero in this case the string is called Empty String
and is written e.
 All Token strings that may be built from a
grammar starting at the start symbol form the
language represented by this grammar.
11
Exercise
Example 2)
1. determine the non-terminal symbols and the
terminal symbols from the following grammar:
2. Determine the start symbol
3. Give three token strings derived from this
grammar:
Block  begin compound_stmts end
Compound_stmts  stmt_list | e
Stmt_list  stmt_list ; stmt | stmt
Stmt  a | c | b
12
Parse Tree
• Shows how the start symbol of a grammar can derive
a string in the language
• A tree with the following properties:
1- the root is the start symbol
2- each internal node is a Non-terminal
3- each leaf is a Token or e.
4- If A is the label for an interior node, and
X1,X2,…,Xn (nonterminals or tokens) are the labels of
its children, then the following production must exist:
A
A X1X2…Xn
X
X
1
2
...
X
n
13
Example
SSS+|SS*|a
1) Derive the following string: aa+a*
S  S S *  Sa*  SS+a*  Sa+a*  aa+a*
SSS*
Sa
SSS+
Sa
Sa
14
2) Draw the Parse tree of the derivation:
S  S S *  Sa*  SS+a*  Sa+a*  aa+a*
s
s
s
s
a
a
s
+
a
*
15
Ambiguous Grammars
• If any string has more than one parse tree, grammar is said
to be ambiguous
• Need to avoid for compilation, since string can have more
than one meaning
• List of digits separated by plus or minus signs:
string → string + string | string – string |0 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
• Example merges notion of digit and list into single
nonterminal string
• Same strings are derivable, but some strings have multiple
parse trees (possible meanings)
16
Two Parse Trees: 9 – 5 + 2
17
Precedence and Associativity
• Precedence
– Determines the order in which different operators are evaluated
when they occur in the same expression
– Operators of higher precedence are applied before operators of
lower precedence
• Associativity
– Determines the order in which operators of equal precedence are
evaluated when they occur in the same expression
– Most operators have a left-to-right associativity, but some have
right-to-left associativity
18
Precedence and Associativity
Example: Arithmetic Expression
We start with the lowest level in the grammar (highest priority)
Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Then the higher level (lower priority)
Factor  digit | (expr)
Then the higher level (lower priority)
Term  term * factor | term / factor | factor
Then the highest level (lowest priority)
expr  expr + term | expr – term | term
19
Postfix Notation
• Formal rules, infix → postfix
– If E is variable or constant, E → E
– If E is expression of form E1 op E2, where op is binary
operator, E1 → E1’, and E2 → E2’, then E → E1’ E2’ op
– If E is expression of form (E1) and E1 → E1’, then E → E1’
• Parentheses are not needed!
20
Translation Schemes
• Adds to a CFG
• Includes “semantic actions” embedded within
productions
Example Translation Scheme
expr
expr
expr
term
term





expr + term { print(‘+’) }
expr – term { print(‘-’) }
term
0 { print(‘0’) }
1 { print(‘1’) }
…
term  9 { print(‘9’) }
21
Equivalent Translation Scheme
expr
rest
rest
rest
term
term






term rest
+ term { print(‘+’) } rest
- term { print(‘-’) } rest
ε
0 { print(‘0’) }
1 { print(‘1’) }
…
term  9 { print(‘9’) }
22
Parsing
• Parsing is the process of determining if a string of
tokens can be generated by a grammar
23
Top-down Parsing
• Recursively apply the following steps:
– At node n with nonterminal A, select a production for A
– Construct children at n for symbols on right side of selected
production
– Find next node for which subtree needs to be constructed
• Top-down parsing uses a “lookahead” symbol
• Selecting production may involve trial-and-error and
backtracking
24
Predictive Parsing
• Recursive-descent parsing is a recursive, top-down
approach to parsing
• A procedure is associated with each nonterminal
of the grammar
• Predictive parsing
– Special case of recursive-descent parsing
– The lookahead symbol unambiguously determines the
procedure for each nonterminal
25
Procedures for Nonterminals
• Production with right side α used if lookahead is in
FIRST(α)
– FIRST(α) is set of all symbols that can be first symbol of α
– If lookahead symbol is not in FIRST set for any production, can
use production with right side of ε
– If two or more possibilities, can not use this method
– If no possibilities, an error is declared
• Nonterminals on right side of selected production are
recursively expanded
26
Left Recursion
• Left-recursive productions can cause recursivedescent parsers to loop forever
• Example: example  example + term
• Can eliminate left recursion
AAα|β
AβR
RαR|ε
27
Eliminating Left Recursion
expr
expr
expr
term
term





expr
rest
rest
rest
term
term






expr + term { print(‘+’) }
expr – term { print(‘-’) }
term
0 { print(‘0’) }
1 { print(‘1’) }
…
term  9 { print(‘9’) }
term rest
+ term { print(‘+’) } rest
- term { print(‘-’) } rest
ε
0 { print(‘0’) }
1 { print(‘1’) }
…
term  9 { print(‘9’) }
28
Infix to Prefix Code: Part 1
#include <stdio.h>
#include <ctype.h>
int lookahead;
void
void
void
void
void
expr(void);
rest(void);
term(void);
match(int);
error(void);
int main(void)
{
lookahead = getchar();
expr();
putchar('\n'); /* adds trailing newline character */
}
…
29
Infix to Prefix Code: Part 2
…
void expr(void)
{
term();
rest();
}
void term(void)
{
if (isdigit(lookahead)) {
putchar(lookahead);
match(lookahead);
}
else
error();
}
…
30
Infix to Prefix Code: Part 3
…
void rest(void)
{
if (lookahead == '+') {
match('+');
term();
putchar('+');
rest();
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
rest();
}
}
…
31
Infix to Prefix Code: Part 4
…
void match(int t)
{
if (lookahead == t)
lookahead = getchar();
else
error();
}
void error(void)
{
printf("syntax error\n"); /* print error message */
exit(1); /* then halt */
}
32
Code Optimization 1
void rest(void)
{
REST:
if (lookahead == '+') {
match('+');
term();
putchar('+');
goto REST;
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
goto REST;
}
}
33
Code Optimization 2
void expr(void)
{
term();
while (1) {
if (lookahead == '+') {
match('+');
term();
putchar('+');
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
}
else
break;
}
}
34
Improvements Remaining
•
•
•
•
Want to ignore whitespace
Allow numbers
Allow identifiers
Allow additional operators (multiplications and
division)
• Allow multiple expressions (separated by
semicolons)
35
Lexical Analyzer
• Eliminates whitespace (and comments)
• Reads numbers (not just single digits)
• Reads identifiers and keywords
36
Implementing the Lexical Analyzer
37
Allowable Tokens
• expected tokens: +, -, *, /, DIV, MOD, (, ), ID,
NUM, DONE
• ID represents an identifier, NUM represents a
number, DONE represents EOF
38
Tokens and Attributes
LEXEME
white space
TOKEN
ATTRIBUTE VALUE
---
---
sequence of digits
NUM
numeric value of
sequence
div
DIV
---
mod
MOD
---
letter followed by letters
and digits
ID
EOF
DONE
any other character
that character
index into symbol table
--NONE
39
A Simple Symbol Table
• Each record of symbol table contains a token type and a
string (lexeme or keyword)
• Symbol table has fixed size
• All lexemes in array of fixed size
• Will be able to insert and search for tokens:
– insert(s, t): creates entry with string s and token t, returns
index into symbol table
– lookup(s): searches for entry with string s, returns index if
found, 0 otherwise
• Keywords (div and mod) will be inserted into symbol
table, they can not be used as identifiers
40
Updated Translation Scheme
start  list eof
list  expr; list | ε
expr  expr + term { print(‘+’) }
| expr – term { print(‘-’) }
| term
term  term * factor { print(‘*’) }
| term / factor { print(‘/’) }
| term div factor { print(‘DIV’) }
| term mod factor { print(‘MOD’) }
| factor
factor  (expr)
| id { print(id.lexeme) }
| num { print(num.value) }
41
After Eliminating Left Recursion
start  list eof
list  expr; list | ε
expr  term moreterms
moreterms  + term { print(‘+’) } moreterms
| - term { print(‘-’) } moreterms
| ε
term  factor morefactors
morefactors  * factor { print(‘*’) } morefactors
| / factor { print(‘/’) } morefactors
| div factor { print(‘DIV’) } morefactors
| mod factor { print(‘MOD’) } morefactors
| ε
factor  (expr)
| id { print(id.lexeme) }
| num { print(num.value) }
42
Final Code
• About 250 lines of C
• Pretty sloppy, otherwise would be longer
43
********** global.h ‫************* الملف‬
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define BSIZE 128
#define NONE -1
#define EOS '\0'
#define NUM
#define DIV
#define MOD
#define ID
#define DONE
int tokenval;
int lineno;
struct entry {
char *lexptr;
int token;
};
256
257
258
259
260
44
********** Init.c *************
Array symtable
#include "global.h"
lexptr
DIV
MOD
ID
ID
struct entry keywords[] = {
"div", DIV,
"mod", MOD,
0, 0
};
void init()
d i
{
struct entry *p;
for (p = keywords; p->token; p++)
insert(p->lexptr, p->token);
}
token
v eos m o d eos c o u n t eos i eos
Array lexemes
45
The lexical analyzer calls:
- Lookup function for symbol search in the symbol
table.
- Insert function to add a symbol to the symbol
table.
- Adds 1 to the counter of lines when the end of line
character is found.
46
********** symbol.c *************
#include "global.h"
int insert(char s[], int tok)
#define STRMAX 999
#define SYMMAX 100
{
int len;
len = strlen(s);
char lexemes[STRMAX];
int lastchar = -1;
struct entry symtable[SYMMAX];
int lastentry = 0;
if (lastentry + 1 >= SYMMAX)
error("symbol table full");
if (lastchar + len + 1 >= STRMAX)
error("lexemes array full");
int lookup(char s[])
lastentry = lastentry + 1;
{
int p;
for (p = lastentry; p > 0; p = p-1)
if (strcmp(symtable[p].lexptr, s) == 0)
return p;
symtable[lastentry].token = tok;
symtable[lastentry].lexptr = &lexemes[lastchar + 1];
lastchar = lastchar + len + 1;
return 0;
}
strcpy(symtable[lastentry].lexptr, s);
return lastentry;
}
47
********** lexer.c *************
#include "global.h"
char lexbuf[BSIZE];
int lineno = 1;
int tokenval = NONE;
int lexan()
{
else if (isalpha(t)) {
int p, b = 0;
while (isalnum(t)) {
lexbuf[b] = t;
t = getchar();
b = b + 1;
if (b >= BSIZE)
error("compiler error");
}
int t;
lexbuf[b] = EOS;
if (t != EOF)
ungetc(t, stdin);
p = lookup(lexbuf);
if(p == 0)
p = insert(lexbuf, ID);
tokenval = p;
return symtable[p].token;
}
else if (t == EOF)
return DONE;
else {
tokenval = NONE;
return t;
}
}
while(1) {
t = getchar();
if (t == ' ' || t == '\t');
else if (t == '\n')
lineno = lineno + 1;
else if (isdigit (t)) {
ungetc(t, stdin);
scanf("%d", &tokenval);
return NUM;
}
}
48
********** emitter.c *************
#include "global.h"
void emit(t, tval)
int t, tval;
{
switch(t) {
case '+': case '-': case '*': case '/':
printf("%c", t);
break;
case DIV:
printf(“ DIV ");
break;
case MOD:
printf(“ MOD ");
break;
case NUM:
printf("%d", tval);
break;
case ID:
printf(” %s ", symtable[tval].lexptr);
break;
default:
printf("token %d, tokenval %d\n", t, tval);
}
}
49
********** parse.c *************
void parse()
{
lookahead = lexan();
while (lookahead != DONE) {
expr(); match(';');
}
}
void expr()
{
int t;
term();
while(1)
switch (lookahead) {
case '+': case '-':
t = lookahead;
match(lookahead); term(); emit(t, NONE);
continue;
default:
return;
}
}
void term()
{
int t;
factor();
while(1)
switch (lookahead) {
case '*': case '/': case DIV: case MOD:
t = lookahead;
match(lookahead); factor(); emit(t, NONE);
continue;
default:
return;
}
}
50
********** parse.c (Con’d)**********
void factor()
{
switch (lookahead) {
case '(':
match ('('); expr(); match(')');
break;
case NUM:
emit(NUM, tokenval);
match(NUM); break;
case ID:
emit(ID, tokenval);
match(ID);
break;
default:
error("syntax error");
}
}
void match(t)
int t;
{
if (lookahead == t)
lookahead = lexan();
else error ("syntax error");
}
51
*** error.c ***
#include "global.h"
void error(char* m)
{
fprintf(stderr, "line %d: %s\n", lineno, m);
exit(1);
}
*** main.c ***
#include "global.h"
void main()
{
init();
parse();
exit(0);
}
Download