CSC338 chap02

1 Study of a Simple Compiler In this chapter we will study a simple compiler and study the different steps to build a compiler. This chapter will be an introduction of the rest of the course. 2 Arithmetic expression processing using the stack The stack operations are: • Push (x) : puts the value of X in the top of the stack • Pop () : returns the value in the top of the stack. Before using the stack for arithmetic expression processing we have to translate the expression from Infix form to postfix form. 3 Examples of expression translation Infix 1+5 1+5*2 (1+5) * 2 9–5+2 Postfix 15+ 152*+ 15+2* 95–2+ 4 Processing of expression To process an arithmetic expression using the stack we have to follow the following steps: 1) Read the expression from left to write 2) When getting a number put it in the top of the stack (using push). 3) When getting an operation:  Get the first number from the top of the stack (using pop)  Get the second number from the top of the stack (using pop)  Do the operation between the first number and the second number.  Put the result in the top if the stack (using push). 5 If we process the following expression Translation 1+5*2 152*+ 1 5 1 2 5 1 push 1 push 5 push 2 10 1 11 pop r1 pop r1 Pop r2 Pop r2 mult r2,r1 add r2,r1 push r2 push r2 6 Exercise 1) Process the other expression in the above table (page 3) using the stack. 2) Complete the following table. Infix 1-5 1+5-2 9 – 3 / (1+2) (9-3)/1+2 Postfix 7 Simple compiler structure Character stream (Infix representation) Lexical analyzer Token stream Intermediate Syntax-directed translator representation (Postfix Representation) 8 Grammar Grammar (context free grammar (CFG)) 1) Set of Tokens (called terminal symbols( 2) Set of Non-terminals 3) Set of rules each has  Left part (Non-terminal)  Arrow  Right part (sequence (string) of Tokens and/or Non-terminal symbols) 4) Start symbol (one of Non-terminal symbols) 9 1) Example 1: List  list + digit List  list – digit List  digit Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 This may be written as follow: List  list + digit | list – digit | digit Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 10 - Terminal symbols (Tokens) + - 0 1 2 3 4 5 6 7 8 9 - Non-terminals Digit , List - Starting non-terminal List  String of tokens: is a sequence of number of Tokens or terminal symbols. This number may be zero in this case the string is called Empty String and is written e.  All Token strings that may be built from a grammar starting at the start symbol form the language represented by this grammar. 11 Exercise Example 2) 1. determine the non-terminal symbols and the terminal symbols from the following grammar: 2. Determine the start symbol 3. Give three token strings derived from this grammar: Block  begin compound_stmts end Compound_stmts  stmt_list | e Stmt_list  stmt_list ; stmt | stmt Stmt  a | c | b 12 Parse Tree • Shows how the start symbol of a grammar can derive a string in the language • A tree with the following properties: 1- the root is the start symbol 2- each internal node is a Non-terminal 3- each leaf is a Token or e. 4- If A is the label for an interior node, and X1,X2,…,Xn (nonterminals or tokens) are the labels of its children, then the following production must exist: A A X1X2…Xn X X 1 2 ... X n 13 Example SSS+|SS*|a 1) Derive the following string: aa+a* S  S S *  Sa*  SS+a*  Sa+a*  aa+a* SSS* Sa SSS+ Sa Sa 14 2) Draw the Parse tree of the derivation: S  S S *  Sa*  SS+a*  Sa+a*  aa+a* s s s s a a s + a * 15 Ambiguous Grammars • If any string has more than one parse tree, grammar is said to be ambiguous • Need to avoid for compilation, since string can have more than one meaning • List of digits separated by plus or minus signs: string → string + string | string – string |0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Example merges notion of digit and list into single nonterminal string • Same strings are derivable, but some strings have multiple parse trees (possible meanings) 16 Two Parse Trees: 9 – 5 + 2 17 Precedence and Associativity • Precedence – Determines the order in which different operators are evaluated when they occur in the same expression – Operators of higher precedence are applied before operators of lower precedence • Associativity – Determines the order in which operators of equal precedence are evaluated when they occur in the same expression – Most operators have a left-to-right associativity, but some have right-to-left associativity 18 Precedence and Associativity Example: Arithmetic Expression We start with the lowest level in the grammar (highest priority) Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Then the higher level (lower priority) Factor  digit | (expr) Then the higher level (lower priority) Term  term * factor | term / factor | factor Then the highest level (lowest priority) expr  expr + term | expr – term | term 19 Postfix Notation • Formal rules, infix → postfix – If E is variable or constant, E → E – If E is expression of form E1 op E2, where op is binary operator, E1 → E1’, and E2 → E2’, then E → E1’ E2’ op – If E is expression of form (E1) and E1 → E1’, then E → E1’ • Parentheses are not needed! 20 Translation Schemes • Adds to a CFG • Includes “semantic actions” embedded within productions Example Translation Scheme expr expr expr term term      expr + term { print(‘+’) } expr – term { print(‘-’) } term 0 { print(‘0’) } 1 { print(‘1’) } … term  9 { print(‘9’) } 21 Equivalent Translation Scheme expr rest rest rest term term       term rest + term { print(‘+’) } rest - term { print(‘-’) } rest ε 0 { print(‘0’) } 1 { print(‘1’) } … term  9 { print(‘9’) } 22 Parsing • Parsing is the process of determining if a string of tokens can be generated by a grammar 23 Top-down Parsing • Recursively apply the following steps: – At node n with nonterminal A, select a production for A – Construct children at n for symbols on right side of selected production – Find next node for which subtree needs to be constructed • Top-down parsing uses a “lookahead” symbol • Selecting production may involve trial-and-error and backtracking 24 Predictive Parsing • Recursive-descent parsing is a recursive, top-down approach to parsing • A procedure is associated with each nonterminal of the grammar • Predictive parsing – Special case of recursive-descent parsing – The lookahead symbol unambiguously determines the procedure for each nonterminal 25 Procedures for Nonterminals • Production with right side α used if lookahead is in FIRST(α) – FIRST(α) is set of all symbols that can be first symbol of α – If lookahead symbol is not in FIRST set for any production, can use production with right side of ε – If two or more possibilities, can not use this method – If no possibilities, an error is declared • Nonterminals on right side of selected production are recursively expanded 26 Left Recursion • Left-recursive productions can cause recursivedescent parsers to loop forever • Example: example  example + term • Can eliminate left recursion AAα|β AβR RαR|ε 27 Eliminating Left Recursion expr expr expr term term      expr rest rest rest term term       expr + term { print(‘+’) } expr – term { print(‘-’) } term 0 { print(‘0’) } 1 { print(‘1’) } … term  9 { print(‘9’) } term rest + term { print(‘+’) } rest - term { print(‘-’) } rest ε 0 { print(‘0’) } 1 { print(‘1’) } … term  9 { print(‘9’) } 28 Infix to Prefix Code: Part 1 #include <stdio.h> #include <ctype.h> int lookahead; void void void void void expr(void); rest(void); term(void); match(int); error(void); int main(void) { lookahead = getchar(); expr(); putchar('\n'); /* adds trailing newline character */ } … 29 Infix to Prefix Code: Part 2 … void expr(void) { term(); rest(); } void term(void) { if (isdigit(lookahead)) { putchar(lookahead); match(lookahead); } else error(); } … 30 Infix to Prefix Code: Part 3 … void rest(void) { if (lookahead == '+') { match('+'); term(); putchar('+'); rest(); } else if (lookahead == '-') { match('-'); term(); putchar('-'); rest(); } } … 31 Infix to Prefix Code: Part 4 … void match(int t) { if (lookahead == t) lookahead = getchar(); else error(); } void error(void) { printf("syntax error\n"); /* print error message */ exit(1); /* then halt */ } 32 Code Optimization 1 void rest(void) { REST: if (lookahead == '+') { match('+'); term(); putchar('+'); goto REST; } else if (lookahead == '-') { match('-'); term(); putchar('-'); goto REST; } } 33 Code Optimization 2 void expr(void) { term(); while (1) { if (lookahead == '+') { match('+'); term(); putchar('+'); } else if (lookahead == '-') { match('-'); term(); putchar('-'); } else break; } } 34 Improvements Remaining • • • • Want to ignore whitespace Allow numbers Allow identifiers Allow additional operators (multiplications and division) • Allow multiple expressions (separated by semicolons) 35 Lexical Analyzer • Eliminates whitespace (and comments) • Reads numbers (not just single digits) • Reads identifiers and keywords 36 Implementing the Lexical Analyzer 37 Allowable Tokens • expected tokens: +, -, *, /, DIV, MOD, (, ), ID, NUM, DONE • ID represents an identifier, NUM represents a number, DONE represents EOF 38 Tokens and Attributes LEXEME white space TOKEN ATTRIBUTE VALUE --- --- sequence of digits NUM numeric value of sequence div DIV --- mod MOD --- letter followed by letters and digits ID EOF DONE any other character that character index into symbol table --NONE 39 A Simple Symbol Table • Each record of symbol table contains a token type and a string (lexeme or keyword) • Symbol table has fixed size • All lexemes in array of fixed size • Will be able to insert and search for tokens: – insert(s, t): creates entry with string s and token t, returns index into symbol table – lookup(s): searches for entry with string s, returns index if found, 0 otherwise • Keywords (div and mod) will be inserted into symbol table, they can not be used as identifiers 40 Updated Translation Scheme start  list eof list  expr; list | ε expr  expr + term { print(‘+’) } | expr – term { print(‘-’) } | term term  term * factor { print(‘*’) } | term / factor { print(‘/’) } | term div factor { print(‘DIV’) } | term mod factor { print(‘MOD’) } | factor factor  (expr) | id { print(id.lexeme) } | num { print(num.value) } 41 After Eliminating Left Recursion start  list eof list  expr; list | ε expr  term moreterms moreterms  + term { print(‘+’) } moreterms | - term { print(‘-’) } moreterms | ε term  factor morefactors morefactors  * factor { print(‘*’) } morefactors | / factor { print(‘/’) } morefactors | div factor { print(‘DIV’) } morefactors | mod factor { print(‘MOD’) } morefactors | ε factor  (expr) | id { print(id.lexeme) } | num { print(num.value) } 42 Final Code • About 250 lines of C • Pretty sloppy, otherwise would be longer 43 ********** global.h ‫************* الملف‬ #include <stdio.h> #include <stdlib.h> #include <string.h> #include <ctype.h> #define BSIZE 128 #define NONE -1 #define EOS '\0' #define NUM #define DIV #define MOD #define ID #define DONE int tokenval; int lineno; struct entry { char *lexptr; int token; }; 256 257 258 259 260 44 ********** Init.c ************* Array symtable #include "global.h" lexptr DIV MOD ID ID struct entry keywords[] = { "div", DIV, "mod", MOD, 0, 0 }; void init() d i { struct entry *p; for (p = keywords; p->token; p++) insert(p->lexptr, p->token); } token v eos m o d eos c o u n t eos i eos Array lexemes 45 The lexical analyzer calls: - Lookup function for symbol search in the symbol table. - Insert function to add a symbol to the symbol table. - Adds 1 to the counter of lines when the end of line character is found. 46 ********** symbol.c ************* #include "global.h" int insert(char s[], int tok) #define STRMAX 999 #define SYMMAX 100 { int len; len = strlen(s); char lexemes[STRMAX]; int lastchar = -1; struct entry symtable[SYMMAX]; int lastentry = 0; if (lastentry + 1 >= SYMMAX) error("symbol table full"); if (lastchar + len + 1 >= STRMAX) error("lexemes array full"); int lookup(char s[]) lastentry = lastentry + 1; { int p; for (p = lastentry; p > 0; p = p-1) if (strcmp(symtable[p].lexptr, s) == 0) return p; symtable[lastentry].token = tok; symtable[lastentry].lexptr = &lexemes[lastchar + 1]; lastchar = lastchar + len + 1; return 0; } strcpy(symtable[lastentry].lexptr, s); return lastentry; } 47 ********** lexer.c ************* #include "global.h" char lexbuf[BSIZE]; int lineno = 1; int tokenval = NONE; int lexan() { else if (isalpha(t)) { int p, b = 0; while (isalnum(t)) { lexbuf[b] = t; t = getchar(); b = b + 1; if (b >= BSIZE) error("compiler error"); } int t; lexbuf[b] = EOS; if (t != EOF) ungetc(t, stdin); p = lookup(lexbuf); if(p == 0) p = insert(lexbuf, ID); tokenval = p; return symtable[p].token; } else if (t == EOF) return DONE; else { tokenval = NONE; return t; } } while(1) { t = getchar(); if (t == ' ' || t == '\t'); else if (t == '\n') lineno = lineno + 1; else if (isdigit (t)) { ungetc(t, stdin); scanf("%d", &tokenval); return NUM; } } 48 ********** emitter.c ************* #include "global.h" void emit(t, tval) int t, tval; { switch(t) { case '+': case '-': case '*': case '/': printf("%c", t); break; case DIV: printf(“ DIV "); break; case MOD: printf(“ MOD "); break; case NUM: printf("%d", tval); break; case ID: printf(” %s ", symtable[tval].lexptr); break; default: printf("token %d, tokenval %d\n", t, tval); } } 49 ********** parse.c ************* void parse() { lookahead = lexan(); while (lookahead != DONE) { expr(); match(';'); } } void expr() { int t; term(); while(1) switch (lookahead) { case '+': case '-': t = lookahead; match(lookahead); term(); emit(t, NONE); continue; default: return; } } void term() { int t; factor(); while(1) switch (lookahead) { case '*': case '/': case DIV: case MOD: t = lookahead; match(lookahead); factor(); emit(t, NONE); continue; default: return; } } 50 ********** parse.c (Con’d)********** void factor() { switch (lookahead) { case '(': match ('('); expr(); match(')'); break; case NUM: emit(NUM, tokenval); match(NUM); break; case ID: emit(ID, tokenval); match(ID); break; default: error("syntax error"); } } void match(t) int t; { if (lookahead == t) lookahead = lexan(); else error ("syntax error"); } 51 *** error.c *** #include "global.h" void error(char* m) { fprintf(stderr, "line %d: %s\n", lineno, m); exit(1); } *** main.c *** #include "global.h" void main() { init(); parse(); exit(0); }

CSC338 chap02

Related documents

Products

Support

CSC338 chap02

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib