CS 410/510 Compiler Design Theory

advertisement
CS 410/510 Compiler Design Theory
The following notes are what I am using in class to lecture with. They are not to be
considered complete, or a substitute for your own notes. You should use these notes, and
your own, as guides for your reading and study outside of class. The textbook will
provide you with additional information on each of the subjects introduced below.
Compiler Design Introduction
Two parts to compilation
1. Analysis: Breaks up the source program into constituent pieces and creates an
intermediate representation.
2. Synthesis: Constructs the desired target program from the intermediate
representation.
Analysis of the source program
1. Linear Analysis: The stream of characters of the source program is grouped into
tokens that are sequences of characters having collective meaning. Lexical
Analyzer.
2. Hierarchical Analysis: The tokens are grouped hierarchically into nested
collections with collective meaning. Syntax Analyzer.
3. Semantic Analysis: Checks are performed to ensure that the components of a
program fit together meaningfully. Type Checking.
Linear Analysis of the following: sum := oldsum – value / 100
Lexeme (collection of characters)
sum
:=
oldsum
value
/
100
Token (category of lexeme)
identifier
assignment operator
identifier
subtraction operator
identifier
division operator
integer constant
newposition := initial position + rate * 60
Lexical Analyzer
id1 := id2 + id3 * int_lit
Syntax Analyzer
:=
/
\
id1
+
/ \
id2 *
/ \
id3 int_lit
Semantic Analyzer
:=
/
id1
\
+
/ \
id2 *
/ \
id3 int_to_real
\
int_lit
Intermediate Code Generator
temp1 := int_to_real(60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3
Code Optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
Code Generator
MOVF
MULF
MOVF
ADDF
MOVF
id3 R2
#60.0 R2
id2
R2
R1
id1
A pass of a compiler is a reading of a file followed by processing the data from the file.
A phase of a compiler is a logical part of the compilation process.
Lexical Analysis
A token is a category in which a lexeme can be classified
The token, or category, can be defined by a Regular Language and expressed as a
Regular Expression.
Lexical Analysis is the act of breaking down source code into a set of lexemes. Each
lexeme is found by matching sequential characters to a token.
Lexical Analysis can be performed with pattern matching through the use of Regular
Expressions. Therefore, a Lexical Analyzer can be defined and represented as a DFA.
Even though the source language is likely written as Context Free, the Lexical Analyzer
identifies strings and sends them to a Syntax Analyzer for parsing. The Lexical Analyzer
will remove white space, comments, and identify ill-formed strings or invalid characters.
Reasons to separate Lexical Analysis from Syntax Analysis:
1. Simplicity: Techniques for Lexical Analysis can be simpler than those required
for Syntax Analysis. DFA vrs. PDA. Separation also simplifies the Syntax
Analyzer.
2. Efficiency: Separation into different modules makes it easier to perform
simplifications and optimizations unique to the different paradigms.
3. Portability: Due to input/output and character set variations, Lexical Analyzers
are not always machine independent.
Errors often detected in a Lexical Analyzer:
1.
2.
3.
4.
Numeric literals that are too long.
Identifiers that are too long (often a warning is given)
Ill-formed numeric literals.
Input characters that are not in the source language
Input Buffering
Moving input data into local memory can increase the performance.
Double buffering:
Use the sentinel, “@” to identify the end of a buffer. If the end of buffer is found,
increment to the next buffer and re-fill the prior buffer. The sentinel enables only one
check to be made, then a second check to know which buffer needs to be refilled.
Otherwise it would be necessary to check for the end of either buffer for each character
read. Also consider the event if the sentinel is an invalid character in the source code.
Buffering allows for easy look ahead of characters.
< vrs.
<=
The look ahead will allow for the identification of a less than token before getting the
next character from the buffer.
Example of a Lexical Analyzer
Build a Lexical Analyzer which identifies the following tokens:
1. digits
2. digits E [sign] digits
3. digits.digits [ E [sign] digits ]
“sign” refers to +,“E” refers to exponent by power 10
[ ] refers to 0 or 1 of contents
{ } refers to 0 or more of contents
έ=λ
The above tokens can be accepted by the following DFA:
Approaches to constructing a Lexical Analyzer
1. Can use a table representation and a lookup function.
2. Write an if/else block of code for each state.
3. Use a software tool to generate a table driven analyzer from regular expressions.
digit
.
E
+,ά
token
array
S
q1
er
er
er
er
q1
q1
q2
q3
S
S
q2
q4
er
er
er
er
q3
q5
er
er
q6
er
q4
q4
S
q3
S
S
q5
q5
S
S
S
S
q6
q5
er
er
er
er
0
int
0
0
float
float
0
The String Table
The String Table is a data structure for unique lexemes. The String Table can be
queried for the lexeme by a numeric code. Lexemes are generally inserted in the String
Table by the Lexical Analyzer.
The String Table can be used for:
1. Error messages
2. Memory map listings
3. Intermodule linkages
Index
0
1
2
3
String
fptr
number
5.1
Token
[ident]
[ident]
[f_lit]
It is not recommended to use the String Table for reserved words. Use a separate
structure to determine if an identifier is a reserved word, and then assign a token value
accordingly. If it is not a reserved word, insert it in the String Table if necessary.
Do not confuse the String Table with the Symbol Table. The String Table is for
unique spellings of lexemes, allowing index comparisons instead of string comparisons
during syntax analysis and code generation. The Symbol Table is a dynamic data
structure used differently than the String Table.
Download