Notes2-1

advertisement
CS488: Compilers
4/12/03 Session 1 Notes
Midterm: Know about Lexer and Parser.
Lexer is due next Saturday (4/19/03) after class.
Notes regarding homework:
STRING_CONST cannot go beyond 1 line.
Can’t have double quote inside string. [Makes STRING_CONST matching easier]
See your lexer documentation regarding start conditions. (Ex: using BEGIN...)
Assume all numbers are INTEGERS and they are all positive.
No comments are in the Grammar.
Review:
Lexer: Recognizes tokens.
Errors lexer can detect:
 Unrecognized Character.
 Invalid token.
 Unterminated strings.
 Unterminated comments.
Parser: Enforces the grammar.
Works with the Lexer.
Matches the longest substring.
Example: given input ‘bc’ and the following:
a -> bc
d -> b
f -> c
The parser will match a -> bc since input b AND c were matched
in a -> bc. (bc is the longest matching ‘substring’ for the input).
Semantic Analyzer:
Checks for three types of errors:
 Scope of variables.
 Inheritance cycles.
 Type Checking.
Code Generator: At this point there are “no errors.” The code generator does
exactly what its name implies: it generates code for the target language.
End of Review
Compiler-compiler takes a Regular Expression and converts it to a DFA of the target
language. (DFA = Deterministic Finite Automata)
Definitions:
Alphabet or character class: Any finite set of symbols.
Example: the alphabet of binary is:  = {0, 1}.
String over an alphabet: A finite set of symbols from the alphabet. (Also known as a
word or sentence). is the string of length 0.
Language: Set of strings over an alphabet.
Grammar: Set of rules. (Note: the grammar is more related to the parser than the lexer,
see review).
Note: See the chapter 3 of the Dragon Book.
Special languages:
{} the language with one string of length 0.
the empty set (= {})
Assume x, y are strings.
concatenation: xy
exponentiation:
x0 = 
xi = xi-1x i > 0
x2 = xx
x4 = x3x = x2xx = xxxx

Assume L, M are languages.
concatenation: LM
exponentiation:
L0 = {}
Li = Li-1L
Given:
L = {A,B, ..., Z, a, b, c, ..., z}
D = {0, 1, ..., 9}
Then,
L D
LD
L4
L*
L(L  D)*
Union of two sets.
Concat. (Character followed by a digit)
LLLL (Set of character strings of length 4)
Kleene closure, includes L and epsilon
Begin with a letter, followed by 0 or more letters or
numbers.
Positive closure (one or more occurrences of D)
D+
Also take note of the following:
L M
LM
L*
{s | s is in L or s is in M}
{st | s is in L and t is in M}
Li (Union of languages)
L+
Li
L?
Li
Regular Expressions:
A regular expression r denotes a language L(r)
A regular expression determines a language.
One can find a language from a given regular expression.
Regular Expression
(r = )
a
r|s
rs
r*
(r)
Denotes this Language
{}
{a}
L(r) | L(s)
L(r)L(s)
(L(r))*
L(r) [indicates () are optional]
Precedence (High to Low): * -> concatenation -> Union
rst*  (rst)*
rst* = rs(t*)
Given  = {a, b} then,
Regular Expression
a|b
(a|b)(a|b)
a*
(a|b)*
Denotes...
{a, b}
{aa, ab, ba, bb}
{a, aa, aaa, ... }
{, a, b, aa, ab, ba, bb, ...}
(a*b*)* is the same as (a*b*)+ if we can create a minimum state DFA that is the same for
both expressions.
(a|b) = (b|a) Commutative (union is commutative).
(a|b)|c = a|(b|c) Associative
(ab)c = a(bc) Associative
a(b|c) = ab | ac Distributive (Concatenation distributes over union)
Note: is the identity for concatenation.
aa
a = a
(a|
a** = a* (IDEMPOTENT: Repeated operations have no effect.)
Shortcuts:
[abc] = a|b|c
[a-z] = a|b|c|...|z
Limitations of RE’s:
There is no regular expression for nested curly braces.
There is no regular expression for repeated strings.
Regular expressions can denote a fixed number of repetitions or an unspecified
number of repetitions of a given construct.
Note: The Dragon book talks a little about lex in section 3.5.
Examples:
Section 3.6 of Dragon book:
Describe the language denoted by the following:
r=
0(0|1)*0
Answer: The language with minimum string 00, where all strings begin and end with a 0.
r = ((|0)1*)*
Note that r can be rewritten as: (0?1*)*
r = 0*10*10*10* denotes the language with having all strings with only three 1’s.
Section 3.7: Given a language, give the regular expression.
L = [a-zA-Z]
Find the regular expression such that the vowels are in consecutive order and appear at
least once.
L*aL*eL*iL*oL*uL*
D = [0...9]
Find the regular expression denoting all strings of digits with no repeated digits.
The trivial way: write out all combinations.
|0|1|2|3|4|...|01|02|....
End of notes for 4/12/03 Saturday 1
Download