CS488: Compilers 4/12/03 Session 1 Notes Midterm: Know about Lexer and Parser. Lexer is due next Saturday (4/19/03) after class. Notes regarding homework: STRING_CONST cannot go beyond 1 line. Can’t have double quote inside string. [Makes STRING_CONST matching easier] See your lexer documentation regarding start conditions. (Ex: using BEGIN...) Assume all numbers are INTEGERS and they are all positive. No comments are in the Grammar. Review: Lexer: Recognizes tokens. Errors lexer can detect: Unrecognized Character. Invalid token. Unterminated strings. Unterminated comments. Parser: Enforces the grammar. Works with the Lexer. Matches the longest substring. Example: given input ‘bc’ and the following: a -> bc d -> b f -> c The parser will match a -> bc since input b AND c were matched in a -> bc. (bc is the longest matching ‘substring’ for the input). Semantic Analyzer: Checks for three types of errors: Scope of variables. Inheritance cycles. Type Checking. Code Generator: At this point there are “no errors.” The code generator does exactly what its name implies: it generates code for the target language. End of Review Compiler-compiler takes a Regular Expression and converts it to a DFA of the target language. (DFA = Deterministic Finite Automata) Definitions: Alphabet or character class: Any finite set of symbols. Example: the alphabet of binary is: = {0, 1}. String over an alphabet: A finite set of symbols from the alphabet. (Also known as a word or sentence). is the string of length 0. Language: Set of strings over an alphabet. Grammar: Set of rules. (Note: the grammar is more related to the parser than the lexer, see review). Note: See the chapter 3 of the Dragon Book. Special languages: {} the language with one string of length 0. the empty set (= {}) Assume x, y are strings. concatenation: xy exponentiation: x0 = xi = xi-1x i > 0 x2 = xx x4 = x3x = x2xx = xxxx Assume L, M are languages. concatenation: LM exponentiation: L0 = {} Li = Li-1L Given: L = {A,B, ..., Z, a, b, c, ..., z} D = {0, 1, ..., 9} Then, L D LD L4 L* L(L D)* Union of two sets. Concat. (Character followed by a digit) LLLL (Set of character strings of length 4) Kleene closure, includes L and epsilon Begin with a letter, followed by 0 or more letters or numbers. Positive closure (one or more occurrences of D) D+ Also take note of the following: L M LM L* {s | s is in L or s is in M} {st | s is in L and t is in M} Li (Union of languages) L+ Li L? Li Regular Expressions: A regular expression r denotes a language L(r) A regular expression determines a language. One can find a language from a given regular expression. Regular Expression (r = ) a r|s rs r* (r) Denotes this Language {} {a} L(r) | L(s) L(r)L(s) (L(r))* L(r) [indicates () are optional] Precedence (High to Low): * -> concatenation -> Union rst* (rst)* rst* = rs(t*) Given = {a, b} then, Regular Expression a|b (a|b)(a|b) a* (a|b)* Denotes... {a, b} {aa, ab, ba, bb} {a, aa, aaa, ... } {, a, b, aa, ab, ba, bb, ...} (a*b*)* is the same as (a*b*)+ if we can create a minimum state DFA that is the same for both expressions. (a|b) = (b|a) Commutative (union is commutative). (a|b)|c = a|(b|c) Associative (ab)c = a(bc) Associative a(b|c) = ab | ac Distributive (Concatenation distributes over union) Note: is the identity for concatenation. aa a = a (a| a** = a* (IDEMPOTENT: Repeated operations have no effect.) Shortcuts: [abc] = a|b|c [a-z] = a|b|c|...|z Limitations of RE’s: There is no regular expression for nested curly braces. There is no regular expression for repeated strings. Regular expressions can denote a fixed number of repetitions or an unspecified number of repetitions of a given construct. Note: The Dragon book talks a little about lex in section 3.5. Examples: Section 3.6 of Dragon book: Describe the language denoted by the following: r= 0(0|1)*0 Answer: The language with minimum string 00, where all strings begin and end with a 0. r = ((|0)1*)* Note that r can be rewritten as: (0?1*)* r = 0*10*10*10* denotes the language with having all strings with only three 1’s. Section 3.7: Given a language, give the regular expression. L = [a-zA-Z] Find the regular expression such that the vowels are in consecutive order and appear at least once. L*aL*eL*iL*oL*uL* D = [0...9] Find the regular expression denoting all strings of digits with no repeated digits. The trivial way: write out all combinations. |0|1|2|3|4|...|01|02|.... End of notes for 4/12/03 Saturday 1