Scanner 2015.03.16 Front End Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language Perform a membership test: code source language? Is the program well-formed (semantically) ? Build an IR version of the code for the rest of the compiler The front end deals with form (syntax) & meaning (semantics) The Front End Source code Scanner tokens IR Parser Errors Implementation Strategy Scanning Parsing Specify Syntax regular expressions context-free grammars Implement Recognizer deterministic finite automaton push-down automaton Perform Work Actions on transitions in automaton The Front End stream of characters Scanner microsyntax stream of tokens Parser syntax IR + annotations Errors Why separate the scanner and the parser? Scanner classifies words Parser constructs grammatical derivations Parsing is harder and slower Separation simplifies the implementation Scanners are simple Scanner leads to a faster, smaller parser Scanner is only pass that touches every character of the input. token is a pair <part of speech, lexeme > Scanner Generator Why study automatic scanner construction? Avoid writing scanners by hand Harness the theory from classes like COMP 481 compile time design time source code Scanner parts of speech & words tables or code specifications Scanner Generator Represent words as indices into a global table Specifications written as “regular expressions” Goals: To simplify specification & implementation of scanners To understand the underlying techniques and technologies Comp 412, Fall 2010 5 Strings and Languages Alphabet An alphabet is a finite set of symbols (characters) String A string is a finite sequence of symbols from s denotes the length of string s denotes the empty string, thus = 0 Language A language is a countable set of strings over some fixed alphabet Abstract Language Φ {ε} String Operations Concatenation (連接) The concatenation of two strings x and y is denoted by xy Identity (單位元素) The empty string is the identity under concatenation. s=s=s Exponentiation Define s0 = si = si-1s for i > 0 By Define s1 = s s2 = ss Language Operations Union L M = { s s L or s M } Concatenation L M = { xy x L and y M} Exponentiation L0 = { } Li = Li-1L Kleene closure (封閉包) L* = ∪i=0,…, Li Positive closure L+ = ∪i=1,…, Li Regular Expressions Regular Expressions A convenient means of specifying certain simple sets of strings. We use regular expressions to define structures of tokens. Tokens are built from symbols of a finite vocabulary. Regular Sets The sets of strings defined by regular expressions. Regular Expressions Basis symbols: is a regular expression denoting language L() = {} a is a regular expression denoting L(a) = {a} If r and s are regular expressions denoting languages L(r) and M(s) respectively, then rs is a regular expression denoting L(r) M(s) rs is a regular expression denoting L(r)M(s) r* is a regular expression denoting L(r)* (r) is a regular expression denoting L(r) A language defined by a regular expression is called a regular set. Operator Precedence Operator Precedence Associative * highest left concatenation Second left | lowest left Algebraic Laws for Regular Expressions Law r|s=s|r Description | is commutative r | ( s | t ) = ( r | s ) | t | is associative r(st) = (rs)t r(s|t) = rs | rt (s|t)r = sr | tr concatenation is associative concatenation distributes over | εr = rε = r ε is the identity for concatenation r* = ( r |ε)* ε is guaranteed in a closure r** = r* * is idempotent Examples of Regular Expressions Identifiers: Letter (a|b|c| … |z|A|B|C| … |Z) Digit (0|1|2| … |9) Identifier Letter ( Letter | Digit )* Numbers: shorthand for (a|b|c| … |z|A|B|C| … |Z) ((a|b|c| … |z|A|B|C| … |Z) | (0|1|2| … |9))* Integer (+|-|) (0| (1|2|3| … |9)(Digit *) ) Decimal Integer . Digit * Real ( Integer | Decimal ) E (+|-|) Digit * Complex ( Real , Real ) Numbers can get much more complicated! Using symbolic names does not imply recursion underlining indicates a letter in the input stream 13 Finite Automata Finite Automata are recognizers. FA simply say “Yes” or “No” about each possible input string. A FA can be used to recognize the tokens specified by a regular expression Use FA to design of a Lexical Analyzer Generator Two kind of the Finite Automata Nondeterministic finite automata (NFA) Deterministic finite automata (DFA) Both DFA and NFA are capable of recognizing the same languages. NFA Definitions NFA = { S, , , s0, F } A finite set of states S A set of input symbols Σ input alphabet, ε is not in Σ A transition function : S S A special start state s0 A set of final states F, F S (accepting states) Transition Graph for FA is a state is a transition is a the start state is a final state Example 0 a a 1 3 2 b c c This machine accepts abccabc, but it rejects abcab. This machine accepts (abc+)+. Transition Table The mapping of an NFA can be represented in a transition table a start a 0 1 b 2 b 3 b (0, a) = {0,1} (0, b) = {0} (1, b) = {2} (2, b) = {3} STATE a b ε 0 {0, 1} {0} - 1 - {2} - 2 - {3} - 3 - - - DFA DFA is a special case of an NFA There are no moves on input ε For each state s and input symbol a, there is exactly one edge out of s labeled a. Both DFA and NFA are capable of recognizing the same languages. S = {0,1,2,3} = {a, b} s0 = 0 F = {3} NFA vs DFA a start a 0 b 1 b 2 3 b (a | b)*abb b 0 a 1 a b 2 b 3 a a Concept