ch2

Scanner 2015.03.16 Front End Source code Front End IR Back End Machine code Errors The purpose of the front end is to deal with the input language  Perform a membership test: code  source language?  Is the program well-formed (semantically) ?  Build an IR version of the code for the rest of the compiler The front end deals with form (syntax) & meaning (semantics) The Front End Source code Scanner tokens IR Parser Errors Implementation Strategy Scanning Parsing Specify Syntax regular expressions context-free grammars Implement Recognizer deterministic finite automaton push-down automaton Perform Work Actions on transitions in automaton The Front End stream of characters Scanner microsyntax stream of tokens Parser syntax IR + annotations Errors Why separate the scanner and the parser?  Scanner classifies words  Parser constructs grammatical derivations  Parsing is harder and slower Separation simplifies the implementation  Scanners are simple  Scanner leads to a faster, smaller parser Scanner is only pass that touches every character of the input. token is a pair <part of speech, lexeme > Scanner Generator Why study automatic scanner construction?  Avoid writing scanners by hand  Harness the theory from classes like COMP 481 compile time design time source code Scanner parts of speech & words tables or code specifications Scanner Generator Represent words as indices into a global table Specifications written as “regular expressions” Goals:   To simplify specification & implementation of scanners To understand the underlying techniques and technologies Comp 412, Fall 2010 5 Strings and Languages  Alphabet  An alphabet  is a finite set of symbols (characters)  String  A string is a finite sequence of symbols from  s denotes the length of string s  denotes the empty string, thus  = 0  Language  A language is a countable set of strings over some fixed alphabet  Abstract Language Φ {ε} String Operations  Concatenation (連接)  The concatenation of two strings x and y is denoted by xy  Identity (單位元素)  The empty string is the identity under concatenation. s=s=s  Exponentiation  Define s0 =  si = si-1s for i > 0  By Define s1 = s s2 = ss Language Operations  Union L  M = { s  s  L or s  M }  Concatenation L M = { xy  x  L and y  M}  Exponentiation L0 = {  } Li = Li-1L  Kleene closure (封閉包) L* = ∪i=0,…, Li  Positive closure L+ = ∪i=1,…, Li Regular Expressions Regular Expressions A convenient means of specifying certain simple sets of strings. We use regular expressions to define structures of tokens. Tokens are built from symbols of a finite vocabulary. Regular Sets The sets of strings defined by regular expressions. Regular Expressions  Basis symbols:   is a regular expression denoting language L() = {}  a   is a regular expression denoting L(a) = {a}  If r and s are regular expressions denoting languages L(r) and M(s) respectively, then  rs is a regular expression denoting L(r)  M(s)  rs is a regular expression denoting L(r)M(s)  r* is a regular expression denoting L(r)*  (r) is a regular expression denoting L(r)  A language defined by a regular expression is called a regular set. Operator Precedence Operator Precedence Associative * highest left concatenation Second left | lowest left Algebraic Laws for Regular Expressions Law r|s=s|r Description | is commutative r | ( s | t ) = ( r | s ) | t | is associative r(st) = (rs)t r(s|t) = rs | rt (s|t)r = sr | tr concatenation is associative concatenation distributes over | εr = rε = r ε is the identity for concatenation r* = ( r |ε)* ε is guaranteed in a closure r** = r* * is idempotent Examples of Regular Expressions Identifiers: Letter  (a|b|c| … |z|A|B|C| … |Z) Digit  (0|1|2| … |9) Identifier  Letter ( Letter | Digit )* Numbers: shorthand for (a|b|c| … |z|A|B|C| … |Z) ((a|b|c| … |z|A|B|C| … |Z) | (0|1|2| … |9))* Integer  (+|-|) (0| (1|2|3| … |9)(Digit *) ) Decimal  Integer . Digit * Real  ( Integer | Decimal ) E (+|-|) Digit * Complex  ( Real , Real ) Numbers can get much more complicated! Using symbolic names does not imply recursion underlining indicates a letter in the input stream 13 Finite Automata  Finite Automata are recognizers.  FA simply say “Yes” or “No” about each possible input string.  A FA can be used to recognize the tokens specified by a regular expression  Use FA to design of a Lexical Analyzer Generator  Two kind of the Finite Automata  Nondeterministic finite automata (NFA)  Deterministic finite automata (DFA)  Both DFA and NFA are capable of recognizing the same languages. NFA Definitions NFA = { S, , , s0, F } A finite set of states S A set of input symbols Σ input alphabet, ε is not in Σ A transition function   : S    S A special start state s0 A set of final states F, F  S (accepting states) Transition Graph for FA is a state is a transition is a the start state is a final state Example 0 a a 1 3 2 b c c   This machine accepts abccabc, but it rejects abcab. This machine accepts (abc+)+. Transition Table The mapping  of an NFA can be represented in a transition table a start a 0 1 b 2 b 3 b (0, a) = {0,1} (0, b) = {0} (1, b) = {2} (2, b) = {3} STATE a b ε 0 {0, 1} {0} - 1 - {2} - 2 - {3} - 3 - - - DFA DFA is a special case of an NFA There are no moves on input ε For each state s and input symbol a, there is exactly one edge out of s labeled a. Both DFA and NFA are capable of recognizing the same languages. S = {0,1,2,3}  = {a, b} s0 = 0 F = {3} NFA vs DFA a start a 0 b 1 b 2 3 b (a | b)*abb b 0 a 1 a b 2 b 3 a a Concept

ch2

Related documents

Products

Support

ch2

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib