Discussion #1 Finite State Machines & Regular Expressions Discussion #1 1 Topics • • • • Compilers and Interpreters Lexical Analyzers Regular Expressions Finite State Machines & Finite State Transducers • Project 1 Discussion #1 2 Compilers for Programming Languages Program Code Compiler Program Tokens Lexical Analyzer Internal Data Code Code Parser Generator Keywords Syntax Analysis String literals Variables … Or Interpreter (Executed directly) Error messages Discussion #1 3 Series of 5 Projects: Datalog Interpreter Example Input: Example Output: Schemes: snap(S,N,A,P) csg(C,S,G) cn(C,N) ncg(N,C,G) cn('CS101',Name)? Yes(3) Name='C. Brown' Name='P. Patty' Name='Snoopy' Facts: snap('12345','C. Brown','12 Apple St.','555-1234'). snap('22222','P. Patty','56 Grape Blvd','555-9999'). snap('33333','Snoopy','12 Apple St.','555-1234'). csg('CS101','12345','A'). csg('CS101','22222','B'). csg('CS101','33333','C'). csg('EE200','12345','B+'). csg('EE200','22222','B'). ncg('Snoopy',Course,Grade)? Yes(1) Course='CS101', Grade='C' Rules: cn(c,n) :- snap(S,n,A,P),csg(c,S,G). ncg(n,c,g) :- snap(S,n,A,P),csg(c,S,g). Queries: cn('CS101',Name)? ncg('Snoopy',Course,Grade)? Discussion #1 4 Project 1: Lexical Analyzer Example Input: Example Output: Queries: (QUERIES,"Queries",1) (COLON,":",1) (ID,"IsInRoomAtDH",2) (LEFT_PAREN,"(",2) (STRING,"'Snoopy'",2) (COMMA,",",2) (ID,"R",2) (COMMA,",",2) (STRING,"'M'",2) (COMMA,",",2) (ID,"H",2) (RIGHT_PAREN,")",2) (COMMENT,"#SchemesFactsRules",3) (PERIOD,".",4) (COMMENT,"#|comment >= wow|#",5) (EOF,"",7) Total Tokens = 16 IsInRoomAtDH('Snoopy',R,'M',H) #SchemesFactsRules . #|comment >= wow|# Discussion #1 5 The Point of CS 236 • Use mathematics to write better code. – in Project 1: some sample code to help get started – in later projects: continue this process independently • Project 1: Use a Finite State Machine to write a Lexical Analyzer. • Lexical analyzers can identify patterns of text to be turned into tokens. • Regular expressions also identify patterns of text and are equivalent in pattern recognition power. • We’ll start first with regular expressions, which more intuitively identify text patterns, and then return to finite state machines, which more directly correspond to the code we need to write to identify text patterns in our lexical analyzer. Discussion #1 6 Regular Expressions • Pattern description for strings • Standard patterns: – Concatenation: abc matches …abc… but not …abdc… or …ac… – Boolean or: ab|ac matches …ab… and also …ac… but not …cba…or…bc… – Kleene closure: ab* matches …a… and …ab… and …abb… and … • Common shorthand patterns – Optional: ab?c matches …ac… and …abc… but not …abbc… short for ac|abc – One or more: ab+ matches …ab… and …abb… and … but not …a… short for abb* Discussion #1 7 Regular Expressions & Parens • Parens group regular expressions as expected • Examples: – (a|b)c matches …ac… and …bc… – (a|b)*c matches …c… and …ac… and …bac… and …ababababbbabbabaaaababaababbbbc… and … – (a|b)?c matches …c… and …ac… and …bc… Discussion #1 8 Regular Expression Extensions • Additional shorthand and notation – [ABC] = A|B|C – [A-Za-z] = A|B|…|Z|a|b|…|z – [A-Za-z]{4,7} matches any 4-7 letter sequence, e.g. …McKay… – \ is an escape character: \* matches …*… and \, matches …,… – Special characters: – Digit: \d – Word boundary: \b • Languages and language extensions/packages – Perl – Java regular-expression packages • Regular expression testers: • RegExr Discussion #1 • regexpal 9 Regular Expressions & Finite State Machines • abc a b • a(b|c) c b a c • ab* b a Note the special double-circle designation of an accepting state. • (a(b?c))+ b a Discussion #1 c c a 10 Formal Definition of a Finite State Machine & a Finite State Transducer A deterministic finite state machine is a quintuple (Σ,S,s0,δ,F), where: • Σ is the input alphabet (a finite, non-empty set of symbols). • S is a finite, non-empty set of states. • s0 is an initial state, an element of S. • δ is the state-transition function: δ : S Σ → S. • F is the set of final states, a (possibly empty) subset of S. A finite state transducer is a 6-tuple (Σ,Γ,S,s0,δ,F) as above except: Γ is the output alphabet (a finite, non-empty set of symbols). δ is the state-transition function: δ : S Σ → S Γ. Discussion #1 11 Project 1: Lexical Analyzer Varieties <String> Description Example 'quoted string' Any sequence of characters enclosed in single quotes. Two 'this isn''t two strings' single quotes denote an apostrophe within the string. For linenumber counts, count all '\n's within a string. A string token’s line '' (empty string) number is the line where the string starts. 'don''t forget about multiline strings' <Keyword> One of the following four character sequences: Schemes, Facts, Rules, Queries. These keywords are case sensitive. Example: Schemesa is a single identifier and not a keyword and an identifier. <Identifier> An identifier is a letter followed by a sequence of zero or more letters or numbers. No underscores. Legal identifiers: Identifier1 Person <Symbol> One of the following character sequences: : , < > = ( :. <= >= != ) <=('a','b') ( + () ::- ??? White Space Ignore white space; that is, do not output a token for white space, just skip over it. White space includes any encountered spaces, tabs, new lines, and carriage returns. Be sure to count the lines when skipping over white space. <Undefined> Any character not tokenized as a string, keyword, identifier, symbol, or white space. Any non-terminating string or nonterminating comment is undefined. In both of the latter two cases we reached EOF before finding the end of string or end of comment. $&^ (Three individual tokens.) 'any string that doesn''t end <Comment> A line comment starts with # and ends at newline. A block comment starts at #| and ends with |#. The comment’s line number is the line where the comment started. #this is a comment #|this is a multiline comment|# <EOF> End of input file. Discussion #1 * + ? Invalid identifiers: 1stPerson Person_Name 12 Basic FSM for Project 1 start ‘ <character (except ‘ and <eof>)> ‘ ‘ < or <= = <eof> u_eof < … <space> | <tab> | <cr> <letter> <eof> <any other char> Discussion #1 string white space ident. or keywd. eof String quote <= <space> | <tab> | <cr> <letter> | <digit> Special check for Keywords (Schemes, Facts, Rules, Queries) undef. 13 Get the Design Right Code must directly represent a state machine: Σ: Set of characters (the keyboard character set) S: Set of states (enum) s : An initial state (one of the states in the set of states) 0 δ : S Σ → S Γ: Transition function δ for each state: Input: the current state and the next character Output: the next state a TokenType (if the current token is now complete) Or null (if the current token is incomplete) State machine loop: Evaluates state transitions Builds and emits tokens Dirty work: discards whitespace tokens, tracks line numbers, etc. Discussion #1 14 State.cpp: List of States … enum State {Comma, Period, SawColon, Colon_Dash, SawAQuote, ProcessingString, PossibleEndOfString, Start, End }; … Lex.cpp: State Initialization/Termination void Lex::generateTokens(Input* input) { tokens = new vector<Token*>(); index = 0; state = Start; while(state != End) { state = nextState(); } } Lex.cpp: State Transition Function … State Lex::nextState() { State result; char character; switch(state) { case Start: result = getNextState(); break; case Comma: emit(COMMA); result = getNextState(); break; case Period: emit(PERIOD); result = getNextState(); break; case SawColon: character = input->getCurrentCharacter(); if(character == '-') { result = Colon_Dash; input->advance(); } else { //Every other character throw "ERROR:: in case SawColon:, Expecting '-' but found " + character + '.'; } break; case Colon_Dash: emit(COLON_DASH); result = getNextState(); break; case SawAQuote: character = input->getCurrentCharacter(); Lex:cpp: Get Next State for State Transition Function State Lex::getNextState() { State result; char currentCharacter = input->getCurrentCharacter(); switch(currentCharacter) { case ',' : result = Comma; break; case '.' : result = Period; break; case ':' : result = SawColon; break; case '\'' : result = ProcessingString; break; case -1 : result = End; break; default: string error = "ERROR:: in Lex::getNextState, Expecting "; error += "'\'', '.', '?', '(', ')', '+', '*', '=', '!', '<', '>', ':' but found "; error += currentCharacter; error += '.'; throw error.c_str(); } input->advance(); return result; } Lex.cpp: Emit for State Transition Function void Lex::emit(TokenType tokenType) { Token* token = new Token(tokenType, input->getTokensValue(), input->getCurrentTokensLineNumber()); storeToken(token); input->mark(); } TokenType.cpp: Turns the Token Type into a String for Output string TokenTypeToString(TokenType tokenType){ string result = ""; switch(tokenType){ case COMMA: result = "COMMA"; break; case PERIOD: result = "PERIOD"; break; case COLON_DASH: result = "COLON_DASH"; break; case STRING: result = "STRING"; break; case NUL: result = "NUL"; break; } return result; }