Discussion #1 Finite State Machines

advertisement
Discussion #1
Finite State Machines &
Regular Expressions
Discussion #1
1
Topics
•
•
•
•
Compilers and Interpreters
Lexical Analyzers
Regular Expressions
Finite State Machines &
Finite State Transducers
• Project 1
Discussion #1
2
Compilers for Programming Languages
Program
Code
Compiler
Program
Tokens
Lexical
Analyzer
Internal Data
Code
Code
Parser
Generator
Keywords Syntax Analysis
String literals
Variables
…
Or Interpreter
(Executed directly)
Error messages
Discussion #1
3
Series of 5 Projects:
Datalog Interpreter
Example Input:
Example Output:
Schemes:
snap(S,N,A,P)
csg(C,S,G)
cn(C,N)
ncg(N,C,G)
cn('CS101',Name)? Yes(3)
Name='C. Brown'
Name='P. Patty'
Name='Snoopy'
Facts:
snap('12345','C. Brown','12 Apple St.','555-1234').
snap('22222','P. Patty','56 Grape Blvd','555-9999').
snap('33333','Snoopy','12 Apple St.','555-1234').
csg('CS101','12345','A').
csg('CS101','22222','B').
csg('CS101','33333','C').
csg('EE200','12345','B+').
csg('EE200','22222','B').
ncg('Snoopy',Course,Grade)? Yes(1)
Course='CS101', Grade='C'
Rules:
cn(c,n) :- snap(S,n,A,P),csg(c,S,G).
ncg(n,c,g) :- snap(S,n,A,P),csg(c,S,g).
Queries:
cn('CS101',Name)?
ncg('Snoopy',Course,Grade)?
Discussion #1
4
Project 1: Lexical Analyzer
Example Input:
Example Output:
Queries:
(QUERIES,"Queries",1)
(COLON,":",1)
(ID,"IsInRoomAtDH",2)
(LEFT_PAREN,"(",2)
(STRING,"'Snoopy'",2)
(COMMA,",",2)
(ID,"R",2)
(COMMA,",",2)
(STRING,"'M'",2)
(COMMA,",",2)
(ID,"H",2)
(RIGHT_PAREN,")",2)
(COMMENT,"#SchemesFactsRules",3)
(PERIOD,".",4)
(COMMENT,"#|comment >=
wow|#",5)
(EOF,"",7)
Total Tokens = 16
IsInRoomAtDH('Snoopy',R,'M',H)
#SchemesFactsRules
.
#|comment >=
wow|#
Discussion #1
5
The Point of CS 236
•
Use mathematics to write better code.
– in Project 1: some sample code to help get started
– in later projects: continue this process independently
•
Project 1: Use a Finite State Machine to write a Lexical Analyzer.
• Lexical analyzers can identify patterns of text to be turned into tokens.
• Regular expressions also identify patterns of text and are equivalent in pattern
recognition power.
• We’ll start first with regular expressions, which more intuitively identify text
patterns, and then return to finite state machines, which more directly
correspond to the code we need to write to identify text patterns in our lexical
analyzer.
Discussion #1
6
Regular Expressions
•
Pattern description for strings
•
Standard patterns:
– Concatenation: abc matches …abc… but not …abdc… or …ac…
– Boolean or: ab|ac matches …ab… and also …ac… but not …cba…or…bc…
– Kleene closure: ab* matches …a… and …ab… and …abb… and …
•
Common shorthand patterns
– Optional: ab?c matches …ac… and …abc… but not …abbc…
short for ac|abc
– One or more: ab+ matches …ab… and …abb… and … but not …a…
short for abb*
Discussion #1
7
Regular Expressions & Parens
• Parens group regular expressions as expected
• Examples:
– (a|b)c matches …ac… and …bc…
– (a|b)*c matches …c… and …ac… and …bac… and
…ababababbbabbabaaaababaababbbbc… and …
– (a|b)?c matches …c… and …ac… and …bc…
Discussion #1
8
Regular Expression Extensions
• Additional shorthand and notation
– [ABC] = A|B|C
– [A-Za-z] = A|B|…|Z|a|b|…|z
– [A-Za-z]{4,7} matches any 4-7 letter sequence, e.g. …McKay…
– \ is an escape character: \* matches …*… and \, matches …,…
– Special characters:
– Digit: \d
– Word boundary: \b
• Languages and language extensions/packages
– Perl
– Java regular-expression packages
• Regular expression testers:
• RegExr
Discussion
#1
• regexpal
9
Regular Expressions &
Finite State Machines
• abc
a
b
• a(b|c)
c
b
a
c
• ab*
b
a
Note the special double-circle
designation of an accepting state.
• (a(b?c))+
b
a
Discussion #1
c
c
a
10
Formal Definition of a Finite State
Machine & a Finite State Transducer
A deterministic finite state machine is a quintuple (Σ,S,s0,δ,F), where:
• Σ is the input alphabet (a finite, non-empty set of symbols).
• S is a finite, non-empty set of states.
• s0 is an initial state, an element of S.
• δ is the state-transition function: δ : S  Σ → S.
• F is the set of final states, a (possibly empty) subset of S.
A finite state transducer is a 6-tuple (Σ,Γ,S,s0,δ,F) as above except:
Γ is the output alphabet (a finite, non-empty set of symbols).
δ is the state-transition function: δ : S  Σ → S  Γ.
Discussion #1
11
Project 1: Lexical Analyzer
Varieties
<String>
Description
Example
'quoted string'
Any sequence of characters enclosed in single quotes. Two
'this isn''t two strings'
single quotes denote an apostrophe within the string. For linenumber counts, count all '\n's within a string. A string token’s line '' (empty string)
number is the line where the string starts.
'don''t forget
about multiline strings'
<Keyword>
One of the following four character sequences: Schemes,
Facts, Rules, Queries. These keywords are case
sensitive.
Example: Schemesa is a single identifier
and not a keyword and an identifier.
<Identifier>
An identifier is a letter followed by a sequence of zero or more
letters or numbers. No underscores.
Legal identifiers:
Identifier1
Person
<Symbol>
One of the following character sequences:
:
,
<
>
=
(
:.
<=
>=
!=
)
<=('a','b')
( + ()
::- ???
White Space
Ignore white space; that is, do not output a token for white space,
just skip over it. White space includes any encountered spaces,
tabs, new lines, and carriage returns. Be sure to count the lines
when skipping over white space.
<Undefined>
Any character not tokenized as a string, keyword, identifier,
symbol, or white space. Any non-terminating string or nonterminating comment is undefined. In both of the latter two cases
we reached EOF before finding the end of string or end of
comment.
$&^ (Three individual tokens.)
'any string that doesn''t end
<Comment>
A line comment starts with # and ends at newline.
A block comment starts at #| and ends with |#. The comment’s
line number is the line where the comment started.
#this is a comment
#|this is a
multiline comment|#
<EOF>
End of input file.
Discussion #1
*
+
?
Invalid identifiers:
1stPerson
Person_Name
12
Basic FSM for Project 1
start
‘
<character (except ‘ and <eof>)>
‘
‘
<
or
<=
=
<eof>
u_eof
<
…
<space> | <tab> | <cr>
<letter>
<eof>
<any other char>
Discussion #1
string
white
space
ident.
or
keywd.
eof
String
quote
<=
<space> | <tab> | <cr>
<letter> | <digit>
Special check for
Keywords (Schemes,
Facts, Rules, Queries)
undef.
13
Get the Design Right
Code must directly represent a state machine:
Σ: Set of characters (the keyboard character set)
 S: Set of states (enum)
 s : An initial state (one of the states in the set of states)
0
 δ : S  Σ → S  Γ: Transition function δ for each state:

Input: the current state and the next character

Output:
 the next state
 a TokenType (if the current token is now complete)
Or null (if the current token is incomplete)
 State machine loop:

Evaluates state transitions

Builds and emits tokens

Dirty work: discards whitespace tokens, tracks line numbers, etc.

Discussion #1
14
State.cpp: List of States
…
enum State {Comma, Period, SawColon, Colon_Dash, SawAQuote, ProcessingString,
PossibleEndOfString, Start, End
};
…
Lex.cpp: State
Initialization/Termination
void Lex::generateTokens(Input* input) {
tokens = new vector<Token*>();
index = 0;
state = Start;
while(state != End) {
state = nextState();
}
}
Lex.cpp: State Transition Function
…
State Lex::nextState() {
State result;
char character;
switch(state) {
case Start:
result = getNextState(); break;
case Comma:
emit(COMMA); result = getNextState(); break;
case Period:
emit(PERIOD); result = getNextState(); break;
case SawColon:
character = input->getCurrentCharacter();
if(character == '-') {
result = Colon_Dash;
input->advance();
} else { //Every other character
throw "ERROR:: in case SawColon:, Expecting '-' but found " + character + '.';
}
break;
case Colon_Dash:
emit(COLON_DASH); result = getNextState(); break;
case SawAQuote:
character = input->getCurrentCharacter();
Lex:cpp: Get Next State
for State Transition Function
State Lex::getNextState() {
State result;
char currentCharacter = input->getCurrentCharacter();
switch(currentCharacter) {
case ',' : result = Comma; break;
case '.' : result = Period; break;
case ':' : result = SawColon; break;
case '\'' : result = ProcessingString; break;
case -1 : result = End; break;
default:
string error = "ERROR:: in Lex::getNextState, Expecting ";
error += "'\'', '.', '?', '(', ')', '+', '*', '=', '!', '<', '>', ':' but found ";
error += currentCharacter;
error += '.';
throw error.c_str();
}
input->advance();
return result;
}
Lex.cpp: Emit
for State Transition Function
void Lex::emit(TokenType tokenType) {
Token* token = new Token(tokenType, input->getTokensValue(),
input->getCurrentTokensLineNumber());
storeToken(token);
input->mark();
}
TokenType.cpp: Turns the Token
Type into a String for Output
string TokenTypeToString(TokenType tokenType){
string result = "";
switch(tokenType){
case COMMA:
result = "COMMA"; break;
case PERIOD:
result = "PERIOD"; break;
case COLON_DASH:
result = "COLON_DASH"; break;
case STRING:
result = "STRING"; break;
case NUL:
result = "NUL"; break;
}
return result;
}
Download