01-LexAnalRegExFSM

advertisement
Lexical Analysis,
Regular Expressions &
Finite State Machines
Processing English
• Consider the following two sentences
• Hi, I am 22 years old. I come from Alabama.
• 22 come Alabama I, old from am. Hi years I.
• Are they both correct?
• How do you know?
• Same words, numbers and punctuation
• What did you do first?
1. Find words, numbers and punctuation
2. Then, check order (grammar rules)
Finding Words and Numbers
• How did you find words, numbers and punctuation?
• You have a definition of what each is, or looks like
• For example, what is a number? a word?
• Although your are a bit more agile, the process was:
1. Start with first character
2. If letter, assume word; if digit, assume number
3. Scan left to right 1 character at a time, until punctuation mark (space,
comma, etc.)
4. Recognize word or number
5. If no more characters, done; otherwise return to 1
Processing Code
How do you process the following?
What are the main parts in which to break the input?
void quote() {
print(
"To iterate is human, to recurse divine."
+
" - L. Peter Deutsch"
);
}
def addABC(x):
s = “ABC”
return x + s
addABC(input(“String: ”))
Schemes:
childOf(X,Y)
marriedTo(X,Y)
Facts:
marriedTo('Zed','Bea').
marriedTo('Jack','Jill').
childOf('Jill','Zed').
childOf('Sue','Jack').
Rules:
childOf(X,Y) :- childOf(X,Z), marriedTo(Y,Z).
marriedTo(X,Y) :- marriedTo(Y,X).
Queries:
marriedTo('Bea','Zed')?
childOf('Jill','Bea')?
Example
def addABC ( x ) :
s = “ABC”
return x + s
addABC ( input ( “String: ” ) )
What are the Parts?
• They are called TOKENS
• Process similar to English processing
• Lexical Analysis
• Input:
A program in some language
• Output:
A list of tokens
(type, value, location)
Example Revisited
Sample Input:
Sample Output:
def addABC(x):
s = “ABC”
return x + s
(FUNDEF,”def”,1)
(ID,”addABC”,1)
(LEFT_PAREN,”(”,1)
(ID,”x”,1)
(RIGHT_PAREN,”)”,1)
(COLON,”:”,1)
(ID,”s”,2)
(ASSIGN,”=”,2)
(STRING,”’ABC’”,2)
(FUNRET,”return”,3)
(ID,”x”,3)
(OPERATOR,”+”,3)
(ID,”s”,3)
(ID,”addABC”,4)
(LEFT_PAREN,”(”,4)
…
addABC(input(“String: ”))
Program Compilation
Lexical Analysis is first step of process
Program
Code
Compiler
Program
Tokens
Lexical
Analyzer
Internal Data
Code
Code
Parser
Generator
Keywords Syntax Analysis
String literals
Variables
…
Error messages
Or Interpreter
(Executed directly)
Token Specification
• Regular Expressions
• Pattern description for strings
•
•
•
•
•
•
Concatenation: abc -> “abc”
Boolean OR: ab|ac -> “ab”, “ac”
Kleene closure: ab* -> “a”, “ab”, “abbb”, etc.
Optional: ab?c -> “ac”, “abc”
One or more: ab+ -> “ab”, “abbb”
Group using ()
•
•
(a|b)c -> “ac”, “bc”
(a|b)*c -> “c”, “ac”, “bc”, “bac”, “abaaabbbabbaaaaac”, etc.
RegEx Extensions
•
•
•
•
•
•
Exactly n: a3b+ -> “aaab”, “aaabb”, …
[A-Z] = A|B|…|Z
[ABC] = A|B|C
[~aA] = any character but “a” or “A”
\ = escape character (e.g., \* -> “*”)
Whitespace characters
• \s, \t, \n, \v
Token Recognition
• Finite State Machine
• A DFSM is a 5-tuple (Σ,S,s0,δ,F)
• Σ: finite, non-empty set of symbols (input
alphabet)
• S: finite, non-empty set of states
• s0: member of S designated as start state
• δ: state-transition function δ: S x Σ -> S
• F: subset of S (final states, may be empty)
FSM & RegEx
• abc
a
b
• a(b|c)
c
b
a
c
• ab*
b
a
Note the special double-circle
designation of a final/accepting state.
• (a(b?c))+
b
a
c
c
a
Finite State Transducer
• Extended FSM:
• Γ: finite, non-empty set of symbols (output
alphabet)
• δ: state-transition function δ: S x Σ -> S x Γ
• FST consumes input symbols and emits output
symbols
•
Lexical analyzer
•
consume raw characters
•
emit tokens
CS 236 Coolness Factor!
• Design our own language
• Subset of Datalog (LP-like)
• Build an interpreter for our language
•
•
•
•
Lexical Analyzer (Project 1)
Parser (Project 2)
Interpreter (Projects 3 and 4)
Optimization (Project 5)
Designing a Language
• Define the tokens
• Elements of the language, punctuation, etc.
• For example, what are they in C++?
• Recognize the tokens (lexical analysis)
• Define the grammar
• Forms of correct sentences
• For example, what are they in C++?
• Recognize the grammar (parsing)
• Interpret and execute the program
• C++ is a bit too complicated for us…
Varied World Views
fct personlist siblings(person x) {
fct boolean sibling(person x, person y) {
return x’s siblings
if y is x’s sibling return T else return F
}
}
fct int square(int x) {
fct boolean square(int x, int y) {
return x * x
if y == x * x return T else return F
}
}
fct boolean succeeds(person x) {
fct boolean succeeds(person x) {
if studies(x) return T else return F
}
if studies(x) return T else return F
}
Look up table or oracle
No concerns with efficiency
Logic Programming
• Assume: all functions are Boolean
• Compute using facts and rules
• Facts are the known true values of the functions
• Rules express relations among functions
• Example: studies(x), succeeds(x)
• Facts: studies(Matt), studies(Jenny)
• Rule: succeeds(x) :- studies(x)
• Closed-world Assumption
Logic Programming
• Computing is like issuing queries
• First check if it can be answered with facts
• Second check if rules can be applied
• Examples
•
studies(Alex)?
•
•
studies(Matt)?
•
•
NO (neither facts nor rules to establish it)
YES (there is fact about that)
succeeds(Jenny)?
•
YES (no fact, but a rule that if Jenny studies then she succeeds and a fact that
Jenny studies)
Functions of Several Arguments
• Examples
• loves(x,y), parent(x,y), inclass(x,y)
• loves(x,y) :- married(x,y)
• Computing
•
parent(Christophe, Samuel)?
•
•
parent(Christophe, X)?
•
•
Yes, if there is a fact that matches
Yes, if there is a value of X that would cause it to match a fact – return value of X
loves(X, Y)?
•
Yes, if there are values of X and Y that would make this true, either by matching a
fact or via rules (e.g., married(Christophe, Isabelle)) – return values of X and Y
When We Are Done
Sample Program:
Sample Execution:
Schemes:
snap(S,N,A,P)
csg(C,S,G)
cn(C,N)
ncg(N,C,G)
cn('CS101',Name)? Yes(3)
Name='C. Brown'
Name='P. Patty'
Name='Snoopy'
Facts:
snap('12345','C. Brown','12 Apple St.','555-1234').
snap('22222','P. Patty','56 Grape Blvd','555-9999').
snap('33333','Snoopy','12 Apple St.','555-1234').
csg('CS101','12345','A').
csg('CS101','22222','B').
csg('CS101','33333','C').
csg('EE200','12345','B+').
csg('EE200','22222','B').
ncg('Snoopy',Course,Grade)?
Yes(1)
Course='CS101', Grade='C'
Rules:
cn(C,N) :- snap(S,N,A,P),csg(C,S,G).
ncg(N,C,G) :- snap(S,N,A,P),csg(C,S,G).
Queries:
cn('CS101',Name)?
ncg('Snoopy',Course,Grade)?
Demo…
Project 1: Lexical Analyzer
Sample Input:
Sample Output:
Queries:
IsInRoomAtDH('Snoopy',R,'M',H)
#SchemesFactsRules
.
(QUERIES,"Queries",1)
(COLON,":",1)
(ID,"IsInRoomAtDH",2)
(LEFT_PAREN,"(",2)
(STRING,"'Snoopy'",2)
(COMMA,",",2)
(ID,"R",2)
(COMMA,",",2)
(STRING,"'M'",2)
(COMMA,",",2)
(ID,"H",2)
(RIGHT_PAREN,")",2)
(COMMENT,"#SchemesFactsRules",3)
(PERIOD,".",4)
Total Tokens = 14
Define and find the tokens
Basic FST for Project 1
start
<character (except <cr> and <eof>)>
‘
error
‘
string
<cr> or <eof>
:
…
<space> | <tab> | <cr>
<letter>
<eof>
<any other char>
:
or
:white
space
ident.
or
keywd.
eof
error
-
:-
<space> | <tab> | <cr>
<letter> | <digit>
Special check for
Keywords (Schemes,
Facts, Rules, Queries)
Implementing a FST
State in Variable
state = START;
input = readChar();
while (state != ACCEPT) {
if (state == START) {
if (input == QUOTE) {
input = readChar();
state = STRING;
} else if (input == ...) {
... other kinds of tokens ...
}
} else if (state == STRING) {
if (input == QUOTE) {
input = readChar();
state = ACCEPT;
} else {
input = readChar();
state = STRING;
}
}
}
State in Position in Code
input = readChar();
// begin in START state
if (input == QUOTE) {
input = readChar();
// now in STRING state
while (input != QUOTE) {
input = readChar();
// stay in STRING state
}
input = readChar();
// now in ACCEPT state
} else if (input == ...) {
... other kinds of tokens ...
}
Download