Lexical Analysis, Regular Expressions & Finite State Machines Processing English • Consider the following two sentences • Hi, I am 22 years old. I come from Alabama. • 22 come Alabama I, old from am. Hi years I. • Are they both correct? • How do you know? • Same words, numbers and punctuation • What did you do first? 1. Find words, numbers and punctuation 2. Then, check order (grammar rules) Finding Words and Numbers • How did you find words, numbers and punctuation? • You have a definition of what each is, or looks like • For example, what is a number? a word? • Although your are a bit more agile, the process was: 1. Start with first character 2. If letter, assume word; if digit, assume number 3. Scan left to right 1 character at a time, until punctuation mark (space, comma, etc.) 4. Recognize word or number 5. If no more characters, done; otherwise return to 1 Processing Code How do you process the following? What are the main parts in which to break the input? void quote() { print( "To iterate is human, to recurse divine." + " - L. Peter Deutsch" ); } def addABC(x): s = “ABC” return x + s addABC(input(“String: ”)) Schemes: childOf(X,Y) marriedTo(X,Y) Facts: marriedTo('Zed','Bea'). marriedTo('Jack','Jill'). childOf('Jill','Zed'). childOf('Sue','Jack'). Rules: childOf(X,Y) :- childOf(X,Z), marriedTo(Y,Z). marriedTo(X,Y) :- marriedTo(Y,X). Queries: marriedTo('Bea','Zed')? childOf('Jill','Bea')? Example def addABC ( x ) : s = “ABC” return x + s addABC ( input ( “String: ” ) ) What are the Parts? • They are called TOKENS • Process similar to English processing • Lexical Analysis • Input: A program in some language • Output: A list of tokens (type, value, location) Example Revisited Sample Input: Sample Output: def addABC(x): s = “ABC” return x + s (FUNDEF,”def”,1) (ID,”addABC”,1) (LEFT_PAREN,”(”,1) (ID,”x”,1) (RIGHT_PAREN,”)”,1) (COLON,”:”,1) (ID,”s”,2) (ASSIGN,”=”,2) (STRING,”’ABC’”,2) (FUNRET,”return”,3) (ID,”x”,3) (OPERATOR,”+”,3) (ID,”s”,3) (ID,”addABC”,4) (LEFT_PAREN,”(”,4) … addABC(input(“String: ”)) Program Compilation Lexical Analysis is first step of process Program Code Compiler Program Tokens Lexical Analyzer Internal Data Code Code Parser Generator Keywords Syntax Analysis String literals Variables … Error messages Or Interpreter (Executed directly) Token Specification • Regular Expressions • Pattern description for strings • • • • • • Concatenation: abc -> “abc” Boolean OR: ab|ac -> “ab”, “ac” Kleene closure: ab* -> “a”, “ab”, “abbb”, etc. Optional: ab?c -> “ac”, “abc” One or more: ab+ -> “ab”, “abbb” Group using () • • (a|b)c -> “ac”, “bc” (a|b)*c -> “c”, “ac”, “bc”, “bac”, “abaaabbbabbaaaaac”, etc. RegEx Extensions • • • • • • Exactly n: a3b+ -> “aaab”, “aaabb”, … [A-Z] = A|B|…|Z [ABC] = A|B|C [~aA] = any character but “a” or “A” \ = escape character (e.g., \* -> “*”) Whitespace characters • \s, \t, \n, \v Token Recognition • Finite State Machine • A DFSM is a 5-tuple (Σ,S,s0,δ,F) • Σ: finite, non-empty set of symbols (input alphabet) • S: finite, non-empty set of states • s0: member of S designated as start state • δ: state-transition function δ: S x Σ -> S • F: subset of S (final states, may be empty) FSM & RegEx • abc a b • a(b|c) c b a c • ab* b a Note the special double-circle designation of a final/accepting state. • (a(b?c))+ b a c c a Finite State Transducer • Extended FSM: • Γ: finite, non-empty set of symbols (output alphabet) • δ: state-transition function δ: S x Σ -> S x Γ • FST consumes input symbols and emits output symbols • Lexical analyzer • consume raw characters • emit tokens CS 236 Coolness Factor! • Design our own language • Subset of Datalog (LP-like) • Build an interpreter for our language • • • • Lexical Analyzer (Project 1) Parser (Project 2) Interpreter (Projects 3 and 4) Optimization (Project 5) Designing a Language • Define the tokens • Elements of the language, punctuation, etc. • For example, what are they in C++? • Recognize the tokens (lexical analysis) • Define the grammar • Forms of correct sentences • For example, what are they in C++? • Recognize the grammar (parsing) • Interpret and execute the program • C++ is a bit too complicated for us… Varied World Views fct personlist siblings(person x) { fct boolean sibling(person x, person y) { return x’s siblings if y is x’s sibling return T else return F } } fct int square(int x) { fct boolean square(int x, int y) { return x * x if y == x * x return T else return F } } fct boolean succeeds(person x) { fct boolean succeeds(person x) { if studies(x) return T else return F } if studies(x) return T else return F } Look up table or oracle No concerns with efficiency Logic Programming • Assume: all functions are Boolean • Compute using facts and rules • Facts are the known true values of the functions • Rules express relations among functions • Example: studies(x), succeeds(x) • Facts: studies(Matt), studies(Jenny) • Rule: succeeds(x) :- studies(x) • Closed-world Assumption Logic Programming • Computing is like issuing queries • First check if it can be answered with facts • Second check if rules can be applied • Examples • studies(Alex)? • • studies(Matt)? • • NO (neither facts nor rules to establish it) YES (there is fact about that) succeeds(Jenny)? • YES (no fact, but a rule that if Jenny studies then she succeeds and a fact that Jenny studies) Functions of Several Arguments • Examples • loves(x,y), parent(x,y), inclass(x,y) • loves(x,y) :- married(x,y) • Computing • parent(Christophe, Samuel)? • • parent(Christophe, X)? • • Yes, if there is a fact that matches Yes, if there is a value of X that would cause it to match a fact – return value of X loves(X, Y)? • Yes, if there are values of X and Y that would make this true, either by matching a fact or via rules (e.g., married(Christophe, Isabelle)) – return values of X and Y When We Are Done Sample Program: Sample Execution: Schemes: snap(S,N,A,P) csg(C,S,G) cn(C,N) ncg(N,C,G) cn('CS101',Name)? Yes(3) Name='C. Brown' Name='P. Patty' Name='Snoopy' Facts: snap('12345','C. Brown','12 Apple St.','555-1234'). snap('22222','P. Patty','56 Grape Blvd','555-9999'). snap('33333','Snoopy','12 Apple St.','555-1234'). csg('CS101','12345','A'). csg('CS101','22222','B'). csg('CS101','33333','C'). csg('EE200','12345','B+'). csg('EE200','22222','B'). ncg('Snoopy',Course,Grade)? Yes(1) Course='CS101', Grade='C' Rules: cn(C,N) :- snap(S,N,A,P),csg(C,S,G). ncg(N,C,G) :- snap(S,N,A,P),csg(C,S,G). Queries: cn('CS101',Name)? ncg('Snoopy',Course,Grade)? Demo… Project 1: Lexical Analyzer Sample Input: Sample Output: Queries: IsInRoomAtDH('Snoopy',R,'M',H) #SchemesFactsRules . (QUERIES,"Queries",1) (COLON,":",1) (ID,"IsInRoomAtDH",2) (LEFT_PAREN,"(",2) (STRING,"'Snoopy'",2) (COMMA,",",2) (ID,"R",2) (COMMA,",",2) (STRING,"'M'",2) (COMMA,",",2) (ID,"H",2) (RIGHT_PAREN,")",2) (COMMENT,"#SchemesFactsRules",3) (PERIOD,".",4) Total Tokens = 14 Define and find the tokens Basic FST for Project 1 start <character (except <cr> and <eof>)> ‘ error ‘ string <cr> or <eof> : … <space> | <tab> | <cr> <letter> <eof> <any other char> : or :white space ident. or keywd. eof error - :- <space> | <tab> | <cr> <letter> | <digit> Special check for Keywords (Schemes, Facts, Rules, Queries) Implementing a FST State in Variable state = START; input = readChar(); while (state != ACCEPT) { if (state == START) { if (input == QUOTE) { input = readChar(); state = STRING; } else if (input == ...) { ... other kinds of tokens ... } } else if (state == STRING) { if (input == QUOTE) { input = readChar(); state = ACCEPT; } else { input = readChar(); state = STRING; } } } State in Position in Code input = readChar(); // begin in START state if (input == QUOTE) { input = readChar(); // now in STRING state while (input != QUOTE) { input = readChar(); // stay in STRING state } input = readChar(); // now in ACCEPT state } else if (input == ...) { ... other kinds of tokens ... }