b | c

CS153  Problem Set 0 due Friday at 5pm.  See Lucas or me if you need help!  I will be having office hours from 5:45-7:15 on Thu night in the Science Center.  Reading:  Relevant chapters on Lexing & Parsing in Appel  OCamlLex and OCamlYacc documentation  “Monadic Parsing in Haskell” by G.Hutton and E.Meijer.  Two pieces conceptually:  Recognizing syntactically valid phrases.  Extracting semantic content from the syntax. ▪ E.g., What is the subject of the sentence? ▪ E.g., What is the verb phrase? ▪ E.g., Is the syntax ambiguous? If so, which meaning do we take? ▪ “Fruit flies like a banana” ▪ “2 * 3 + 4” ▪ “x ^ f y”  In practice, solve both problems at the same time. We use grammars to specify the syntax of a language. exp  int | var | exp ‘+’ exp | exp ‘*’ exp | ‘let’ var ‘=‘ exp ‘in’ exp ‘end’ int  ‘-’?digit+ var  alpha(alpha|digit)* digit  ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’ alpha  [a-zA-Z] To see if a sentence is legal, start with the first non-terminal), and keep expanding nonterminals until you can match against the sentence. N  ‘a’ | ‘(’ N ‘)’ “((a))” N  ‘(’N ‘)’  ‘(’‘(’N ‘)’ ‘)’  ‘(’‘(’‘a’‘)’‘)’ = “((a))” Start with the sentence, and replace phrases with corresponding non-terminals, repeating until you derive the start non-terminal. N  ‘a’ | ‘(‘ N ‘)’ ‘(’‘(‘ ‘a’‘)’‘)’  ‘(‘ ‘(‘ N ‘)’ ‘)’  ‘(‘ N ‘)’ N “((a))”  For real grammars, automating this non-deterministic search is non-trivial.  As we’ll see, naïve implementations must do a lot of back- tracking in the search.  Ideally, given a grammar, we would like an efficient, deterministic algorithm to see if a string matches it.  There is a very general cubic time algorithm.  Only linguists use it .  (In part, we don’t because recognition is only half the problem.)  Certain classes of grammars have much more efficient implementations.  Essentially linear time with constant state (DFAs).  Or linear time with stack-like state (Pushdown Automata).  Manual parsing (say, recursive descent).  Tedious, error prone, hard to maintain.  But fast & good error messages.  Parsing combinators  Encode grammars as (higher-order) functions. ▪ Basically, functions that generate recursive-descent parsers. ▪ Makes it easy to write & maintain grammars.  But can do a lot of back-tracking, and requires a limited form of grammar (e.g., no left-recursion.)  Lex and Yacc  Domain-Specific-Languages that generate very efficient, table-driven parsers for general classes of grammars.  Learn about the theory in 121  Need to know a bit here to understand how to effectively use these tools. CS153  Non-recursive grammars   (matches no string)  ε (epsilon – matches empty string)  Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet  Concatenation (R1 R2)  Alternation (R1 | R2)  Kleene star (R*)  A regular expression denotes a set of strings:  [[]] = { }  [[ε]] = { “” }  [[‘a’]] = { “a” }  [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 }  [[R1 | R2]] = { s | s in R1 or s in R2 }  [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* } We might recognize numbers as: digit ::= [0-9] number ::= ‘-’? digit+ [0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’ ‘-’? shorthand for (‘-’ | ε) digit+ shorthand for (digit digit*) So number ::= (‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*) ε a b ε c b accept start ε d a b (c | d)* b ε ε a b c ε b accept start d ε Formally: • an alphabet Σ • a set V of states ε • a distinguished start state • one or more accepting states • transition relation: δ : V * (Σ + ε) * V  bool  ε Epsilon:  Literal ‘a’:  R1 R2  R1 | R2 a ε R1 R2 ε R1 ε R2 ε ε  R* ε ε R1 ε  Naively:  Give each state a unique ID (1,2,3,…)  Create super states ▪ One super-state for each subset of all possible states in the original NDFA. ▪ (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123})  For each super-state (say {23}): ▪ For each original state s and character c: ▪ Find the set of accessible states (say {1,2}) skipping over epsilons. ▪ Add an edge labeled by c from the super state to the corresponding super-state.  In practice, super-states are created lazily. ε 4 a 1 b 2 ε c 3 6 b ε d 5 ε 4,6,3 c 1 a 2 b b 7 3,6 d 5,6,3 7 ε 4 a 1 2 ε c b 3 b 6 ε d 5 ε c 4,6,3 b c 1 a 2 b b 7 3,6 d b 5,6,3 d 7 ε 4 a 1 2 ε c b 3 6 b 7 ε d 5 ε c d 4,6,3 c 1 a 2 b b b 7 3,6 b d 5,6,3 c d c d 4,6,3 c 1 a 2 b b b 7 3,6 b d 5,6,3 c d  Deterministic Finite State automata are easy to simulate:  For each state and character, there is at most one transition we can take.   Usually record the transition function as an array, indexed by states and characters. Lexer starts off with a variable s initialized to the start state:  Reads a character, uses transition table to find next state.  Look at the output of Lex!   There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression. It’s based on computing the symbolic derivative of a regular expression with respect to the characters in a string. Think of ε as one,  as zero, concatenation as multiplication, and alternation as addition:  | R = R | = R  R = R =  εR=Rε=R R|R=R ε* = ε R** = R* R2 | R1 = R1 | R2  The derivative of a regular expression R with respect to a character ‘a’ is: [[deriv R a]] = { s | “a” ^ s in R }  So it’s the residual strings once we remove the ‘a’ from the front.  e.g., deriv (abab) ‘a’ = bab  e.g., deriv (abab | acde) = (bab | cde) deriv  c =  deriv ε c =  deriv c c = ε deriv c c’ =  deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R* deriv  c =  deriv ε c =  deriv c c = ε deriv c c’ =  deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R* Think of ε as “true” and  as false. empty  =  empty ε = ε empty c =  emtpy (R1 | R2) = (empty R1) | (empty R2) empty (R1 R2) = (empty R1) (empty R2) empty (R*) = ε   Given a regular expression R and a string of characters c1, c2, …, cn. We can test whether the string is in [[R]] by calculating:  deriv (… deriv (deriv R c1) c2 …) cn  And then test whether the resulting regular expression accepts the empty string. Take R = (ab | ac)* Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’] deriv R ‘a’ = (deriv (ab | ac) ‘a’) R (deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R So deriv R ‘a’ = (b | c) R Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) empty (b | c) = (empty b) | (empty c) = | = So this simplifies to: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) = (deriv (b | c) ‘b’) R (deriv (b | c) ‘b’) R = (deriv b ‘b’ | deriv c ‘b’) R = (ε | ) R = εR= R So starting with R = (ab | ac)* and calculating the derivative w.r.t. ‘a’ and then ‘b’, we are back where we started with R.  Given your regular expression R  associate it with an initial state S0.  Calculate the derivative of R w.r.t. each character in the alphabet, and generate new states for each unique resulting regular expression. Start State (ab | ac)* Start State (ab | ac)* deriv w.r.t. ‘a’ (b | c)((ab |ac)*) Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*)  Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘c’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*)   (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*)   Then continue calculating the derivatives for each new state.  You stop when you don’t generate a new state.  (Important: must do algebraic simplifications or this process won’t stop.) (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*)  (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘b’ (ab | ac)*  (ab | ac)* ‘a’ ‘b’, ‘c’  (b | c)((ab |ac)*) deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)* (ab | ac)* ‘a’ ‘b’, ‘c’  (b | c)((ab |ac)*) deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)*  (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’  ‘a’  Then figure out the accepting states by seeing if they accept the empty string.  i.e., run empty on the associated regular expression and see if its simplified form is ε or . (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’  ‘a’ (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’  ‘a’

b | c

Related documents

Products

Support

b | c

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib