CS153 Problem Set 0 due Friday at 5pm. See Lucas or me if you need help! I will be having office hours from 5:45-7:15 on Thu night in the Science Center. Reading: Relevant chapters on Lexing & Parsing in Appel OCamlLex and OCamlYacc documentation “Monadic Parsing in Haskell” by G.Hutton and E.Meijer. Two pieces conceptually: Recognizing syntactically valid phrases. Extracting semantic content from the syntax. ▪ E.g., What is the subject of the sentence? ▪ E.g., What is the verb phrase? ▪ E.g., Is the syntax ambiguous? If so, which meaning do we take? ▪ “Fruit flies like a banana” ▪ “2 * 3 + 4” ▪ “x ^ f y” In practice, solve both problems at the same time. We use grammars to specify the syntax of a language. exp int | var | exp ‘+’ exp | exp ‘*’ exp | ‘let’ var ‘=‘ exp ‘in’ exp ‘end’ int ‘-’?digit+ var alpha(alpha|digit)* digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’ alpha [a-zA-Z] To see if a sentence is legal, start with the first non-terminal), and keep expanding nonterminals until you can match against the sentence. N ‘a’ | ‘(’ N ‘)’ “((a))” N ‘(’N ‘)’ ‘(’‘(’N ‘)’ ‘)’ ‘(’‘(’‘a’‘)’‘)’ = “((a))” Start with the sentence, and replace phrases with corresponding non-terminals, repeating until you derive the start non-terminal. N ‘a’ | ‘(‘ N ‘)’ ‘(’‘(‘ ‘a’‘)’‘)’ ‘(‘ ‘(‘ N ‘)’ ‘)’ ‘(‘ N ‘)’ N “((a))” For real grammars, automating this non-deterministic search is non-trivial. As we’ll see, naïve implementations must do a lot of back- tracking in the search. Ideally, given a grammar, we would like an efficient, deterministic algorithm to see if a string matches it. There is a very general cubic time algorithm. Only linguists use it . (In part, we don’t because recognition is only half the problem.) Certain classes of grammars have much more efficient implementations. Essentially linear time with constant state (DFAs). Or linear time with stack-like state (Pushdown Automata). Manual parsing (say, recursive descent). Tedious, error prone, hard to maintain. But fast & good error messages. Parsing combinators Encode grammars as (higher-order) functions. ▪ Basically, functions that generate recursive-descent parsers. ▪ Makes it easy to write & maintain grammars. But can do a lot of back-tracking, and requires a limited form of grammar (e.g., no left-recursion.) Lex and Yacc Domain-Specific-Languages that generate very efficient, table-driven parsers for general classes of grammars. Learn about the theory in 121 Need to know a bit here to understand how to effectively use these tools. CS153 Non-recursive grammars (matches no string) ε (epsilon – matches empty string) Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet Concatenation (R1 R2) Alternation (R1 | R2) Kleene star (R*) A regular expression denotes a set of strings: [[]] = { } [[ε]] = { “” } [[‘a’]] = { “a” } [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 } [[R1 | R2]] = { s | s in R1 or s in R2 } [[R*]] = [[ε + RR* ]] = { s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* } We might recognize numbers as: digit ::= [0-9] number ::= ‘-’? digit+ [0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’ ‘-’? shorthand for (‘-’ | ε) digit+ shorthand for (digit digit*) So number ::= (‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*) ε a b ε c b accept start ε d a b (c | d)* b ε ε a b c ε b accept start d ε Formally: • an alphabet Σ • a set V of states ε • a distinguished start state • one or more accepting states • transition relation: δ : V * (Σ + ε) * V bool ε Epsilon: Literal ‘a’: R1 R2 R1 | R2 a ε R1 R2 ε R1 ε R2 ε ε R* ε ε R1 ε Naively: Give each state a unique ID (1,2,3,…) Create super states ▪ One super-state for each subset of all possible states in the original NDFA. ▪ (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123}) For each super-state (say {23}): ▪ For each original state s and character c: ▪ Find the set of accessible states (say {1,2}) skipping over epsilons. ▪ Add an edge labeled by c from the super state to the corresponding super-state. In practice, super-states are created lazily. ε 4 a 1 b 2 ε c 3 6 b ε d 5 ε 4,6,3 c 1 a 2 b b 7 3,6 d 5,6,3 7 ε 4 a 1 2 ε c b 3 b 6 ε d 5 ε c 4,6,3 b c 1 a 2 b b 7 3,6 d b 5,6,3 d 7 ε 4 a 1 2 ε c b 3 6 b 7 ε d 5 ε c d 4,6,3 c 1 a 2 b b b 7 3,6 b d 5,6,3 c d c d 4,6,3 c 1 a 2 b b b 7 3,6 b d 5,6,3 c d Deterministic Finite State automata are easy to simulate: For each state and character, there is at most one transition we can take. Usually record the transition function as an array, indexed by states and characters. Lexer starts off with a variable s initialized to the start state: Reads a character, uses transition table to find next state. Look at the output of Lex! There’s a purely algebraic way to compute whether a string is in the set denoted by a regular expression. It’s based on computing the symbolic derivative of a regular expression with respect to the characters in a string. Think of ε as one, as zero, concatenation as multiplication, and alternation as addition: | R = R | = R R = R = εR=Rε=R R|R=R ε* = ε R** = R* R2 | R1 = R1 | R2 The derivative of a regular expression R with respect to a character ‘a’ is: [[deriv R a]] = { s | “a” ^ s in R } So it’s the residual strings once we remove the ‘a’ from the front. e.g., deriv (abab) ‘a’ = bab e.g., deriv (abab | acde) = (bab | cde) deriv c = deriv ε c = deriv c c = ε deriv c c’ = deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R* deriv c = deriv ε c = deriv c c = ε deriv c c’ = deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c) deriv (R1 R2) c = (deriv R1 c) R2 | (empty R1) (deriv R2 c) deriv (R*) c = (deriv R c) R* Think of ε as “true” and as false. empty = empty ε = ε empty c = emtpy (R1 | R2) = (empty R1) | (empty R2) empty (R1 R2) = (empty R1) (empty R2) empty (R*) = ε Given a regular expression R and a string of characters c1, c2, …, cn. We can test whether the string is in [[R]] by calculating: deriv (… deriv (deriv R c1) c2 …) cn And then test whether the resulting regular expression accepts the empty string. Take R = (ab | ac)* Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’] deriv R ‘a’ = (deriv (ab | ac) ‘a’) R (deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R So deriv R ‘a’ = (b | c) R Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) Now we must compute: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) empty (b | c) = (empty b) | (empty c) = | = So this simplifies to: deriv ((b | c) R) ‘b’ = (deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) = (deriv (b | c) ‘b’) R (deriv (b | c) ‘b’) R = (deriv b ‘b’ | deriv c ‘b’) R = (ε | ) R = εR= R So starting with R = (ab | ac)* and calculating the derivative w.r.t. ‘a’ and then ‘b’, we are back where we started with R. Given your regular expression R associate it with an initial state S0. Calculate the derivative of R w.r.t. each character in the alphabet, and generate new states for each unique resulting regular expression. Start State (ab | ac)* Start State (ab | ac)* deriv w.r.t. ‘a’ (b | c)((ab |ac)*) Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*) Start State (ab | ac)* deriv w.r.t. ‘a’ deriv w.r.t. ‘c’ deriv w.r.t. ‘b’ (b | c)((ab |ac)*) (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) Then continue calculating the derivatives for each new state. You stop when you don’t generate a new state. (Important: must do algebraic simplifications or this process won’t stop.) (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘b’ (ab | ac)* (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)* (ab | ac)* ‘a’ ‘b’, ‘c’ (b | c)((ab |ac)*) deriv w.r.t. ‘a’ deriv w.r.t. ‘b’ deriv w.r.t. ‘c’ (ab | ac)* (ab | ac)* (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’ ‘a’ Then figure out the accepting states by seeing if they accept the empty string. i.e., run empty on the associated regular expression and see if its simplified form is ε or . (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’ ‘a’ (ab | ac)* ‘a’ ‘b’,’c’ (b | c)((ab |ac)*) ‘b’, ‘c’ ‘a’