b | c

advertisement
CS153

Problem Set 0 due Friday at 5pm.
 See Lucas or me if you need help!
 I will be having office hours from 5:45-7:15 on Thu
night in the Science Center.

Reading:
 Relevant chapters on Lexing & Parsing in Appel
 OCamlLex and OCamlYacc documentation
 “Monadic Parsing in Haskell” by G.Hutton and
E.Meijer.

Two pieces conceptually:
 Recognizing syntactically valid phrases.
 Extracting semantic content from the syntax.
▪ E.g., What is the subject of the sentence?
▪ E.g., What is the verb phrase?
▪ E.g., Is the syntax ambiguous? If so, which meaning do we
take?
▪ “Fruit flies like a banana”
▪ “2 * 3 + 4”
▪ “x ^ f y”

In practice, solve both problems at the same
time.
We use grammars to specify the syntax of a
language.
exp  int | var | exp ‘+’ exp | exp ‘*’ exp |
‘let’ var ‘=‘ exp ‘in’ exp ‘end’
int  ‘-’?digit+
var  alpha(alpha|digit)*
digit  ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | … | ‘9’
alpha  [a-zA-Z]
To see if a sentence is legal, start with the first
non-terminal), and keep expanding nonterminals until you can match against the
sentence.
N  ‘a’ | ‘(’ N ‘)’
“((a))”
N  ‘(’N ‘)’
 ‘(’‘(’N ‘)’ ‘)’
 ‘(’‘(’‘a’‘)’‘)’ = “((a))”
Start with the sentence, and replace phrases
with corresponding non-terminals, repeating
until you derive the start non-terminal.
N  ‘a’ | ‘(‘ N ‘)’
‘(’‘(‘ ‘a’‘)’‘)’ 
‘(‘ ‘(‘ N ‘)’ ‘)’ 
‘(‘ N ‘)’ N
“((a))”

For real grammars, automating this non-deterministic
search is non-trivial.
 As we’ll see, naïve implementations must do a lot of back-
tracking in the search.

Ideally, given a grammar, we would like an efficient,
deterministic algorithm to see if a string matches it.
 There is a very general cubic time algorithm.
 Only linguists use it .
 (In part, we don’t because recognition is only half the problem.)

Certain classes of grammars have much more efficient
implementations.
 Essentially linear time with constant state (DFAs).
 Or linear time with stack-like state (Pushdown Automata).

Manual parsing (say, recursive descent).
 Tedious, error prone, hard to maintain.
 But fast & good error messages.

Parsing combinators
 Encode grammars as (higher-order) functions.
▪ Basically, functions that generate recursive-descent parsers.
▪ Makes it easy to write & maintain grammars.
 But can do a lot of back-tracking, and requires a limited form of
grammar (e.g., no left-recursion.)

Lex and Yacc
 Domain-Specific-Languages that generate very efficient, table-driven
parsers for general classes of grammars.
 Learn about the theory in 121
 Need to know a bit here to understand how to effectively use these
tools.
CS153

Non-recursive grammars
  (matches no string)
 ε (epsilon – matches empty string)
 Literals (‘a’, ‘b’, ‘2’, ‘+’, etc.) drawn from alphabet
 Concatenation (R1 R2)
 Alternation (R1 | R2)
 Kleene star (R*)

A regular expression denotes a set of strings:
 [[]] = { }
 [[ε]] = { “” }
 [[‘a’]] = { “a” }
 [[R1 R2]] = { s | s = s1 ^ s2 & s1 in R1 & s2 in R2 }
 [[R1 | R2]] = { s | s in R1 or s in R2 }
 [[R*]] = [[ε + RR* ]] =
{ s | s = “” or s = s1 ^ s2 & s1 in R & s2 in R* }
We might recognize numbers as:
digit ::= [0-9]
number ::= ‘-’? digit+
[0-9] shorthand for ‘0’ | ‘1’ | … | ‘9’
‘-’? shorthand for (‘-’ | ε)
digit+ shorthand for (digit digit*)
So number ::=
(‘-’ | ε) ((‘0’ | ‘1’ | … | ‘9’)(‘0’ | ‘1’ | … | ‘9’)*)
ε
a
b
ε
c
b
accept
start
ε
d
a b (c | d)* b
ε
ε
a
b
c
ε
b
accept
start
d
ε
Formally:
• an alphabet Σ
• a set V of states
ε
• a distinguished start state
• one or more accepting states
• transition relation: δ : V * (Σ + ε) * V  bool

ε
Epsilon:

Literal ‘a’:

R1 R2

R1 | R2
a
ε
R1
R2
ε
R1
ε
R2
ε
ε

R*
ε
ε
R1
ε

Naively:
 Give each state a unique ID (1,2,3,…)
 Create super states
▪ One super-state for each subset of all possible states in the
original NDFA.
▪ (e.g., {1},{2},{3},{1,2},{1,3},{2,3},{123})
 For each super-state (say {23}):
▪ For each original state s and character c:
▪ Find the set of accessible states (say {1,2}) skipping over epsilons.
▪ Add an edge labeled by c from the super state to the corresponding
super-state.

In practice, super-states are created lazily.
ε
4
a
1
b
2
ε
c
3
6
b
ε
d
5
ε
4,6,3
c
1
a
2
b
b
7
3,6
d
5,6,3
7
ε
4
a
1
2
ε
c
b
3
b
6
ε
d
5
ε
c
4,6,3
b
c
1
a
2
b
b
7
3,6
d
b
5,6,3
d
7
ε
4
a
1
2
ε
c
b
3
6
b
7
ε
d
5
ε
c
d
4,6,3
c
1
a
2
b
b
b
7
3,6
b
d
5,6,3
c
d
c
d
4,6,3
c
1
a
2
b
b
b
7
3,6
b
d
5,6,3
c
d

Deterministic Finite State automata are easy to
simulate:
 For each state and character, there is at most one
transition we can take.


Usually record the transition function as an
array, indexed by states and characters.
Lexer starts off with a variable s initialized to the
start state:
 Reads a character, uses transition table to find next
state.
 Look at the output of Lex!


There’s a purely algebraic way to compute
whether a string is in the set denoted by a
regular expression.
It’s based on computing the symbolic
derivative of a regular expression with respect
to the characters in a string.
Think of ε as one,  as zero, concatenation as
multiplication, and alternation as addition:
 | R = R | = R
 R = R = 
εR=Rε=R
R|R=R
ε* = ε
R** = R*
R2 | R1 = R1 | R2

The derivative of a regular expression R with
respect to a character ‘a’ is:
[[deriv R a]] = { s | “a” ^ s in R }

So it’s the residual strings once we remove
the ‘a’ from the front.
 e.g., deriv (abab) ‘a’ = bab
 e.g., deriv (abab | acde) = (bab | cde)
deriv  c = 
deriv ε c = 
deriv c c = ε
deriv c c’ = 
deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c)
deriv (R1 R2) c =
(deriv R1 c) R2 | (empty R1) (deriv R2 c)
deriv (R*) c = (deriv R c) R*
deriv  c = 
deriv ε c = 
deriv c c = ε
deriv c c’ = 
deriv (R1 | R2) c = (deriv R1 c) | (deriv R2 c)
deriv (R1 R2) c =
(deriv R1 c) R2 | (empty R1) (deriv R2 c)
deriv (R*) c = (deriv R c) R*
Think of ε as “true” and  as false.
empty  = 
empty ε = ε
empty c = 
emtpy (R1 | R2) = (empty R1) | (empty R2)
empty (R1 R2) = (empty R1) (empty R2)
empty (R*) = ε


Given a regular expression R and a string of
characters c1, c2, …, cn.
We can test whether the string is in [[R]] by
calculating:
 deriv (… deriv (deriv R c1) c2 …) cn

And then test whether the resulting regular
expression accepts the empty string.
Take R = (ab | ac)*
Take the string to be [‘a’ ; ‘b’; ‘a’; ‘c’]
deriv R ‘a’ =
(deriv (ab | ac) ‘a’) R
(deriv (ab) ‘a’) | (deriv (ac) ‘a’) R = (b | c) R
So deriv R ‘a’ = (b | c) R
Now we must compute:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)
Now we must compute:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’)
empty (b | c) = (empty b) | (empty c) =
| =
So this simplifies to:
deriv ((b | c) R) ‘b’ =
(deriv (b | c) ‘b’) R | (empty (b | c)) (deriv R ‘b’) =
(deriv (b | c) ‘b’) R
(deriv (b | c) ‘b’) R =
(deriv b ‘b’ | deriv c ‘b’) R =
(ε | ) R =
εR=
R
So starting with R = (ab | ac)* and calculating
the derivative w.r.t. ‘a’ and then ‘b’, we are back
where we started with R.

Given your regular expression R
 associate it with an initial state S0.

Calculate the derivative of R w.r.t. each
character in the alphabet, and generate new
states for each unique resulting regular
expression.
Start State
(ab | ac)*
Start State
(ab | ac)*
deriv w.r.t. ‘a’
(b | c)((ab |ac)*)
Start State
(ab | ac)*
deriv w.r.t. ‘a’
deriv w.r.t. ‘b’
(b | c)((ab |ac)*)

Start State
(ab | ac)*
deriv w.r.t. ‘a’
deriv w.r.t. ‘c’
deriv w.r.t. ‘b’
(b | c)((ab |ac)*)


(ab | ac)*
‘a’
‘b’, ‘c’
(b | c)((ab |ac)*)


Then continue calculating the derivatives for
each new state.

You stop when you don’t generate a new
state.

(Important: must do algebraic simplifications
or this process won’t stop.)
(ab | ac)*
‘a’
‘b’, ‘c’
(b | c)((ab |ac)*)

(ab | ac)*
‘a’
‘b’, ‘c’
(b | c)((ab |ac)*)
deriv w.r.t. ‘b’
(ab | ac)*

(ab | ac)*
‘a’
‘b’, ‘c’

(b | c)((ab |ac)*)
deriv w.r.t. ‘b’
deriv w.r.t. ‘c’
(ab | ac)*
(ab | ac)*
(ab | ac)*
‘a’
‘b’, ‘c’

(b | c)((ab |ac)*)
deriv w.r.t. ‘a’
deriv w.r.t. ‘b’
deriv w.r.t. ‘c’
(ab | ac)*
(ab | ac)*

(ab | ac)*
‘a’
‘b’,’c’
(b | c)((ab |ac)*)
‘b’, ‘c’

‘a’

Then figure out the accepting states by
seeing if they accept the empty string.

i.e., run empty on the associated regular
expression and see if its simplified form is
ε or .
(ab | ac)*
‘a’
‘b’,’c’
(b | c)((ab |ac)*)
‘b’, ‘c’

‘a’
(ab | ac)*
‘a’
‘b’,’c’
(b | c)((ab |ac)*)
‘b’, ‘c’

‘a’
Download