Benjamin Lambert
10/15/09
“MLSP” Group
• Why parse? What is parsing?
• What is a language?
– Formal language theory
• Regular languages
• Context Free Languages
• Context-sensitive Languages
• Long-distances dependencies
– Too distant for n-gram language model
• Specifically, parsing can help us find:
– Grammatical errors:
• The dog, that I saw, was fast
• *The dogs, that I saw, was fast
– Semantic errors:
• The pulp will be made into newsprint
• *The Pope will be made into newsprint
“Parsing is the process of structuring a linear
representation in accordance with a given
grammar….
…“The linear representation may be a sentence, a computer program, a knitting pattern, a sequence of geological strata, a piece of music, actions in ritual behaviors, in short any linear sequence in which the preceding elements in some way
restrict the next element.”
(From “Parsing Techniques” Grune and Jacob, 1990)
• “A language is a ‘set’ of sentences,
• and each sentence is a ‘sequence’ of ‘symbols’…
• that is all there is: no meaning, no structure, either a sentence belongs to a language or it does not.” (Grune and Jacobs, 1990)
• A linguistic would disagree
• We’ll stick with this definition for now
• Binary numbers:
– 0, 1, 10, 11, 100…
• Binary numbers with an odd number of ones:
– 1, 111, 1000, 1011, …
– * 11, 101, 1111,…
• N zeros followed by n ones
– 01, 0011, 000111, 00001111…
– *0, 1, 100, …
• Grammatically correct English:
– “The pope will be made into newsprint”, …
– *“The pope will are made into newsprint”, …
• Semantically correct English (Semantic validity determined by some world model)
– “The pulp will be made into newsprint”, ….
– *“The Pope will be made into newsprint”, …
• Formal language theory review
– Regular languages
– Context Free Languages
– Context-sensitive Languages
(All diagrams of DFAa, NFAs, and PDAs from the
Sipser book)
• Three operations over a fixed alphabet:
– Concatenation: ab
– Union: [ab]
– Kleene Star: a*
• For example, binary numbers with an odd number of ones:
– 1, 111, 1000, 1011, …
• Can be proved to be equivalent language class as:
– DFA/NFA
– Regular expression
• {O n 1 n | n ≥ 0}
• Why?
We need something more powerful
• …that has memory.
• Push-down automaton
– FSM with a read/write stack
n
n
S NP VP
NP (Det) N
VP V (NP)
N pope
V ran
Det the
• Context free grammars
– Substitution rules rules with:
• Terminals (a, b, c, ….)
• Non-terminal variables (A, B, C, …)
• Can do everything regular grammars can:
– Concatenation by adjacency: A ab
– Union: A B, A C
– Star, with recursion: A Aa, A epsilon
• Plus, recursion
n
n
• S A
• A 0A1
• A ε
• Can convert any CFG to Chomsky normal form:
A BC
A a
A Non-deterministic PDA for {a i b j c k |i,j,k ≥ 0 and i=j or i=k}
Parsing with a Non-deterministic PDA
• What’s the run-time?
• Deterministically?
n
n
n
• … we’ll come back to them…
• … first context-sensitive languages...
– Not as important to many computer scientists
– Probably important for linguists and NLP
– Human languages are “mildly context-sensitive”
{O n 1 n 2 n | n >= 0}
1. S -> abc | aSQ
2. bQc -> bbcc
3. cQ -> Qc
Derivation from Grune book
Derivation from Grune book
• These need more than a stack: writable tape or RAM
– Turing machine (or linear-bounded automata?)
• “Monotonic context-sensitive languages” are a special (easier) class of Non-CFL
– TM decidable
– Some classes even easier, Tree-adjoining grammars (O(n 6 ) ?)
Language class
Regular
Written formalism “Machine”
Regular expressions DFA/NFA
Context-free Context-free grammars
(nondeterministic)
Push-down automaton
2
Chomsky class
3
Example
01*
0 n 1 n , programming languages,
(natural languages?)
Contextsensitive
Context-sensitive grammars
Linear-bounded automaton
1 0 n 1 n, 2 n , (Natural languages— mildly)
Unrestricted (Perl?) Turing machine “0”
Dick Grune and Ceriel J.H. Jacobs, “Parsing
Techniqes – A Practical Guide,” Ellis Horwood,
Chichester, England, 1990.
Michael Sipser, “Introduction to the Theory of
Computation,” Course Technology, 2005.
• Next time:
– Parsing
– Parsing Speech