Parsing & Parsing Speech

advertisement

Formal Language Theory

Benjamin Lambert

10/15/09

“MLSP” Group

Outline

• Why parse? What is parsing?

• What is a language?

– Formal language theory

• Regular languages

• Context Free Languages

• Context-sensitive Languages

Why parse speech?

• Long-distances dependencies

– Too distant for n-gram language model

• Specifically, parsing can help us find:

– Grammatical errors:

• The dog, that I saw, was fast

• *The dogs, that I saw, was fast

– Semantic errors:

• The pulp will be made into newsprint

• *The Pope will be made into newsprint

What is parsing?

“Parsing is the process of structuring a linear

representation in accordance with a given

grammar….

…“The linear representation may be a sentence, a computer program, a knitting pattern, a sequence of geological strata, a piece of music, actions in ritual behaviors, in short any linear sequence in which the preceding elements in some way

restrict the next element.”

(From “Parsing Techniques” Grune and Jacob, 1990)

What is a language?

• “A language is a ‘set’ of sentences,

• and each sentence is a ‘sequence’ of ‘symbols’…

• that is all there is: no meaning, no structure, either a sentence belongs to a language or it does not.” (Grune and Jacobs, 1990)

• A linguistic would disagree

• We’ll stick with this definition for now

Example languages:

• Binary numbers:

– 0, 1, 10, 11, 100…

• Binary numbers with an odd number of ones:

– 1, 111, 1000, 1011, …

– * 11, 101, 1111,…

• N zeros followed by n ones

– 01, 0011, 000111, 00001111…

– *0, 1, 100, …

• Grammatically correct English:

– “The pope will be made into newsprint”, …

– *“The pope will are made into newsprint”, …

• Semantically correct English (Semantic validity determined by some world model)

– “The pulp will be made into newsprint”, ….

– *“The Pope will be made into newsprint”, …

What is a language?

• Formal language theory review

– Regular languages

– Context Free Languages

– Context-sensitive Languages

(All diagrams of DFAa, NFAs, and PDAs from the

Sipser book)

Regular languages

• Three operations over a fixed alphabet:

– Concatenation: ab

– Union: [ab]

– Kleene Star: a*

• For example, binary numbers with an odd number of ones:

– 1, 111, 1000, 1011, …

• Can be proved to be equivalent language class as:

– DFA/NFA

– Regular expression

DFAs

NFA and equivalent DFA

Non-regular languages

• {O n 1 n | n ≥ 0}

• Why?

We need something more powerful

• …that has memory.

• Push-down automaton

– FSM with a read/write stack

PDA for: {O

n

1

n

| n ≥ 0}

How to formalize this language?

S  NP VP

NP  (Det) N

VP  V (NP)

N  pope

V  ran

Det  the

• Context free grammars

– Substitution rules rules with:

• Terminals (a, b, c, ….)

• Non-terminal variables (A, B, C, …)

• Can do everything regular grammars can:

– Concatenation by adjacency: A  ab

– Union: A  B, A  C

– Star, with recursion: A  Aa, A  epsilon

• Plus, recursion

CFG for: {O

n

1

n

| n ≥ 0}

• S  A

• A  0A1

• A  ε

• Can convert any CFG to Chomsky normal form:

A  BC

A  a

A Non-deterministic PDA for {a i b j c k |i,j,k ≥ 0 and i=j or i=k}

Parsing with a Non-deterministic PDA

• What’s the run-time?

• Deterministically?

PDA for: {O

n

1

n

2

n

| n >= 0} ?

Detour from CFGs

• … we’ll come back to them…

• … first context-sensitive languages...

– Not as important to many computer scientists

– Probably important for linguists and NLP

– Human languages are “mildly context-sensitive”

Context-sensitive grammar

{O n 1 n 2 n | n >= 0}

1. S -> abc | aSQ

2. bQc -> bbcc

3. cQ -> Qc

Context-sensitive derivation

Derivation from Grune book

Context-sensitive derivation

Derivation from Grune book

Non-CFG languages

• These need more than a stack: writable tape or RAM

– Turing machine (or linear-bounded automata?)

• “Monotonic context-sensitive languages” are a special (easier) class of Non-CFL

– TM decidable

– Some classes even easier, Tree-adjoining grammars (O(n 6 ) ?)

Equivalences

Language class

Regular

Written formalism “Machine”

Regular expressions DFA/NFA

Context-free Context-free grammars

(nondeterministic)

Push-down automaton

2

Chomsky class

3

Example

01*

0 n 1 n , programming languages,

(natural languages?)

Contextsensitive

Context-sensitive grammars

Linear-bounded automaton

1 0 n 1 n, 2 n , (Natural languages— mildly)

Unrestricted (Perl?) Turing machine “0”

References

Dick Grune and Ceriel J.H. Jacobs, “Parsing

Techniqes – A Practical Guide,” Ellis Horwood,

Chichester, England, 1990.

Michael Sipser, “Introduction to the Theory of

Computation,” Course Technology, 2005.

To be continued…

• Next time:

– Parsing

– Parsing Speech

Download