part 2 [PPT]

advertisement
Models of Grammar Learning
CS 182 Lecture
April 24, 2008
What constitutes learning a language?
 What are the sounds
(Phonology)
 How to make words
(Morphology)
 What do words mean
(Semantics)
 How to put words together
(Syntax)
 Social use of language
(Pragmatics)
 Rules of conversations
(Pragmatics)
2
Language Learning Problem
 Prior knowledge



Initial grammar G (set of ECG constructions)
Ontology (category relations)
Language comprehension model
(analysis/resolution)
 Hypothesis space: new ECG grammar G’


Search = processes for proposing new
constructions
Relational Mapping, Merge, Compose
3
Language Learning Problem
 Performance measure

Goal: Comprehension should improve with training

Criterion: need some objective function to guide
learning…
Probability of Model given Data:
P( X | M ) P( M )
P( X )
P( M | X )    P( X | M ) P( M )
log P( M | X )  log P( X | M )  log P( M )
P( M | X ) 
Minimum Description Length:
 log P( M | X )   log P( X | M )  log P( M )
4
Minimum Description Length
 Choose grammar G to minimize cost(G|D):

cost(G|D) = α • size(G) + β • complexity(D|G)

Approximates Bayesian learning;
cost(G|D) ≈ posterior probability P(G|D)
 Size of grammar = size(G) ≈ prior P(G)

favor fewer/smaller constructions/roles; isomorphic mappings
 Complexity of data given grammar ≈ likelihood P(D|G)

favor simpler analyses
(fewer, more likely constructions)

based on derivation length + score of derivation
5
Size Of Grammar
 Size of the grammar G is the sum of the size of each
construction:
size( G )   size( c)
cG
 Size of each construction c is:
size( c)  nc  mc   length( e)
where
ec

nc = number of constituents in c,

mc = number of constraints in c,

length(e) = slot chain length of element reference e
6
What do we know about
language development?
(focusing mainly on first language acquisition
of English-speaking, normal population)
7
Children are amazing learners
0 mos
6 mos
12 mos
2 yr
3 yrs
4 yrs
5 yrs
8
Phonology: Non-native contrasts
 Werker and Tees (1984)
 Thompson: velar vs. uvular, /`ki/-/`qi/.
 Hindi: retroflex vs. dental, /t.a/-/ta/
20
18
16
14
12
yes
10
no
8
6
4
2
0
6-8 months
8-10 months
10-12 months
9
Finding words: Statistical learning
 Saffran, Aslin and Newport (1996)
pretty baby
 /bidaku/, /padoti/, /golabu/
 /bidakupadotigolabubidaku/
 2 minutes of this continuous speech stream
 By 8 months infants detect the words (vs
non-words and part-words)
10
Word order: agent and patient
 Hirsch-Pasek and Golinkoff (1996)
 1;4-1;7
 mostly still in the
one-word stage
 Where is CM
tickling BB?
11
Early syntax
 agent + action
‘Daddy sit’
 action + object
‘drive car’
 agent + object
‘Mommy sock’
 action + location
‘sit chair’
 entity + location
‘toy floor’
 possessor + possessed
‘my teddy’
 entity + attribute
‘crayon big’
 demonstrative + entity
‘this telephone’
12
From Single Words To Complex Utterances
FATHER:
NAOMI:
NAOMI:
NAOMI:
MOTHER:
NAOMI:
MOTHER:
Nomi are you
climbing up the
books?
up.
climbing.
books.
1;11.3
what are you doing?
I climbing up.
you’re climbing up?
2;0.18
FATHER: what’s the boy doing
to the dog?
NAOMI: squeezing his neck.
NAOMI: and the dog climbed
up the tree.
NAOMI: now they’re both safe.
NAOMI: but he can climb
trees.
4;9.3
Sachs corpus (CHILDES)
13
How Can Children Be So Good At
Learning Language?
 Gold’s Theorem:
No superfinite class of language is identifiable in the
limit from positive data only
 Principles & Parameters
Babies are born as blank slates but acquire language
quickly (with noisy input and little correction) →
Language must be innate:
Universal Grammar + parameter setting
But babies aren’t born as blank slates!
And they do not learn language in a vacuum!
14
Modifications of Gold’s Result
 (Weakly) Ordered Examples, implicit
negatives
 Loosened Identification Conditions
 Complexity Measures, Best Fit
No Theorems will resolve these issues
15
Modeling the acquisition
of grammar:
Theoretical assumptions
16
Language Acquisition
 Opulence of the substrate

Prelinguistic children already have rich sensorimotor
representations and sophisticated social knowledge

intention inference, reference resolution

language-specific event conceptualizations
(Bloom 2000, Tomasello 1995,
Bowerman & Choi, Slobin, et al.)
 Children are sensitive to statistical information

Phonological transitional probabilities

Even dependencies between non-adjacent items
(Saffran et al. 1996, Gomez 2002)
17
Language Acquisition
 Basic Scenes

Simple clause constructions are associated directly
with scenes basic to human experience
(Goldberg 1995, Slobin 1985)
 Verb Island Hypothesis

Children learn their earliest constructions
(arguments, syntactic marking) on a verb-specific basis
(Tomasello 1992)
throw frisbee
throw ball
get ball
get bottle
…
…
throw OBJECT
get OBJECT
this should be
reminiscent of your
model merging
assignment
18
Comprehension
is
partial.
(not just for dogs)
19
What children pick up from what they hear
what did you throw it into?
they’re throwing this in here.
they’re throwing a ball.
don’t throw it Nomi.
well you really shouldn’t throw things Nomi you know.
remember how we told you you shouldn’t throw things.
 Children use rich situational context / cues to fill in the gaps
 They also have at their disposal embodied knowledge and
statistical correlations (i.e. experience)
20
Language Learning Hypothesis
Children learn constructions
that bridge the gap between
what they know from language
and
what they know from the rest of cognition
21
Modeling the acquisition
of (early) grammar:
Comprehension-driven,
usage-based
22
Natural Language
Processing at Berkeley
Dan Klein
EECS Department
UC Berkeley
NLP: Motivation
 It’d be great if machines could





Read text and understand it
Translate languages accurately
Help us manage, summarize, and
aggregate information
Use speech as a UI
Talk to us / listen to us
 But they can’t



Language is complex
Language is ambiguous
Language is highly structured
24
Machine Translation
 Syntactic MT

Learn grammar
mappings between
languages

Fully data-driven
25
Information Extraction
 Unsupervised Coreference
Resolution

Take in lots of text

Learn what the
entities are and how
they corefer

Fully unsupervised,
but gets supervised
performance!
 General research goal:
unsupervised learning of
meaning
26
Syntactic Learning
 Grammar Induction

Raw text in

Learned grammars out

Big result: this can be
done!
 Grammar Refinement

Coarse grammars in

Detailed grammars
out

Gives top parsing
systems
27
Syntactic Inference
 Natural language is very ambiguous


Grammars are huge
Billions of parses
to consider

Milliseconds to do it
Influental members of the House
Ways and Means Committee
introduced legislation that would
restrict how the new S&L bailout
agency can raise capital, creating
another potential obstacle to the
government's sale of sick thrifts.
28
Idea: Learn PCFGs with EM
 Classic experiments on learning PCFGs with Expectation-
Maximization [Lari and Young, 1990]
Xi
{ X1 , X2 … Xn }
Xj

Full binary grammar over n symbols

Parse uniformly/randomly at first

Re-estimate rule expectations off of parses

Repeat
 Their conclusion: it doesn’t really work.
Xk
30
Re-estimation of PCFGs
 Basic quantity needed for re-estimation with EM:
P ( X c | i, j , S ) 

P(T )
T :( X k ,i , j )  yield (T )  S

P(T )
T : yield (T )  S
 Can calculate in cubic time with the Inside-Outside algorithm.
 Consider an initial grammar where all productions have equal weight:
P( X a X b | X c )  1/ n 2
 Then all trees have equal probability initially.
 Therefore, after one round of EM, the posterior over trees will (in the
absence of random perturbation) be approximately uniform over all trees,
31
and symmetric over symbols.
Problem: “Uniform” Posteriors
Tree Uniform
Split Uniform
32
Overview: NLP at UCB
 Lots of research and resources:

Dan Klein: Statistical NLP / ML

Marti Hearst: Stat NLP / HCI

Jerry Feldman: Language and Mind

Michael Jordan: Statistical Methods / ML

Tom Griffiths: Statistical Learning / Psychology

ICSI Speech and AI groups (Morgan, Stolcke, Shriberg,
Fillmore, Kay, Narayanan…)

Great linguistics and stats departments!
 No better place to solve the hard NLP problems!
34
Other Approaches
 Evaluation: fraction of nodes in gold trees correctly posited in
proposed trees (unlabeled recall)
 Some recent work in learning constituency:

[Adrians, 99] Language grammars aren’t general PCFGs

[Clark, 01] Mutual-information filters detect constituents, then
an MDL-guided search assembles them

[van Zaanen, 00] Finds low edit-distance sentence pairs and
extracts their differences
Adriaans, 1999
16.8
Clark, 2001
34.6
van Zaanen, 2000
35.6
35
Download