lecture21

advertisement
6.863J Natural Language Processing
Lecture 21: Language Acquisition Part 1
Robert C. Berwick
berwick@ai.mit.edu
The Menu Bar
• Administrivia:
• Project check
• The Twain test & the Gold Standard
• The Logical problem of language
acquisition: the Gold theorem results
• How can (human) languages be learned?
• The logical problem of language
acquisition
• What is the problem
• A framework
for analyzing it
6.863J/9.611J Lecture 21 Sp03
The Twain test
• Parents spend….
6.863J/9.611J Lecture 21 Sp03
The Logical problem of language
acquisition
noise
Initial
Constraints
& learning
method
Input Data
Target
language
All Possible
Human languages
6.863J/9.611J Lecture 21 Sp03
“Correct”
target
language
The problem
• From finite data, induce infinite set
• How is this possible, given limited time &
computation?
• Children are not told grammar rules
• Ans: put constraints on class of possible
grammars (or languages)
6.863J/9.611J Lecture 21 Sp03
The logical problem of language
acquisition
• Statistical MT: how many parameters?
How much data?
• “There’s no data like more data”
• Number of parameters to estimate in Stat
MT system -
6.863J/9.611J Lecture 21 Sp03
The logical problem of language
acquisition
• Input does not uniquely specify the grammar
(however you want to represent it) = Poverty of
the Stimulus (POS)
• Paradox 1: children grow up to speak language
of their caretakers
• Proposed solution: target choice of candidate
grammars is restricted set
• This is the theory of Universal Grammar (UG)
• (Paradox 2: why does language evolve?)
6.863J/9.611J Lecture 21 Sp03
The illogical problem of language
change
Langagis, whos reulis ben not writen as ben
Englisch, Frensch and many otheres, ben channgid
withynne yeeris and countrees that oon man of the
oon cuntre, and of the oon tyme, myghte not, or
schulde not kunne undirstonde a man of the othere
kuntre, and of the othere tyme; and al for this, that
the seid langagis ben not stabili and fondamentali
writen
Pecock (1454)
Book of Feith
6.863J/9.611J Lecture 21 Sp03
Information needed
• Roughly: sum of given info + new info
(data) has to pick out right language
(grammar)
• If we all spoke just 1 language – nothing
to decide – no data needed
• If just spoke 2 languages (eg, Japanese,
English), differing in just 1 bit, 1 piece of
data needed
• What about the general case?
6.863J/9.611J Lecture 21 Sp03
Can memorization suffice?
• Can a big enough table work?
• Which is largest, a <noun>-1, a <noun>2, a <noun>-3, a <noun>-4, a <noun>5, a <noun>-6, or a <noun>-7?
• Assume 100 animals
• # queries = 100 x 99 x …94 = 8 x 1013
• 2 queries/line, 275 lines, 1000 pages inch
=
• How big?
6.863J/9.611J Lecture 21 Sp03
153m
6.863J/9.611J Lecture 21 Sp03
The inductive puzzle
• Unsupervised learning
• (Very) small sample complexity
• 1—5 examples; no Wall Street Journal
subscriptions
• The Burst effect
• Order of presentation of examples doesn’t
matter
• External statistics don’t match
maturational time course
6.863J/9.611J Lecture 21 Sp03
The burst effect
two-3 words,
ages 1;1–1;11
“full” language
(some residual)
? time span: 2 weeks–2months
1;10 ride papa's neck
1;10.3 this my rock-baby
1;11.2 papa forget this
2;1.2 you watch me
open sandbox
2;1.3 papa, you like
this song?
2;4.0 I won't cry if
mama wash my hair
2;4.3 put this right here so
I see it better
6.863J/9.611J Lecture 21 Sp03
What’s the difference?
1. I see red one
2. P. want drink
3. P. open door
4. P. tickle S.
5. I go beach
6. P. forget this
7. P. said no
8. P. want out
9. You play chicken
Multiple choice:
(a) Pidgin speakers; (b) apes;
(c) feral child Genie;
(d) ordinary children
Answers
1,5,9 = pidgin
2,4,8 = apes
7= Genie
3,6= children
6.863J/9.611J Lecture 21 Sp03
Challenge: tension headaches
• Essentially error-free
• Minimal triggering
(+ examples)
• Robust under noise
• Still variable enough
?
?
6.863J/9.611J Lecture 21 Sp03
Warlpiri
Chinese
German
…
(4500 others)
The input…
Bob just went away .
Bob went away .
no he went back to school .
he went to work .
are you playing with the plate ?
where is the plate ?
you 're going to put the plate on the wall ?
let's put the plate on the table .
the car is on your leg ?
you 're putting the car on your leg ?
on your other leg .
that's a car
.
woom ? oh you mean voom . the car goes voom .
cars are on the road ?
thank you .
the cow goes moo ?
what color is the cow ?
what color is the cow ?
what color is the cow ?
6.863J/9.611J Lecture 21 Sp03
what color
A developmental puzzle
• If pure inductive learning, then based on
pattern distribution in the input
• What’s the pattern distribution in the input?
• English subjects: most English sentences overt
• French: only 7-8% of French sentences have
inflected verb followed by negation/adverb (“Jean
embrasse souvent/pas Marie”)
• Dutch: no Verb first S’s; Obj Verb Subject trigger
appears in only 2% of the cases, yet…
6.863J/9.611J Lecture 21 Sp03
Predictions from pure induction
• English obligatory subject should be
acquired early
• French verb placement should be acquired
late
• Dutch verb first shouldn’t be produced at
all – because it’s not very evident in the
input
6.863J/9.611J Lecture 21 Sp03
The empirical evidence runs completely
contrary to this…
• English: Subjects acquired late (Brown, Bellugi,
Bloom, Hyams…), but Subjects appear virtually
100% uniformly
• French: Verb placement acquired as early as it is
possible to detect (Pierce, others), but triggers
don’t occur very frequently
• Dutch: 40-50% Verb first sentences produced
by kids, but 0 % in input (Klahsen)
• So: what are we doing wrong?
6.863J/9.611J Lecture 21 Sp03
Can’t just be statistical regularities…acquisition
time course doesn’t match
% corrrect (adult) use
English Subject use
(“there” sentences)
1% in Childes
French verb
raising 7-8%
(VfinAdv/pas)
Dutch verb second
OVS 2%; 0 V1 pats
{
diffuse and sparse
regularities
6.863J/9.611J Lecture 21 Sp03
20m.
2.5y
stable inferences
and rapid time course
The language identification game
•
•
•
•
black sheep
baa baa black sheep
baa black sheep
baa baa baa baa black sheep
…
6.863J/9.611J Lecture 21 Sp03
The facts
Child: Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
[dialogue repeated five more times]
Mother:
Now listen carefully, say “Nobody likes me.”
Child:
Oh! Nobody don’t likeS me.
(McNeill, 1966)
6.863J/9.611J Lecture 21 Sp03
Brown & Hanlon, 1970
• parents correct for meaning, not form
• when present, correction was not picked
up
6.863J/9.611J Lecture 21 Sp03
The problem…
• The child makes an error.
• The adult may correct or identify the
error.
• But the child ignores these corrections.
• So, how does the child learn to stop
making the error?
6.863J/9.611J Lecture 21 Sp03
But kids do recover (well, almost)
• u-shaped curve:
•
•
•
•
went - goed - went
child must stop saying:
“goed”
“unsqueeze”
“deliver the library the book”
6.863J/9.611J Lecture 21 Sp03
Overgeneralization
went
jumped
goed
runned
falled
wented
6.863J/9.611J Lecture 21 Sp03
Positive vs. negative evidence
Positive examples
Utterance
1. Child says “went”.
2. Child says “goed”.
3. Adult says “went”.
Feedback
none
none
---
Result
none
none
positive data
Positive & Negative examples
1.
2.
3.
4.
Utterance
Child says “went”.
Child says “goed”.
Adult says “went”.
Adult says “goed”.
Feedback
good
bad
good
bad
6.863J/9.611J Lecture 21 Sp03
Result
positive data
corrective
positive data
corrective
Positive & negative examples
Child:
Father:
Child:
Father:
Child:
Father:
Child:
Father:
me want more.
ungrammatical.
want more milk.
ungrammatical.
more milk !
ungrammatical.
cries
ungrammatical
6.863J/9.611J Lecture 21 Sp03
Contrast…
Child:
Father:
Child:
Father:
Child:
Father:
Child:
Father:
me want more.
You want more? More what?
want more milk.
You want more milk?
more milk !
Sure, honey, I’ll get you some more.
cries
Now, don’t cry, daddy is getting you
some.
6.863J/9.611J Lecture 21 Sp03
Formalize this game…
•
•
•
•
Family of target languages (grammars) L
The example data
The learning algorithm, A
The notion of learnability (convergence to the
target) in the limit
• Gold’s theorem (1967): If a family of languages
contains all the finite languages and at least one
infinite language, then it is not learnable in the
limit
6.863J/9.611J Lecture 21 Sp03
Gold’s result…
• So, class of finite-state automata, class of
Kimmo systems, class of cfg’s, class of
feature-based cfgs, class of GPSGs,
transformational grammars,… NOT
learnable from positive-only evidence
• Doesn’t matter what algorithm you use –
the result is based on a mapping – not an
algorithmic limitation (Use EM, whatever
you want…)
6.863J/9.611J Lecture 21 Sp03
Framework for learning
1. Target Language Lt L is a target language
drawn from a class of possible target languages L
.
2. Example sentences si  Lt are drawn from the
target language & presented to learner.
3. Hypothesis Languages h H drawn from a class
of possible hypothesis languages that child
learners construct on the basis of exposure to the
example sentences in the environment
4. Learning algorithm A is a computable procedure
by which languages from H are selected given
the examples
6.863J/9.611J Lecture 21 Sp03
Some details
• Languages/grammars – alphabet S*
• Example sentences
• Independent of order
• Or: Assume drawn from probability distribution m
(relative frequency of various kinds of sentences) –
eg, hear shorter sentences more often
• If m  Lt , then the presentation consists of positive
examples, o.w.,
• examples in both Lt & S* - Lt (negative examples),
I.e., all of S* (“informant presentation”)
6.863J/9.611J Lecture 21 Sp03
Learning algorithms & texts
is mapping from set of all finite data streams to
hypotheses in H
• A
• Finite data stream of k examples (s1, s2 ,…, sk )
• Set of all data streams of length k ,
Dk = {(s1, s2 ,…, sk)| si  S*}= (S*)k
• Set of all finite data sequences D = k>0 Dk (enumerable), so:
A : DH
- Can consider A to flip coins if need be
If learning by enumeration: The sequence of hypotheses after each
sentence is h1, h2, …,
Hypothesis after n sentences is hn
6.863J/9.611J Lecture 21 Sp03
Criterion of success; Learnability
• Distance measure d between target grammar gt
and any hypothesized grammar h, d(gt , h)
• Learnability of L implies that this distance goes
to 0 as # of sentences n goes to infinity
(“convergence in the limit”)
• We say that a family of languages L is learnable
if each member L L is learnable
• This framework is very general – any linguistic
setting; any learning procedure (EM, gradient
descent,…)
6.863J/9.611J Lecture 21 Sp03
Generality of this setting
1.
2.
L S*
L S1* x S2* - NO different from (1) above -
(form, meaning) pairs
3. L:S*  [0,1] real number representing
grammaticality; this is generalization of (1)
4.
L is probability distribution m on S* - this is
the usual sense in statistical applications (MT)
5. L is probability distribution m on S1* x S2*
6.863J/9.611J Lecture 21 Sp03
What can we do with this?
• Two general approaches:
• Inductive inference (classical – Gold theorem)
• Probabilistic – approximate learning (VC dimension &
“PAC” learning)
• Both get same result that all interesting families
of languages are not learnable from positiveonly data!
(even under all the variations given previously):
Fsa’s, Hmm’s, CFGs,…,
• Conclusion: some a priori restrictions on class H
is required.
• This is Universal Grammar
6.863J/9.611J Lecture 21 Sp03
In short:
• Innate = ‘before data’ (data = information used
to learn the language, so, examples + algorithm
used, or even modify the acquisition algorithm)
• Result from Learning theory: Restricted search
space must exist (even if you use semantics!)
• No other way to search for ‘underlying rules’ –
even if unlimited time, resources
• Research question: what is A ? Is it domain
specific, or a general method?
6.863J/9.611J Lecture 21 Sp03
The inductive inference approach
(Gold’s theorem)
• Identification in the limit
• The Gold standard
• Extensions & implications & for natural
languages
• We must restrict the class of
grammars/languages the learner chooses
from, severely!
6.863J/9.611J Lecture 21 Sp03
ID in the limit - dfns
• Text t of language L is an infinite sequence of sentences
of L with each sentence of L occurring at least once
(“fair presentation”)
• Text tn is the first n sentences of t
• Learnability: Language L is learnable by algorithm A if
for each t of L if there exists a number m s.t. for all
n>m, A (tn )= L
• More formally, fix distance metric d, a target grammar gt
and a text t for the target language. Learning algorithm
A identifies (learns) gt in the limit if
d(A (tk), gt )  0 k 
6.863J/9.611J Lecture 21 Sp03
e-learnability & “locking sequence/data
set”
le
L
e
e
6.863J/9.611J Lecture 21 Sp03
Ball of radius
e
Locking sequence:
If (finite) sequence le
gets within e of target
& then it stays there
Relation between this &
learnability in limit
• Thm 1 (Blum & Blum, 1975, e-version) If
A identifies g in the limit, then for every
e >0, there exists a locking data set that
comes within e of the target
6.863J/9.611J Lecture 21 Sp03
Gold’s thm follows…
• Theorem (Gold, 1967). If the family L
consists of all the finite languages and at
least 1 infinite language, then it is not
learnable in the limit
• Corollary: The class of fsa’s, cfg’s, csg’s,…
are not learnable in the limit
• Proof by contradiction…
6.863J/9.611J Lecture 21 Sp03
Gold’s thm
• Suppose A is able to identify the family
L. Then it must identify the infinite
language, Linf .
• By Thm, a locking sequence exists, sinf
• Construct a finite language L sinf from this
locking sequence to get locking sequence
for L sinf - a different language from Linf
• A can’t identify L sinf , a contradiction
6.863J/9.611J Lecture 21 Sp03
Picture
One Superfinite L, all finite L’s
{ai| a> 0}
6.863J/9.611J Lecture 21 Sp03
But what about…
• We shouldn’t require exact identification!
• Response: OK, we can use e notion, or,
statistical learning theory to show that if we
require convergence with high probability, then
the same results hold (see a bit later)
• Suppose languages are finite?
• Response: naw, the Gold result is really about
information density, not infinite languages
6.863J/9.611J Lecture 21 Sp03
But what about… (more old whine in
new bottles)
• Why should you be able to learn on every
sequence?
• Response: OK, use the “Probably approximately
correct” (PAC) approach – learn target with high
probability, to within epsilon, on 1-d sequences
• Converge now not on every data sequence, but
still with probability 1
• Now d(gt ,hn) is a random variable, and you want
weak convergence of random variables
• So this is also convergence in the limit
6.863J/9.611J Lecture 21 Sp03
Stochastic extensions/Gold
complaints & positive results
• To handle statistical case – rules are stochastic –
so the ‘text’ the learner gets is stochastic (some
distribution spits it out…)
• If you know how language is generated then it
helps you learn what language is generated
• Absence of sentence from guessed L is like
negative evidence: although approximate, can
be used to reject guess (“indirect negative
evidence”)
6.863J/9.611J Lecture 21 Sp03
Results for stochastic case
• Results:
• Negative evidence: really needs all the text (enough
sampling over negative examples s.t. child can really
know it)
• If you don’t know the distribution – you lose –
estimating a density function is even harder than
approximating functions…
• If you have very strong constraints on distribution
functions to be drawn from the language family, then
you can learn fsa’s, cfg’s…
• This constraint is that the learner knows a function d,
s.t. after seeing at least d(n) examples, learner
knows what membership of each example sentence
in every sequence
6.863J/9.611J Lecture 21 Sp03
Finite language case
• Result doesn’t really depend on some subtle
property of infinite languages
• Suppose finite languages. Then Gold framework
– learner identifies language by memorization only after hearing all the examples of the
language
• No possibility of generalization; no extrapolation
– not the case for natural languages
A simple example…
6.863J/9.611J Lecture 21 Sp03
Simple finite case
•
•
•
•
Finite set of finite languages
3 sentences, s1, s2 , s3, so 8 possible languages
Suppose learner A considers all 8 languages
Learner B considers only 2 languages:
L1 = {s1 , s2 }, L2 = {s3 }
• If A receives sentence s1 then A has no
information whether s2 or s3 will be part of
target or not – only can tell this after hearing all
the sentences
• If B receives s1 then B knows that s2 will be part
of the target – extrapolation beyond experience
• Restricted space is requirement for
generalization 6.863J/9.611J Lecture 21 Sp03
How many examples needed?
• Gold (again): even if you know the # of
states in an fsa, this is NP-hard
• Restrictions on class of fsa’s make this
poly-time (Angluin; Pilato & Berwick)
• If fsa is backwards deterministic
6.863J/9.611J Lecture 21 Sp03
Example inferred fsa (NP specifiers)
6.863J/9.611J Lecture 21 Sp03
OK smarty…
• What can you do?
• Make class of a priori languages finite,
and small…
• Parameterize it
• How?
6.863J/9.611J Lecture 21 Sp03
English is function-argument
form
function
args
the stock
sold
at a bargain price
green with envy
the over-priced stock
6.863J/9.611J Lecture 21 Sp03
Other languages are the mirrorinverse: arg-function
This is like Japanese
the stock
sold
at a bargain price
green with envy
the over-priced stock
6.863J/9.611J Lecture 21 Sp03
Orders
• SVO “they eat ice-cream”
Finnish, Chinese…)
• SOV “ice-cream they eat”
Hindi, Turkish…)
• VSO “eat they ice-cream”
Arabic,…)
• VOS “eat ice-cream they”
• OSV “ice-cream they eat”
• OVS “ice-cream eat they”
6.863J/9.611J Lecture 21 Sp03
(English,
(Japanese,
(Welsh,
(Malagasy,…)
(Kabardian,…)
Parameter space of languages
Language
S-H
C-H
C
Ng
Va
Vt
SR
Ara bic
Dutch
English
French
German
Hindi
Icelandic
Irish
Italian
Japanese
Malay
Mandarin
Swedish
Tamil
6.863J/9.611J Lecture 21 Sp03
Scr
NP Op LD V2
Scr Scr Sc
Wh
Pro
Parameter space - 14
6.863J/9.611J Lecture 21 Sp03
Japanese
6.863J/9.611J Lecture 21 Sp03
Simple switches
• Just 3, so 8 possible languages
(grammars) – set 0 or 1
• Complement first/final (dual of Head 1st)
• English: Complement final (value = 1)
• Specifier first/final (determiner on right or
left, Subject on right or left)
• Verb second or not (German/not)
6.863J/9.611J Lecture 21 Sp03
Parametric space – triggering
learning algorithm (TLA)
[ 0 1 0 ] = ‘English’
spec 1st/final comp 1st/final –V2/+V2
[ 0 0 1] = ‘German’
spec 1st/final comp 1st/final –V2/+V2
6.863J/9.611J Lecture 21 Sp03
TLA
• 36 possible surface sentence combos:
• SVO, SOV, S Aux V O, S Aux O V, S V O1 O2,
etc.
• We use only simple sentences (at most,
perhaps one embedding) – psychological
fidelity
• 1-bit different example that moves us
from one setting to the next is called a
trigger
6.863J/9.611J Lecture 21 Sp03
(Greedy, not very smart)
6.863J/9.611J Lecture 21 Sp03
How to get there from here
Bangla
Spec-first,
Comp first,
–Verb 2nd
[0 0 0]
German
Subject-Verb-Object
example
sentence w
Hans kauft das Buch
Hans boi-ta kene
Hans the book buys
Spec-first,
Comp-first
+Verb 2nd
[0 0 1]
Hans kauft das Buch
Hans buys the book
•
• transitions based on example sentence
Prob(transition) based on set differences between
languages, normalized by target language |Ltarget|
examples (in our case, if t=English,36 of them)
6.863J/9.611J Lecture 21 Sp03
Learning driven by language
triggering set differences
Li
Lj
Li\Lj/|Ltarget|
6.863J/9.611J Lecture 21 Sp03
Ltarget
The Ringstrasse
11/12
4
[1 0 1]
31/36
1/12
1/12
1/2
3 [1 0 0]
1
[1 1 0]
1
1/6
sink
1/3
2 [1 1 1]
sink target: 5
5/18 5 [spec 1st, comp final, –V2]
[0 0 0]
1/6
7
1/18
[0 1 0]
2/3
6 [0 1 1]
1
1/18 1/36 1/12
8
[0 0 1]
8/9
6.863J/9.611J Lecture 21 Sp03
5/6
6.863J/9.611J Lecture 21 Sp03
TLA, 2
• Learnability depends on start state we are
at
• There are some ‘sink’ states – no escape
• Can calculate average time of
convergence in terms of # of examples
(‘maturation time’)
6.863J/9.611J Lecture 21 Sp03
TLA, 3
• 4 ‘no good’ states – not learnable
• So what can learner do?
• One solution: don’t use greedy, local
algorithm!
• Note that if you can flip more than 1
switch at a time – no sink states
6.863J/9.611J Lecture 21 Sp03
6.863J/9.611J Lecture 21 Sp03
Parameter space of languages
Language
S-H
C-H
C
Ng
Va
Vt
SR
Ara bic
Dutch
English
French
German
Hindi
Icelandic
Irish
Italian
Japanese
Malay
Mandarin
Swedish
Tamil
6.863J/9.611J Lecture 21 Sp03
Scr
NP Op LD V2
Scr Scr Sc
Wh
Pro
6.863J/9.611J Lecture 21 Sp03
# of examples to hit the target
6.863J/9.611J Lecture 21 Sp03
Learning with 24 parameters
6.863J/9.611J Lecture 21 Sp03
But…what’s a symbol?
6.863J/9.611J Lecture 21 Sp03
Stochastic extensions/Gold
complaints & positive results
• To handle statistical case – rules are stochastic –
so the ‘text’ the learner gets is stochastic (some
distribution spits it out…)
• If you know how language is generated then it
helps you learn what language is generated
• Absence of sentence from guessed L is like
negative evidence: although approximate, can
be used to reject guess (“indirect negative
evidence”)
6.863J/9.611J Lecture 21 Sp03
Results for stochastic case
• Results:
• Negative evidence: really needs all the text (enough
sampling over negative examples s.t. child can really
know it)
• If you don’t know the distribution – you lose –
estimating a density function is even harder than
approximating functions…
• If you have very strong constraints on distribution
functions to be drawn from the language family, then
you can learn fsa’s, cfg’s…
• This constraint is that the learner knows a function d,
s.t. after seeing at least d(n) examples, learner
knows what membership of each example sentence
in every sequence
6.863J/9.611J Lecture 21 Sp03
Power of Gold result
• More about information theory than
• Extensions from strings to meaning
• Negative evidence: really needs all the
text (enough sampling over negative
examples s.t. child can really know it)
6.863J/9.611J Lecture 21 Sp03
Statistical learning theory approach
• Removes most of the assumptions of the Gold
framework –
• It does not ask for convergence to exactly the
right language
• The learner receives positive and negative
examples
• The learning process has to end after a certain
number of examples
• Get bounds on the # of examples sentences
needed to converge with high probability
• Can also remove assumption of arbitrary
resources: efficient (poly time) [Valiant]
6.863J/9.611J Lecture 21 Sp03
Statistical learning theory goes
further – but same results
• Languages defined as before:
1L(s)=1 if s  L, 0 o.w.
• Examples provided by some distribution P
on set of all sentences
• Distances between languages defined as
well by the probability measure P
d(L1 – L2) = SS | 1L1(s) - 1L2(s) |P(s)
• Learnability definition: now an e,d pair
6.863J/9.611J Lecture 21 Sp03
Learnability in statistical framework
• Model:
• Examples drawn randomly
• After l data pts, learner conjectures hypothesis
hl - nb this is now a random variable, because it
depends on the randomly generated data
• Dfn: Learner’s hypothesis hl converges to the
target (1L) with probability 1, iff for every e > 0
Prob[d(hl , 1L) > e ] l 0
P is not known to the learner except through the
draws
(e, d) form6.863J/9.611J Lecture 21 Sp03
“Standard” PAC formulation
• If hl converges to the target 1L in a weak
sense, then for every e>0 there exists an
m(e,d) s.t. for all l > m(e,d)
Prob[d(hl , 1L ) > e ] < d
With high probability (> 1-d) the learner’s
hypothesis is approximately close (within e
in this norm) to the target language
m(e,d) is the sample complexity of learning
6.863J/9.611J Lecture 21 Sp03
Main result: VC dimension finite
= learnable
• Vapnik-Cheronenkis result – VC-dimension
• Set of languages is learnable iff it has
finite VC-dimension
• L a collection of languages; H
hypotheses
• Also gives you upper and lower bounds on
the # of example sentences you need to
get within e with probability
• What is VC-dimension?
6.863J/9.611J Lecture 21 Sp03
VC dimension
• Defined over a set of functions H
• Suppose there exists at least one set of
cardinality d that can be shattered and no
set of cardinality > d that can be
shattered by H
• Then VC-dimension of H is d
• If no such set exists, then VC-dimension is
infinite
6.863J/9.611J Lecture 21 Sp03
Learnability and VC dimension
• Class of all finite languages is unlearnable
in PAC framework
• Class of all fsa’s, cfg’s, …
• We can also get bound on # of examples
needed to learn, if VC-dim is finite
6.863J/9.611J Lecture 21 Sp03
•
•
•
•
•
•
•
A subset X ½ U is shattered by R if all
subsets of X can be obtaind by
intersecting X with members of R.
² That is, for any Y µ X, there is some
A 2 R such that Y = X \ A.
² Examples: U = points in the plane. R =
half-spaces.
6.863J/9.611J Lecture 21 Sp03
6.863J/9.611J Lecture 21 Sp03
•
•
•
•
•
•
•
•
•
•
•
The shatter function measures the
complexity of the set system.
² If instead of half-spaces, we used ellipses,
then (ii) and (iii) can be shattered as well.
² So, the set system of ellipses has higher
complexity than half-spaces.
VC Dimension: The VC dimension of a
set system (U;R) is the maximum size of
any set X ½ U shattered by R.
² Thus, the half-spaces system has VC
dimension 3. 6.863J/9.611J Lecture 21 Sp03
• Clearly |PC(S) |  2m
• C shatters S if |PC(S) | =2m
• VC dimension of a class C:
• The size d of the largest set S that shatters C.
• Can be infinite.
• For a finite class C
• VC-dim(C)  log |C|
6.863J/9.611J Lecture 21 Sp03
• complexity is often but not always equal to
the number of parameters (degrees of
freedom) in the model.
6.863J/9.611J Lecture 21 Sp03
• Example: linear decision boundaries
shatter (any) 3 points in 2D
6.863J/9.611J Lecture 21 Sp03
But not 4 points
6.863J/9.611J Lecture 21 Sp03
• Finite VC-dimension is necessary and
sucient for (exponentially) fast
convergence of a learning method
• This result holds for any underlying
probability distribution
6.863J/9.611J Lecture 21 Sp03
The facts
Child: Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
Mother:
No, say “Nobody likes me.”
Child:
Nobody don’t like me.
[dialogue repeated five more times]
Mother:
Now listen carefully, say “Nobody likes me.”
Child:
Oh! Nobody don’t likeS me.
(McNeill, 1966)
6.863J/9.611J Lecture 21 Sp03
Brown & Hanlon, 1970
• parents correct for meaning, not form
• when present, correction was not picked
up
6.863J/9.611J Lecture 21 Sp03
The problem…
• The child makes an error.
• The adult may correct or identify the
error.
• But the child ignores these corrections.
• So, how does the child learn to stop
making the error?
6.863J/9.611J Lecture 21 Sp03
But kids do recover (well, almost)
• u-shaped curve:
•
•
•
•
went - goed - went
child must stop saying:
“goed”
“unsqueeze”
“deliver the library the book”
6.863J/9.611J Lecture 21 Sp03
Overgeneralization
went
jumped
goed
runned
falled
wented
6.863J/9.611J Lecture 21 Sp03
Positive vs. negative evidence
Positive examples
Utterance
1. Child says “went”.
2. Child says “goed”.
3. Adult says “went”.
Feedback
none
none
---
Result
none
none
positive data
Positive & Negative examples
1.
2.
3.
4.
Utterance
Child says “went”.
Child says “goed”.
Adult says “went”.
Adult says “goed”.
Feedback
good
bad
good
bad
6.863J/9.611J Lecture 21 Sp03
Result
positive data
corrective
positive data
corrective
Positive & negative examples
Child:
Father:
Child:
Father:
Child:
Father:
Child:
Father:
me want more.
ungrammatical.
want more milk.
ungrammatical.
more milk !
ungrammatical.
cries
ungrammatical
6.863J/9.611J Lecture 21 Sp03
Contrast…
Child:
Father:
Child:
Father:
Child:
Father:
Child:
Father:
me want more.
You want more? More what?
want more milk.
You want more milk?
more milk !
Sure, honey, I’ll get you some more.
cries
Now, don’t cry, daddy is getting you
some.
6.863J/9.611J Lecture 21 Sp03
• Do TM example as learning in the limit to
show inf seq of 0’s or 1’s works
• Show that exact convergence – if just 1
slip – speak Fr after 1000s of English s’s –
get swatted.
6.863J/9.611J Lecture 21 Sp03
Download