“Ideal” learning of language and categories

advertisement
“Ideal” learning of language and
categories
Nick Chater
Department of Psychology
University of Warwick
Paul Vitányi
Centrum voor Wiskunde
en Informatica
Amsterdam
OVERVIEW
I. Learning from experience: The problem
II. Learning to predict
III. Learning to identify
IV. A methodology for assessing
learnability
VI. Where next?
I.
Learning from experience:
The problem
Learning: How few
assumptions will work?

Model fitting



Assume M(x)
Optimize x
Easy, but needs prior
knowledge

No assumptions

Learning is impossible--”no free lunch”
?
?
?
?
Can a more minimal model of
learning still work?
Learning from +/- vs. + data
target language/category
guess
+ data
overlap
- data
Under-general
Over-general
But how about learning from
+ data only?

Categorization
?

Language
acquisition
?
?
?
?
?
Learning from +ive data seems
to raise in principle problems

In Categorization, rules out:




Almost all learning
experiments in psychology
Exemplar models
Prototype models
NNs, SVMs…

Language acquisition


Assumed that children only
needing access to positive
evidence
Sometimes viewed as ruling
out learning models entirely

?
?
?
The “Logical” problem of
language acquisition(e.g.,
Hornstein & Lightfoot,
1981; Pinker, 1979)
?
?
?
Must be solvable: A parallel
with science



Science only has access to positive data
Yet it seems to be possible
So overgeneral theories must be
eliminated, somehow


e.g., “Anything goes” seems a bad theory
Theories must capture regularities, not
just fit data
Absence as implicit negative
evidence?


Thus overgeneral grammars may
predict lots of missing sentences
And their absence is a systematic clue
that the theory is probably wrong
This idea only seems convincing if can be proved
that convergence works well, statistically... So
what do we need to assume?
Modest assumption:
Computability constraint

Assume that data is
generated by



Random factors
Computable factors
i.e., nothing
uncomputable
Chance
…HHTTTHTTHTTHT…
Computable process
S


“Monkeys typing into a
programming language”
A modest assumption!
NP
V
V
NP
…The cat sat on the mat. The dog…
Learning by simplicity

Find explanation of “input” that is as simple
as possible


An ‘explanation’ reconstructs the input
Simplicity measured in code length

Long history in perception: Mach, Koffka, Hochberg,
Attneave, Leeuwenberg, van der Helm

Mimicry theorem with Bayesian analysis E.g., Li &
Vitányi (2000); Chater (1996); Chater & Vitányi ( ms.)


Relation to Bayesian inference
Widely used in statistics and machine learning
Consider “ideal” learning


Given the data, what is
the shortest code
How well does the
shortest code work?


Prediction
Identification

Ignore the question of
search



Makes general results
feasible
But search won’t go
away…!
Fundamental question:
when is learning
data-limited or
search-limited?
Three kinds of induction

Prediction:


Identification:


converge on correct predictions
identify generating category/distribution in the limit
Learning causal mechanisms??


Inferring counterfactuals---effects of intervention
(cf Pearl: from probability to causes)
II. Learning to predict
Prediction by simplicity


Find shortest ‘program/explanation’ for
current data
Predict using that program


Strictly, use ‘weighted sum’ of explanations,
weighted by brevity…
Equivalent to Bayes with (roughly) a 2-K(x) prior,
where K(x) is the length of the shortest
program generating x
Summed error has finite bound
(Solomonoff, 1978)

log e 2
sj 
K ( )

2
j=1
So prediction converges [faster than
1/nlog(n), for corpus size n]

Inductive inference is possible!
No independence or stationarity assumptions; just
computability of generating mechanism
Applications
Language



A. Grammaticality
judgements
B. Language
production
C. Form-meaning
mappings
Categorization

Learning from
positive examples
A: Grammaticality judgments


We want a grammar that doesn’t over- or undergeneralize (much) w.r.t., ‘true’ grammar, on
sentences that are statistically likely to occur
NB. No guarantees for…


Colorless green ideas sleep furiously (Chomsky)
Bulldogs bulldogs bulldogs fight fight fight (Fodor)
Converging on a grammar


Fixing undergeneralization is easy (such grammars
get ‘falsified’)
Overgeneralization is the hard problem



Need to use absence as evidence
But the language is infinite; any corpus finite
So almost all grammatical sentences are also absent



Logical problem of language acquisition;
Baker’s paradox
Impossibility of ‘mere’ learning from positive evidence
Overgeneralization Theorem

Suppose learner has probability j of
erroneously guessing an ungrammatical jth

word
K ( )

j 1

j 
log e 2
Intuitive explanation:


overgeneralization implies smaller than need
probs to grammatical sentences;
and hence excessive code lengths
B: Language production


Simplicity allows ‘mimicry’ of any
computable statistical method of
generating a corpus
Arbitrary prob, ; simplicity prob, 
 ( y | x)
1
 ( y | x)

Li & Vitányi, 1997
C: Learning form-meaning
mappings



So far we have ignored semantics
Suppose language inputs consists of
form-meaning pairs (cf Pinker)
Assume only the form→meaning and
meaning → form mappings are computable
(don’t have to be deterministic)…
A theorem

It follows that:



Total errors in mapping forms to (sets of)
meanings (with probs) and
Total errors in mapping forms to (sets of)
meanings (with probs)
…have a finite bound (and hence average
errors/sentence tend to 0)
Categorization


Sample n items from
category C (assume
each all items equally
likely)
Guess, by choosing the
D that provides the
shortest code for the
data
General proof method:
1. Overgeneralization
 D must be basis for a
shorter code than C (or
you wouldn’t prefer it)
2. Undergeneralization
 Typical data from
category C will have no
code shorter than
nlog|C|

1. Fighting overgeneralization




D can’t be much
bigger than C, or it’ll
have a longer code
length:
K(D)+nlog|D| ≤
K(C)+nlog|C|
as n, constraint
is that
|D|/|C| ≤ 1+O(1/n)
2. Fighting
undergeneralization
But guess must cover
most of the correct
category---or it’d
provide a “suspiciously”
short code for the data
 Typicality:
K(D|C)+nlog|CD|≥
nlog|C|
 as n, constraint is
that
|CD|/|C| ≥ 1-O(1/n)

C
C
D
D
Implication



|D| converges
to near |C|

Accuracy bounded by
O(1/n), with n samples

i.i.d. assumptions

Actual rate depends on
structure of category is
crucial
Language: need lots of
examples (but how
many?)
Some categories may
only need a few (one?)
example
(Tenenbaum, Feldman)
III. Learning to identify
Hypothesis identification



Induction of ‘true’ hypothesis, or
category, or language
In philosophy of science, typically
viewed as hard problem…
Needs stronger assumptions than
prediction
Identification in the limit:
The problem



Assume endless data
Goal: specify an
algorithm that, at each
point, picks a
hypothesis
And eventually locks in
on the correct
hypothesis

though can never
announce it---as there
may always be an
additional low frequency
item that’s yet to be
encountered


Gold, Osherson et al
have studied this
extensively
Sometimes viewed as
showing identification
not possible
(but really a mix of positive
and negative results)

But i.i.d. and
computability allows a
general positive result
Algorithm

Algorithms have two
parts



Program which specifies
set Pr
Sample from Pr, using
average code length
H(Pr) per data point
Pick a specific set of
data (which needs to be
‘long enough’)

Won’t necessarily know
what is long enough---an
extra assumption

Specify enumeration of
programs for Pr, e.g., in
order of length

Run, dovetailing

Initialize with any Pr

Flip to Pr that
corresponds to shortest
program so far, that has
generated data
Dovetailing
prog1
prog2
prog3
prog4
1 2 4 7
3 5 8
6 9
10
Runs for ever…




Run these in order,
dovetailing, where each
program gets 2(-length)
steps
This process runs for
ever (looping programs)
Shortest prog so far
“pocketed”…
This will always finish
on the “true” program
Overwhelmingly likely to work...
(as n, Prob correct identification1)

For large enough
stream of n typical data,
no alternative model
does better
Expected code length of
coding data generated
by Pr, by Pr’ rather than
Pr, wastes
n.D(Pr’||Pr)
D(Pr’||Pr) > 0; so swamps initial
code length, for large enough n
Initial Code
K(Pr)
K(Pr’)
n=8
Pr wins
IV. A methodology for assessing
learnability
Assessing learnability in cognition?
Constraint c is learnable if
code which

Nativism?


1. “invests” l(c) bits to
encode c (investing)
can…
2. recoup its investment
save more than l(c) bits
in encoding the data
c is acquired
But not enough data
can’t recoup investment


(e.g., little/no relevant
data)
Viability of
empiricism?

Ample supply of data to
recoup l(c)
Cf Tenenbaum, Feldman…
Language acquisition:
Poverty of the stimulus, quantified

Consider of linguistic
constraint


(e.g., noun-verb
agreement; subjacency,
phonological constraints)
Cost assessed by
length of
formulation


(length of linguistic rules)
Saving: reduction in
cost of coding data
(perceptual, linguistic)
Easy example: learning
singular-plural
John loves tennis
They love_ tennis
x bits
y bits
John loves tennis
*John love_ tennis
They love_ tennis
*They loves tennis
x+1 bits
y+1 bits
If constraint applies to proportion p of n
sentences, constraint saves pn bits.
Visual structure―ample data?

Depth from stereo:


Invest: algorithm for
correspondence
Recoup: almost a
whole image (that’s
a lot!)

Perhaps could infer
stereo for a single
stereo image?

Object/texture
models (Yuille)


Investment in
building the model
But recoup in
compression, over
“raw” image
description

Presumably few
images needed?
A harder linguistic case: Baker’s paradox
(with Luca Onnis and Matthew Roberts)
Quasi-regular structures
are ubiquitous in
language: e.g.,
alternations








It is likely that John will
come
It is possible that John
will come
John is likely to come
*John is possible to
come
(Baker,1979, see also Culicover)




Strong winds
High winds
Strong currents
*High currents
I love going to Italy!
I enjoy going to Italy!
I love to go to Italy!
*I enjoy to go to Italy!
Baker’s paradox (Baker, 1979)



Selectional restrictions: “holes” in the space
of possible sentences allowed by a given
grammar…
How does the learner avoid falling into the
holes??
i.e., how does the learner distinguish genuine
‘holes’ from the infinite number of unheard
grammatical constructions?
Our abstract theory tells us
something


Theorem on grammaticality judgments
show that the paradox is solvable, in
the asymptote, and with no
computational restrictions
But can this be scaled down…


Learn specific ‘alternation’ patterns
With corpus the child hears
Argument by information
investment


To encode an
exception, which
appears to have
probability x,
requires
Log2(1/x) bits


But this elimination
of x makes all other
sentences (1-x)
times more likely,
saving:
n(Log2(x/1-x) bits
Does the saving outweigh the investment?
An example
Recovery from overgeneralisations
The rabbit hid
You hid the rabbit!
The rabbit disappeared
*You disappeared the rabbit!
Return on ‘investment’ over 5M words from the CHILDES database is easily
sufficient
But this methodology can be applied much more widely
(and aimed at fitting time-course of U-shaped generalization; and when
overgeneralizations do or do not arise).
V. Where next?
Can we learn causal structure from observation?
What happens if we move the left hand stick?
The output of perception provides a
description in terms of causality





Liftability
Breakability
Edibility
Whats is attached to what
What is resting on what
Without this, perception is fairly useless as an
input for action
Inferring causality from observation:
The hard problem of induction
Sensory
input
Generative process

Formal question



Suppose a modular computer program generate stream of data
of indefinite length…
Under what conditions can modularity be recovered?
How might “interventions”/expts help?
(Key technical idea: Kolmogorov sufficient statistic)
Fairly uncharted territory


If data is generated by independent
processes
Then one model of the data will
involve recapitulation of those
processes



But will there be other alternative
modular programs?
Which might be shorter?
Hopefully not!
Completely open field…
Download