Class slides 3

advertisement
Neural Networks
Functions
Input
4, 4
2, 3
1, 9
6, 7
341, 257
Output
8
5
10
13
598
Functions
Input
rock
sing
alqz
dark
lamb
Output
rock
sing
alqz
dark
lamb
Functions
Input
00
10
01
11
Output
0
0
0
1
Functions
Input
look
rake
sing
go
want
Output
looked
raked
sang
went
wanted
Functions
Input
John left
Wallace fed Gromit
Fed Wallace Gromit
Who do you like Mary and?
Output
1
1
0
0
Learning Functions
• In training, network is shown examples of what the function generates,
and has to figure out what the function is.
• Think of language/grammar as a very big function (or set of functions).
Learning task is similar – learner is presented with examples of what
the function generates, and has to figure out what the system is.
• Main question in language acquisition: what does the learner need to
know in order to successfully figure out what this function is?
• Questions about Neural Networks
– How can a network represent a function?
– How can the network discover what this function is?
AND Network
Input
Output
00
10
01
11
0
0
0
1
OR Network
Input
Output
00
10
01
11
0
1
1
1
NETWORK CONFIGURED BY TLEARN
# weights after 10000 sweeps
# WEIGHTS
# TO NODE 1
-1.9083807468
## bias to 1
4.3717832565
## i1 to 1
4.3582129478
## i2 to 1
0.0000000000
2-layer XOR Network
• In order for the network to model the XOR function, we need
activation of either of the inputs to turn the output node “on” – just as
in the OR network. This was achieved easily by making the negative
weight on the bias be smaller in magnitude than the positive weight on
either of the inputs.
However, in the XOR network we also want the effect of turning both
inputs on to be to turn the output node “off”. Since turning both nodes
on can only increase the total input to the output node, and the output
is switched “off” when it receives less input, this effect cannot be
achieved.
• The XOR function is not linearly separable, and hence it cannot be
represented by a two-layer network. This is a classic result in the
theory of neural networks.
XOR Network
-4.4429202080
9.0652370453
8.9045801163
## bias to output
## 1 to output
## 2 to output
The mapping from the
hidden units to output is an
OR network, that never
receives a [1 1] input.
Input
Output
00
10
01
11
0
1
1
1
Input
Output
Input
Output
00
10
01
11
0
1
0
0
00
10
01
11
0
0
1
0
-3.0456776619
5.5165352821
-5.7562727928
## bias to 1
## i1 to 1
## i2 to 1
-3.6789164543
-6.4448370934
6.4957633018
## bias to 2
## i1 to 2
## i2 to 2
Learning Rate
•
The learning rate, which is explained in chapter 1 (pp. 12-13), is a training
parameter which basically determines how strongly the network responds to an
error signal at each training cycle. The higher the learning rate, the bigger the
change the network will make in response to a large error. Sometimes having a
high learning rate will be beneficial, at other times it can be quite disastrous
for the network. An example of sensitivity to learning rate can be found in the
case of the XOR network discussed in chapter 4.
•
Why should it be a bad thing to make big corrections in response to big errors?
The reason for this is that the network is looking for the best general solution
to mapping all of the input-output pairs, but the network normally adjusts
weights in response to an individual input-output pair. Since the network has
no knowledge of how representative any individual input-output pair is of the
general trend in the training set, it would be rash for the network to respond
too strongly to any individual error signal. By making many small responses to
the error signals, the network learns a bit more slowly, but it is protected
against being messed up by outliers in the data.
Momentum
Just as with learning rate, sometimes the learning algorithm can only
find a good solution to a problem if the momentum training parameter
is set to a specific value. What does this mean, and why should it make
a difference?
If momentum is set to a high value, then the weight changes made by
the network are very similar from one cycle to the next. If momentum
is set to a low value, then the weight changes made by the network can
be very different on adjacent cycles.
So what?
Momentum
In searching for the best available configuration to model the training data, the
network has no ‘knowledge’ of what the best solution is, or even whether there is a
particularly good solution at all. It therefore needs some efficient and reliable way
of searching the range of possible weight-configurations for the best solution.
One thing that can be done is for the network to test whether any small changes to
its current weight-configuration lead to improved performance. If so, then it can
make that change. Then it can ask the same question in its new weightconfiguration, and again modify the weights if there is a small change that leads to
improvement. This is a fairly effective way for a blind search to proceed, but it has
inherent dangers – the network might come across a weight-configuration which is
better than all very similar configurations, but is not the best configuration of all.
In this situation, the network can figure out that no small changes improve
performance, and will therefore not modify its weights. It therefore ‘thinks’ that it
has reached an optimal solution, but this is an incorrect conclusion. This problem
is known as a local maximum or local minimum.
Momentum
Momentum can serve to help the network avoid local maxima, by controlling the
‘scale’ at which the search for a solution proceeds. If momentum is set high, then
changes in the weight-configuration are very similar from one cycle to the next. A
consequence of this is that early in training, when error levels are typically high,
weight changes will be consistently large. Because weight changes are forced to
be large, this can help the network avoid getting trapped in a local maximum.
A decision about the momentum value to be used for learning amounts to a
hypothesis
about
the
nature
of
the
problem
being
learned,
i.e., it is a form of innate knowledge, although not of the kind that we are
accustomed to dealing with.
The Past Tense and Beyond
Classic Developmental Story
• Initial mastery of regular and irregular past tense forms
• Overregularization appears only later (e.g. goed, comed)
• ‘U-Shaped’ developmental pattern taken as evidence for
learning of a morphological rule
V + [+past] --> stem + /d/
Rumelhart & McClelland 1986
Model learns to classify regulars and irregulars,
based on sound similarity alone.
Shows U-shaped developmental profile.
What is really at stake here?
• Abstraction
• Operations over variables
– Symbol manipulation
– Algebraic computation
• Learning based on input
– How do learners generalize beyond input?
y = 2x
What is not at stake here
• Feedback, negative evidence, etc.
Who has the most at stake here?
• Those who deny the need for rules/variables in language have the most
to lose here
…if the English past tense is hard, just wait until you get to the rest of
natural language!
• …but if they are successful, they bring with them a simple and
attractive learning theory, and mechanisms that can readily be
grounded at the neural level
• However, if the advocates of rules/variables succeed here or elsewhere,
they face the more difficult challenge at the neuroscientific level
Pinker
Ullman
Beyond Sound
Similarity
Regulars and
Associative Memory
1. Are regulars different?
2. Do regulars implicate
operations over variables?
Neuropsychological
Dissociations
Other Domains
of Morphology
(Pinker & Ullman 2002)
Beyond Sound Similarity
•
Zero-derived denominals are
regular
–
–
–
–
•
•
Soldiers ringed the city
*Soldiers rang the city
high-sticked, grandstanded, …
*high-stuck, *grandstood, …
Productive in adults & children
Shows sensitivity to morphological
structure
[[ stem N] ø V]-ed
• Provides good evidence that sound
similarity is not everything
• But nothing prevents a model from
using richer similarity metric
– morphological structure (for ringed)
– semantic similarity (for low-lifes)
Beyond Sound
Similarity
Regulars and
Associative Memory
1. Are regulars different?
2. Do regulars implicate
operations over variables?
Neuropsychological
Dissociations
Other Domains
of Morphology
Regulars & Associative Memory
• Regulars are productive, need
not be stored
• Irregulars are not productive,
must be stored
• But are regulars immune to
effects of associative memory?
– frequency
– over-irregularization
• Pinker & Ullman:
– regulars may be stored
– but they can also be generated
on-the-fly
– ‘race’ can determine which of
the two routes wins
– some tasks more likely to show
effects of stored regulars
Child vs. Adult Impairments
• Specific Language Impairment
– Early claims that regulars show
greater impairment than
irregulars are not confirmed
• Pinker & Ullman 2002b
– ‘The best explanation is that
language-impaired people are
indeed impaired with rules, […]
but can memorize common regular
forms.’
– Regulars show consistent
frequency effects in SLI, not in
controls.
– ‘This suggests that children
growing up with a grammatical
deficit are better at compensating
for it via memorization than are
adults who acquired their deficit
later in life.’
Beyond Sound
Similarity
Regulars and
Associative Memory
1. Are regulars different?
2. Do regulars implicate
operations over variables?
Neuropsychological
Dissociations
Other Domains
of Morphology
Neuropsychological Dissociations
•
Ullman et al. 1997
– Alzheimer’s disease patients
• Poor memory retrieval
• Poor irregulars
• Good regulars
– Parkinson’s disease patients
• Impaired motor control, good
memory
• Good irregulars
• Poor regulars
• Striking correlation involving
laterality of effect
• Marslen-Wilson & Tyler 1997
– Normals
• past tense primes stem
– 2 Broca’s Patients
• irregulars prime stems
• inhibition for regulars
– 1 patient with bilateral lesion
• regulars prime stems
• no priming for irregulars or
semantic associates
Morphological Priming
• Lexical Decision Task
– CAT, TAC, BIR, LGU, DOG
– press ‘Yes’ if this is a word
• Priming
– facilitation in decision times
when related word precedes
target (relative to unrelated
control)
– e.g., {dog, rug} - cat
•
Marslen-Wilson & Tyler 1997
– Regular
{jumped, locked} - jump
– Irregular
{found, shows} - find
– Semantic
{swan, hay} - goose
– Sound
{gravy, sherry} - grave
Neuropsychological Dissociations
• Bird et al. 2003
– complain that arguments for
selective difficulty with
regulars are confounded with
the phonological complexity of
the word-endings
• Pinker & Ullman 2002
– weight of evidence still
supports dissociation; Bird et
al.’s materials contained
additional confounds
Brain Imaging Studies
•
Jaeger et al. 1996, Language
–
–
–
–
•
PET study of past tense
Task: generate past from stem
Design: blocked conditions
Result: different areas of activation
for regulars and irregulars
Is this evidence decisive?
– task demands very different
– difference could show up in
network
– doesn’t implicate variables
• Münte et al. 1997
–
–
–
–
ERP study of violations
Task: sentence reading
Design: mixed
Result:
• regulars: ~LAN
• irregulars: ~N400
• Is this evidence decisive?
– allows possibility of comparison
with other violations
Regular
Irregular
Nonce
(Jaeger et al. 1996)
Beyond Sound
Similarity
Regulars and
Associative Memory
1. Are regulars different?
2. Do regulars implicate
operations over variables?
Neuropsychological
Dissociations
Other Domains
of Morphology
(Clahsen, 1999)
Low-Frequency Defaults
• German Plurals
– die Straße
die Frau
– der Apfel
die Mutter
– das Auto
der Park
die Straßen
die Frauen
die Äpfel
die Mütter
die Autos
die Parks
die Schmidts
• -s plural low frequency, used
for loan-words, denominals,
names, etc.
• Response
– frequency is not the critical
factor in a system that focuses
on similarity
– distribution in the similarity
space is crucial
– similarity space with islands of
reliability
• network can learn islands
• or network can learn to
associate a form with the
space between the islands
Similarity Space
Similarity Space
German Plurals
(Hahn & Nakisa 2000)
Arabic Broken Plural
• CvCC
– nafs
– qidh
nufuus
qidaah
‘soul’
‘arrow’
xawaatim
jawaamiis
‘signet ring’
‘buffalo’
shuway?ir-uun
kaatib-uun
hind-aat
ramadaan-aat
‘poet (dim.)’
‘writing (participle)’
‘Hind (fem. name)’
‘Ramadan (month)’
• CvvCv(v)C
– xaatam
– jaamuus
• Sound Plural
–
–
–
–
shuway?ir
kaatib
hind
ramadaan
• How far can a model generalize to novel forms?
– All novel forms that it can represent
– Only some of the novel forms that it can represent
• Velar fricative [x], e.g., Bach
– Could the Lab 2b model generate the past tense for Bach?
Hebrew Word Formation
• Roots
– lmd
– dbr
learning
talking
• Word patterns
– CiCeC
– CiCeC
limed
diber
‘he learned’
‘he talked’
– CaCaC
lamad
‘he studied’
– CiCuC
limud
‘study’
– hitCaCeC hitlamed ‘he taught himself’
• English phonemes absent from Hebrew
–
–
–
–
j (as in jeep)
ch (as in chair)
th (as in thick)
w (as in wide)
<-- features absent from Hebrew
• Do speakers generalize the Obligatory Contour Principle (OCP)
constraint effects?
– XXY < YXX
– jjr < rjj
• Root position vs. word position
– *jjr
– jajartem
– hijtajartem
hiCtaCaCtem
Ratings derived from rankings for word-triples
1 = best, 3 = worst, scores subtracted from 4
Abstraction
• Phonological categories, e.g., /ba/
–
–
–
–
Treating different sounds as equivalent
Failure to discriminate members of the same category
Treating minimally different words as the same
Efficient memory encoding
• Morphological concatenation, e.g., V + ed
– Productivity: generalization to novel words, novel sounds
– Frequency-insensitivity in memory encoding
– Association with other aspects of ‘procedural memory’
Gary Marcus
Generalization
• Training Items
–
–
–
–
Input: 1 0 1 0
Input: 0 1 0 0
Input: 1 1 1 0
Input: 0 0 0 0
Output: 1 0 1 0
Output: 0 1 0 0
Output: 1 1 1 0
Output: 0 0 0 0
• Test Item
– Input: 1 1 1 1
Output ? ? ? ?
Generalization
• Training Items
–
–
–
–
Input: 1 0 1 0
Input: 0 1 0 0
Input: 1 1 1 0
Input: 0 0 0 0
Output: 1 0 1 0
Output: 0 1 0 0
Output: 1 1 1 0
Output: 0 0 0 0
• Test Item
– Input: 1 1 1 1
1 1 1 1 (Humans)
Output ? ? ? ?
1 1 1 0 (Network)
Generalization
• Training Items
–
–
–
–
Input: 1 0 1 0
Input: 0 1 0 0
Input: 1 1 1 0
Input: 0 0 0 0
Output: 1 0 1 0
Output: 0 1 0 0
Output: 1 1 1 0
Output: 0 0 0 0
• Test Item
– Input: 1 1 1 1
1 1 1 1 (Humans)
Output ? ? ? ?
1 1 1 0 (Network)
• Generalization fails because learning is local
Generalization
• Training Items
–
–
–
–
Input: 1 0 1 0
Input: 0 1 0 0
Input: 1 1 1 0
Input: 0 0 0 0
Output: 1 0 1 0
Output: 0 1 0 0
Output: 1 1 1 0
Output: 0 0 0 0
• Test Item
– Input: 1 1 1 1
1 1 1 1 (Humans)
Output ? ? ? ?
1 1 1 1 (Network)
• Generalization succeeds because representations are shared
Now another example…
Shared Representation
Copying 1:
Copying 2:
“The key to the representation of variables is whether all inputs
in a class are represented by a single node.”
Generalization
• “In each domain in which there is generalization, it is an empirical
question whether the generalization is restricted to items that closely
resemble training items or whether the generalization can be freely
extended to all novel items within some class.”
Syntax, Semantics, & Statistics
Starting Small Simulation
• How well does the network perform?
• How does it manage to learn?
Download