Computational Models of Natural Language

advertisement
Introduction to Computational Natural
Language Learning
Linguistics 79400 (Under: Topics in Natural Language Processing)
Computer Science 83000 (Under: Topics in Artificial Intelligence)
The Graduate School of the City University of New York
Fall 2001
William Gregory Sakas
Hunter College, Department of Computer Science
Graduate Center, PhD Programs in Computer Science and Linguistics
The City University of New York
Meeting 4:
Notes:
My Web page was a little messed up. Sorry about that! Should be OK
now. www.hunter.cuny.edu/cs/Faculty/Sakas
There is a link to this course, but we will probably move to the new
blackboard system soon.
I got some email asking about the details of how ANN’s work. Yes.
Working out the math for a simple perceptron is fair game for a midterm
question. A good link to check out:
pris.comp.nus.edu.sg/ArtificialNeuralNetworks/perceptrons.html
And I will be happy to arrange to meet with people to go over the math
(as I will today at the beginning of class).
Now we have to talk about learning.
Training simply means the process by which
the weights of the ANN are calculated by
exposure to training data.
Supervised learning:
Training data
00
01
10
11
Supervisor's answers
0
1
1
0
One datum at a
time
This is a bit simplified. In the general case, it is possible to feed the learner batch
data. But the models we will look at in this course data is fed one datum at a time.
ANN's prediction
based on the current
weights (which haven't
converged yet)
0
1
0
From the training
file
0
From the
supervisor's file.
Ooops! Gotta go
back and
increase the
weights so that
the output unit
fires.
Let’s look at how we might train an OR unit.
First: set the weights to values picked out of a hat.
and the bias activation to 1. Then: feed in 1,1.
What does the network predict?
a0
1
a7
1
w79
.5
a8
1
w09
-.3
Boolean OR
a1
f(net9
)
w89
.1
a9
The prediction is fine (f(.3) = 1) so do nothing.
Now: feed in 0,1. What does the network predict?
a0
1
a7
0
w09
w79
.5
a8
1
-.3
a1
f(net9
)
w91
1
w89
.1
a9
Now got to adjust the weights. ANN’s predicition = 0 = f(-.3),
But supervisor’s answer = 1 (remember we’re doing boolean OR)
But how much to adjust? The modeler picks a value:
 = learning rate (Let’s pick .1 for this example)
The weights are adjusted to minimize the
error rate of the ANN.
Perceptron Learning Procedure:
wij = old wij +  (Supervisor’s answer - ANN’s prediction)
So for example, the ANN predicts 0 and the supervisor says 1
wij = old wij + .1 ( 1 - 0)
I.e. all weights are increased by .1
For multilayer ANN’s, the error adjustment
is backpropagated through the hidden layer.
ey
approx
= w3 (Supervisor’s answer-ANN’s prediction)
w3
w4
w1
w2
ex
approx
= w4 (Supervisor’s answer - ANN’s prediction)
ez = w1 e y + w 2 ex
Backpropagated adjustment for one unit. Of
course the error is calculated for ALL units.
In summary:
1) Multilayer ANN’s are Universal Function Approximators they can approximate any function a modern computer can
represent.
2) They learn without explicitly being told any “rules” - they
simply cut up the hypothesis space by inducing boundaries.
Importantly, they are "non-symbolic" computational devices.
That is, they simply multiply activations by weights.
So,what does all of this have to do with linguistics and language?
Some assumptions of “classical” language processing
(roughly from Elman (1995))
1) symbols and rules that operate over symbols (S, VP, IP, etc)
2) static structures of competence (e.g. parse trees)
More or less, the classical viewpoint is language as algebra
ANN’s make none of these assumptions, so if an ANN can learn
language, then perhaps language as algebra is wrong.
We’re going to discuss the pros and cons of Elman’s viewpoint in
some depth next week, but for now, let’s go over his
variation of the basic, feedforward ANN that we’ve been
talking about.
Localist representation in a standard feedforward ANN
boy
dog
run
book see
eat
rock
Output nodes
. . . . .
Hidden nodes
. . . . .
boy
dog
run
book see
eat
rock
Input nodes
Localist = each node represents a single item. If more than one output node
fires, then a group of items can be considered activated.
Basic idea is activate a single input node (representing a word) and see
which group of output nodes (words) are activated.
Elman’s Single Recurrent Network
boy
boy
dog
dog
run
run
book see
book see
eat
eat
rock
rock
1-to-1 exact copy of activations
"regular" trainable weight connections
1) activate from input to output as usual (one input word at a time), but
copy the hidden activations to the context layer.
2) repeat 1 over and over - but activate from the input AND copy layers
to the ouput layer.
From Elman (1990) Templates were set up and lexical items were chosen at random
from "reasonable" categories.
Categories of lexical items
Templates for sentence generator
NOUN-HUM man, woman
NOUN-HUM VERB-EAT NOUN-FOOD
NOUN-ANIM cat, mouse
NOUN-HUM VERB-PERCEPT NOUN-INANIM
NOUN-INANIM book, rock
NOUN-HUM VERB-DESTROY NOUN-FRAG
NOUN-AGRESS dragon, monster
NOUN-HUM VERB-INTRAN
NOUN-FRAG glass, plate
NOUN-HUM VERB-TRAN NOUN-HUM
NOUN-FOOD cookie, sandwich
NOUN-HUM VERB-AGPAT NOUN-INANIM
VERB-INTRAN think, sleep
NOUN-HUM VERB-AGPAT
VERB-TRAN see, chase
NOUN-ANIM VERB-EAT NOUN-FOOD
VERB-AGPA move, break
NOUN-ANIM VERB-TRAN NOUN-ANIM
VERB-PERCEPT smell, see
NOUN-ANIM VERB-AGPAT NOUN-INANIM
VERB-DESTROY break, smash
NOUN-ANIM VERB-AGPAT
VERB-EA eat
NOUN-INANIM VERB-AGPAT
NOUN-AGRESS VERB-DESTORY NOUN-FRAG
NOUN-AGRESS VERB-EAT NOUN-HUM
NOUN-AGRESS VERB-EAT NOUN-ANIM
NOUN-AGRESS VERB-EAT NOUN-FOOD
Training data
Resulting
training and
supervisor files.
Files were
27,354 words
long,
made up of
10,000 two and
three word
"sentences."
woman
smash
plate
cat
move
man
break
car
boy
move
girl
eat
bread
dog
Supervisor's answers
smash
plate
cat
move
man
break
car
boy
move
girl
eat
bread
dog
move
Cluster (Similarity) analysis
Hidden activations were
for each word were
averaged together.
boy <.5
smash <.4
plate <.2
.
.
.
dragon <.6
eat <.1
boy <.9
.
.
.
boy <.7
eat <.4
cookie <.2
.
.
.
.3
.4
.3
.
.
.
.1
.2
.9
.
.
.
.6
.3
.3
.2>
.2>
.8>
.3>
.4>
.7>
.7>
.6>
.4>
For simplicity assume only 3 hidden nodes (in fact
there were 150).
After the SRN was trained, the file was run
through the network. The activations at the hidden
nodes was recorded (I made up these numbers for
the example).
Now the average was taken for every word:
boy
smash
plate
dragon
eat
cookie
<.70
<.40
<.20
<.60
<.25
<.20
.60
.40
.30
.10
.25
.30
.53>
.20>
.80>
.30>
.50>
.40>
boy
smash
plate
dragon
eat
cookie
<.70
<.40
<.20
<.60
<.25
<.20
.60
.40
.30
.10
.25
.30
.53>
.20>
.80>
.30>
.50>
.40>
Each of these vectors represents
a point in 3-D space.
Some points are near to each
other and form "clusters"
Hierarchical Clustering:
1. calculate the distance between all possible pairs of points in the space.
2. find the closed two points
3. make them a single cluster – i.e. treat them as a single point*
4. recalculate all pairs of points (you will have one less point to deal with the
firs t time you hit this step).
5. go to step 2.
* note there are many ways to treat clusters as single points. One could make a single point in the
middle, one could calcuate median's etc. For Elman's study, I don't think it matters which he
used, all would yield similar results, although this is just a guess on my part.
Each of these words represents a point in
150-Dimentional space averaged from all
activations generated by the network when
processing that word.
Each joint (where there is a connection)
represents the distance between clusters.
So for example, the distance between
animals and humans is approx .85 and the
distance between ANIMATES and
INANIMATES is approx 1.5.
Download