Computational Models of Natural Language

advertisement
Introduction to Computational Natural
Language Learning
Linguistics 79400 (Under: Topics in Natural Language Processing)
Computer Science 83000 (Under: Topics in Artificial Intelligence)
The Graduate School of the City University of New York
Fall 2001
William Gregory Sakas
Hunter College, Department of Computer Science
Graduate Center, PhD Programs in Computer Science and Linguistics
The City University of New York
Meeting 3 :
Notes:
My Web page got a little messed up. Sorry about that! Should be OK
now. www.hunter.cuny.edu/cs/Faculty/Sakas
There is a link to this course, but we will probably move to the new
blackboard system.
I got some email asking about the details of how ANN’s work. Yes.
Working out the math for a simple perceptron is fair game for a midterm
question. A good link to check out:
pris.comp.nus.edu.sg/ArtificialNeuralNetworks/perceptrons.html
And I will be happy to arrange to meet with people to go over the math
(as I will today at the beginning of class).
Artificial Neural Networks: A Brief Review
a) fully recurrent b) feedforward c) multi-component
A Perceptron
Bias node (a fixed
activation)
Input activations
Threshold unit
If the sum of these inputs are
great enough, the unit fires.
That is to say, a positive
activation occurs here.
How can we implement the AND function?
How can we implement the AND function?
First we must decide on representation:
possible inputs: 1,0
possible outputs: 1,0
We want an artificial neuron
to implement this function.
Boolean AND:
unit inputs
11
01
10
00
unit output
1
0
0
0
-1
net = Σ activations arriving at threshold node
1
net
1
1
unit inputs
11
01
10
00
unit output
1
0
0
-1
Oooops
-1
0
net = Σ activations arriving at threshold node
net f(net)
0
0
STEP activation function
f(x) = 1 if x >= 0
f(x) = 0 if x < 0
The picture is a little more complicated by adding weights. A weight
is simply a multiplier. a0
1
a7
w09 -.92
1
Boolean AND
w79
-.92
a1
.76
.76
w91
f(net9
0
1.25
0
)
1
a8
.0
w89
a9
.87
0
a8 (w89) = 0(.87) = 0.0
ai = activation node i, where
a0 is the bias node (fixed at 1)
wij = the weight between node ai and aj
netk = the result of summing all multiplications:
ai (wik) that enter node ak
net = -0.16 which is less
than 0, so a9 = 0
net9= 0(.87) + 1(.76) + 1(-.92) = -0.16
Now work out on your own the resulting activations for inputs 1,1.
a0
1
a7
w09 -.92
1
Boolean AND
w79
a1
.76
w91
f(net9
)
1
a8
w89
a9
.87
1
net = ???? , so a9 = ????
For those that have had
some exposure to this
stuff, what is the bias
node really doing?
net9=
The picture is a yet little more complicated by changing the activation
function to 1 / (1+e (-net) )
a7
1
= a7 (w79) = 1(.75) = .75
w79
.75
a1
.75
1.25
a8
w89
.3
1.6667
f(net9
.777
)
w91
.8
.6216
.5
a9
= 1 / (1+e (-net)) =. 777
= a8 (w89) = .3(1.6667) = .5
net9= Σj aj (wj9) = .3(1.6667) + 1(.75) = 1.25
A hypothesis space for two inputs.
<0,0>
<0,1>
<1,1>
1
x
x
<1,0>
0
x
x
0
1
Can think of the Perceptron as drawing a line in the space that
separates the points in hypothesis space. All instances of AND
1
x
x
0
x
x
0
1
function are linearly separable "true" and "false" regions of
the space.
But how divide the space
for XOR?
unit inputs
Boolean XOR:
11
01
10
00
unit output
0
1
1
0
x
x
x
x
0
1
The fact that a perceptron couldn't
represent the XOR function stopped
ANN research for years (Minsky and
Papert's work in 1969 was particularly
damaging).
Early 1980's work (Hopfield, 1982) on
associative memory and
in the mid 1980's a simple, important
innovation was introduced. Rumelhart,
Hinton and Williams (1986)
Multilayer networks with
backpropagation.
Boolean XOR:
Hidden Layer
unit inputs
11
01
10
00
Try and figure out the weights.
unit output
0
1
1
0
From: pris.comp.nus.edu.sg/ArtificialNeuralNetworks/perceptrons.html
Now we have to talk about learning.
Training simply means the process by which
the weights of the ANN are calculated by
exposure to training data.
Supervised learning:
Training data
00
01
10
11
Supervisor's answers
0
1
1
0
One datum at a
time
This is a bit simplified. In the general case, it is possible to feed the learner batch
data. But the models we will look at in this course data is fed one datum at a time.
ANN's prediction
based on the current
weights (which haven't
converged yet)
0
1
0
From the training
file
0
From the
supervisor's file.
Ooops! Gotta go
back and
increase the
weights so that
the output unit
fires.
Let’s look at how we might train an OR unit.
First: set the weights to values picked out of a hat.
and the bias activationt to 1. Then: feed in 1,1.
What does the network predict?
a0
1
a7
1
w79
.5
a8
1
w09
-.3
Boolean OR
a1
f(net9
)
w89
.1
a9
The prediction is fine (f(.3) = 1) so do nothing.
Now: feed in 0,1. What does the network predict?
a0
1
a7
0
w09
w79
.5
a8
1
-.3
a1
f(net9
)
w91
1
w89
.1
a9
Now got to adjust the weights. ANN’s predicition = 0 = f(-.3),
But supervisor’s answer = 1 (remember we’re doing boolean OR)
But how much to adjust? The modeler picks a value:
 = learning rate (Let’s pick .1 for this example)
The weights are adjusted to minimize the
error rate of the ANN.
Perceptron Learning Procedure:
wij = old wij +  (Supervisor’s answer - ANN’s prediction)
So for example, the ANN predicts 0 and the supervisor says 1
wij = old wij + .1 ( 1 - 0)
I.e. all weights are increased by .1
For multilayer ANN’s, the error rate is backpropagated
through the hidden layer.
ey
approx
= w3 (Supervisor’s answer-ANN’s prediction)
w3
w4
w1
w2
ex
approx
= w4 (Supervisor’s answer - ANN’s prediction)
ez = w1 e y + w 2 ex
Backpropagated error
In summary:
1) Multilayer ANN’s are Universal Function Approximators they can approximate any function a modern computer can
represent.
2) They learn without explicitly being told any “rules” - they
simply cut up the hypothesis space by inducing boundaries.
Importantly, they are "non-symbolic" computational devices.
That is, they simply multiply activations by weights.
So,what does all of this have to do with linguistics and language?
Some assumptions of “classical” language processing
(roughly from Elman (1995))
1) symbols and rules that operate over symbols (S, VP, IP, etc)
2) static structures of competence (e.g. parse trees)
3) structure is built up
More or less the classical viewpoint is language as algebra
ANN’s make none of these assumptions, so if an ANN can learn
language, then perhaps the language as algebra is wrong.
We’re going to discuss the pros and cons of Elman’s viewpoint in
some depth next week, but for now, let’s go over his
variation of the basic, feedforward ANN that we’ve been
addressing.
Localist representation in a standard feedforward ANN
boy
dog
run
book see
eat
rock
Output nodes
. . . . .
Hidden nodes
. . . . .
boy
dog
run
book see
eat
rock
Input nodes
Localist = each node represents a single item. If more than one output node
fires, then a group of items can be considered activated.
Basic idea is activate a single input node (representing a word) and see
which group of output nodes (words) are activated.
Elman’s Single Recurrent Network
boy
boy
dog
dog
run
run
book see
book see
eat
eat
rock
rock
1-to-1 exact copy of activations
"regular" tainable weight connections
1) activate from input to output as usual (one input word at a time), but
copy the hidden activations to the context layer.
2) repeat 1 over and over - but activate from the input AND copy layers
to the ouput layer.
From Elman (1990) Templates were set up and lexical items were chosen at random
from "reasonable" categories.
Categories of lexical items
NOUN-HUM man, woman
NOUN-ANIM cat, mouse
NOUN-INANIM book, rock
NOUN-AGRESS dragon, monster
NOUN-FRAG glass, plate
NOUN-FOOD cookie, sandwich
VERB-INTRAN think, sleep
VERB-TRAN see, chase
VERB-AGPA move, break
VERB-PERCEPT smell, see
VERB-DESTROY break, smash
VERB-EA eat
Templates for sentence generator
NOUN-HUM VERB-EAT NOUN-FOOD
NOUN-HUM VERB-PERCEPT NOUN-INANIM
NOUN-HUM VERB-DESTROY NOUN-FRAG
NOUN-HUM VERB-INTRAN
NOUN-HUM VERB-TRAN NOUN-HUM
NOUN-HUM VERB-AGPAT NOUN-INANIM
NOUN-HUM VERB-AGPAT
NOUN-ANIM VERB-EAT NOUN-FOOD
NOUN-ANIM VERB-TRAN NOUN-ANIM
NOUN-ANIM VERB-AGPAT NOUN-INANIM
NOUN-ANIM VERB-AGPAT
NOUN-INANIM VERB-AGPAT
NOUN-AGRESS VERB-DESTORY NOUN-FRAG
NOUN-AGRESS VERB-EAT NOUN-HUM
NOUN-AGRESS VERB-EAT NOUN-ANIM
NOUN-AGRESS VERB-EAT NOUN-FOOD
Training data
Resulting
training and
supervisor files.
Files were
27,354 words
long,
made up of
10,000 two and
three word
"sentences."
woman
smash
plate
cat
move
man
break
car
boy
move
girl
eat
bread
dog
Supervisor's answers
smash
plate
cat
move
man
break
car
boy
move
girl
eat
bread
dog
move
Download