Introduction to Computational Natural Language Learning Linguistics 79400 (Under: Topics in Natural Language Processing) Computer Science 83000 (Under: Topics in Artificial Intelligence) The Graduate School of the City University of New York Fall 2001 William Gregory Sakas Hunter College, Department of Computer Science Graduate Center, PhD Programs in Computer Science and Linguistics The City University of New York Meeting 4: Notes: My Web page was a little messed up. Sorry about that! Should be OK now. www.hunter.cuny.edu/cs/Faculty/Sakas There is a link to this course, but we will probably move to the new blackboard system soon. I got some email asking about the details of how ANN’s work. Yes. Working out the math for a simple perceptron is fair game for a midterm question. A good link to check out: pris.comp.nus.edu.sg/ArtificialNeuralNetworks/perceptrons.html And I will be happy to arrange to meet with people to go over the math (as I will today at the beginning of class). Now we have to talk about learning. Training simply means the process by which the weights of the ANN are calculated by exposure to training data. Supervised learning: Training data 00 01 10 11 Supervisor's answers 0 1 1 0 One datum at a time This is a bit simplified. In the general case, it is possible to feed the learner batch data. But the models we will look at in this course data is fed one datum at a time. ANN's prediction based on the current weights (which haven't converged yet) 0 1 0 From the training file 0 From the supervisor's file. Ooops! Gotta go back and increase the weights so that the output unit fires. Let’s look at how we might train an OR unit. First: set the weights to values picked out of a hat. and the bias activation to 1. Then: feed in 1,1. What does the network predict? a0 1 a7 1 w79 .5 a8 1 w09 -.3 Boolean OR a1 f(net9 ) w89 .1 a9 The prediction is fine (f(.3) = 1) so do nothing. Now: feed in 0,1. What does the network predict? a0 1 a7 0 w09 w79 .5 a8 1 -.3 a1 f(net9 ) w91 1 w89 .1 a9 Now got to adjust the weights. ANN’s predicition = 0 = f(-.3), But supervisor’s answer = 1 (remember we’re doing boolean OR) But how much to adjust? The modeler picks a value: = learning rate (Let’s pick .1 for this example) The weights are adjusted to minimize the error rate of the ANN. Perceptron Learning Procedure: wij = old wij + (Supervisor’s answer - ANN’s prediction) So for example, the ANN predicts 0 and the supervisor says 1 wij = old wij + .1 ( 1 - 0) I.e. all weights are increased by .1 For multilayer ANN’s, the error adjustment is backpropagated through the hidden layer. ey approx = w3 (Supervisor’s answer-ANN’s prediction) w3 w4 w1 w2 ex approx = w4 (Supervisor’s answer - ANN’s prediction) ez = w1 e y + w 2 ex Backpropagated adjustment for one unit. Of course the error is calculated for ALL units. In summary: 1) Multilayer ANN’s are Universal Function Approximators they can approximate any function a modern computer can represent. 2) They learn without explicitly being told any “rules” - they simply cut up the hypothesis space by inducing boundaries. Importantly, they are "non-symbolic" computational devices. That is, they simply multiply activations by weights. So,what does all of this have to do with linguistics and language? Some assumptions of “classical” language processing (roughly from Elman (1995)) 1) symbols and rules that operate over symbols (S, VP, IP, etc) 2) static structures of competence (e.g. parse trees) More or less, the classical viewpoint is language as algebra ANN’s make none of these assumptions, so if an ANN can learn language, then perhaps language as algebra is wrong. We’re going to discuss the pros and cons of Elman’s viewpoint in some depth next week, but for now, let’s go over his variation of the basic, feedforward ANN that we’ve been talking about. Localist representation in a standard feedforward ANN boy dog run book see eat rock Output nodes . . . . . Hidden nodes . . . . . boy dog run book see eat rock Input nodes Localist = each node represents a single item. If more than one output node fires, then a group of items can be considered activated. Basic idea is activate a single input node (representing a word) and see which group of output nodes (words) are activated. Elman’s Single Recurrent Network boy boy dog dog run run book see book see eat eat rock rock 1-to-1 exact copy of activations "regular" trainable weight connections 1) activate from input to output as usual (one input word at a time), but copy the hidden activations to the context layer. 2) repeat 1 over and over - but activate from the input AND copy layers to the ouput layer. From Elman (1990) Templates were set up and lexical items were chosen at random from "reasonable" categories. Categories of lexical items Templates for sentence generator NOUN-HUM man, woman NOUN-HUM VERB-EAT NOUN-FOOD NOUN-ANIM cat, mouse NOUN-HUM VERB-PERCEPT NOUN-INANIM NOUN-INANIM book, rock NOUN-HUM VERB-DESTROY NOUN-FRAG NOUN-AGRESS dragon, monster NOUN-HUM VERB-INTRAN NOUN-FRAG glass, plate NOUN-HUM VERB-TRAN NOUN-HUM NOUN-FOOD cookie, sandwich NOUN-HUM VERB-AGPAT NOUN-INANIM VERB-INTRAN think, sleep NOUN-HUM VERB-AGPAT VERB-TRAN see, chase NOUN-ANIM VERB-EAT NOUN-FOOD VERB-AGPA move, break NOUN-ANIM VERB-TRAN NOUN-ANIM VERB-PERCEPT smell, see NOUN-ANIM VERB-AGPAT NOUN-INANIM VERB-DESTROY break, smash NOUN-ANIM VERB-AGPAT VERB-EA eat NOUN-INANIM VERB-AGPAT NOUN-AGRESS VERB-DESTORY NOUN-FRAG NOUN-AGRESS VERB-EAT NOUN-HUM NOUN-AGRESS VERB-EAT NOUN-ANIM NOUN-AGRESS VERB-EAT NOUN-FOOD Training data Resulting training and supervisor files. Files were 27,354 words long, made up of 10,000 two and three word "sentences." woman smash plate cat move man break car boy move girl eat bread dog Supervisor's answers smash plate cat move man break car boy move girl eat bread dog move Cluster (Similarity) analysis Hidden activations were for each word were averaged together. boy <.5 smash <.4 plate <.2 . . . dragon <.6 eat <.1 boy <.9 . . . boy <.7 eat <.4 cookie <.2 . . . .3 .4 .3 . . . .1 .2 .9 . . . .6 .3 .3 .2> .2> .8> .3> .4> .7> .7> .6> .4> For simplicity assume only 3 hidden nodes (in fact there were 150). After the SRN was trained, the file was run through the network. The activations at the hidden nodes was recorded (I made up these numbers for the example). Now the average was taken for every word: boy smash plate dragon eat cookie <.70 <.40 <.20 <.60 <.25 <.20 .60 .40 .30 .10 .25 .30 .53> .20> .80> .30> .50> .40> boy smash plate dragon eat cookie <.70 <.40 <.20 <.60 <.25 <.20 .60 .40 .30 .10 .25 .30 .53> .20> .80> .30> .50> .40> Each of these vectors represents a point in 3-D space. Some points are near to each other and form "clusters" Hierarchical Clustering: 1. calculate the distance between all possible pairs of points in the space. 2. find the closed two points 3. make them a single cluster – i.e. treat them as a single point* 4. recalculate all pairs of points (you will have one less point to deal with the firs t time you hit this step). 5. go to step 2. * note there are many ways to treat clusters as single points. One could make a single point in the middle, one could calcuate median's etc. For Elman's study, I don't think it matters which he used, all would yield similar results, although this is just a guess on my part. Each of these words represents a point in 150-Dimentional space averaged from all activations generated by the network when processing that word. Each joint (where there is a connection) represents the distance between clusters. So for example, the distance between animals and humans is approx .85 and the distance between ANIMATES and INANIMATES is approx 1.5.