Neural Networks Functions Input 4, 4 2, 3 1, 9 6, 7 341, 257 Output 8 5 10 13 598 Functions Input rock sing alqz dark lamb Output rock sing alqz dark lamb Functions Input 00 10 01 11 Output 0 0 0 1 Functions Input look rake sing go want Output looked raked sang went wanted Functions Input John left Wallace fed Gromit Fed Wallace Gromit Who do you like Mary and? Output 1 1 0 0 Learning Functions • In training, network is shown examples of what the function generates, and has to figure out what the function is. • Think of language/grammar as a very big function (or set of functions). Learning task is similar – learner is presented with examples of what the function generates, and has to figure out what the system is. • Main question in language acquisition: what does the learner need to know in order to successfully figure out what this function is? • Questions about Neural Networks – How can a network represent a function? – How can the network discover what this function is? AND Network Input Output 00 10 01 11 0 0 0 1 OR Network Input Output 00 10 01 11 0 1 1 1 NETWORK CONFIGURED BY TLEARN # weights after 10000 sweeps # WEIGHTS # TO NODE 1 -1.9083807468 ## bias to 1 4.3717832565 ## i1 to 1 4.3582129478 ## i2 to 1 0.0000000000 2-layer XOR Network • In order for the network to model the XOR function, we need activation of either of the inputs to turn the output node “on” – just as in the OR network. This was achieved easily by making the negative weight on the bias be smaller in magnitude than the positive weight on either of the inputs. However, in the XOR network we also want the effect of turning both inputs on to be to turn the output node “off”. Since turning both nodes on can only increase the total input to the output node, and the output is switched “off” when it receives less input, this effect cannot be achieved. • The XOR function is not linearly separable, and hence it cannot be represented by a two-layer network. This is a classic result in the theory of neural networks. XOR Network -4.4429202080 9.0652370453 8.9045801163 ## bias to output ## 1 to output ## 2 to output The mapping from the hidden units to output is an OR network, that never receives a [1 1] input. Input Output 00 10 01 11 0 1 1 1 Input Output Input Output 00 10 01 11 0 1 0 0 00 10 01 11 0 0 1 0 -3.0456776619 5.5165352821 -5.7562727928 ## bias to 1 ## i1 to 1 ## i2 to 1 -3.6789164543 -6.4448370934 6.4957633018 ## bias to 2 ## i1 to 2 ## i2 to 2 Learning Rate • The learning rate, which is explained in chapter 1 (pp. 12-13), is a training parameter which basically determines how strongly the network responds to an error signal at each training cycle. The higher the learning rate, the bigger the change the network will make in response to a large error. Sometimes having a high learning rate will be beneficial, at other times it can be quite disastrous for the network. An example of sensitivity to learning rate can be found in the case of the XOR network discussed in chapter 4. • Why should it be a bad thing to make big corrections in response to big errors? The reason for this is that the network is looking for the best general solution to mapping all of the input-output pairs, but the network normally adjusts weights in response to an individual input-output pair. Since the network has no knowledge of how representative any individual input-output pair is of the general trend in the training set, it would be rash for the network to respond too strongly to any individual error signal. By making many small responses to the error signals, the network learns a bit more slowly, but it is protected against being messed up by outliers in the data. Momentum Just as with learning rate, sometimes the learning algorithm can only find a good solution to a problem if the momentum training parameter is set to a specific value. What does this mean, and why should it make a difference? If momentum is set to a high value, then the weight changes made by the network are very similar from one cycle to the next. If momentum is set to a low value, then the weight changes made by the network can be very different on adjacent cycles. So what? Momentum In searching for the best available configuration to model the training data, the network has no ‘knowledge’ of what the best solution is, or even whether there is a particularly good solution at all. It therefore needs some efficient and reliable way of searching the range of possible weight-configurations for the best solution. One thing that can be done is for the network to test whether any small changes to its current weight-configuration lead to improved performance. If so, then it can make that change. Then it can ask the same question in its new weightconfiguration, and again modify the weights if there is a small change that leads to improvement. This is a fairly effective way for a blind search to proceed, but it has inherent dangers – the network might come across a weight-configuration which is better than all very similar configurations, but is not the best configuration of all. In this situation, the network can figure out that no small changes improve performance, and will therefore not modify its weights. It therefore ‘thinks’ that it has reached an optimal solution, but this is an incorrect conclusion. This problem is known as a local maximum or local minimum. Momentum Momentum can serve to help the network avoid local maxima, by controlling the ‘scale’ at which the search for a solution proceeds. If momentum is set high, then changes in the weight-configuration are very similar from one cycle to the next. A consequence of this is that early in training, when error levels are typically high, weight changes will be consistently large. Because weight changes are forced to be large, this can help the network avoid getting trapped in a local maximum. A decision about the momentum value to be used for learning amounts to a hypothesis about the nature of the problem being learned, i.e., it is a form of innate knowledge, although not of the kind that we are accustomed to dealing with. The Past Tense and Beyond Classic Developmental Story • Initial mastery of regular and irregular past tense forms • Overregularization appears only later (e.g. goed, comed) • ‘U-Shaped’ developmental pattern taken as evidence for learning of a morphological rule V + [+past] --> stem + /d/ Rumelhart & McClelland 1986 Model learns to classify regulars and irregulars, based on sound similarity alone. Shows U-shaped developmental profile. What is really at stake here? • Abstraction • Operations over variables – Symbol manipulation – Algebraic computation • Learning based on input – How do learners generalize beyond input? y = 2x What is not at stake here • Feedback, negative evidence, etc. Who has the most at stake here? • Those who deny the need for rules/variables in language have the most to lose here …if the English past tense is hard, just wait until you get to the rest of natural language! • …but if they are successful, they bring with them a simple and attractive learning theory, and mechanisms that can readily be grounded at the neural level • However, if the advocates of rules/variables succeed here or elsewhere, they face the more difficult challenge at the neuroscientific level Pinker Ullman Beyond Sound Similarity Regulars and Associative Memory 1. Are regulars different? 2. Do regulars implicate operations over variables? Neuropsychological Dissociations Other Domains of Morphology (Pinker & Ullman 2002) Beyond Sound Similarity • Zero-derived denominals are regular – – – – • • Soldiers ringed the city *Soldiers rang the city high-sticked, grandstanded, … *high-stuck, *grandstood, … Productive in adults & children Shows sensitivity to morphological structure [[ stem N] ø V]-ed • Provides good evidence that sound similarity is not everything • But nothing prevents a model from using richer similarity metric – morphological structure (for ringed) – semantic similarity (for low-lifes) Beyond Sound Similarity Regulars and Associative Memory 1. Are regulars different? 2. Do regulars implicate operations over variables? Neuropsychological Dissociations Other Domains of Morphology Regulars & Associative Memory • Regulars are productive, need not be stored • Irregulars are not productive, must be stored • But are regulars immune to effects of associative memory? – frequency – over-irregularization • Pinker & Ullman: – regulars may be stored – but they can also be generated on-the-fly – ‘race’ can determine which of the two routes wins – some tasks more likely to show effects of stored regulars Child vs. Adult Impairments • Specific Language Impairment – Early claims that regulars show greater impairment than irregulars are not confirmed • Pinker & Ullman 2002b – ‘The best explanation is that language-impaired people are indeed impaired with rules, […] but can memorize common regular forms.’ – Regulars show consistent frequency effects in SLI, not in controls. – ‘This suggests that children growing up with a grammatical deficit are better at compensating for it via memorization than are adults who acquired their deficit later in life.’ Beyond Sound Similarity Regulars and Associative Memory 1. Are regulars different? 2. Do regulars implicate operations over variables? Neuropsychological Dissociations Other Domains of Morphology Neuropsychological Dissociations • Ullman et al. 1997 – Alzheimer’s disease patients • Poor memory retrieval • Poor irregulars • Good regulars – Parkinson’s disease patients • Impaired motor control, good memory • Good irregulars • Poor regulars • Striking correlation involving laterality of effect • Marslen-Wilson & Tyler 1997 – Normals • past tense primes stem – 2 Broca’s Patients • irregulars prime stems • inhibition for regulars – 1 patient with bilateral lesion • regulars prime stems • no priming for irregulars or semantic associates Morphological Priming • Lexical Decision Task – CAT, TAC, BIR, LGU, DOG – press ‘Yes’ if this is a word • Priming – facilitation in decision times when related word precedes target (relative to unrelated control) – e.g., {dog, rug} - cat • Marslen-Wilson & Tyler 1997 – Regular {jumped, locked} - jump – Irregular {found, shows} - find – Semantic {swan, hay} - goose – Sound {gravy, sherry} - grave Neuropsychological Dissociations • Bird et al. 2003 – complain that arguments for selective difficulty with regulars are confounded with the phonological complexity of the word-endings • Pinker & Ullman 2002 – weight of evidence still supports dissociation; Bird et al.’s materials contained additional confounds Brain Imaging Studies • Jaeger et al. 1996, Language – – – – • PET study of past tense Task: generate past from stem Design: blocked conditions Result: different areas of activation for regulars and irregulars Is this evidence decisive? – task demands very different – difference could show up in network – doesn’t implicate variables • Münte et al. 1997 – – – – ERP study of violations Task: sentence reading Design: mixed Result: • regulars: ~LAN • irregulars: ~N400 • Is this evidence decisive? – allows possibility of comparison with other violations Regular Irregular Nonce (Jaeger et al. 1996) Beyond Sound Similarity Regulars and Associative Memory 1. Are regulars different? 2. Do regulars implicate operations over variables? Neuropsychological Dissociations Other Domains of Morphology (Clahsen, 1999) Low-Frequency Defaults • German Plurals – die Straße die Frau – der Apfel die Mutter – das Auto der Park die Straßen die Frauen die Äpfel die Mütter die Autos die Parks die Schmidts • -s plural low frequency, used for loan-words, denominals, names, etc. • Response – frequency is not the critical factor in a system that focuses on similarity – distribution in the similarity space is crucial – similarity space with islands of reliability • network can learn islands • or network can learn to associate a form with the space between the islands Similarity Space Similarity Space German Plurals (Hahn & Nakisa 2000) Arabic Broken Plural • CvCC – nafs – qidh nufuus qidaah ‘soul’ ‘arrow’ xawaatim jawaamiis ‘signet ring’ ‘buffalo’ shuway?ir-uun kaatib-uun hind-aat ramadaan-aat ‘poet (dim.)’ ‘writing (participle)’ ‘Hind (fem. name)’ ‘Ramadan (month)’ • CvvCv(v)C – xaatam – jaamuus • Sound Plural – – – – shuway?ir kaatib hind ramadaan • How far can a model generalize to novel forms? – All novel forms that it can represent – Only some of the novel forms that it can represent • Velar fricative [x], e.g., Bach – Could the Lab 2b model generate the past tense for Bach? Hebrew Word Formation • Roots – lmd – dbr learning talking • Word patterns – CiCeC – CiCeC limed diber ‘he learned’ ‘he talked’ – CaCaC lamad ‘he studied’ – CiCuC limud ‘study’ – hitCaCeC hitlamed ‘he taught himself’ • English phonemes absent from Hebrew – – – – j (as in jeep) ch (as in chair) th (as in thick) w (as in wide) <-- features absent from Hebrew • Do speakers generalize the Obligatory Contour Principle (OCP) constraint effects? – XXY < YXX – jjr < rjj • Root position vs. word position – *jjr – jajartem – hijtajartem hiCtaCaCtem Ratings derived from rankings for word-triples 1 = best, 3 = worst, scores subtracted from 4 Abstraction • Phonological categories, e.g., /ba/ – – – – Treating different sounds as equivalent Failure to discriminate members of the same category Treating minimally different words as the same Efficient memory encoding • Morphological concatenation, e.g., V + ed – Productivity: generalization to novel words, novel sounds – Frequency-insensitivity in memory encoding – Association with other aspects of ‘procedural memory’ Gary Marcus Generalization • Training Items – – – – Input: 1 0 1 0 Input: 0 1 0 0 Input: 1 1 1 0 Input: 0 0 0 0 Output: 1 0 1 0 Output: 0 1 0 0 Output: 1 1 1 0 Output: 0 0 0 0 • Test Item – Input: 1 1 1 1 Output ? ? ? ? Generalization • Training Items – – – – Input: 1 0 1 0 Input: 0 1 0 0 Input: 1 1 1 0 Input: 0 0 0 0 Output: 1 0 1 0 Output: 0 1 0 0 Output: 1 1 1 0 Output: 0 0 0 0 • Test Item – Input: 1 1 1 1 1 1 1 1 (Humans) Output ? ? ? ? 1 1 1 0 (Network) Generalization • Training Items – – – – Input: 1 0 1 0 Input: 0 1 0 0 Input: 1 1 1 0 Input: 0 0 0 0 Output: 1 0 1 0 Output: 0 1 0 0 Output: 1 1 1 0 Output: 0 0 0 0 • Test Item – Input: 1 1 1 1 1 1 1 1 (Humans) Output ? ? ? ? 1 1 1 0 (Network) • Generalization fails because learning is local Generalization • Training Items – – – – Input: 1 0 1 0 Input: 0 1 0 0 Input: 1 1 1 0 Input: 0 0 0 0 Output: 1 0 1 0 Output: 0 1 0 0 Output: 1 1 1 0 Output: 0 0 0 0 • Test Item – Input: 1 1 1 1 1 1 1 1 (Humans) Output ? ? ? ? 1 1 1 1 (Network) • Generalization succeeds because representations are shared Now another example… Shared Representation Copying 1: Copying 2: “The key to the representation of variables is whether all inputs in a class are represented by a single node.” Generalization • “In each domain in which there is generalization, it is an empirical question whether the generalization is restricted to items that closely resemble training items or whether the generalization can be freely extended to all novel items within some class.” Syntax, Semantics, & Statistics Starting Small Simulation • How well does the network perform? • How does it manage to learn?