Connectionism Connectionism, the alternative paradigm, considers (in a similar way to computationalism or “symbolic paradigm”) that the brain is a large neural network wherein computations over representations take place, the computations being a mapping of an input vector to an output vector. The essential difference between these approaches concerns the nature of representations: in connectionist theory, in the most important neural networks, the representations are distributed. Exclusive OR (XOR) problem: Perceptron Convergence Procedure (Rosenblatt 1962) with two-layer network with Hebbian rule (a more powerful variation: the delta rule) cannot solve the XOR problem. (Or Perceptrons - Minsky and Papert 1969) The solution- internal representations (hidden units). (Elman et al, pp. 60-66) According to Clark- 3 generations. I. First generation: “The input units send signals to the hidden units. Each hidden unit computes its own outcome and then sends the signal to the output units. If a neural net were to model the whole human nervous system, the input units would be analogous to the sensory neurons, the output units to the motor neurons, and the hidden units to all other neurons.” (Garson 2007) (Garson 2007) Other networks- (Elman et al, p. 51) 1 The pattern of activation in the network is determined by the weights on the nodes. “It is in the weights that knowledge is progressively build up in a network.” (Elman et al. p. 51) Each input unit receives input external to the net. Input units send their activation value to hidden units. “Each of these hidden units calculates its own activation value depending on the activation values it receives from the input units.” (Garson 2007) each hidden unit is sensitive to complex, often subtle, regularities that connectionists call microfeatures. Each layer of hidden units can be regarded as provinding a particular distributed encoding of the input pattern (that is, an encoding in terms of a pattern of microfeatures). (Bechtel & Abrahamsen 2002, p. 42 or Hinton, McClleland and Rumelhart, 1986, PDP:3, pp. 801 in Bechtel 2002, p. 51) The same phenomena take place between hidden units and output units. Essential for neural nets is that the activation of a net is determined by its weights that can be positive (excitatory) or negative (inhibitory). The activation value for each receiving unit: “The function sums together the contributions of all sending units, where the contribution of a unit is defined as the weight of the connection between the sending and receiving units times the sending unit's activation value.” aj = the activation of node j that send output to node i. weight- wij. The single input from j is wij aj. Ex: A node j- output= 0.5; connection to i with weight= -2.0 → (0.5 x -2.0) = -1 One node receives inputs from many nodes: Net input to the node neti = ∑ wij aj - the total input received by a node. (Elman et. al, 1996, p. 51-2) The output- like neurons- not the same as the input: what a node “does” (the response function) = the node’s activation value that can be linear or nonlinear functions (sigmoid activation function or others). (Elman et al, p. 53) Nonlinear = “the numerical value of the output is not directly proportional to the sum of the inputs.” (Clark 2001, p. 63) 2 Rumelhart and McClleland 1982: 3 layers of nodes for word, letter and orthographic feature. (Ex.: the node “trap” receives positively weighted input from the letter nodes “t”. “r”, “a”, and “p”; inhibited by other word nodes. -- Elman et al, p. 55) One basic principle: similarity. A network has classified a pattern (11110000) in a certain way then it will tend to classify a novel pattern (11110001) in a similar way. (Elman et al, p. 59) Similarity → generalizations vs. “tyranny of similarity”. (Mcleod, Rolls, Plunkett 1996) The task for neural nets is to find the weights that correspond to a particular task. One most widely used training method is the backpropagation rule. (It does not correspond to human learning processes! Other methods: self-supervised learning, unsupervised learning.) (Elman et al p. 66-7) Step I: The error is the difference between the activation of a given output unit (actual output) and that it is supposed to have (target output). Step II: Adjusting of the weights leading into those output units → decrease the error. “The local feedback … is provided by the supervisory system that determines whether a slight increase or decrease in a given weight would improve performance (assuming the other weights remain fixed). This procedure, repeated weight by weight and layer by layer, effectively pushes the system down a slope of decreasing error.” (Clark 2001, p. 65) The algorithm: “We propagate the error information (…= error signal) backwards in the network from output units to hidden units.” (Elman et al 1966, p. 67) “… the ability to learn may change over time-not as a function of any explicit change in the mechanism, but rather as an intrinsic consequence of learning itself. The network learns, just as children do.” (Elman et al, p. 70) Learning as gradient descent in weight space. (Elman et al, pp. 71-2) NETtalk (Sejnowsky and Rosenberg 1986, 1987) - the task: written input into coding for speech (grapheme-to-phoneme conversation). (Clark 2001, p. 63) 3 - DECtalk vs. NETtalk: classical program (explicitly programmed, rules and exceptions) vs. “learned to solve the problem using a learning algorithm and … example cases…” (p. 63) During learning- the speech output- progress from initial babble to semirecognizable words and syllable structure, to … a fair simulacrum of human speech.” (p. 63) However - no semantic depth. Superpositional storage: “Two representations are fully superposed if the resources used to represent item 1 are coextensive with those used to represent item 2.” (Clark 1997, p. 169) The definition of superpositional storage of two items if … it then goes on to encode the information about item 2 by amending the set of original weightings in a way that preserves the functionally (some desired input-output pattern) required to represent item 1 while simultaneously exhibiting the functionality required to represent item 2. (Clark 1997, p. 170) The combination of two characteristics for superposition: (1) The use of distributed representations (2) the use of a learning rule that imposes a semantic metric on the acquired representations. (Clark 1997, p. 170) “… semantically related items are represented by sintactically related (partially overlapping) patterns of activation.” (Clark 2001, p. 66) (Ex: cat and panther vs. fox) Or “The semantic (…) similarity between representational contents is echoed as a similarity between representational vehicle.” (Clark 1997, p. 171) → Prototype extraction (category or concept) + generalization. Against symbolic paradigm, neural nets do not work with innate rules (and representations). (Chomsky, Fodor, etc.) Such rules appear as a natural effect of training (learning) so we have learning rules that impose the semantic metric of acquired representations. (Clark 1997, p. 171) In Bechtel & Abrahamsen 2002, the whole Chapter 5 – “Are rules required to process representations?” is dedicated to the same topic. Intrinsic context-sensitivity or Smolensky’s “subsymbolic paradigm”: Physical symbol system approaches display semantic transparency (familiar words + ideas – rendered as simple inner symbols). 4 Fodor and Pylyshyn (1988) - against connectionism (Bechtel & Abrahamsen 2002, Chapter 6): LOT with compositionality, systematicity, and productivity. (See Week 4) “Symbolic representations have a combinatorial syntax and semantics.” (Bechtel & Abrahamsen 2002, p. 157) Dennett: “The syntactic engine mimics a semantic engine.” (Bechtel & Abrahamsen 2002, p. 157) Fodor and Pylyshyn consider that connectionism as lacking a combinatorial syntax and semantics. For them connectionism is just a mere implementation of symbolic system. Fodor and Pylyshyn 1988) vs. Connectionism - “fine grained context sensitivity”. A representation of an item is given by a distributed pattern of activity that contains sub-patterns appropriate to the feature-set involved… [A] network will be able to represent several instances of such an item, which may differ in respect of one or more features. …[S]uch “near neighbors” will be represented by similar internal representational structures, that is, the vehicles of the several representations (activation patterns) will be similar to each other in ways that echo the semantic similarity of the cases – that is the semantic metric (see above) in operation. (Clark 1997, p. 174) and thus “.. the contentful elements in a subsymbolic program do not directly recapitulate the concepts we use ‘to consciously conceptualize the task domain’ (Smolensky, 1988, p. 5) and that ‘the units do not have the same semantics as words of natural language’ (p.6).” (Clark 2001, p. 67) In Clark words, the unit-level activation differences can mirror the details of various mental functions in interactions with “real-world contexts”. The knowledge – from training data → “postraining analysis” (statistical analysis and systematic interference). Smolensky (1988): A connectionist state is a pattern of activity (within an activation space), which contains constituent subpatterns. A pattern of activity cannot be decomposed into conceptual constituents as in the symbolic paradigm. The 5 connectionist decomposition is an approximate one: a complex pattern contains constituent subpatterns that are not defined precisely and exactly, but depend on context. The constituent structure of a subpattern is strongly influenced by the inner structure included within it. (See the example with a cup with coffee in Smolensky 1988). The conceptual constituents of mental states are vectors of activity with a special kind of constituent structure: the activation of individual units. The connectionist representations have constituents, but these constituents are functional parts of the complex representations, not effective parts of a concatenate scheme, the constituent relations not being instantiated in a part-whole type of relation. While the classical approach deals with a type of concatenate compositionality, connectionism stresses functional compositionality (van Gelder 1990). For van Gelder, concatenation means “linking or ordering successive constituents without altering them in any way” and the “representations ‘must preserve tokens of an expression’s constituents (and the sequential relations among tokens)’ (p. 360).” Functional compositionality is the process of having a representation as recovering parts through certain operations. (van Gelder 1990, p. 360 in (Bechtel & Abrahamsen 2002, p. 170 and 6.3.1) His examples are Pollack’s RAAM nets, Hinton’s (1990) reduced descriptions of levels in hierarchical trees, and Smolensky’s (1990) tensor product representations of binding relations. The difference between classical approach and connectionism is that In the symbolic paradigm the context of a symbol is manifest around it and consist of other symbols; in the subsymbolic paradigm the context of a symbol is manifest inside it, and consist of subsymbols. (Smolensky 1988, p. 17) About the coffee example: The compositional structure is there, but it’s there in an approximate sense. It not equivalent to taking a context-independent representation of coffee and a context-independent representation of cup – and certainly not equivalent to taking a context-independent representation of the relationship in or with – and stiking them al together in a symbolic structure concatenating them together to form syntactic compositional structure like “with (cup, coffee).” (Smolensky, 1991, p. 208) 6 (Clark 1997, p. 175) Such nets “do not involve computations defined over symbols. Instead, any given accurate (i.e., fully predictive) picture of the systems processing will need to be given at the numerical level of units and weights and activation-evolution equation…” and so there are no syntactically identifiable elements that both have a symbolic interpretation and can figure in a full explanation of the totality of the system’s semantic good behaviour, that is, “There is no account of the architecture in which the same elements carry both the syntax and the semantics” (Smolensky, 1991. p. 204). (Clark 1997, p. 175) and Mental representations and mental processes are not supported by the same formal entities−there are not “symbols” that can do both jobs. The new cognitive architecture is fundamentally two-level; formal, algorithmic specification of processing mechanisms on the one hand, and semantic interpretation on the other, must be done at two different levels of description. (Smolensky, 1991, p. 203) (Clark, 1997c, p. 175) In Smolensky’s words, on one level, mental processes are represented by “numerical level descriptions of units, weights and activation-evolution equation”. (Clark, p. 175) At this level we can not find the semantic interpretation. On the other level, “large scale activity of such systems allows interpretation but the patterns thus fixed on are not capable of figuring in accurate descriptions of the actual course or processing. (See Smolensky, op. cit., p. 204)” (Clark, p. 176) The semantic metric of the system imposes a similarity for content when there is a similarity for vehicle (similar patterns). Clark emphasizes that such coding systems exploit “more highly structured syntactic vehicles than words.” → - Economical use of representational resources - “Free” generalization (a new input if it resembles an old one… will yield a response rooted in that partial overlap → Sensible responses to new inputs are possible. - Graceful degradation (the ability to produce sensible responses given some systemic damage), pattern completion, damage tolerance. (Clark, p. 66-7) 7 Fodor and McLaughlin 1990, McLaughlin 1993: against Smolensky vs. Hadley and Hayward 1997, Christiansen and Charter 1994. II. Second generation: the temporal structure- In 1990, 1991 and 1993, Elman created recurrent neural nets that have something more then classic nets, in that the signal from inputs to hidden units and finally to outputs is sent back from the output units to the inputs or hidden units. In this way, the recurrent net stands for human short-term memory. According to Bechtel & Abrahamsen (2002), related to time, language-sentences have two features: “(1) they are processed sequentially in time; (2) they exhibit longdistance dependencies, that is, the form of one word (or larger constituent) may depend on another that is located at an indeterminate distance.” (Verbs must agree with their subjects- a relative clause … intervenes between the subject and the verb.) (p. 179) For producing such sentences the net has to incorporate such relationships without using explicit representations of linguistic structures. (Elman et al., pp. 74-5, p. 81) Elman’s simple recurrent network (1990) (150 hidden units related to context unitsincorporate information about previous words → the net processes sentences sequentially in time through grasping the dependencies between nonadjacent wordsBechtel & Abrahamsen 2002, p. 181): to predict successive words in a sentence. The input one (the word – localist representation) one time. The output of the network- to predict the next word. After the network’s output (prediction of the word) – Backpropag. rule → Adjust the weights. Then next word as input. This process – reiterated thousands of sentences. The hidden units define a high dimensional space 150-dimensional hypercube: “the network would learn to represent words which ‘behave’ in similar ways (i.e., have similar distributional properties) with vectors which are close in this internal representation space.” (Elman et al, p. 94) This space cannot be visualized, therefore: hierarchical clustering tree of the words’ hidden unit activation patterns. (Elman et al, p. 96 or Clark 2001, p. 68-73) It means “capturing the hidden unit activation pattern corresponding to each word, and then measuring the distance between each pattern 8 and every other pattern. These inter-pattern distances are nothing more than the Euclidian distance between vectors in activation space” → the hierarchical clustering tree, “placing similar patterns close and low on the tree, and more distant groups on different branches.” (p. 94-5) = VERBS, animates, NOUNS, inanimates. (Contextsensitivity= “tokens of the same type are all spatially proximal, and closer to each other than to tokens of any other type.” Elman et al, p. 97) The net “discovered” categories - verbs, nouns, animate, inanimate - “properties that were good clues to grammatical role in the training corpus used.” (Clark, p. 71) Elman uses “cluster analysis” and “principal component analysis” (PCA) for determining what the networks learned. For NETtalk - they used “cluster analysis” (the network learned a set of static distributed symbols→ the relations of similarity and difference between static states), PCA for a SRN in addition “can promote or impede movement into future states = “temporally rich information-processing detail.” → Dynamic representation. (Clark 2001, p. 71-2) There is no separate stage of lexical retrieval. There are no representations of words in isolation. The representations of words (the internal states following input of a word) always reflect the input taken together with the prior state ... the representations are not propositional and their information content changes constantly over time in accord with the demands of the current task. Words serve as guideposts which help establish mental states that support (desired) behaviour. (Elman, 1991b, p. 378 in Clark 2001, p. 72) SRN- Without any knowledge of semantics, the net learns to group the encoding of animate objects together only because they were distributed similarly in the training corpus. (Bechtel & Abrahamsen, 182) Strong Representational Change vs. weak representational change (Fodor) = the concepts are innate not learned. For Fodor “concept-learning involves two processes: “(1) triggering of an innate representational atom and/or (2) the deployment of such atoms in generate and test style learning.” 9 Connectionism is against this image. It models acquire “domain knowledge” only through learning mechanisms. (See past tense learning net- Rumelhart and McClelland 1986) However, initial architecture (units, layers) = a kind of “little knowledge” but not innate symbols. In a net essential are the weights and their content depend on training the net. Bates and Elman consider encode 90% and 10% innate. (1992 in Clark, p. 183) In a net training environment determine “both the knowledge and the processing profile acquired by a network.” It can be a kind of functional modularity = “that are powerfully connected among themselves and relatively weakly connected to units outside the set”. (Rumelhart and McClleland 1986b, p. 141 in Clark, p. 182) Through training there are qualitative changes for a network. (See U-curve effect, Plunkett and Marshman 1991) Essential it is the “deep interpenetration of knowledge and processing characteristics” for a network. Processing involve the weights to create pattern of activation → outputs. But these weights are the knowledge stored of network. “And a new knowledge has to be stored superpositionally, that is, by amending existing weights.” (p. 184) “Text (knowledge) and process (the use and alteration of knowledge) are thus inextricably intertwined.” And “Where the classicist thinks of mind as essentially static, recombinable text, the connectionist thinks as a highly fluid environmentally coupled dynamic process.” (p. 184) Neural nets can perform various tasks. “Experiments on models of this kind have demonstrated an ability to learn such skills as face recognition, reading, and the detection of simple grammatical structure.” (Garson 2007) The first important task for a net after its training was to predict the irregular past tense of verbs (PDP Group, 1986). Other nets can recognize faces or associate images with labels (Plunkett & Marchman 1991, 1993 in Elman et al, p. 124-129). “Another influential early connectionist model was a net trained by Rumelhart and McClelland (1986) to predict the past tense of English verbs.” [Elman et al, Chapter 3, p. 131-7] → Regular vs. irregular verbs. Classical approach: There are two mechanisms (one for regular verbs the other for irregular verbs). vs. 10 Connectionism- only one mechanism (a single set of connections for regular and irregular). They used a single layered network and perceptron convergence procedure → Such nets – capable of learning problems that are linearly separable. But past tense problem is a nonlinear one. (Elman et al, 1996, p. 137) Rumelhart and McClleland (1986, PDP:18) Lawful behaviour and judgments any be produced by a mechanism in which there is no explicit representation of the rule. Instead, we suggest that the mechanisms that process language and make judgments of grammaticality are constructed in such a way that their performance is characterizable by rule, but that the rules themselves are not written in explicit form anywhere in the mechanism. (1986a, p. 217) (Bechtel & Abrahamsen 2002, p. 121) We have shown that a reasonable account of the acquisition of past tense can be provided without recourse to the notion of a “rule” as anything more than a description of the language. We have shown that, for this case, there is no induction problem. The child need not figurer out what the rules are, nor even that there are rules. The child need not decide whether a verb is regular or irregular. … A uniform procedure is supplied as input to the past-tense network and the resulting pattern of activation is interpreted as a phonological representation of the past form of that verb. This is the procedure whether the verb is regular or irregular, familiar or novel. (1986a, p. 267) (Bechtel & Abrahamsen 2002, p. 135) Debates -Pinker & Prince (1988) – “… a poor job of generalizing to some novel regular verbs. … Nets may be good at making associations and matching patterns, but - fundamental limitations in mastering general rules such as the formation of the regular past tense.” (Garson 2007) (See 5.3 in Bechtel & Abrahamsen 2002, p. 135-) “Despite Pinker and Prince's objections, many connectionists believe that generalization of the right kind is still possible (Niklasson and van Gelder 1994).” (Garson 2007) Plunkett and Marchman (1991, 1993) – A net with hidden units → The U-shape form for reproducing the patterns of error observed in children. (Elman et al, 137-47 or Bechtel and Abrahamsen 5.4) 11 Differentiation at the behavioral level need not necessarily imply differentiation at the level of mechanism. Regular and irregular verbs can behave quite differently even though represented and processed similarly in the same device. (Elman et al, p. 139) The net was first trained on a set containing a large number of irregular verbs, and later on a set of 460 verbs containing mostly regulars. The net learned the past tenses of the 460 verbs in about 200 rounds of training, and it generalized fairly well to verbs not in the training set. It even showed a good appreciation of "regularities" to be found among the irregular verbs (‘send’ / ‘sent’, ‘build’ / ‘built’; ‘blow’ / ‘blew’, ‘fly’ / ‘flew’). During learning, as the system was exposed to the training set containing more regular verbs, it had a tendency to overregularize, i.e., to combine both irregular and regular forms: (‘break’ / ‘broked’, instead of ‘break’ / ‘broke’). This was corrected with more training. It is interesting to note that children are known to exhibit the same tendency to overregularize during language learning. (Garson 2007) III. Third generation: “Dynamical connectionism” (Wheeler 1994, Port and van Gelder 1995) - Neurobiologically realistic features to the basic units and weights. (Clark, p. 72-3) Some philosophers: nn - distributed representations - similar to the brain structure. Nn= implementation of the mind. Others - nn = the mind. NN- strengths (motor control, pattern recognition) and weaknesses (planning and sequential logical derivation) (Clark, p. 73) Connectionism vs. classical approach - Connectionism eliminates the homunculus. The Belusov-Zhabotinsky reaction –classical example of emergent behaviour. “Connectionist models are attractive because they provide a computational framework for exploring the conditions under which such emergent properties occur.” (Elman et al, p. 85) - Innate (Chomsky, Fodor- mental representations and rules are innate) vs. learning: 3 classes of constraints (in Rethinking Innateness): representations, architecture, timing. - Connectionist representations: local or distributed representations. (Elman p. 90-2) 12 Distributed representation = computationally – encode information concerning similarities and differences. (Clark, p. 66) “A distributed pattern of activity can encode ‘microstructural’ information such that variations in the overall pattern reflect variations in the content.” (Clark, p. 66) - No modularity for nn at the beginning of training. (Elman, p. 100-1) → “There is a huge difference between starting modular and becoming modular.” (p. 101) - Rules for nn: they are not capable of productive and systematic behavior but they have certain “rules” “since networks are function approximators and functions are nothing if not rules.” (Elman, p. 102) Against connectionism- (Clark 2001) (a) Mental causation (Ramsey, Stich, and Garon, 1991 in Clark, p. 73-6) - “Propositional modularity”: in common talk- “functionally discrete semantically interpretable states that play a causal role in the production of behaviour” (p. 204, their emphasis in Clark, p. 73) - “Propositional modularity”: individual beliefs functions as the discrete causes for specific actions. (Clark, p. 74) - No such “propositional modularity” for distributed connectionist processing- mainly because of the use of “superpositional” information storage that produces “total causal holism”. (Clark, p. 74-5) (b) Systematicity (Fodor and Pylyshyn 1988) - Fodor’s LOT: compositionality, systematicity and productivity. The systematicity of thought is an effect of the compositionally structured inner base, which includes manipulable inner expressions meaning “John” “loves” “Mary” and resources for combining them. (Clark, p. 77) Replies: (1) Classical approach is not the only way to support systematicity (2) this property is displayed from the grammatical structure of human language. (Clark, p. 77) - Examples for systematicity in nn: Smolensky’s tensor product, Chalmers’s net (that uses recursive autoassociative memory – RAAM) 13 (c) Biological reality 1. The use of artificial tasks and choice of input and output representations - The choice of problem domain and training materials - “Horizontal microworlds”: parts of human cognition (past tense, simple grammars, etc.) (Clark, 79-80) 2. Small resources of units and connections vs. brain (Clark, p. 80) 3. The enormous differences between nn and the brain. (p. 81) The sentences were formed from a simple vocabulary of 23 words using a subset of English grammar. The grammar, though simple, posed a hard test for linguistic awareness. It allowed unlimited formation of relative clauses while demanding agreement between the head noun and the verb. So for example, in the sentence “Any man that chases dogs that chase cats … runs.” “The singular ‘man’ must agree with the verb ‘runs’ despite the intervening plural nouns (‘dogs’, ‘cats’) which might cause the selection of ‘run’. One of the important features of Elman's model is the use of recurrent connections. The values at the hidden units are saved in a set of so called context units, to be sent back to the input level for the next round of processing. This looping back from hidden to input layers provides the net with a rudimentary form of memory of the sequence of words in the input sentence. Elman's nets displayed an appreciation of the grammatical structure of sentences that were not in the training set. The net's command of syntax was measured in the following way. Predicting the next word in an English sentence is, of course, an impossible task. However, these nets succeeded, at least by the following measure. At a given point in an input sentence, the output units for words that are grammatical continuations of the sentence at that point should be active and output units for all other words should be inactive. After intensive training, Elman was able to produce nets that displayed perfect performance on this measure including sentences not in the training set.” (Garson 2007) “Marcus (1998, 2001) argues that Elman's nets are not able to generalize this performance to sentences formed from a novel vocabulary. This, he claims, is a sign that connectionist models merely associate instances, and are unable to truly master abstract rules. On the other hand, Phillips (2002) argues that classical architectures are no better off in this respect. The purported inability of connectionist models to generalize performance in this way has become an important theme in the systematicity debate.” (Garson 2007) Distributed representations for complex expressions like ‘John loves Mary’ can be constructed that do not contain any explicit representation of their parts (Smolensky 1991). The information about the constituents can be extracted from the representations, but neural network models do not need to explicitly extract this information themselves in order to process it correctly (Chalmers 1990). This suggests that neural network models serve as counterexamples to the idea that the language of thought is a prerequisite for human cognition. However, the matter is still a topic of lively debate (Fodor 1997). (Garson 2007) 14