Ontogenetic versus Phylogenetic Learning of the Emergence of Phonetic Categories Robert I. Damper Image, Speech and Intelligent Systems (ISIS) Research Group Department of Electronics and Computer Science University of Southampton, Southampton S017 1BJ, UK. Introduction The scientific study of spoken language confronts two central problems: (1) How is the continuous sound pressure wave of speech decoded by a listener into the discrete percept of a linguistic message? (2) How does each individual listener acquire the particular form of sound system, morphology, grammar, etc. appropriate to the language of his/her social community? The first of these is the classical problem of categorical perception (Harnad 1987) and the second is that of language acquisition (Wexler & Culicover 1980; Pinker 1984). Both have been intensively studied, and given rise to intense debate and controversy. They are generally held to be rather different problems: plainly, however, each makes little or no sense by itself. Language is acquired rapidly and robustly by speaker/listeners in spite of obvious theoretical difficulties such as “poverty of the stimulus” (Chomsky 1965), non-uniqueness of the object to be learned (Gold 1967), and the vast preponderance of positive examples (Angluin 1980). These issues have influenced ideas of language acquisition profoundly and need no further rehearsal here. They are generally taken as evidence for a nativist stance, holding that much of language acquisition (the ‘universal’ component) must be genetically constrained. The developing child’s task is then to infer by exposure in early life the particular, local variant of spoken language. So what are the prospects of explaining either or both of the above faculties in terms of evolutionary emergence—the topic of this workshop? Our starting point is the suggestion of Hurford (1992, p.292): “… that promising candidate design features for evolutionary explanation include … [I]n phonetics, the phenomena of categorical perception … and the tendency of speech to lump itself into segments.” Yet this suggestion appears far too nativist. Since learners in different speech/language communities acquire different variants of sound system, it seems incontrovertible that there must be an ontogenetic aspect. This latter view is supported by experimental evidence from cross-language and infant perception studies (Wilson 1977; Simon & Fourcin 1978; Werker & Tees 1984) suggesting that human neonates are capable of distinguishing all phonetic contrasts in the world’s languages, but these generalised abilities are restricted by learning through exposure to become specific, cf. recent ideas of ‘perceptual magnets’ (Kuhl 1991; Kuhl et al. 1992; Guenter & Gjaja 1996). Hence, the thesis explored in this paper is that an evolutionary explanation is only warranted for part of the process of acquiring phonetic categories. The question then is: which part? Specifically, it is argued—initially at least, before refining the argument—that phylogenetic (evolutionary) learning provides a pre-processor which eases the problem of ontogenetic (during life) language acquisition by the individual. This preprocessor is the peripheral auditory system. The notion of a preprocessor which evolves in Darwinian fashion and effects a recoding of the stimulus to ease the problem of learnability is consonant with the arguments of Clark and Thornton (1997). Distinguishing between tractable type-1 and difficult/intractable type-2 learning problems, they posit (but do not pursue the point) “that evolution gifts us with exactly the right set of recoding biases so as to reduce specific type-2 problems to … type-1 problems” (p.57). How might this work in the case of phonetic categorisation of speech sounds? Modelling Categorisation of Speech Sounds According to MacWhinney (1998): “It now appears that the ability to discriminate the sounds of language is grounded on raw perceptual abilities of the mammalian auditory system” (p.202) yet: “Even in the areas to which they have been applied, emergentist models are limited in many ways … the development of the auditory and articulatory systems is not yet sufficiently grounded in physiological and neurological facts” (p.222). In previous work over many years, however, we have studied extensively the emergence of phonetic categories in a variety of computational models, but all having the general form shown in Figure 1 (Damper, Pont & Elenius 1990; Damper, Gunn & Gore, forthcoming; Damper & Harnad, forthcoming). By inclusion of a front-end simulation of the mammalian auditory system, we explicitly ground our model in physiological and neurological facts. Space precludes any detailed treatment: suffice to say that the computational models convincingly replicate the results of human and animal experimentation, in both category labelling and discrimination. Figure 1: Two-stage computational model for the categorisation of speech sounds. An auditory model converts the sound pressure wave input into a time-frequency representation of auditory nerve firings, followed by a ‘neural network’ trained to convert the auditory time-frequency patterns of firing into a category label. Title: SLPBP Creator: Harvard Graphics Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Title: H:\percep\c p\arc hive\kuhlmill.eps Creator: Corel PHOTO-PAINT 8 Preview : This EPS picture w as not saved w ith a preview included in it. Comment: This EPS picture w ill print to a PostScript printer, but not to other ty pes of printers . (a) (b) Figure 2: Labelling results for the categorisation of synthetic CV stimuli varying in voice onset time. (a) Human and chinchilla results from Kuhl and Miller (1978) and (b) typical results using a single-layer perceptron as the learning component of the model in Figure 1. To illustrate this, Figure 2 compares the results of labelling experiments with (a) human and chinchilla listeners and (b) one version of the computational model depicted in Figure 1. In both cases, input stimuli were tokens from synthetic series of initial-stop-consonant/vowel stimuli varying in voice onset time (VOT). There were three such series (bilabial, alveolar and velar), varying in place of articulation of the stop consonant. To mimic the operant training of the animals, the neural networks (one for each place of articulation) were trained on the extreme VOT values of 0 and 80 ms and then tested on the full range of values. (Of course, adult human listeners need no such training.) In (a), but not in (b), smooth curves have been fitted to the raw data which otherwise are essentially identical. Just like human and animal listeners, the computational model produces plateaux, corresponding to the two (voiced/unvoiced) categories, at the extremes of the labelling curve, with a sharp transition between them. Learning is robust (largely independent of details of the learning system) and rapid, requiring only a handful of training epochs. (No ‘poverty of the stimulus’ here!) Nor is supervised learning essential—a potentially important fact because feedback on correct categorisation is not obviously present in the real learner's environment. Also, the movement of category boundary with place of articulation is correct. The model does not merely place the boundary at the midpoint of the VOT range as we would expect a ‘dumb’ pattern classifier to do. However, the learning component of our model is nothing other than a dumb (linear, at that) pattern classifier. This suggests that the separation according to place of articulation must be effected by the front-end auditory component. This is indeed the case: replacing the front end by a simpler Fourier spectrum analyser abolishes the boundary-movement effect. That is, the peripheral auditory system—a product of evolution—preprocesses the speech sounds into a form which promotes the learning of phonetic categories. Discussion and Implications The work described here falls into the ‘empiricist’ approach to the study of speech perception exemplified by Kluender (1991), Nearey (1997), Sussman et al. (1998) and others. According to this increasingly influential tradition, “speech is decodable via simple pattern recognition because talkers adhere to ‘orderly output constraints’ ” (Kluender & Lotto 1999, p.504). The latter authors warn, however: “With the suggestion that a complete model include both auditory and learning processes … falsifiability is at risk … part of the explanation of speech perception falls out of general auditory processes, and the remaining variance can be ‘mopped up’ by learning processes”. But, of course, the complete model must all the time respect current knowledge of auditory system anatomy and physiology, computational learning theory, the facts of speech perception and so on, as we have tried to do here. Kluender and Lotto (1999) also warn: “It is not very useful to hypothesize that learning plays some role. Of course it does” (p.508). So how successful have we been in teasing apart phylogenetic (evolutionary) and ontogenetic (learning) factors? Probably the main aspect in which we appear to be in conflict with available human data (reviewed above) is that these data indicate that neonates are born with pre-wired categories. These seem not to have to be learned, although they may have to be unlearned. In principle, this could be incorporated in the model by some phylogenetic setting of the initial weights/parameters of the learning system. At this stage, it is uncertain whether this is reasonable or not. The issue should become clearer as the modelling work is extended and deepened to cover more phonetic categories and to deal with real speech. References Angluin, D. (1980). Inductive inferences of formal languages from positive data. Information and Control, 45, 117-145. Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Clark, A. and C. Thornton (1997). Trading spaces: Computation, representation and the limits of uninformed learning. Behavioral and Brain Sciences, 20, 69-70. Damper, R.I., S.R. Gunn, and M.O. Gore (forthcoming). Extracting phonetic knowledge from learning systems: Perceptrons, support vector machines and linear discriminants. Applied Intelligence, in press. Damper, R.I. and S.R. Harnad (forthcoming). Neural network models of categorical perception. Perception and Psychophysics, in press. Damper, R.I., M.J. Pont, and K. Elenius (1990). Representation of initial stop consonants in a computational model of the dorsal cochlear nucleus. Technical Report STL-QPSR,4/90, Speech Transmission Laboratory Quarterly Progress and Status Report, Royal Institute of Technology (KTH), Stockholm. Gold, E. M. (1967). Language identification in the limit. Information and Control, 16, 447-474. Guenter, F. H. and M. N. Gjaja (1996). The perceptual magnet effect as an emergent property of neural map formation. Journal of the Acoustical Society of America, 100, 1111-1121. Harnad, S. (Ed.) (1987). Categorical Perception: the Groundwork of Cognition. Cambridge, UK: Cambridge University Press. Hurford, J. R. (1992). An approach to the phylogeny of the language faculty. In J. A. Hawkins and M. Gell-Mann (Eds.), The Evolution of Human Languages, pp.273-303. Reading, MA: Addison-Wesley. Kluender, K.R. (1991). Effects of first formant onset properties on voicing judgements resulting from processes not specific to humans. Journal of the Acoustical Society of America, 90, 83-96. Kluender, K.R. and A.J. Lotto (1999). Virtues and perils of an empiricist approach to speech perception. Journal of the Acoustical Society of America, 105,503-511. Kuhl, P.K. (1991). Human adults and human infants show a ‘perceptual magnet effect’ for the prototypes of speech categories, monkeys do not. Perception and Psychophysics, 50, 93-107. Kuhl, P.K. and J.D. Miller (1978). Speech perception by the chinchilla: Identification functions for synthetic VOT stimuli. Journal of the Acoustical Society of America, 63, 905-917. Kuhl, P.K., K.A. Williams, F. Lacerda, K.N. Stevens, and B. Lindblom (1992). Linguistic experience alters phonetic perception in infants by 6 months of age. Science, 225, 606-608. MacWhinney, B. (1998). Models of the emergence of language. Annual Reviews of Psychology, 49, 199-227. Nearey, T.M. (1997). Speech perception as pattern recognition. Journal of the Acoustical Society of America, 101, 3241-3254. Pinker, S. (1984). Language Learnability and Language Development. Cambridge, MA: Harvard University Press. Simon, C. and A.J. Fourcin (1978). A cross-language study of speech-pattern learning. Journal of the Acoustical Society of America, 63, 925-935. Sussman, H.M., D. Fruchter, J. Hilbert, and J. Sirosh (1998). Linear correlates in the speech signal: The orderly output constraint. Behavioral and Brain Sciences, 21, 241-299. Werker, J. F. and R. C. Tees (1984). Cross-language speech perception: Evidence for perceptual reorganization during the first year of life. Infant Behavior and Development, 7, 49-63. Wexler, K. and P. Culicover (1980). Formal Principles of Language Acquisition. Cambridge, MA: MIT Press. Williams, L. (1977). The perception of stop consonant voicing by Spanish-English bilinguals. Perception and Psychophysics, 21, 289-297.