Microsoft Word - Areas of Expertise

advertisement
Ontogenetic versus Phylogenetic Learning of the Emergence of Phonetic
Categories
Robert I. Damper
Image, Speech and Intelligent Systems (ISIS) Research Group
Department of Electronics and Computer Science
University of Southampton, Southampton S017 1BJ, UK.
Introduction
The scientific study of spoken language confronts two central problems: (1) How is the
continuous sound pressure wave of speech decoded by a listener into the discrete percept of a
linguistic message? (2) How does each individual listener acquire the particular form of sound
system, morphology, grammar, etc. appropriate to the language of his/her social community?
The first of these is the classical problem of categorical perception (Harnad 1987) and the
second is that of language acquisition (Wexler & Culicover 1980; Pinker 1984). Both have been
intensively studied, and given rise to intense debate and controversy. They are generally held to
be rather different problems: plainly, however, each makes little or no sense by itself.
Language is acquired rapidly and robustly by speaker/listeners in spite of obvious theoretical
difficulties such as “poverty of the stimulus” (Chomsky 1965), non-uniqueness of the object to
be learned (Gold 1967), and the vast preponderance of positive examples (Angluin 1980).
These issues have influenced ideas of language acquisition profoundly and need no further
rehearsal here. They are generally taken as evidence for a nativist stance, holding that much of
language acquisition (the ‘universal’ component) must be genetically constrained. The
developing child’s task is then to infer by exposure in early life the particular, local variant of
spoken language.
So what are the prospects of explaining either or both of the above faculties in terms of
evolutionary emergence—the topic of this workshop? Our starting point is the suggestion of
Hurford (1992, p.292): “… that promising candidate design features for evolutionary
explanation include … [I]n phonetics, the phenomena of categorical perception … and the
tendency of speech to lump itself into segments.” Yet this suggestion appears far too nativist.
Since learners in different speech/language communities acquire different variants of sound
system, it seems incontrovertible that there must be an ontogenetic aspect. This latter view is
supported by experimental evidence from cross-language and infant perception studies (Wilson
1977; Simon & Fourcin 1978; Werker & Tees 1984) suggesting that human neonates are
capable of distinguishing all phonetic contrasts in the world’s languages, but these generalised
abilities are restricted by learning through exposure to become specific, cf. recent ideas of
‘perceptual magnets’ (Kuhl 1991; Kuhl et al. 1992; Guenter & Gjaja 1996). Hence, the thesis
explored in this paper is that an evolutionary explanation is only warranted for part of the
process of acquiring phonetic categories. The question then is: which part? Specifically, it is
argued—initially at least, before refining the argument—that phylogenetic (evolutionary)
learning provides a pre-processor which eases the problem of ontogenetic (during life) language
acquisition by the individual. This preprocessor is the peripheral auditory system.
The notion of a preprocessor which evolves in Darwinian fashion and effects a recoding of the
stimulus to ease the problem of learnability is consonant with the arguments of Clark and
Thornton (1997). Distinguishing between tractable type-1 and difficult/intractable type-2
learning problems, they posit (but do not pursue the point) “that evolution gifts us with exactly
the right set of recoding biases so as to reduce specific type-2 problems to … type-1
problems” (p.57). How might this work in the case of phonetic categorisation of speech sounds?
Modelling Categorisation of Speech Sounds
According to MacWhinney (1998): “It now appears that the ability to discriminate the sounds of
language is grounded on raw perceptual abilities of the mammalian auditory system” (p.202)
yet: “Even in the areas to which they have been applied, emergentist models are limited in many
ways … the development of the auditory and articulatory systems is not yet sufficiently
grounded in physiological and neurological facts” (p.222). In previous work over many years,
however, we have studied extensively the emergence of phonetic categories in a variety of
computational models, but all having the general form shown in Figure 1 (Damper, Pont &
Elenius 1990; Damper, Gunn & Gore, forthcoming; Damper & Harnad, forthcoming). By
inclusion of a front-end simulation of the mammalian auditory system, we explicitly ground our
model in physiological and neurological facts. Space precludes any detailed treatment: suffice
to say that the computational models convincingly replicate the results of human and animal
experimentation, in both category labelling and discrimination.
Figure 1: Two-stage computational model for the categorisation of speech sounds. An auditory model
converts the sound pressure wave input into a time-frequency representation of auditory nerve firings, followed
by a ‘neural network’ trained to convert the auditory time-frequency patterns of firing into a category label.
Title:
SLPBP
Creator:
Harvard Graphics
Preview:
This EPS picture was not saved
with a preview included in it.
Comment:
This EPS picture will print to a
PostScript printer, but not to
other types of printers.
Title:
H:\percep\c p\arc hive\kuhlmill.eps
Creator:
Corel PHOTO-PAINT 8
Preview :
This EPS picture w as not saved
w ith a preview included in it.
Comment:
This EPS picture w ill print to a
PostScript printer, but not to
other ty pes of printers .
(a)
(b)
Figure 2: Labelling results for the categorisation of synthetic CV stimuli varying in voice onset time. (a) Human
and chinchilla results from Kuhl and Miller (1978) and (b) typical results using a single-layer perceptron as the
learning component of the model in Figure 1.
To illustrate this, Figure 2 compares the results of labelling experiments with (a) human and
chinchilla listeners and (b) one version of the computational model depicted in Figure 1. In both
cases, input stimuli were tokens from synthetic series of initial-stop-consonant/vowel stimuli
varying in voice onset time (VOT). There were three such series (bilabial, alveolar and velar),
varying in place of articulation of the stop consonant. To mimic the operant training of the
animals, the neural networks (one for each place of articulation) were trained on the extreme
VOT values of 0 and 80 ms and then tested on the full range of values. (Of course, adult human
listeners need no such training.) In (a), but not in (b), smooth curves have been fitted to the raw
data which otherwise are essentially identical. Just like human and animal listeners, the
computational model produces plateaux, corresponding to the two (voiced/unvoiced) categories,
at the extremes of the labelling curve, with a sharp transition between them. Learning is robust
(largely independent of details of the learning system) and rapid, requiring only a handful of
training epochs. (No ‘poverty of the stimulus’ here!) Nor is supervised learning essential—a
potentially important fact because feedback on correct categorisation is not obviously present in
the real learner's environment. Also, the movement of category boundary with place of
articulation is correct. The model does not merely place the boundary at the midpoint of the
VOT range as we would expect a ‘dumb’ pattern classifier to do.
However, the learning component of our model is nothing other than a dumb (linear, at that)
pattern classifier. This suggests that the separation according to place of articulation must be
effected by the front-end auditory component. This is indeed the case: replacing the front end
by a simpler Fourier spectrum analyser abolishes the boundary-movement effect. That is, the
peripheral auditory system—a product of evolution—preprocesses the speech sounds into a form
which promotes the learning of phonetic categories.
Discussion and Implications
The work described here falls into the ‘empiricist’ approach to the study of speech perception
exemplified by Kluender (1991), Nearey (1997), Sussman et al. (1998) and others. According to
this increasingly influential tradition, “speech is decodable via simple pattern recognition
because talkers adhere to ‘orderly output constraints’ ” (Kluender & Lotto 1999, p.504). The
latter authors warn, however: “With the suggestion that a complete model include both auditory
and learning processes … falsifiability is at risk … part of the explanation of speech perception
falls out of general auditory processes, and the remaining variance can be ‘mopped up’ by
learning processes”. But, of course, the complete model must all the time respect current
knowledge of auditory system anatomy and physiology, computational learning theory, the facts
of speech perception and so on, as we have tried to do here.
Kluender and Lotto (1999) also warn: “It is not very useful to hypothesize that learning plays
some role. Of course it does” (p.508). So how successful have we been in teasing apart
phylogenetic (evolutionary) and ontogenetic (learning) factors? Probably the main aspect in
which we appear to be in conflict with available human data (reviewed above) is that these data
indicate that neonates are born with pre-wired categories. These seem not to have to be learned,
although they may have to be unlearned. In principle, this could be incorporated in the model by
some phylogenetic setting of the initial weights/parameters of the learning system. At this stage,
it is uncertain whether this is reasonable or not. The issue should become clearer as the
modelling work is extended and deepened to cover more phonetic categories and to deal with
real speech.
References
Angluin, D. (1980). Inductive inferences of formal languages from positive data. Information and Control, 45,
117-145.
Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press.
Clark, A. and C. Thornton (1997). Trading spaces: Computation, representation and the limits of uninformed
learning. Behavioral and Brain Sciences, 20, 69-70.
Damper, R.I., S.R. Gunn, and M.O. Gore (forthcoming). Extracting phonetic knowledge from learning systems:
Perceptrons, support vector machines and linear discriminants. Applied Intelligence, in press.
Damper, R.I. and S.R. Harnad (forthcoming). Neural network models of categorical perception. Perception and
Psychophysics, in press.
Damper, R.I., M.J. Pont, and K. Elenius (1990). Representation of initial stop consonants in a computational model
of the dorsal cochlear nucleus. Technical Report STL-QPSR,4/90, Speech Transmission Laboratory
Quarterly Progress and Status Report, Royal Institute of Technology (KTH), Stockholm.
Gold, E. M. (1967). Language identification in the limit. Information and Control, 16, 447-474.
Guenter, F. H. and M. N. Gjaja (1996). The perceptual magnet effect as an emergent property of neural map
formation. Journal of the Acoustical Society of America, 100, 1111-1121.
Harnad, S. (Ed.) (1987). Categorical Perception: the Groundwork of Cognition. Cambridge, UK: Cambridge
University Press.
Hurford, J. R. (1992). An approach to the phylogeny of the language faculty. In J. A. Hawkins and M. Gell-Mann
(Eds.), The Evolution of Human Languages, pp.273-303. Reading, MA: Addison-Wesley.
Kluender, K.R. (1991). Effects of first formant onset properties on voicing judgements resulting from processes not
specific to humans. Journal of the Acoustical Society of America, 90, 83-96.
Kluender, K.R. and A.J. Lotto (1999). Virtues and perils of an empiricist approach to speech perception. Journal
of the Acoustical Society of America, 105,503-511.
Kuhl, P.K. (1991). Human adults and human infants show a ‘perceptual magnet effect’ for the prototypes of speech
categories, monkeys do not. Perception and Psychophysics, 50, 93-107.
Kuhl, P.K. and J.D. Miller (1978). Speech perception by the chinchilla: Identification functions for synthetic VOT
stimuli. Journal of the Acoustical Society of America, 63, 905-917.
Kuhl, P.K., K.A. Williams, F. Lacerda, K.N. Stevens, and B. Lindblom (1992). Linguistic experience alters
phonetic perception in infants by 6 months of age. Science, 225, 606-608.
MacWhinney, B. (1998). Models of the emergence of language. Annual Reviews of Psychology, 49, 199-227.
Nearey, T.M. (1997). Speech perception as pattern recognition. Journal of the Acoustical Society of America, 101,
3241-3254.
Pinker, S. (1984). Language Learnability and Language Development. Cambridge, MA: Harvard University Press.
Simon, C. and A.J. Fourcin (1978). A cross-language study of speech-pattern learning. Journal of the Acoustical
Society of America, 63, 925-935.
Sussman, H.M., D. Fruchter, J. Hilbert, and J. Sirosh (1998). Linear correlates in the speech signal: The orderly
output constraint. Behavioral and Brain Sciences, 21, 241-299.
Werker, J. F. and R. C. Tees (1984). Cross-language speech perception: Evidence for perceptual reorganization
during the first year of life. Infant Behavior and Development, 7, 49-63.
Wexler, K. and P. Culicover (1980). Formal Principles of Language Acquisition. Cambridge, MA: MIT Press.
Williams, L. (1977). The perception of stop consonant voicing by Spanish-English bilinguals. Perception and
Psychophysics, 21, 289-297.
Download