Martin Nowak Evolution of Language: From Animal Communication to Universal Grammar

advertisement
February 9, 2001
3rd Annual Lecture
Martin Nowak
Professor of Biology and of Mathematics
Director, Program for Evolutionary Dynamics
Harvard University
Evolution of Language: From Animal
Communication to Universal Grammar
Martin Nowak
It’s a great pleasure for me to be here. My background is, as Scott Weinstein said, in
theoretical biology, so I apologize to all those who understand much more about language
and psychology than I do. I think everybody in the audience, actually, will understand
more. But what I would like to bring into the subject is evolutionary biology, so I would
like to link thinking about language to models of evolutionary biology.
Evolutionary biology is actually a fairly mathematical discipline, because the basic
ingredients of evolution, the mutation and selection, are very well described by
mathematical models. And this has already a long-standing tradition starting with the
work of Fisher, J.B.S. Haldane, and Sewall Wright in the 30s, linking Darwin’s ideas of
evolution to Mendel’s observation of genetics.
So then I would like to talk about evolution of language today; I would like to talk about
mathematical models of evolution of language. And, before I continue, I would like to
tell you a story about a mathematical biologist. There’s a man and a flock of sheep, and
another man comes by and says, “If I can guess the correct number of sheep in your
flock, can I have one?” And the shepherd says, “Sure, try.” And the man looks at the
sheep and says “Eighty-three.” And the shepherd is completely amazed, and the man
picks up a sheep and starts to walk away, and the shepherd says “Hang on. If I guess you
profession, can I have my sheep back?” And the man says, “Sure, try.” “You must be a
mathematical biologist.” “How did you know?” “Because you picked up my dog.”
So, you will also have realized by now that my eight years at Oxford did not give me an
Oxford accent, but maybe it’s some comfort to you that my students always say I speak
exactly like Arnold Schwarzenegger.
So I’ve worked with a number of people on the evolution of language, and I would like to
mention all of them: Natalia Komorova, David Krakauer, Joshua Plotkin (at Princeton
right now), Peter ??? (a mathematician who moved from Princeton to Harvard recently),
Andrea xxxxxxxxx and xxx (who is at the University of Chicago and is a computational
linguist). David Krakauer, in fact, was the person who one evening came to my house in
Oxford and said “Let’s work on the evolution of language.” And I said “Okay.”
Why would one want to work on the evolution of language? One reason is that the
Linguistic Society of Paris officially banned any work on language evolution at their
meeting in 1866. This was only a few years after Darwin published the “Origin of the
Species.” By the way, Darwin made a number of interesting comments about language
evolution. For him, it was totally clear that human language emerged gradually from
animal communication. He also compared languages to species in the sense that
languages would compete with each other and some languages go extinct, and once
languages were extinct, they never reappear, like species. In close, I would like to work
on languages evolution because Chomsky suggested that language might not have arisen
by Darwinian evolution, which is a surprising statement for an evolutionary biologist.
We could say we would like to work on language evolution because one view is that
language came as a by-product of a big brain, which Steve Pinker compares to the idea
that we have all the parts of a jetliner assembled in some backyard and then a hurricane
sweeps through and randomly puts the jetliner into place.
I would like to convince you that language is a very complex trait, and it is extremely
unlikely that such a complex trait can arise as a random by-product of some other process
or as one gigantic mutation or something like that. I would like to say I work on
language evolution because language is the most interesting thing to evolve in the last
several hundred million years. Maybe ever since the evolution of the nervous system,
actually. Why is it the most interesting thing? In my opinion, because it is really the last
of a series of major events that changed the rules of evolution itself. So if you ask,
among all the things that are covered by evolution, what actually affected the rules of
evolution? First, you will talk about the evolution of life, because there wouldn’t be
anything, really, and maybe four billion years ago, who knows? Maybe on this earth,
maybe somewhere else. But 3.5 billion years there were prokaryotes; there were already
bacterial cells, so it’s a very short time, actually, if you really believe that it started four
billion years ago. Because it then took two billion years to go from bacterial cells to
higher cells. Approximately 600 million years ago there were multicellular organisms.
Sometime, and who knows when, complicated language. Language changes the rules of
evolution because it creates a new mode of evolution. It is no longer the case that
information transfer is limited to genetic information transfer, as it was for most of the
time of life on earth. But we have the ability to use language for unlimited cultural
evolution; certainly animals have cultural evolution, but language allows us to bring this
to a qualitatively new state and we can transmit information to other individuals and to
the next generation on a non-genetic carrier, just on a linguistic carrier.
Now I would like give a very personal list of what I think is very remarkable about
language, and this should serve again to make the point that language is indeed a
complicated trait, and a complicated trait can only arise gradually and by natural
selection. Language gives us unlimited expressibility. In the beautiful words of von
Humbolt, it “makes infinite use of finite means.” Worldwide, there are approximately
6000 languages; many, by the way, are threatened by extinction. If you look at all these
natural languages, there is no simple language. So all of these naturally occurring
languages have unlimited expressibility. The only exception, maybe, is pidgin, as you
know, and the very interesting observation is that pidgin, which is a very impoverished
language happens whenever you bring two groups together that do not have a language in
common. Nevertheless, in one generation, children who receive the input from this
pidgin language turn it into a more-or-less full-blown language, which is called creole.
This is a very interesting process to underline our innate ability to interpret linguistic
input as referring to a full-blown language. I think it is made complicated by the fact that
most of these children also receive input from another full-blown language, and I think
this is something that is more and more appreciated at the moment. Even though this
argument may not hold for the observation of sign language development, which was
recently reported in Nicaragua, as I understand. Children were brought together for the
first time who were in their own families and had very impoverished sign languages, and
then because they used it as their primary communication, it developed very rapidly into
a complicated language.
What is remarkable about language? All of you know about 60,000, and maybe many
more, words. This is the average for a 17-year-old American high school student. If you
take this figure, 60,000 words, then you learned about one new word per hour for sixteen
years. Our language instinct is an association with an enormous memory capacity. It
would be unimaginable to memorize 60,000 telephone numbers with the people they
refer to. But in some sense, in our lexicon, we have arbitrary meanings memorized with
each word. And of course it is an extremely hard question to ask, and people here have
worked on this, what it means for a child to learn a meaning of a word. A three-year-old
child gets more than 90% of grammatical rules right. This is something that I read in the
books and I do not observe this with my own children. Apparently if you count all the
mistakes they could make, 90% of the mistakes they don’t make. So very early on, they
have some instinct for grammatical language.
Speech production is the most complicated mechanical motion we perform. As you are
all aware, in order to produce the sounds of speech, various organs have to be
coordinated with each other, movements have to be regulated within fractions of
millimeters, and the timing has to be right within tenths, maybe several hundredths of a
second. Likewise, speech comprehension occurs with an impressive speed. Artificially
accelerated speech can allow you to recognize up to fifty phonemes per second, which is
above the theoretical resolution of auditory input of twenty words. The reason for this is
because at one moment of speech sound, several phonemes are packed into it. So we can
make the argument that language is a complicated trait and that many different parts of
our anatomy are really geared to deal with it in an almost perfect way. Talking is totally
effortless—we can speak without thinking, and usually we do it. Again, I am fascinated
by the comparison of how hard it is to multiply two numbers; it requires enormous
concentration. At the same time, the computations that are involved in producing
language and interpreting language are arguably much more complicated. So I’ve made
all these points just to say that language is a complex trait that can only arise gradually by
natural selection.
When did language evolve, is something that people ask over and over again. The
question has to be refined a bit, but first let’s take some facts. The first fact is that all
humans have complicated language. The other fact is that the most recent common
ancestors of all humans lived about 100,000 years ago. This is actually not a wellconfirmed figure at the moment because it was calculated for the male lines of ancestors
taking the Y-chromosome. So this is Adam. Adam lived 100,000 years ago. If you take
the female lines of ancestors, and take mitochondrial DNA, which we only get from our
mothers, you find out that Eve lived about 200,000 years ago. So it’s not totally in
agreement. You can actually get this disagreement if you assume that there was a higher
male mortality. But sometime back 100,000 years, all of Africa was populated with an
anatomically modern people. By 60,000 years, from East Africa, there was the spread to
the whole world.
Since everybody now has complicated language, and language must have a biologic
foundation, as I wanted to convince you here, we must assume that these people had
language. So the most recent thing that you could argue is between 60,000 and 100,000
years. If you would like to say 60,000 years, then you would have to assume that the
people who went from East Africa to the world 60,000 years ago also replaced the people
that had already populated Africa at 100,000 years ago. And the other timing is that
chimpanzees do not have language’s unlimited expressibility and humans and chimps
separated about five million years ago. We all know that there are experiments about
training chimpanzees and Nim Chimpsky was one of the individuals that was taught to
speak. They are certainly very clever, but at the same time, nobody would argue that
they have unlimited expressibility in their natural communication system.
At the same time, we have to be aware that monkeys have brain areas that are very
similar to the human language centers. But they don’t seem to use it for vocalizations.
Wernicke’s area and Broca’s area, as I understand it, is not used by these monkeys for
vocalization. But again, experts here in the audience may know much more about it than
I. What I want to say is that it is clear that language uses cognitive abilities that evolved
long ago. So language is not just a great moment of evolution’s blind watchmaker, some
100,000 to 200,000 years ago, but is the consequence of playing with animal cognition
for 500 million years. The question is therefore not when did language evolve, but when
in evolutionary time can we find which aspect of language.
Evolution always uses the trick that it will use certain structures that have evolved for
other purposes and re-wire them and use them for unexpected new purposes. The same
must have happened for human language facility. Another question which is asked very,
very often is why do only humans have language. And again I will say more precisely
one should say, why did only one family of the animal kingdom evolve communication
with unlimited expressibility? Partly, this is what my research is about. In the sense that
I would like to show you what are the steps in language evolution and what is the
complication in making these steps. And where animal communication may not have
taken the next step. But partly, this question remains unanswered because in one sense it
is almost an historical question, and science doesn’t really allow you to make statements
of one-time events.
So the goal of this research program is to formulate an evolutionary theory of language,
and to study how natural selection guides the gradual emergence of the basic design
features of human language. Such as arbitrary signs, syntactic signals, and grammar.
And again, evolution is a mathematical theory and therefore I would like to formulate
everything I say in terms of mathematical concepts. To define whatever we are talking
about, in the very least.
So first, let us try to imagine how evolution leads to arbitrary signs. So let us look at a
simple signaling system, maybe this is something that is used by animals. There is a
matrix, which somehow links which referent of the world—whatever can be referred to—
is linked to what signal. And with this xxx design still in current human language, where
one can see it as the lexical matrix. In current human language, the lexical matrix would
link word meaning with word form. A lot of entries in this matrix are zero, which means
this word meaning is not linked to this word form, but some of these entries may be
linked. You can think of this matrix as a binary matrix—either you are linked or you are
not linked—or you can have graduation in this matrix, where you are more strongly
associating it with one thing than with another thing.
In order to define speaking and hearing, one has to go from this lexical matrix to two
other matrices, which have two stochastic matrices to derive probabilities. So, there is a
mathematical transformation that generates out of this lexical matrix a matrix that
encodes how a speaker uses language. This is basically, I want to say a certain referent,
what is the probability that I use a certain signal for it? And the other way around: if I
receive a certain signal, what’s the probability that I associate a certain referent with it.
These languages cannot be identical but they can be similar, and they are related to this
lexical matrix. So the way communication then works is that there’s a speaker who
wants to communicate referent i, and uses a certain signal j. This matrix tells him or her
which signal to use. And then the receiver will receive signal j and this matrix will tell
the receiver which referent to associate with it. That is the basic mathematical ingredient
of our simple language model here.
And then we use evolutionary game theory. Game theory is something which was
developed John von Neumann and Oskar Morgenstern to describe human behavior and
economic behavior in 1970, but John Maynard-Smith applied to animal behavior, and
linked to evolutionary processes. So we assume that whenever a speaker and a hearer
have correct communication about a certain referent, they both get the point. So we
make a very simple assumption that communication is advantageous to both of us. I
could make it more complicated, and at some stage I have to do this, where I would
assume I give you information and it only helps you, it doesn’t help me, or I tell you to
do something which is bad for you, so I manipulate you and that helps me but not you.
So we can extend the model in these directions but at the moment I stay with the model
that assumes it helps both of us. So we assume a cooperative model.
Then, I can write down this function, which is essentially the probability that I will use a
certain signal to convey a certain meaning and you will associate exactly the same signal
with this meaning, and I average over all the signals and over all the referents and then I
take the average over ones I speak to you, and ones you speak to me. This is how you
can derive this payoff function. Now I use evolutionary dynamics in conjunction with
this payoff function.
So ‘evolutionary dynamics,’ what does that mean? There is a group of individuals;
everyone starts with a random lexical matrix, so I have no associations to start off with.
Everyone talks to everyone else, and there’s the same probability that one person talks to
another person. For each successful communication there is a payoff of one point, as I
have indicated in this function before. Individuals produce children proportional to their
payoff—this is very important, by the way, language ability has to have consequences for
biological fitness. If it doesn’t have it, then natural selection has no ability to shape our
language instinct. So in some sense, language must have an impact on, say, survival
probability. If survival probability increases, we may also have more children. And then
finally, I will assume that children learn the lexical matrix of their parents. At first, this is
convenience, because it is very hard to formulate the mathematical theory where you can
actually learn from everybody, but we are doing this at the moment. So one could say
predominately I learn my language from the parents, and I also have other input, but here
the mathematical model at first we assume that you just get it from your parents.
If you could learn language randomly from anyone in the population, then I would
actually destroy the fitness effect, then, because there’s no reward for speaking well. If I
want to learn language from lots of people, I should also preferably learn from those who
communicate well. So I can reward fitness in two different ways: either directly
biologically by learning from the parents, or by preferably learning from those who
communicate well, based on reputation. So if you round this on a computer, it’s not just
a computer simulation, we can also write down equations and analyze them in detail, but
in a computer simulation you could say that there are three individuals here in your
population at time zero and they have some random entries, which overall don’t make
sense. Then you run this evolutionary generator for some generations and then you will
find that certain associations begin to emerge and if you run for long enough, you may
find that everybody uses the same signal to convey the same meaning. And it is arbitrary
because whether the first signal is associated with the first referent or with any other
referent, that’s just something which happens. Whether or not I will arrive at this final
stage of a coherent lexicon depends on how well children learn from their parents. This
is exactly what I want to analyze.
So you could say children make certain errors during language learning, and one simple
kind of error is that they do not learn all lexical items. So there is a certain lexical item
and maybe it is so infrequent that children don’t get it, and it’s lost. Then we can
calculate, given the probability that children do not learn a specific association (we call
this u), which is somehow in the accuracy of this cognitive ability to learn lexical items.
What’s the total number of items I can maintain in a population? We can show that the
maximum lexicon size is given by 1/2u. So if I give the accuracy of acquiring words,
that’s the total number of words. We can also calculate the distribution of words that are
known by people; some people know a lot of words, other people know few words. It is
a Gaussian distribution and the average number of words that is known by individuals is
the total number of words divided by Euler’s constant e. So this is a specific
mathematical model for this one type of error. But of course this one type of error is not
really the most general type of error because, what you would like to do is allow children
to learn incorrect associations. This really gets to the problem of how to learn the
meaning of a word because the children might have certain guesses and may get it wrong
in the beginning but they may refine their guess and make it right later. Here, we are
doing the mathematical analysis just at the moment, and we find in preliminary results
that what you really need if you have this kind of mistake is a mechanism to sort out
conflicts. Otherwise the communicative ability of a population will diverge. For
example, if you have ambiguities here, you want to resolve those ambiguities, at least to
some degree. As we know, lexicals have ambiguities—we cannot get rid of them—that’s
also what we find here, but you want to reduce them to increase communication ability.
The other thing that is very important is that there must be something like a necessity of
reduced expectation in the sense that the child must know which are potential referents.
If there is a word, what could it label? The child cannot possibly be totally open-minded
about what it could refer to. So we can calculate, in terms of the mathematical model
here, for a different probability to mistake one referent for another referent what is the
total number of referents that the child can expect.
So I think that in the same way that some people like to talk about Universal Grammar,
one could make an argument for the Universal Lexicon. There has to be some innate
components to solving the task of learning the meaning of a word.
The next extension of our theory was to assume that there is an error matrix that
essentially a noisy channel that links speaker and hearer. For example, I say ‘lion’ and
you understand ‘banana.’ It’s an unlikely mistake, but other mistakes are more likely.
This is also important for animal communication because signals may not be arbitrarily
well resolved from each other. We have a setup here that is totally equivalent, or at least
superficially equivalent, to Shannon’s Information Theory. There is again this matrix
that determines speaking and this is like Shannon’s encoding matrix. And then we have
an error matrix, which is the probability that one will mistake one signal for another,
which is Shannon’s noisy channel. And then we have our hearer matrix which is
decoding.
If we analyze this now in the population center as fitness, which is different information
theory, then we find that there is an error limit. This is essentially the observation that
there is a maximum fitness which is achieved by using a small number of signals to
describe the most relevant objects in the world around you. Adding more signals in the
system reduces the overall fitness of the individual. In such a setup, therefore, natural
selection will prefer a limited repertoire. This is interesting because I think we will all
agree that animal communication in natural situations is based on limited repertoires,
perhaps because increasing the number of signals doesn’t really increase the
communicative ability because of the chance of mistaking signals for each other.
How did human language deal with this error limit? The obvious thing that it did was
that this error limit can be overcome by using signals that consist of sequences of basic
units, which are phonemes. So human language works with sequences of phonemes.
Mathematically, what we can show, and this is similar to a theorem proved by Shannon,
in Information Theory, that the maximum reachable fitness of this communication system
increases exponentially with the code length, with the word length, essentially. Of
course, you do not want to have arbitrarily long code because of the time of
communication goes down. So if you want to optimize both, you have word length of
intermediate size. But what I want to suggest here is that human language overcame this
error limit by sequencing phonemes and therefore has the ability to generate large
numbers of words which can be fairly well resolved from each other.
A fascinating question for me is do animals have words. Really, that depends on what
you would call a word, because there are certain aspects to a word, some of which you
will definitely find with animal communication. So if you say a word is an arbitrary sign,
then I think that you would say that animals also have arbitrary signs. If words are
sequences of phonemes, then you could say birdsong also in some sense makes use of
sequencing sounds. We don’t really know how that is important for the meaning,
because the meaning of the birdsong seems to be fairly standard. Then you could say the
meaning of a word depends also on context in human language and here it is much less
clear whether animals have the same ability in their natural communication systems.
Again, I remind you of the situation where a child would be in the one-word stage and
would say something like the word kitty and can use it, actually, to denote many different
meanings: ‘there is the kitty,’ or ‘where is the kitty.’ Still, it is just a one-word
utterance. I am not sure if there is an example of animal communication where they can
do that. I wonder if the ability of children to do this is basically a consequence of the fact
that they have listened to their parents’ syntactic structure and have seen that this one
word can occur in different syntactic contexts and therefore without actually making use
of the syntactic structure themselves make use of the context dependence.
So again, for me a most fascinating question is do animals have words and I would like to
know what experts like John Smith or Dorothy Cheney would say about this. Animal
communication: bees, most remarkable, can communicate to each other the location of
food and the amount of food. It is like a three-dimensional analog system. And then, the
famous example of animal communication, vervet monkeys, worked out by Dorothy
Cheney, that the handful of signals and maybe the most interesting ones are ‘eagle,’
‘snake,’ and ‘leopard,’ there are more of these and as you know they can use the signals
to induce a certain behavior. So for example, ‘leopard’ and the monkeys jump up into a
tree and I understand that children have to learn these signals and get it wrong initially
and I also understand that there’s an anecdote that these signals can sometimes be used
for deceptive purposes. One monkey has a banana, another one shouts “leopard!” the
monkey drops the banana and jumps into the bush, the other one has the banana.
So, as you know, human language has an interesting design which is referred to as
relative patterning. It means that human language makes use of combinatories on two
different levels: sequences of phonemes form words (and I talked about this already), but
then you have another stage where combinatories come into play, and this is sequences of
words form sentences. I would like to talk in the next part of my talk on the evolution of
syntactic communication.
Let me define syntactic communication in the most trivial way, just to say that syntactic
communication is whenever signals consist of components that have their own meaning.
Most syntactic communication is the opposite, so you use signals that cannot be
decomposed into their own parts with their own meanings. I ask, what is the condition
that natural selection sees the advantage of syntactic communication? I mention a very
simple scenario: say you have two nouns, lion and monkey, and two verbs, running and
sleeping, and you want to refer to the events ‘lion running,’ ‘monkey running,’ ‘lion
sleeping,’ and ‘monkey sleeping.’ So a non-syntactic communication system would have
four arbitrary signals for these things. A syntactic communication system would have
signals for the components of the signals, so for lion and monkey and for running and
sleeping.
Now, I basically build a population dynamics evolutionary model and ask when does
natural selection see the advantage of syntactic communication. Obviously what we
observe is that syntactic communication allows a larger repertoire, allows to formulate
new messages that have not been learned beforehand. So our sentences are of course
syntactic signals and our words are non-syntactic signals. If you take words as listemes.
All the words, we have to learn, but sentences we say are new messages. But it also
allows us to use the signal in different contexts. There are clear advantages to syntactic
communication. The interesting observation was, however, that in our evolutionary
models we find that natural selection can only favor syntactic communication if the
number of relevant messages exceeds a certain threshold. If individual components can
be used in many different messages. There is a certain mathematical equation that one
can write down, and if the social interactions of this group do not fulfill these conditions
then natural selection prefers non-syntactic communication over syntactic
communication. So, as the social complexity of interactions increased and more and
more messages became relevant, only then can natural selection see why it would be
better to use syntactic communication. Otherwise, you are better off sticking with the
non-syntactic communication system, which does obviously not give you unlimited
expressability. We call this a “syntax threshold” in evolutionary theory.
Now, in the final part of my talk, I would like to discuss with you the evolution of
grammar. This is work together with Natalia Komorova, an applied mathematician at
Princeton, and xxx, a computational linguist at Chicago. First, let us think, what is
grammar? I think Chomsky called grammar the computational system of language. It is
essentially the part that allows us to make infinite use of finite media. This is an
architecture that was proposed by the linguist Ray Jackendoff describing what grammar
is in the human language capacity. You have phonological rules, and these phonological
rules are linked to hearing and speaking. There is an interface that links these
phonological rules to syntactic rules and an interface that links these syntactic rules to
conceptual rules or semantic rules, which then determine perception and action.
Grammar is essentially an overarching rule system that includes all of these. So if you
like grammar as a rule system that generates a mapping between hearing and speaking,
which is our signal formulation, and perception and action. Very much as we had before
in the lexical matrix.
So the very generalist way you can think of grammar as a rule system that generates such
a matrix, that links referents to signals. Mathematically, the conventional representation
of grammar is that it is a rule system that generates a subdivision of strings, of sets of
sentences, of integers if you like. So you could imagine, these are the sets of all possible
sentences and you could describe everything as a binary language, because you know,
computers encode everything. So you can enumerate all possible sentences. There are
infinitely many, but we can enumerate them. Some of them make sense, others don’t
make sense at all. And then we can say that a grammar is a rule system that tells you
which of the sentences make sense and which of the sentences do not make sense.
And now, what is the process of grammar acquisition? The observation is that children
reliably acquire the grammar of their native language by hearing grammatical sentences
and the information available to the child does not uniquely determine the underlying
grammatical rules. This is called the poverty of the stimulus. This is not actually a
matter of debate, but is a mathematical fact. This was first observed by Chomsky, but
then made rigorous by Gold, who formulated it as a theorem. He mentioned that I have a
rule system that can generate certain integers, and I’ll give you examples of my integers,
because as I say, sentences can always be enumerated. After a number of examples, and
you can choose how many examples you would like to hear, I’ll give you as many
examples as you wish, I ask you what is my rule. Or I ask you to give me some examples
back. We can show mathematically that there is no way you could solve this problem if
you had no preformed expectation about what could be my rule, or some preformed
expectation about what is the most complicated rule I could possibly consider. This is a
fundamental part of learning theory. People like Scott Weinstein have worked on this
and written books about exactly this mathematical foundation of learning theory. But
from that perspective, the notion of necessity of innate components is undebatable.
So what Chomsky then said with respect to linguistic acquisition is that children could
not guess the correct grammar if they had no pre-formed expectation and this innate
expectation is Universal Grammar. In that context, grammar acquisition works in the
following way: there’s an environmental input which consists of sample sentences for
example, and then there’s a learning procedure, the child evaluates this environmental
input. This learning procedure tells the child to choose one of the candidate grammars
that are available to the child in the search space. So this is the total hypothesis that can
occur to the child during language learning. And this search space could be finite or
infinite. If it is infinite, then the child must have a prior distribution of which candidate
grammars are more likely than others.
What you would like to propose now is that Universal Grammar, what Chomsky called
Universal Grammar, consists exactly of the rule system that generates the search space
together with the mechanism for how to evaluate the input. So in this context Universal
Grammar is equivalent to the mechanism of language acquisition. If I then say Universal
Grammar has to be innate, I don’t say anything which is controversial, because you all
agree that a mechanism for learning language has to be there, because you learn
language. But how you learn language, you have to know it, it is not part of the learning
mechanism. What, then, in reality, goes into this Universal Grammar? This is one of the
most fascinating questions of cognitive science. It seems to be clear that it is not only
linguistic structure that must be part of Universal Grammar, but also very general
cognitive abilities like Theory of Mind. I think you could not learn the meaning of a
word if you did not have a model for the theory of mind of the person who teaches you.
We will now proceed with a finite search space, because this is what we are able to
formulate so far. We can extend these mathematical models to an infinite search space
with a prior distribution. I do not think this would greatly change the quality of these
results I’m presenting. So now I link this concept of language acquisition and Universal
Grammar to population dynamics of grammar acquisition. I ask now, what conditions
does Universal Grammar have to satisfy for a population to evolve and maintain coherent
grammatical communication, because this is the real evolutionary question. We would
like to know which cognitive abilities must be in place such that a population of
individuals will have a chance to come up with and maintain a grammatical
communication system.
In order to do this, I have to introduce a few technical aspects. So, imagine there’s a
grammar i and a grammar j. These are rules that generate sets of sentences; they may
have a certain overlap. I will define, with the number aij the probability that a speaker of
grammar i says a sentence that is compatible with somebody who uses grammar j. This
tells you the pairwise relationship between grammars. This will be very important for the
task of learning which grammar is actually being spoken by a certain person, as you will
appreciate very soon.
So I have introduced another a matrix, and I apologize for always using the letter ‘a’ for
matrices, this is very different from the matrix I had before, but this a matrix now
specifies the pairwise relationship between the candidate grammars in the search space.
It doesn’t really tell you the full configuration of the search space, only considers the
pairwise relationships. And now, this a matrix will be a consequence of your favorite
theory of grammar acquisition. For example, principles and parameters or optimality.
You don’t really know what this matrix looks like. Therefore we will do a trick that
physicists use whenever they come across such a problem. There was a time when they
wanted to describe how the components of a nucleus in an atom interact with each other
and the interactions are very complicated, and therefore what they did is they assumed,
let’s make them random. Let’s start with a random matrix, and try to see whether we can
make some headway in this way. So we will assume that this matrix here is a random
matrix. So we will ask, how does the system behave as if it would be a random matrix,
with certain properties. So again, it’s the same as I had before with learning the lexicon.
You must assume that communication has some consequence for fitness. They way we
do it with grammar is actually very natural, because we have these interactions between
the two grammars now. So if one speakers uses grammar i and another uses grammar j,
then the payoff, which is the chance of mutual communication compatibility is just the
average of whenever i says something and j understands, or j says something and i
understands. Take the average of the two. So we have here a link between this a matrix
here and evolutionary gain. Communicative success.
Now as before, we write population dynamics of grammar acquisition. We assume that
everybody talks with everybody else, some probability. Payoff translates into fitness,
reproduction is proportional to fitness, children receive input from their parents and
develop their own grammar. Everybody has the same Universal Grammar—that’s what
we assume in the beginning. So, a very autonomous mathematical model where
everybody has a particular mechanism of language acquisition and we ask, is this
mechanism of language acquisition sufficiently powerful such that the population will
converge on a coherent communication system.
The mathematical equation that we looked on has the following form: it is a system of
differential equations where xi is the frequency of all those who use grammar i and the
sum over xi is normalized to deal with a system that doesn’t consider changes in
population size. The fitness of all those who use grammar i is a function of what
everybody else uses in the population, so it’s the sum of all j, xj is the frequency of those
who use grammar j and this is the payoff to somebody who uses grammar i gets from
somebody who uses grammar j. This is then the biological features of all those
individuals who use change of tape and qij is the probability that a learner will acquire
grammar j from a teacher of grammar i. So this is where your learning mechanism comes
into play. Perfect learning would be qii=1 and you never make a mistake, but then you
never learn perfectly, so you have a slightly different grammar from the grammar of your
parents for example.
We have this term Φ here, which is the average frequency of the population, which is a
term that makes sure that the overall population size is constant, but it has an interesting
interpretation of our model: it is the grammatical coherence. It is precisely the
probability that if one person says a random sentence, another person will understand the
sentence. For us, as theoretical evolutionary biologists counting sheep, it’s interesting to
see the relation that this language equation has with other equations in evolutionary
biology. When on one hand you have this xxx species equation which was designed by
Mumfred Eigen for the chemical evolution of the origin of life, and you have the socalled replicate equation which is the fundamental equation for all of frequency
dependent selection and evolutionary game theory. Our equation is on the continuum
between these two equations, generalizing both.
Now, what we see for the language equation is that their equilibrium solutions of this
equation—there are two kinds of equilibria. One is everybody uses a different grammar,
so this is kind of an “after the Tower of Babel” situation, where nobody understands each
other and then one grammar’s predominating. Which of these solutions is stable? It
depends on Universal Grammar, depends on the mechanism that individuals use to learn
the language. That’s what you want to calculate.
First, we start with a random matrix, but first we make it even more simple, we start with
a very symmetric matrix where we assume that all grammars are equally good and all
grammars have the same distance from each other. So the chance that a person who uses
grammar 1 says a sentence is compatible with grammar 1 is 1. It’s obvious that the
diagonal here is 1. And then they overlap the pairwise relationship if xxx the grammar’s
is constant. We do this because we can solve it analytically and later we’ll extend to
this. So if you have this matrix for the pairwise relationship among grammars, then the
learning mechanism will then generate a cue matrix that has the same structure. The q is
the probability of learning it correctly and p is the probability of learning it incorrectly.
So there are n+1 equilibria in this equation, there are n asymmetric solutions that are
indicated in the one symmetric solution, which is no coherent communication. We can
then write down equations that tell us when the solutions start to become existent and
when they are stable and I shouldn’t really go into all these details, but show you a sort of
bifurcation diagram and this bifurcation diagram has the following form: there is this
accuracy of language acquisition q, which is the probability of children learning exactly
the language of their parents, the grammar of their parents, and if this accuracy is below a
certain threshold, the only solution you have is this where every grammar is equally
likely to be adopted, so there is no coherent communication. But if this accuracy goes
above a certain threshold, then this solution starts to become xxx and also stable (this
yellow branch is the stable branch of the solution here) and they refer to the situation
where the population has settled into one predominant grammar. Which of these n
grammars in the search space, is arbitrary. So if asymmetry breaking in the system then
you go from uniform distribution of all candidate grammars to one dominating. And
then, the uniform solution loses stability here, so q>q1is a necessary condition for
grammatical coherence and q>q2 is a sufficient condition. So this coherence threshold
can be formulated in the following way: if q>q1 then Universal Grammar can produce
coherent communication in the population, and you can calculate this q1 for this simple
example.
More generally, we can say something about the maximum complexity of the search
space. The learning inaccuracy is a declining function of the search space and qn>q1 is
an implicit condition for maximum complexity of the search space, that is compatible
with coherent communication in the population. So that’s a condition that Universal
Grammar has to fulfill, such that there is coherent communication.
I have not yet talked about an actual learning mechanism. But now I would like to do
this. I would like to discuss two learning mechanisms which are arguably boundaries for
the learning mechanisms that are actually used by the child. The one is the most simple
learning mechanism you can imagine and the other one is a complicated one and our
cognitive abilities are better than the xxx learner, but not sufficient to perform as good as
the other learning mechanism I will describe, as so-called xxx learner. So the xxx works
in the following way: you start with a randomly chosen grammar, a candidate grammar in
your search space, you stay with the current grammar as long as you receive input that is
compatible. Then you change to a different grammar if a sentence is not compatible and
you stop after a certain number of sentences.
And if you have acquired this learning mechanism then you can calculate the coherence
ratio that Universal Grammar has to induce which is that the number of sentences that the
child receives has to be greater than a constant that is the size of the search space. So if
this condition holds, then the mechanism of language acquisition can induce coherent
communication in a population. And if we now put up the bifurcation diagram for a
random a matrix, it looks much more complicated than what I showed before. There’s
again something which is corresponding to this uniform solution where all grammars are
more-or-less equally likely and then individual grammars have different branches that
reflect when xxx the population and for certain language learning accuracy you can
choose a subset of all the possible grammars in your search space. We know now that
this equation that I showed here applies also to this situation here. It is a consequence of
something that was proven for random matrix theory only very recently by a
mathematician at Temple University in Philadelphia.
The other learning mechanism I would like to discuss is a batch learner. A batch learner
is like xxx men with infinite memory. It memorizes a large number of sentences, all
sentences it receives, and after this it chooses which kind of grammar is most compatible
with all this input. Our primitive capabilities are certainly not as good as that. And the
the coherence ratio of the batch learner is that the number of input sentences has to
exceed a constant times the logarithm of the size of the search space. And whatever we
use most likely is between these two boundaries.
So we could extend this model in several ways, and partly we try to do this and partly we
hope that others would do it. We can formulate it in a spatial context and then you can
ask where in different locations different grammars become predominating. So we have
described a homogeneous, well-mixed population that one grammar will dominate if the
language acquisition mechanism is high enough. But in different regions of the space
you will get xxx in different grammars. Then one has to ask whether in this setup the
boundaries between languages would be stable. One can also formulate these dynamics
in small populations, with a xxx deterministic equation. The deterministic equations
always describe large populations, but more importantly one would like to have small
populations that xxx a stochastic equation. We have started to analyze this. And you can
also generalize it to assume that different grammars have different performances. What
I’ve showed you so far is that each grammar was equally good, also for the random
matrix. But you can assume that certain grammars express certain things but not other
things, and then you can think of the population searching for more and more welladapted grammars, in a large search space of possible grammars. But then you would
have a cultural evolution trying to adapt for a grammar that has better and better
expressibility or less and less ambiguity. And in that sense you can describe the cultural
evolution of grammar and also compare ideas of this model to Xxx from his article
Language changes from historical linguistics. So this is one way how one can make
connections to observations about language change.
Then, what we did already do is evolution of Universal Grammar in a biological sense.
So in the model so far I’ve assumed that everyone has the same universal grammar--does
it give rise to coherence, or not. But what you would like to do is the variance of the
mechanism of language acquisition, so the variance of Universal Grammar, so not very
universal then. So suppose people differ in their genetic ability to deal with Universal
Grammar, and xxx halt that mutation; does natural selection change Universal
Grammar? In order to do this you have to study equations of the following type. Here
there would be two different Universal Grammars in the population, and you ask which
one is favored by natural selection. What we find so far is that natural selection acting on
Universal Grammars leads to a limited period of language acquisition. So, on one hand,
you want to receive a large input, because then you learn accurately. But then you also
learn for a very long time. So you can analyze this situation precisely, where you say
everybody has the same learning mechanism, the same search space. The only difference
is how much input is being considered. If you do this, you find that natural selection
prefers an intermediate amount of input, which is one idea to suggest why there could be
a limited language learning acquisition period in children, because this is the evolutionary
optimum. We also find in some sense (but this is very hard to make rigorous, at least at
the moment) that natural selection leads to intermediate search spaces. So you could ask,
if I want to maximize the accuracy of language learning, I should just be born with
whatever language we’re speaking, because it’s the most perfect way xxx. So you could
survive xxx, but obviously it may just be impossible to do it. But then you will say, okay,
so natural selection then would reduce the size of the search space as much as possible.
The complication, however, is that imagine that there are two kinds of people, ones that
prefer a very limited search space and ones with a larger search space. And now
somebody has a cultural invention about language, say invents the concept of a subclause or so, and then the larger search space allows you to learn this. The smaller search
space does not allow you to learn this. So you have natural selection for maintaining
flexibility. I would like to make this more precise mathematically but it is hand-waving
at the moment, and I would like to say that in other areas we know very well that
evolution pays a big price for remaining flexible. So that could also be a reason why we
would like to remain flexible for language acquisition. The price that we pay is
essentially the inaccuracy or difficulty in learning language.
The very last thing that I would like to mention is that similar to the syntax threshold that
I described there’s also something like a grammar threshold. If you imagine that you
have a population of speakers and they have sentence types now that are relevant to their
performance, and imagine there’s a finite number of relevant sentence types. Then you
could say one strategy to learn language is don’t search for rules at all and just learn it by
heart. So memorize the sentence types. Of course, with our current understanding of
language this is stupid, because we know we don’t do this. But I would like to remind
you that you have an enormous memory capacity associated with our language instinct,
and this is precisely what we do for words. And, arguably, the distinction between
grammar and lexical items is not so clear and certain grammatical rules we memorize as
lexical items.
So we could then say, how is this competition between so-called list-makers and rulefinders? Rule-finders are those that are described just in the last part of the talk where
you would have a search space and look for underlying rules and list-makers are those
that just make note of all the structures that they receive and memorize them. And you
can also calculate that rule-finders out-compete list-makers only if this number of
relevant sentence types is above a certain threshold, which depends on the constant,
which depends on the size of the search space that is induced by Universal Grammar. It
also depends, of course, on the learning mechanism that you use, but for fairness here I
used the best possible learning mechanism for the rule-finders.
In summary, I have talked about arbitrary signs, sort of components of the lexical matrix
of human language and then I talked about how natural selection can shape the two
aspects of the duality of patterning in human language that is word formation, words are
sequences of phonemes, and syntactic signals, our sentences consist of components that
have their own meaning. And in the final part of my talk I presented some ideas how to
start to begin thinking about evolution and natural selection of grammars and Universal
Grammar. Thank you very much.
Download