どうして統語論は不可能であるのか - Linguistics and English Language

advertisement
Why Syntax is Impossible
Mike Dowman
Syntax
Languages have tens of thousands
of words
Some combinations of words make
valid sentences
Others don’t
No one understands the grammar
of any language
Syntax is Complicated!
I saw Bill with Mary yesterday.
You saw WHO with Mary yesterday?!
Who did you see with Mary yesterday?
Syntax is Complicated!
I saw Bill with Mary yesterday.
You saw WHO with Mary yesterday?!
Who did you see with Mary yesterday?
I saw Bill and Mary yesterday.
You saw WHO and Mary yesterday?!
Syntax is Complicated!
I saw Bill with Mary yesterday.
You saw WHO with Mary yesterday?!
Who did you see with Mary yesterday?
I saw Bill and Mary yesterday.
You saw WHO and Mary yesterday?!
Who did you see and Mary yesterday?
Generative Grammar
 An explicit formal system that defines
the set of valid sentences in a
language
 And maybe also explains what each
one means
 Generative grammar is the core
research topic in linguistics
 Includes strongly nativist theories
and theories proposing that
languages are primarily learned
Grammar Writing
Linguists take a selection of
possible sentences
And obtain grammaticality
judgments for those
sentences
Then they produce a
grammar that accounts for
all the data
Grammar Coverage
Linguists’ grammars only
work for selected sentences
They can’t explain most
naturally occurring
sentences
The more data we consider
the more surprising quirks
of syntax that emerge
Children’s Language
Acquisition
Kid’s observe a limited number
of example sentences
But quickly internalize a system
that correctly characterizes the
whole language
E-language
LAD
I-language
How can kids do syntax
when linguists can’t?
Innate component of
language (provided by
genes)
Learned component of
language (provided by
language data)
How can kids do syntax
when linguists can’t?
Innate component of
language (provided by
genes)
Learned component of
language (provided by
language data)
Linguists have to infer both
Children only the learned
component
Information Theory
Both components of language
must contain some amount of
information
Data available to children must
provide at least enough
information as is in the learned
component
This puts a limit on the
complexity of the learned
component of language
Linguists’ Task
Linguists need to have at least
as much information as is in the
learned and innate components
together
Can use data from multiple
languages to try to characterize
innate components
And can use positive and
negative data
Correspondence to
Linguistic Theories
Small learned component =
parameter setting
Large learned component =
learned languages
Small innate component =
general learning mechanism
Large innate component =
universal grammar
Size of Each Component
Inna te Co m po nen t
small
Lea r ne d
Co m po nen t
large
huge
small
large
huge
learn = easy
learn = easy
learn = easy
ling = easy
ling = hard
ling = impossible
learn = hard
learn = hard
learn = hard
ling = hard
ling = hard
ling = impossible
learn = impossibl e
learn = impossible
learn = impossible
ling = impossible
ling = impossible
ling = impossible
Which component is
large?
As we haven’t yet managed
to produce a generative
grammar, at least one of
innate or learned
components must be large
Children learn relatively
easily, so the learned
component can’t be too big
Size of Each Component
Inna te Co m po nen t
small
Lea r ne d
Co m po nen t
large
huge
small
large
huge
learn = easy
learn = easy
learn = easy
ling = easy
ling = hard
ling = impossible
learn = hard
learn = hard
learn = hard
ling = hard
ling = hard
ling = impossible
learn = impossibl e
learn = impossible
learn = impossible
ling = impossible
ling = impossible
ling = impossible
How big could the innate
component be?
Genome contains 3 billion
base pairs = 6 billion bits
Cell metabolism adds more
information
Each base pair can be
modified
Huge amount of
information!
What could be in a huge
innate component?
Not words forms - vary
from language to language
Grammaticality patterns
Rules of syntax would be
hugely complex
Impossibility of Syntax
Grammaticality judgments on
average can provide no more
than one bit of information each
If syntax is hugely complex,
there will be many grammars
that are compatible with any
given body of data
But all but one of these
grammars would fail when
tested on enough new data
A Concrete Example
A multi-agent model
Each agent has:
innate component
learned component
Both are bit strings of fixed
length
Sentences are 100 bit strings
Deciding on the
Grammaticality of a
Sentence 1
 Treat the sentence as a binary
number
 Find:
bi = s mod ni
bl = s mod nl
b is an index to a bit in the innate (bi)
or learned (bl) component
n is the number of bits in the innate (ni)
or learned (nl) component
s is the length of the sentences
Deciding on the
Grammaticality of a
Sentence 2
 A pseudo-random function maps from the
two selected bits plus the sentence to a
Boolean grammaticality judgment
 It’s therefore typically necessary to know
every bit of the sentence and both the innate
and learned bits to predict the grammaticality
of the sentence
 Every bit counts
 Usually about half of sentences
grammatical, half ungrammatical
are
4 Kinds of Agent
Teacher
Innate: 10101000
Learned: 10010101
Related
Innate: 10101000
Learned: 11110001
Unrelated
Innate: 10110101
Learned: 00111000
Linguist
Innate: 00110100
Learned: 10001100
Learning by Related,
Unrelated
Observe a sentence from
the teacher
Work out if it is grammatical
according to current
I-language
If not, invert the relevant
bit of the learned
component
Grammar Inference by
Linguists
Choose random sentences
Ask the teacher if they are
grammatical
Store all sentences and
grammaticality judgments
Search for a setting of innate
and learned components that
assigns the correct
grammaticality rating to every
sentence
1,000 Bit Innate and
Learned Components
1
0.9
related
unrelated
linguist
0.8
0.7
0.6
0
5000
10000
Number of Example Sentences
15000
20000
1,000 Bit Innate Component
1,000,000 Bit Learned Component
1
0.9
related
unrelated
linguist
0.8
0.7
0.6
0
5000
10000
Number of Example Sentences
15000
20000
1,000,000 Bit Innate Component
1,000 Bit Learned Component
1
0.9
related
unrelated
linguist
0.8
0.7
0.6
0
5000
10000
Number of Example Sentences
15000
20000
Implications of
Impossible Syntax
A linguist can write a
grammar that will
adequately characterize any
body of data
But it will fail when tested
on new data
Partial grammars are not a
stepping stone to complete
generative grammars
A Universal Law of
Generative Grammar
Generative grammar is impossible if:
H(learned component) + H(innate
component) > H(language data)
Unless we can use information from
another source (genetic, neuroscientific,
psycholinguistic)
Why do Syntax?
Studying generative
grammar may tell us
something about the human
mind
It won’t help us build
natural language processing
systems
Is studying rare and
obscure constructions the
best way to do syntax?
Conclusion
The idea that we can
characterize a language by
considering enough
linguistic data is a
hypothesis
It’s very unlikely that it’s
possible to write a complete
generative grammar
Download