How do linguists study grammar? Lori Levin 11-721: Grammars and Lexicons

advertisement
How do linguists study
grammar?
Lori Levin
11-721: Grammars and Lexicons
August 29, 2007
Outline
• Views of language:
– Prescriptive
– Artistic
– Descriptive
• Claims about knowledge of a language:
–
–
–
–
Unconscious
Complex
Systematic
Can be studied scientifically
• A research tool: grammaticality judgments
–
–
–
–
What is grammaticality?
Problems with grammaticality
Rationalism vs empiricism
Why should language technologists care about grammaticality?
Prescriptive and Descriptive
Linguistics
• Natural phenomena cannot be legislated,
just described.
– You can’t declare the value of π to be 3.
– Sag, Wasow, and Bender, page 1
• Social phenomena can be legislated.
• Language use can be legislated as a social
phenomenon, but it can also be studied as
a natural phenomenon.
Prescriptive view of language
• Rules about how language should be used
– Don’t say Me and him went to the movies.
– It doesn’t make sense because you can’t say Me
went to the movies.
• Focus on isolated phenomena that are thought
to be corruptions of the language.
– Everybody should do their homework.
• Some people speak correctly and others don’t.
• Rules are something that you are aware of.
Artistic View of Language
• Language can be used creatively to make
literature and poetry.
• Some people are better at it than others.
• Language is not systematic and rule
governed.
Descriptive view of language
• Study language as a natural phenomenon
– People say Me and him went to the movies.
– That’s interesting because they don’t say Me went to
the movies.
• Focus on all aspects of language, even very
normal sentences.
• Every native speaker of a language speaks
equally well.
– Unless there is an injury or an illness that affects
certain parts of the brain or speech producing organs.
• Language consists of systematic knowledge that
the speakers are not aware of.
Outline
• Views of language:
– Prescriptive
– Artistic
– Descriptive
• Claims about knowledge of a language:
–
–
–
–
Unconscious
Complex
Systematic
Can be studied scientifically
• A research tool: grammaticality judgments
–
–
–
–
What is grammaticality?
Problems with grammaticality
Rationalism vs empiricism
Why should language technologists care about grammaticality?
Knowledge of Language
• “Every normal speaker of any natural
language has acquired an immensely rich
and systematic body of unconscious
Claim 1
knowledge, which can be investigated by
consulting speakers’ intuitive judgments.”
Claim 2
• “Languages are objects of considerable
Claim 3
complexity, which can be studied
scientifically. That is, we can formulate
Claim 4
hypotheses about linguistic structure and test
them against the facts of particular
languages.”
Sag et al., page 2
Chomsky, 1957 on testable
hypotheses
The search for rigorous formulation in linguistics has a much more
serious motivation than mere concern for logical niceties or the
desire to purify well-established methods of linguistic analysis.
Precisely constructed models for linguistic structure can play an
important role, both negative and positive, in the process of
discovery itself. By pushing a precise but inadequate formulation to
an unacceptable conclusion, we can often expose the exact source
of the inadequacy and, consequently, gain a deeper understanding
of the linguistic data. More positively a formalized theory may
automatically provide solutions for many problems other than those
for which it was explicitly designed. Obscure and intuition-bound
notions can neither lead to absurd conclusions nor provide new and
correct ones, and hence they fail to be useful in two important
respects.
(Noam Chomsky has been the most influential linguist in many parts of
the world since 1957. You may have also heard his name
associated with politics. )
“Immensely rich and systematic
body of unconscious knowledge”
• They saw Pat and Chris.
• They saw Pat with Chris.
• Who did they see Pat with?
• *Who did they see Pat and?
– Has anyone ever had to tell you not to say
this?
Testable hypotheses about
linguistic knowledge
•
•
•
•
•
•
•
•
*We like us.
We like ourselves.
She likes her. (She ≠ her)
She likes herself.
Nobody likes us.
*Leslie likes ourselves.
*Ourselves like us.
*Ourselves like ourselves.
Testable hypotheses
• Use a reflexive pronoun only when:
• Use a regular pronoun only when:
Counter-examples
• We think that Leslie likes us.
• *We think that Leslie likes ourselves.
• *We think that ourselves like Leslie.
New Hypothesis
• Use a reflexive pronoun only when:
• Use a regular pronoun only when:
• (This is an English rule. Many languages
do not follow it.)
Support for the new hypothesis
•
•
•
•
We think that she voted for her. (she ≠ her)
We think that she voted for herself.
We think that herself voted for her.
*We think that herself voted for herself.
Counter-examples
•
•
•
•
Our friends like us.
*Our friends like ourselves.
Those pictures of us offended us.
*Those pictures of us offended ourselves.
New Hypothesis
• Use a reflexive pronoun only when:
• Use a regular pronoun only when:
Counter-examples
•
•
•
•
Vote for us.
*Vote for ourselves.
*Vote for you.
Vote for yourselves.
Counter-examples
• We appealed to them to
vote for themselves.
• We appealed to them to
vote for them.
– Them ≠ them
• We appealed to them to
vote for us.
• *We appealed to them to
vote for ourselves.
• *We appeared to them to
vote for themselves.
• We appeared to them to
vote for them.
– Them = them
• *We appeared to them to
vote for us.
• We appeared to them to
vote for ourselves.
“The theoretical machinery required for a viable grammatical
analysis could be quite abstract.” Sag et al., page 6
Knowledge of Language
• “Every normal speaker of any natural
language has acquired an immensely rich
and systematic body of unconscious
Claim 1
knowledge, which can be investigated by
consulting speakers’ intuitive judgments.”
Claim 2
• “Languages are objects of considerable
Claim 3
complexity, which can be studied
scientifically. That is, we can formulate
Claim 4
hypotheses about linguistic structure and test
them against the facts of particular
languages.”
Sag et al., page 2
Grammaticality Judgments as a
scientific tool for collecting data
• What is grammaticality?
• What are some problems in using it as a
tool for collecting data?
• Grammaticality vs corpus analysis
One more claim:
• It is also possible to make testable
hypotheses about how languages differ
and what they have in common.
Outline
• Views of language:
– Prescriptive
– Artistic
– Descriptive
• Claims about knowledge of a language:
–
–
–
–
Unconscious
Complex
Systematic
Can be studied scientifically
• A research tool: grammaticality judgments
–
–
–
–
What is grammaticality?
Problems with grammaticality
Rationalism vs empiricism
Why should language technologists care about grammaticality?
Investigate hypotheses by consulting
native speakers’ intuitions
• Many linguists (probably a majority)
assume that people can distinguish strings
of words that are sentences of their
language from strings of words that are
not sentences of their language.
– So imagine that you are a machine or a
classifier that takes a sentence as input, and
returns “accept” or “reject” as output.
Native speakers as automata that
accept and reject strings of words.
• The student read a book.
• Student the a read book.
Grammaticality
• A string of words that you recognize as a sentence in
your native language is grammatical.
• A string of words that you do not recognize as a
sentence in your native language is ungrammatical.
• When you decide whether a sentence is grammatical
or ungrammatical, this is called giving a grammaticality
judgment.
• Ungrammatical sentences are preceded by an asterisk
or star (*). Sometimes they are called starred
sentences.
• If native speakers can’t decide whether the sentence is
grammatical or ungrammatical, it is preceded by a
combination of stars and question marks.
Grammaticality: Descriptive
• When you give a grammaticality judgment,
you are not supposed to judge whether the
sentence is the most elegant or
appropriate --- just whether it is a sentence
of your language or not.
• You may have a stylistic preference for
one of these, but they are all grammatical.
– These are things you never want to hear.
– These are things you want never to hear.
– These are things you want to never hear.
Grammatical ≠ meaningful
–
–
–
–
It is unlikely that Pat will succeed.
It is improbable that Pat will succeed.
Pat is unlikely to succeed.
*Pat is improbable to succeed.
•
–
–
–
–
This could be meaningful, but most people consider it to be
ungrammatical.
They saw Pat with Chris.
They saw Pat and Chris.
Who did they see Pat with?
*Who did they see Pat and?
•
Again, this could be meaningful, but it is ungrammatical.
Syntactically well-formed vs
semantically well-formed
• Colorless green ideas sleep furiously.
– Syntactically well-formed
– Chomsky, 1957
• Colorless sleep green furiously ideas.
– Not syntactically well-formed
Grammaticality:
Where to draw the line?
• Sentences that are understandable, but
sound like mistakes are probably not
grammatical.
– *These are things that I don’t know anyone
who says.
Where to draw the line?
• Sentences of bad poetry are not grammatical.
• Strange word order in order to make lines
rhyme.
–
–
–
–
Fame to our alma mater
Thousands of voices ring
Telling of love we bear her
* [To her] [we] [laurels] [bring].
• From my high school song. Don’t ask how I could remember something like
that.
– * [indirect-object] [subject] [direct-object] [verb]
– [We] [ [bring] [laurels] [to her]].
– [subject] [ [verb] [direct-object] [indirect-object]]
Grammaticality
• More bad poetry; not grammatical:
– Shout on high the ringing praises, loyal strong
and true
– *[Bring] [we] [to our alma mater] [trust and
honor due].
– * [verb] [subject] [indirect-object] [direct-object]
– [We] [ [bring] [trust and honor (that are) due]
[to our alma mater].
– [subject] [verb] [direct-object] [indirect-object]
Where to draw the line?
• However, many types of sentences that are
found in writing, or are restricted to special
contexts are considered to be grammatical and
even have names:
– Locative Inversion: In this village live many people.
– Topicalization: Sam, I like.
– Heavy NP Shift: I presented to the students many
examples of strange and unusual constructions.
(indirect object comes before direct object because
the direct object is too long)
• These are grammatical.
Grammaticality
• Grammatical:
– In this village live many people.
– I presented to the students many examples of strange
and unusual constructions.
– Sam, I like.
• Not grammatical:
–
–
–
–
–
*To her we laurels bring.
*Bring we to our alma mater trust and honor due.
*These are things that I don’t know anyone who says.
*Who did they see Pat and?
*We told them to vote for ourselves.
Problems with Grammaticality
• Dialect differences:
– The car needs washed.
• (The car needs to be washed.)
– We go to the movies a lot anymore.
• (We go to the movies a lot these days.)
– I gave it her.
• (I gave it to her.)
– It were me what told her.
• (It was me that told her.)
– Mine is bigger than what yours is.
• (Mine is bigger than yours is.)
– Ain’t no chicken can’t get into no coop.
• (No chicken can get into a coop.)
• (There isn’t a chicken that can get into a coop.)
Problems with grammaticality
• Changes over time:
– (From Kroeger, Chapter 1)
– [With two things] hath [God] [men’s soul]
endowed.
• Normal word order in English before 1100 AD
– I know not what course others may take,…
• Patrick Henry, 1775
Grammaticality: Discrete or
Continuous?
• Manning (2003) Probabilistic Syntax
– *We regard Kim to be an acceptable candidate.
• Consulting native speakers’ judgments.
– Conservatives argue that the Bible regards
homosexuality to be a sin.
• Attested example.
– *Kim turned out doing all the work.
• Consulting native speakers’ judgments.
– But it turned out having a greater impact than any of
us dreamed.
• Attested example.
• Better to ask, “How likely?” than to ask,
“Possible or not?”
Philosophy Lesson:
Rationalism and Empiricism
• Rationalism: the source of knowledge is
reason
• Empiricism: the source of knowledge is
data
Rationalist view of linguistic data
• Language is something in people’s minds – a set
of rules and principles that allows them to make
grammaticality judgments and produce and
understand sentences that they have never
heard before
– i-language or internal language
• We study i-language asking people to give
grammaticality judgments.
• A corpus (a collection of texts or speech) is elanguage, or external language. It is not the
object of study.
Empiricist view of linguistic data
• Corpora are the objects of study.
• We study language by examining patterns
in corpora (collections of texts or speech).
Why do we need the philosophy lesson?
• In the second half of the 20th century, linguistics was heavily
dominated by rationalism.
• Computational linguistics was also initially dominated by rationalism.
• Rationalism/empiricism was heavily debated in computational
linguistics in the 1990’s.
– Rationalism: people writing grammar rules for a parser
– Empiricism: statistical, corpus-based models
• In current Language Technologies Research, rationalism and
empiricism are often combined.
– Combination: A person choosing linguistic features as input to a machine
learning algorithm, which then learns from the distribution of the features
in a corpus.
– Combination: Syntax-based statistical machine translation.
• Empiricism is gaining ground in linguistics (Manning 2003)
• Linguistics textbooks are still mainly rationalist.
– Empiricism is mentioned only in one footnote in Chapter 1 of the Sag et al
book.
– But a few years earlier, it would not have been mentioned at all!
Strong points of rationalism
• Infinite, creative capacity: People can produce
and understand sentences that have never been
uttered before. They are not repeating
memorized patterns, but applying productive
rules.
• Leads people to wonder about things that don’t
exist in a corpus: *Who did you see Pat and?
• Probability is not grammaticality: grammatical
sentences may have very low probability.
• Probability reflects facts about the world, but
grammaticality is independent of context.
– Clyde is an African elephant.
– Clyde is a pink elephant
Strong points of empiricism
• Frequency of occurrence in a corpus is
easier to measure reliably than a
grammaticality judgment.
• Many ungrammatical sentences turn out to
be acceptable in the right context.
– Identifying the right context turns out to be an
interesting question that does not arise in the
rationalist approach.
• Bresnan et al., 2005, 2007
– I gave her the book.
– I gave the book to her.
Grammaticality in language
technologies
• Real input (especially spoken input) is not
always well-formed, so you should not
build a program that accepts only
grammatical sentences.
• Can we do away with grammar in
language technologies?
Grammaticality in Language
Technologies
• You cannot extract the meaning of a sentence without
processing the grammar:
– Sue interviewed Sam.
– Sam interviewed Sue.
• LT output has to be comprehensible, and therefore,
mostly grammatical:
– Synthesized speech
– An automatically produced translation
– An automatically produced summary
• Error detection programs for computer-assisted
language instruction or for word processing must
distinguish grammatical from ungrammatical sentences.
In favor of grammaticality
• Probability is not grammaticality:
grammatical sentences may have very low
probability.
• Probability reflects facts about the world,
but grammaticality is independent of
context.
– Clyde is an African elephant.
– Clyde is a pink elephant
Download