How do linguists study grammar? Lori Levin 11-721: Grammars and Lexicons August 29, 2007 Outline • Views of language: – Prescriptive – Artistic – Descriptive • Claims about knowledge of a language: – – – – Unconscious Complex Systematic Can be studied scientifically • A research tool: grammaticality judgments – – – – What is grammaticality? Problems with grammaticality Rationalism vs empiricism Why should language technologists care about grammaticality? Prescriptive and Descriptive Linguistics • Natural phenomena cannot be legislated, just described. – You can’t declare the value of π to be 3. – Sag, Wasow, and Bender, page 1 • Social phenomena can be legislated. • Language use can be legislated as a social phenomenon, but it can also be studied as a natural phenomenon. Prescriptive view of language • Rules about how language should be used – Don’t say Me and him went to the movies. – It doesn’t make sense because you can’t say Me went to the movies. • Focus on isolated phenomena that are thought to be corruptions of the language. – Everybody should do their homework. • Some people speak correctly and others don’t. • Rules are something that you are aware of. Artistic View of Language • Language can be used creatively to make literature and poetry. • Some people are better at it than others. • Language is not systematic and rule governed. Descriptive view of language • Study language as a natural phenomenon – People say Me and him went to the movies. – That’s interesting because they don’t say Me went to the movies. • Focus on all aspects of language, even very normal sentences. • Every native speaker of a language speaks equally well. – Unless there is an injury or an illness that affects certain parts of the brain or speech producing organs. • Language consists of systematic knowledge that the speakers are not aware of. Outline • Views of language: – Prescriptive – Artistic – Descriptive • Claims about knowledge of a language: – – – – Unconscious Complex Systematic Can be studied scientifically • A research tool: grammaticality judgments – – – – What is grammaticality? Problems with grammaticality Rationalism vs empiricism Why should language technologists care about grammaticality? Knowledge of Language • “Every normal speaker of any natural language has acquired an immensely rich and systematic body of unconscious Claim 1 knowledge, which can be investigated by consulting speakers’ intuitive judgments.” Claim 2 • “Languages are objects of considerable Claim 3 complexity, which can be studied scientifically. That is, we can formulate Claim 4 hypotheses about linguistic structure and test them against the facts of particular languages.” Sag et al., page 2 Chomsky, 1957 on testable hypotheses The search for rigorous formulation in linguistics has a much more serious motivation than mere concern for logical niceties or the desire to purify well-established methods of linguistic analysis. Precisely constructed models for linguistic structure can play an important role, both negative and positive, in the process of discovery itself. By pushing a precise but inadequate formulation to an unacceptable conclusion, we can often expose the exact source of the inadequacy and, consequently, gain a deeper understanding of the linguistic data. More positively a formalized theory may automatically provide solutions for many problems other than those for which it was explicitly designed. Obscure and intuition-bound notions can neither lead to absurd conclusions nor provide new and correct ones, and hence they fail to be useful in two important respects. (Noam Chomsky has been the most influential linguist in many parts of the world since 1957. You may have also heard his name associated with politics. ) “Immensely rich and systematic body of unconscious knowledge” • They saw Pat and Chris. • They saw Pat with Chris. • Who did they see Pat with? • *Who did they see Pat and? – Has anyone ever had to tell you not to say this? Testable hypotheses about linguistic knowledge • • • • • • • • *We like us. We like ourselves. She likes her. (She ≠ her) She likes herself. Nobody likes us. *Leslie likes ourselves. *Ourselves like us. *Ourselves like ourselves. Testable hypotheses • Use a reflexive pronoun only when: • Use a regular pronoun only when: Counter-examples • We think that Leslie likes us. • *We think that Leslie likes ourselves. • *We think that ourselves like Leslie. New Hypothesis • Use a reflexive pronoun only when: • Use a regular pronoun only when: • (This is an English rule. Many languages do not follow it.) Support for the new hypothesis • • • • We think that she voted for her. (she ≠ her) We think that she voted for herself. We think that herself voted for her. *We think that herself voted for herself. Counter-examples • • • • Our friends like us. *Our friends like ourselves. Those pictures of us offended us. *Those pictures of us offended ourselves. New Hypothesis • Use a reflexive pronoun only when: • Use a regular pronoun only when: Counter-examples • • • • Vote for us. *Vote for ourselves. *Vote for you. Vote for yourselves. Counter-examples • We appealed to them to vote for themselves. • We appealed to them to vote for them. – Them ≠ them • We appealed to them to vote for us. • *We appealed to them to vote for ourselves. • *We appeared to them to vote for themselves. • We appeared to them to vote for them. – Them = them • *We appeared to them to vote for us. • We appeared to them to vote for ourselves. “The theoretical machinery required for a viable grammatical analysis could be quite abstract.” Sag et al., page 6 Knowledge of Language • “Every normal speaker of any natural language has acquired an immensely rich and systematic body of unconscious Claim 1 knowledge, which can be investigated by consulting speakers’ intuitive judgments.” Claim 2 • “Languages are objects of considerable Claim 3 complexity, which can be studied scientifically. That is, we can formulate Claim 4 hypotheses about linguistic structure and test them against the facts of particular languages.” Sag et al., page 2 Grammaticality Judgments as a scientific tool for collecting data • What is grammaticality? • What are some problems in using it as a tool for collecting data? • Grammaticality vs corpus analysis One more claim: • It is also possible to make testable hypotheses about how languages differ and what they have in common. Outline • Views of language: – Prescriptive – Artistic – Descriptive • Claims about knowledge of a language: – – – – Unconscious Complex Systematic Can be studied scientifically • A research tool: grammaticality judgments – – – – What is grammaticality? Problems with grammaticality Rationalism vs empiricism Why should language technologists care about grammaticality? Investigate hypotheses by consulting native speakers’ intuitions • Many linguists (probably a majority) assume that people can distinguish strings of words that are sentences of their language from strings of words that are not sentences of their language. – So imagine that you are a machine or a classifier that takes a sentence as input, and returns “accept” or “reject” as output. Native speakers as automata that accept and reject strings of words. • The student read a book. • Student the a read book. Grammaticality • A string of words that you recognize as a sentence in your native language is grammatical. • A string of words that you do not recognize as a sentence in your native language is ungrammatical. • When you decide whether a sentence is grammatical or ungrammatical, this is called giving a grammaticality judgment. • Ungrammatical sentences are preceded by an asterisk or star (*). Sometimes they are called starred sentences. • If native speakers can’t decide whether the sentence is grammatical or ungrammatical, it is preceded by a combination of stars and question marks. Grammaticality: Descriptive • When you give a grammaticality judgment, you are not supposed to judge whether the sentence is the most elegant or appropriate --- just whether it is a sentence of your language or not. • You may have a stylistic preference for one of these, but they are all grammatical. – These are things you never want to hear. – These are things you want never to hear. – These are things you want to never hear. Grammatical ≠ meaningful – – – – It is unlikely that Pat will succeed. It is improbable that Pat will succeed. Pat is unlikely to succeed. *Pat is improbable to succeed. • – – – – This could be meaningful, but most people consider it to be ungrammatical. They saw Pat with Chris. They saw Pat and Chris. Who did they see Pat with? *Who did they see Pat and? • Again, this could be meaningful, but it is ungrammatical. Syntactically well-formed vs semantically well-formed • Colorless green ideas sleep furiously. – Syntactically well-formed – Chomsky, 1957 • Colorless sleep green furiously ideas. – Not syntactically well-formed Grammaticality: Where to draw the line? • Sentences that are understandable, but sound like mistakes are probably not grammatical. – *These are things that I don’t know anyone who says. Where to draw the line? • Sentences of bad poetry are not grammatical. • Strange word order in order to make lines rhyme. – – – – Fame to our alma mater Thousands of voices ring Telling of love we bear her * [To her] [we] [laurels] [bring]. • From my high school song. Don’t ask how I could remember something like that. – * [indirect-object] [subject] [direct-object] [verb] – [We] [ [bring] [laurels] [to her]]. – [subject] [ [verb] [direct-object] [indirect-object]] Grammaticality • More bad poetry; not grammatical: – Shout on high the ringing praises, loyal strong and true – *[Bring] [we] [to our alma mater] [trust and honor due]. – * [verb] [subject] [indirect-object] [direct-object] – [We] [ [bring] [trust and honor (that are) due] [to our alma mater]. – [subject] [verb] [direct-object] [indirect-object] Where to draw the line? • However, many types of sentences that are found in writing, or are restricted to special contexts are considered to be grammatical and even have names: – Locative Inversion: In this village live many people. – Topicalization: Sam, I like. – Heavy NP Shift: I presented to the students many examples of strange and unusual constructions. (indirect object comes before direct object because the direct object is too long) • These are grammatical. Grammaticality • Grammatical: – In this village live many people. – I presented to the students many examples of strange and unusual constructions. – Sam, I like. • Not grammatical: – – – – – *To her we laurels bring. *Bring we to our alma mater trust and honor due. *These are things that I don’t know anyone who says. *Who did they see Pat and? *We told them to vote for ourselves. Problems with Grammaticality • Dialect differences: – The car needs washed. • (The car needs to be washed.) – We go to the movies a lot anymore. • (We go to the movies a lot these days.) – I gave it her. • (I gave it to her.) – It were me what told her. • (It was me that told her.) – Mine is bigger than what yours is. • (Mine is bigger than yours is.) – Ain’t no chicken can’t get into no coop. • (No chicken can get into a coop.) • (There isn’t a chicken that can get into a coop.) Problems with grammaticality • Changes over time: – (From Kroeger, Chapter 1) – [With two things] hath [God] [men’s soul] endowed. • Normal word order in English before 1100 AD – I know not what course others may take,… • Patrick Henry, 1775 Grammaticality: Discrete or Continuous? • Manning (2003) Probabilistic Syntax – *We regard Kim to be an acceptable candidate. • Consulting native speakers’ judgments. – Conservatives argue that the Bible regards homosexuality to be a sin. • Attested example. – *Kim turned out doing all the work. • Consulting native speakers’ judgments. – But it turned out having a greater impact than any of us dreamed. • Attested example. • Better to ask, “How likely?” than to ask, “Possible or not?” Philosophy Lesson: Rationalism and Empiricism • Rationalism: the source of knowledge is reason • Empiricism: the source of knowledge is data Rationalist view of linguistic data • Language is something in people’s minds – a set of rules and principles that allows them to make grammaticality judgments and produce and understand sentences that they have never heard before – i-language or internal language • We study i-language asking people to give grammaticality judgments. • A corpus (a collection of texts or speech) is elanguage, or external language. It is not the object of study. Empiricist view of linguistic data • Corpora are the objects of study. • We study language by examining patterns in corpora (collections of texts or speech). Why do we need the philosophy lesson? • In the second half of the 20th century, linguistics was heavily dominated by rationalism. • Computational linguistics was also initially dominated by rationalism. • Rationalism/empiricism was heavily debated in computational linguistics in the 1990’s. – Rationalism: people writing grammar rules for a parser – Empiricism: statistical, corpus-based models • In current Language Technologies Research, rationalism and empiricism are often combined. – Combination: A person choosing linguistic features as input to a machine learning algorithm, which then learns from the distribution of the features in a corpus. – Combination: Syntax-based statistical machine translation. • Empiricism is gaining ground in linguistics (Manning 2003) • Linguistics textbooks are still mainly rationalist. – Empiricism is mentioned only in one footnote in Chapter 1 of the Sag et al book. – But a few years earlier, it would not have been mentioned at all! Strong points of rationalism • Infinite, creative capacity: People can produce and understand sentences that have never been uttered before. They are not repeating memorized patterns, but applying productive rules. • Leads people to wonder about things that don’t exist in a corpus: *Who did you see Pat and? • Probability is not grammaticality: grammatical sentences may have very low probability. • Probability reflects facts about the world, but grammaticality is independent of context. – Clyde is an African elephant. – Clyde is a pink elephant Strong points of empiricism • Frequency of occurrence in a corpus is easier to measure reliably than a grammaticality judgment. • Many ungrammatical sentences turn out to be acceptable in the right context. – Identifying the right context turns out to be an interesting question that does not arise in the rationalist approach. • Bresnan et al., 2005, 2007 – I gave her the book. – I gave the book to her. Grammaticality in language technologies • Real input (especially spoken input) is not always well-formed, so you should not build a program that accepts only grammatical sentences. • Can we do away with grammar in language technologies? Grammaticality in Language Technologies • You cannot extract the meaning of a sentence without processing the grammar: – Sue interviewed Sam. – Sam interviewed Sue. • LT output has to be comprehensible, and therefore, mostly grammatical: – Synthesized speech – An automatically produced translation – An automatically produced summary • Error detection programs for computer-assisted language instruction or for word processing must distinguish grammatical from ungrammatical sentences. In favor of grammaticality • Probability is not grammaticality: grammatical sentences may have very low probability. • Probability reflects facts about the world, but grammaticality is independent of context. – Clyde is an African elephant. – Clyde is a pink elephant