English Syntax Read J & M Chapter 9. Two Kinds of Issues •Linguistic – what are the facts about language? The rules of syntax (grammar) •Algorithmic – what are effective computational procedures for dealing with those facts? Building parsers What is Syntax? Try 1: the rules for stringing words together to form sentences. The boys hit the ball. vs. Ball boys hit the the. I gave Sue a ride to the store vs. I gave Sue ride to store. I saw the book that Mary had written. vs. I saw the book what Mary had written. But if that’s all it were, we wouldn’t have to do much for understanding assuming legal input. What is Syntax? Try 2: The rules for forming constituents that correspond to meaningful entities. Example: The cat with the furry tail purred. Why Do We Care about Syntax? Morphology POS Tagging Syntax Semantics Discourse Integration Generation goes backwards. For this reason, we generally want declarative representations of the facts. Sometimes We Need it Even if We Don’t Go All the Way Question answering: Lawyers whose clients committed fraud vs Lawyers who committed fraud vs Clients whose lawyers committed fraud Finding Constituents in Sentences A constituent is a word or group of words that functions as a unit. How can we discern constituents? •Semantically: The cat with the furry tail purred. •What can be chopped out and replaced by a single word? Agnes purred. * Agnes tail purred. Finding Constituents in Sentences, con’t •Preposed and postposed constructions: Early next year I’d like to go to Paris. I’d like to go to Paris early next year. I’d like early next year to go to Paris. * Early I’d like to go to Paris next year. * I’d like early to go to Paris next year. * The early next year old man would like to go to Paris. How Many Kinds of Constituents are There? Although there may be an infinite number of possible constituent tokens, there’s quite a small number of constituent types, e.g., NP, PP, VP. On what basis can we group tokens into types? Occurrence in similar contexts. How Many Kinds of Constituents are There, con’t The cat with the furry tail purred. Every dog wore a collar. Most of the children in the room brought a dog with a furry tail and a collar. The furry tail brought a room. Every room purred. A dog with a furry tail and a collar purred. Mary saw most of the children in the room. NPs occur as subjects, objects of verbs, and objects of prepositions. Single Word Constituents Single word constituents are exactly the parts of speech that we have already considered. How many of these single word constituent types are there? Look at sizes of tagsets. Lots of design decisions: Sue bought the big white house. * Sue bought the white big house. Are big and white the same POS? Simple Constituent Types Don’t Capture Everything * The cat with a furry tail purred a collar. Mary imagined a cat with a furry tail. Mary decided to go. * Mary decided a cat with a furry tail. Mary decided a cat with a furry tail would be her next pet. Mary gave Lucy the food. * Mary decided Lucy the food. Subcategorization Frame Ø NP NP NP PPfrom PPto NP PPwith VPto VPbrst S Verb eat, sleep, … prefer, find, leave, ... show, give, … fly, travel, … help, load, … prefer, want, need, … can, would, might, … mean Example I want to eat Find [NP the flight from Pittsburgh to Boston] Show [NP me] [NP airlines with flights from Pittsburgh] I would like to fly [pp from Boston] [pp to Philadelphia] Can you help [NP me] [pp with a flight] I would prefer [VPto to go by United airlines] I can [VPbrst go from Boston] Does this mean [S AA has a hub in Boston]? The Role of the Lexicon in Parsing •Serves as the starting point for POS tagging. •Provides additional information such as subcategorization: •For verbs •For adjectives: I’m angry with Mary. I’m mad at Mary. I’m angry at Mary. * I’m mad with Mary. •For nouns: Jane has a passion for old movies. Jane has an interest in old movies. One Other Barrier to a Small Number of Kinds of Constituents Agreement Number agreement: The boys want to go to the game(s). * The boy want to to to the game(s). Case agreement: I want to give it to him. * Me want to give it to he. In English it’s just pronouns, but not so in many other languages. The Solution – Augmenting the Constituent Types To solve these and other problems, one strategy is to augment constituent types with other sorts of information: V +pl +[NP NP] VP/NP/NP +pl Show VP/NP +pl Show me VP +pl Show me the book. Specifying a Language •The set of sentences in English is large (maybe even infinite). •We want a concise (i. e., much shorter than a list of sentences) definition of it. •We have a finite (in fact quite small) set of constituent types (NP, VP, etc.) from which to build our description. So we appeal to recursion and write grammar rules such as: S NP VP VP V NP NP NP PP NP NP S (The boy who went to the store won the game.) PP prep NP A Context-Free Grammar for English If we ignore: •subcategorization •agreement •gapping Then we can build a context-free grammar for English that does a pretty good job of: •generating all and only the acceptable sentences, and of •building reasonable parse trees for those sentences. We’ll look at whether English is formally context free later. Context-Free Grammars A context-free grammar (CFG) is a 4-tuple: 1. A set of non-terminal symbols N 2. A set of terminals (disjoint from N) 3. A set of productions P, each of the form A , where A is a non-terminal and is a string of symbols from the infinite set of strings (N)* 4. A designated start symbol S In our grammar of English: • is the set of POS, and • N is the set of remaining constituent types, e.g., NP, VP, PP Derivations Using CFGs The standard formal definition: LG generated by grammar G is the set of strings composed of terminal symbols which can be derived from the designated start symbol S. LG = {w | w is in * and S w} But we won’t generally want our grammar to have to all the way to words. We want to let the lexicon do that. That’s why we let be the set of POS. So the grammar may generate strings such as: N V Det N Derivations Using CFGs So we will use the following definition: LG = {s | w is in * and S w and s can be derived from w by substituting words for POS as licensed by the lexicon} Note that this doesn’t change the formal picture. We could instead augment our grammar with tens of thousands of rules of the form: N phlogiston This is a system design decision. Context-Free Grammars and Parse Trees S NP VP NP Name NP Det N VP V NP S NP Name John (S (NP (VP (NAME John)) (V ate) (NP (ART the) (N pizza)))) VP V ate NP Det N the pizza Long Distance Dependencies Who did she say she saw ____ coming down the hill? She did say she saw who coming down the hill. The boy she saw coming down the road was crying. The boy she saw _____ coming down the road was crying. Long Distance Dependencies – A Linguistic Solution Transformational Grammar (Chomsky, 1965): •A context free grammar generates base forms •A transformational component moves constituents around and may delete them from the surface form. But how can we run these rules backwards? This approach went out of fashion at least 20 years ago. Long Distance Dependencies – Computational Solutions •Augmented Transition Networks: All arbitrary actions on the arcs. These permit insertions and movements of constituents. But any procedural solution won’t be reversible for generation. •Unification systems: Declarative patterns for assigning constituents to fill subcategorization slots. Spoken Language Syntax Speech is collected in utterances rather than in text. Spoken language is looser than written with more pauses, ‘nonverbal events’, disfluencies such as er, uh, um. Sample spoken language utterances from users interacting with ATIS Spoken Language Syntax The repair often has the same structure as the constituent immediately before the interruption point.