Parsing with Context-Free Grammars for ASR Julia Hirschberg CS 4706 Slides with contributions from Owen Rambow, Kathy McKeown, Dan Jurafsky and James Martin What is Syntax? • Structure of language • How words are arranged together and related to one another • Goal of syntactic analysis: relate surface form (what someone says or writes) to underlying structure, to support semantic analysis (what the utterance or text means) • Syntactic representation: typically a tree structure Structure in Strings • A set of words, or, a lexicon: the a small nice big very boy girl sees likes • Some `good’ (grammatical) sentences: – the boy likes a girl – the small girl likes the big girl – a very small nice boy sees a very nice boy • Some bad (ungrammatical) sentences: – *the boy the girl – *small boy likes nice girl • Can we find a way of distinguishing between the two kinds of sequences? • Can we identify similarities among grammatical subsequences? One Version of Constituent Structure • Lexicon: the a small nice big very boy girl sees likes • Grammatical sentences: – (the) boy (likes a girl) – (the small) girl (likes the big girl) – (a very small nice) boy (sees a very nice boy) • Ungrammatical sentences: – *(the) boy (the girl) – *(small) boy (likes the nice girl) Another Constituency Hypothesis • Lexicon: the a small nice big very boy girl sees likes • Grammatical sentences: – (the boy) likes (a girl) – (the small girl) likes (the big girl) – (a very small nice boy) sees (a very nice boy) • Ungrammatical sentences: – *(the boy) (the girl) – *(small boy) likes (the nice girl) • Better: fewer types of constituents (blue and red are of same type) Even More Structures • Lexicon: the a small nice big very boy girl sees likes • Grammatical sentences: – ((the) boy) likes ((a) girl) – ((the) (small) girl) likes ((the) (big) girl) – ((a) ((very) small) (nice) boy) sees ((a) ((very) nice) girl) • Ungrammatical sentences: – *((the) boy) ((the) girl) – *((small) boy) likes ((the) (nice) girl) From Substrings to Trees • (((the) boy) likes ((a) girl)) boy the likes a girl How do we Label the Nodes? • ( ((the) boy) likes ((a) girl) ) • Choose constituents so each one has one non-bracketed word: the head • Group words by distribution of constituents they head (POS) – Noun (N), verb (V), adjective (Adj), adverb (Adv), determiner (Det) • Category of constituent: XP, where X is POS – NP, S, AdjP, AdvP, DetP Types of Nodes • (((the/Det) boy/N) likes/V ((a/Det) girl/N)) nonterminal symbols = constituents S NP DetP the boy likes NP DetP Phrase-structure tree girl a terminal symbols = words Determining Part-of-Speech A blue seat/a child seat: noun or adjective? – Syntax: • a blue seat a child seat • a very blue seat *a very child seat • this seat is blue *this seat is child – Morphology: • bluer *childer – blue and child are not the same POS – blue is Adj, child is Noun Determining Part-of-Speech – Preposition or particle? • • • • A B A B he threw out the garbage he threw the garbage out the door he threw the garbage out *he threw the garbage the door out – The two out are not same POS • A is particle, B is Preposition Constituency • Some Noun phrases (NPs) • A red dog on a blue tree • A blue dog on a red tree • Some big dogs and some little dogs • A dog •I • Big dogs, little dogs, red dogs, blue dogs, yellow dogs, green dogs, black dogs, and white dogs • How do we know these form a constituent? NP Constituency • NPs can all appear before a verb: – Some big dogs and some little dogs are going around in cars… – Big dogs, little dogs, red dogs, blue dogs, yellow dogs, green dogs, black dogs, and white dogs are all at a dog party! – I do not • But individual words can’t always appear before verbs: – *little are going… – *blue are… – *and are • Must be able to state generalizations like: – Noun phrases occur before verbs PP Constituency • Preposing and postposing: – Under a tree is a yellow dog. – A yellow dog is under a tree. • But not: – *Under, is a yellow dog a tree. – *Under a is a yellow dog tree. • Prepositional phrases notable for ambiguity in attachment – I saw a man on a hill with a telescope. Context-Free Grammars • Defined in formal language theory – Terminals: e.g. cat – Non-terminal symbols: e.g. NP, VP – Start symbol: e.g. S – Rewriting rules: e.g. S NP VP • Start with start symbol, rewrite using rules, done when only terminals left A Fragment of English S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the Input: the cat is on the mat Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the S S Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the NP VP S NP VP Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the DetP N VP S NP DetP VP N Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the the cat VP S NP VP DetP N the cat Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the the cat V PP S NP VP DetP N the cat V PP Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the the cat is Prep NP S NP VP DetP N V the cat is PP Prep NP Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the the cat is on Det N S NP VP DetP N V the cat is PP Prep NP on DetP N Derivations in a CFG S NP VP VP V PP NP DetP N N cat | mat V is PP Prep NP Prep on DetP the the cat is on the mat S NP VP DetP N V the cat is PP Prep NP on DetP the N mat A More Complicated Fragment of English • • • • • • • • • S NP VP S VP VP V PP VP V NP VP V NP DetP NP NP N NP NP N PP Prep NP Mary likes the cat bowl. The cat ate the tasty food. Hello. Nice talking to you. • N cat | mat | food | bowl | Mary • V is | likes | sits • Prep on | in | under • DetP the | a Pocket Sphinx Grammar Format • Variables go in angle brackets, e.g. <city> • Terminals must appear in your pronunciation dictionary (case sensitive) • X Y is concatenation -- e.g., I WANT • (X | Y) means X or Y -- e.g., (WANT|NEED) • Square brackets mean optional -- e.g., [ON] FRIDAY • * means that the expansion may be spoken zero or more times -- e.g. <digit>* • + means one or more times -- e.g. <digit>+ Example • <city> = BOSTON | NEWYORK | WASHINGTON | BALTIMORE; • <time> = MORNING | EVENING; • <day> = FRIDAY | MONDAY; • public <query> = (((WHAT TRAINS LEAVE) | (WHAT TIME CAN I TRAVEL) | (IS THERE A TRAIN)) (FROM|TO) <city> [(FROM|TO) <city>] ON <day> [<time>]); Hello. No. I want to go on Tuesday. When does the train leave? Next Class • Language modeling for large vocabulary applications: Ngrams