LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 29th Today's Topics • • • • Britney Spears edit distance More on n-gram models Tregex class exercise Tregex homework Britney Spears edit distance • Data: Many ways to overfit the homework: • negative edit distance for doubling the 't' • 'a' for 'e' lowers cost • etc. • there is no simple and general formula … More on n-gram models Colorless green ideas • examples – (1) colorless green ideas sleep furiously – (2) furiously sleep ideas green colorless • Chomsky (1957): – . . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. • idea One piece of supporting evidence: – (1) is syntactically valid, – (2) is word salad (1) pronounced with normal intonation (2) pronounced like a list of words … Colorless green ideas • Sentences: – (1) colorless green ideas sleep furiously – (2) furiously sleep ideas green colorless • Statistical Experiment (Pereira 2002) bigram language model wi-1 wi Part-of-Speech (POS) Tag Sequence • Chomsky's example: – colorless green ideas sleep furiously – JJ JJ NNS VBP RB (POS Tags) • Similar but grammatical example: – revolutionary new ideas appear infrequently – JJ JJ NNS VBP RB LSLT pg. 146 Penn Treebank (PTB) Corpus How is it used? • One million words of 1989 Wall Street Journal (WSJ) material • nearly 50,000 sentences (25 sections) • Training: (2–21) 39,832 sentences • Evaluation: (23) 2,416 sentences • Standard practice 0 training sections 2–21 90% evaluation 23 24 Stanford Parser • Stanford Parser: – a probabilistic PS parser trained on the Penn Treebank Stanford Parser • Stanford Parser: a probabilistic PS parser trained on the Penn Treebank Penn Treebank (PTB) • Corpus: word frequencies: Word POS colorless green Frequency Word POS 0 revolutionary JJ Frequency 6 NNP 33 NNP 2 JJ 19 NN 2 NN 5 JJ 1795 ideas NNS 32 NNP 1459 sleep VB 5 NNPS 2 NN 4 NN 1 VBP 2 ideas NNS 32 NNP 1 appear VB 55 RB 2 VBP 41 furiously new infrequently 0 Stanford Parser • Structure of NPs: – colorless green ideas – revolutionary new ideas Phrase [NP JJ JJ NNS] Frequency 1073 [NP NNP JJ NNS] 61 An experiment • examples – (1) colorless green ideas sleep furiously – (2) furiously sleep ideas green colorless • Question: – Is (1) the most likely permutation of these particular five words? Parsing Data • All 5! (=120) permutations of – colorless green ideas sleep furiously . Parsing Data • The winning sentence was: ① furiously ideas sleep colorless green . • after training on sections 02-21 (approx. 40,000 sentences) • sleep selects for ADJP object with 2 heads • adverb (RB) furiously modifies noun Parsing Data • The next two highest scoring permutations were: ② Furiously green ideas sleep colorless . ③ Green ideas sleep furiously colorless . sleep takes NP object sleep takes ADJP object Parsing Data • (Pereira 2002) compared Chomsky’s original minimal pair: 23. colorless green ideas sleep furiously 36. furiously sleep ideas green colorless Ranked #23 and #36 respectively out of 120 Parsing Data But … • graph (next slide) shows how arbitrary these rankings are • when trained on randomly chosen sections covering 14K- 31K sentences Example: #36 furiously sleep ideas green colorless outranks #23 colorless green ideas sleep furiously (and the top 3) over much of the training space Example: Chomsky's original sentence #23 colorless green ideas sleep furiously outranks both the top 3 and #36 just briefly at one data point Sentence Rank vs. Amount of Training Data 120 Best three sentences 100 ⬅︎Rank 80 60 #1 #2 #3 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Amount of training data ➠ 16 17 18 19 20 21 Sentence Rank vs. Amount of Training Data 120 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 100 ⬅︎Rank 80 60 #23 #36 40 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Amount of training data ➠ 16 17 18 19 20 21 Sentence Rank vs. Amount of Training Data 120 #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless 100 ⬅︎Rank 80 #1 60 #2 #3 #23 40 #36 20 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Amount of training data ➠ 16 17 18 19 20 21 Break tregex class exercise tregex class exercise • Other things: 1. nonstandard associative rule without round brackets ( … ) 2. @ basic category 3. boolean operators: ! (negation) ? (optional) & (and) | (or) [ … ] 4. segment : 5. naming a node: =name 6. groups inside regex for nodes: /regex/#1%name tregex class exercise • Look at TREEBANK_3/docs/prsguid1.pdf – on page 63 tregex class exercise • Search for NP trace relative clauses as defined below: tregex homework • One PDF file. • Due by email before next class tregex homework Section 16: Clefts (pp.268—270) • Examples from the guide: – (Subject) It is the clumsy child who sustains the worst injuries – (Direct object) It is this conclusion that we challenge – others… Do these exist in the PTB WSJ corpus? • Search for them and categorize by type, e.g. subject/object etc. • Supply frequency counts. – (Make sure you don't include it-extraposition examples, section 17.) • Supply your regex search strings • How about wh-clefts (16.2)? – (More difficult.)