Speech and Language Processing: Where have we been and where are we going? Kenneth Ward Church church@microsoft.com Where have we been? How To Cook A Demo (After Dinner Talk at TMI-1992 & Invited Talk at TMI-2002) • Great fun! • Effective demos Message for After Dinner Talk – Theater, theater, theater – Production quality matters – Entertainment >> evaluation – Strategic vision >> technical correctness • Success/Catastrophe Message for After Breakfast Talk – Warning: demos can be too effective – Dangerous to raise unrealistic expectations Brno 2004 2 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Brno 2004 3 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – Classic example of a demo embarrassment in retrospect Brno 2004 4 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • Though well understood by research community Brno 2004 5 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • 3. Though well understood by research community Apple (~1990) video – – Still having trouble setting appropriate expectations Factoid: the day of this demo, speech recognition deployed at scale in AT&T network – with significant lasting impact – but little media Brno 2004 6 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • 3. Apple (~1990) video – – 4. Though well understood by research community Still having trouble setting appropriate expectations Factoid: the day of this demo, speech recognition deployed at scale in AT&T network – with significant lasting impact – but little media Andy Rooney (~1990): reset expectations video Brno 2004 7 Charles Wayne’s Challenge: Demonstrate Consistent Progress Over Time Managing Expectations • Controversial in 1980s – – • But not in 1990s Though, lgrumbling Benefits 1. Agreement on what to do 2. Limits endless discussion 3. Helps sell the field • • • Manage expectations Fund raising Risks (similar to benefits) 1. All our eggs are in one basket (lack of diversity) 2. Not enough discussion • Hard to change course 3. Methodology Burden Brno 2004 8 $ Hockey Stick Business Case 2003 Last Year 2004 This Year Brno 2004 t 2005 Next Year 9 Moore’s Law: Ideal Answer Where have we been and where are we going? Brno 2004 10 Error Rate Borrowed Slide Audrey Le (NIST) Moore’s Law Time Constant: • 10x improvement per decade Date (15 years) Brno 2004 11 Milestones in Speech and Multimodal Technology Research Borrowed Slide Small Vocabulary, Acoustic Phoneticsbased Isolated Words Filter-bank analysis; Timenormalization ;Dynamic programming 1962 Medium Large Vocabulary, Vocabulary, Template-based Statistical-based Isolated Words; Connected Digits; Continuous Speech Pattern recognition; LPC analysis; Clustering algorithms; Level building; 1967 1972 Connected Words; Continuous Speech Continuous Speech; Speech Understanding Hidden Markov models; Stochastic Language modeling; Stochastic language understanding; Finite-state machines; Statistical learning; 1977 1982 Very Large Vocabulary; Semantics, Multimodal Dialog, TTS Large Vocabulary; Syntax, Semantics, 1987 1992 Spoken dialog; Multiple modalities Concatenative synthesis; Machine learning; Mixedinitiative dialog; 1997 2002 Year Consistent improvement over time, but unlike Moore’s Law, hard to extrapolate (predict future) Brno 2004 12 Speech-Related Technologies Where will the field go in 10 years? Niels Ole Bernsen (ed) 2003 Useful speech recognition-based language tutor 2003 Useful portable spoken sentence translation systems 2003 First pro-active spoken dialogue with situation awareness 2004 Satisfactory spoken car navigation systems 2005 Small-vocabulary (> 1000 words) spoken conversational systems 2006 Multiple-purpose personal assistants (spoken dialog, animated characters) 2006 Task-oriented spoken translation systems for the web 2006 Useful speech summarization systems in top languages 2008 Useful meeting summarization systems 2010 Medium-size vocabulary conversational systems Brno 2004 13 Where have we been and where are we going? Manage Consistent Progress over Time Expectations Extrapolation/Prediction is Not Applicable $ Extrapolation/Prediction is Applicable 2002 2003 2004 t Brno 2004 14 Outline 1. We’re making consistent progress, or 2. We’re running around in circles, or 3. We’re going off a cliff… We are here Brno 2004 15 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Brno 2004 16 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Brno 2004 17 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Brno 2004 18 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) • Periodic signals are continuous • Support extrapolation/prediction • Progress? Consistent progress? – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: Consistent progress? • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Brno 2004 Extrapolation/Prediction: Applicable? 19 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? – Has too much happened since TMI-1992? • I worry that the pendulum has swung so far that – We are no longer training students for the possibility • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Brno 2004 20 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Brno 2004 21 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Brno 2004 22 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: – Statistics and Machine Learning – as well as Linguistic Theory • History repeats itself: – – – – 1950s: empiricism 1970s: rationalism (empiricist methodology became too burdensome) 1990s: empiricism 2010s: rationalism (empiricist methodology is burdensome, again) Grandparents and grandchildren has a natural alliance Brno 2004 23 Rationalism Well-known Chomsky, Minsky advocates Model Competence Model Contexts of Interest Phrase-Structure Goals Empiricism Shannon, Skinner, Firth, Harris Noisy Channel Model N-Grams All and Only Minimize Prediction Error (Entropy) Explanatory Descriptive Theoretical Applied Linguistic Agreement & WhGeneralizations movement Principle-Based, Parsing Strategies CKY (Chart), ATNs, Unification Understanding Applications Who did what to whom Brno 2004 Collocations & Word Associations Forward-Backward (HMMs), Inside-outside (PCFGs) Recognition Noisy Channel Applications 24 Revival of Empiricism: A Personal Perspective • As a student at MIT, I was solidly opposed to empiricism – But that changed soon after moving to AT&T Bell Labs (1983) • Letter-to-Sound Rules (speech synthesis) Letter-to-sound rules Dict – Names (~1985): Letter stats Etymology Pronunciation video • • Part of Speech Tagging (1988) Word Associations (Hanks) Case-based reasoning: – Corpus-based lexicography: Empirical, but not statistical The best inference is table lookup • Collocations: Strong tea v. powerful computers Lexicography • Word Associations: bread and butter, doctor/nurse – Contribution: adding stats • Mutual info collocations & word associations • Pr(doctor…nurse) >> Pr(doctor) Pr(nurse) • Good-Turing Smoothing (Gale): Statistics – Estimate probability of something you haven’t seen (whales) • • Aligning Parallel Corpora: inspired by Machine Translation (MT) Word Sense Disambiguation (river bank v. money bank) – Bilingual Monolingual (Yarowsky) • Even if IBM’s stat-based approach fails for Machine Translation lasting benefit (tools, linguistic resources, academic contributions to machine learning) Brno 2004 Played well at TMI-2002 25 Speech Language Shannon’s: Noisy Channel Model Channel Model Language Model • I Noisy Channel O • I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I) Application Independent Trigram Language Model Word Rank We 9 The This One Two A Three Please In need 7 are will the would also do to 1 resolve 85 all 9 The This One Two A Three Please In of 2 The This One Two A Three Please In the 1 important 657 issues 14 More likely alternatives have know do… Channel Model Application Input Output Speech Recognition writer rider OCR (Optical Character Recognition) all a1l Spelling Correction government goverment document question first… thing point to Brno 2004 26 Speech Language: Using (Abusing) Shannon’s Noisy Channel Model • Speech – Words Noisy Channel Acoustics • OCR – Words Noisy Channel Optics • Spelling Correction – Words Noisy Channel Typos • Part of Speech Tagging (POS): – POS Noisy Channel Words • Machine Translation: “Made in America” – English Noisy Channel French Brno 2004 27 Recent work The chance of Two Noriegas is Closer to p/2 than p2: Implications for Language Modeling, Information Retrieval and Gzip • Standard indep models (Binomial, Multinomial, Poisson): – Chance of 1st Noriega is p – Chance of 2nd is also p • Repetition is very common – Ngrams/words (and their variant forms) appear in bursts – Noriega appears several times in a doc, or not at all. • Adaptation & Contagious probability distributions • Discourse structure (e.g., text cohesion, given/new): – 1st Noriega in a document is marked (more surprising) – 2nd is unmarked (less surprising) • Empirically, we find first Noriega is surprising (p≈6/1000) – But chance of two is not surprising (closer to p/2 than p2) • Finding a rare word like Noriega is like lightning – We might not expect lightning to strike twice in a doc – But it happens all the time, especially for good keywords • Documents ≠ Random Bags of Words Brno 2004 28 Three Applications & Independence Assumptions: No Quantity Discounts • Compression: Huffman Coding – |encoding(s)| = ceil(−log2 Pr(s)) – Two Noriegas consume twice as much space as one • |encoding(s s)| = |encoding(s)| + |encoding(s)| – No quantity discount • Indep is the worst case: any dependencies less H (space) • Information Retrieval – Score(query, doc) = ∑term in doc tf(term, doc) idf(term) • idf(term): inverse doc freq: −log2 Pr(term) = −log2 df(term)/D • tf(term, doc): number of instances of term in doc – Two Noriegas are twice as surprising as one (2 idf v. idf) – No quantity discount: any dependencies less surprise • Speech Recognition, OCR, Spelling Correction – I Noisy Channel O – Pr(I) Pr(O|I) – Pr(I) = Pr(w1, w2 … wn) ≈ ∏k Pr(wk|wk-2, wk-1) Brno 2004 Log tf smoothing 29 Interestingness Metrics: Deviations from Independence • Poisson (and other indep assumptions) – Not bad for meaningless random strings • Deviations from Poisson are clues for hidden variables – Meaning, content, genre, topic, author, etc. • Analogous to mutual information (Hanks) – Pr(doctor…nurse) >> Pr(doctor) Pr(nurse) Brno 2004 30 Brno 2004 31 Brno 2004 32 Brno 2004 33 Poisson Mixtures: More Poissons Better Fit (Interpretation: Each Poisson is conditional on hidden variables: meaning, content, genre, topic, author, etc.) Brno 2004 34 Adaptation: Three Approaches 1. Cache-based adaptation Pr( w | ...) Prlocal ( w | ...) (1 ) Prglobal ( w | ...) 2. Parametric Models – Poisson, Two Poisson, Mixtures (neg binomial) Pr( k 2 | k 1) 1 Pr(1) Pr(0) 1 Pr(0) 3. Non-parametric – Pr(+adapt1) ≡ Pr(test|hist) – Pr(+adapt2) ≡ Pr(k≥2|k ≥1) Brno 2004 35 Positive & Negative Adaptation • Adaptation: – How do probabilities change as we read a doc? • Intuition: If a word w has been seen recently 1. +adapt: prob of w (and its friends) goes way up 2. −adapt: prob of many other words goes down a little • Pr(+adapt) >> Pr(prior) > Pr(−adapt) Brno 2004 36 Adaptation: Method 1 • Split each document into two equal pieces: Documents containing hostages in 1990 AP News – Hist: 1st half of doc – Test: 2nd half of doc • Task: – Given hist – Predict test test • Compute contingency table for each word hist Brno 2004 638 505 557 76,787 37 Adaptation: Method 1 test hist a • Notation – D = a+b+c+d (library) – df = a+b+c (doc freq) c • Prior: • +adapt • −adapt d Documents containing hostages in 1990 AP News ac Pr( w test ) D Pr( w test | w hist ) b a ab Pr(+adapt) >> Pr(prior) > Pr(−adapt) +adapt prior −adapt source 0.56 0.014 0.0069 AP 1987 0.56 0.015 0.0072 AP 1990 c Pr( w test | w hist ) cd 0.59 0.013 0.0057 AP 1991 0.39 0.004 0.0030 AP 1993 Brno 2004 38 Priming, Neighborhoods and Query Expansion • Priming: doctor/nurse – Doctor in hist Pr(Nurse in test) ↑ • Find docs near hist (IR sense) – Neighborhood ≡ set of words in docs near hist (query expansion) • Partition vocabulary into three sets: 1. Hist: Word in hist 2. Near: Word in neighborhood − hist 3. Other: None of the above • • • • test hist a b c d ae g test D a hist a +adapt Pr( w test | w hist ) a b near e e Near Pr( w test | w near ) e f other g g Other Pr( w test | w other ) gh Prior: Pr( w test ) Brno 2004 b f h 39 Adaptation: Hist >> Near >> Prior • Magnitude is huge – p/2 >> p2 – Two Noriegas are not much more surprising than one – Huge quantity discounts • Shape: Given/new – 1st mention: marked • Surprising (low prob) • Depends on freq – 2nd: unmarked • Less surprising • Independent of freq – Priming: • “a little bit” marked Brno 2004 40 Adaptation is Lexical • Lexical: adaptation is – Stronger for good keywords (Kennedy) – Than random strings, function words (except), etc. • Content ≠ low frequency +adapt prior −adapt source word 0.27 0.012 0.0091 AP90 Kennedy 0.40 0.015 0.0084 AP91 Kennedy 0.32 0.014 0.0094 AP93 Kennedy 0.049 0.016 0.016 AP90 except 0.048 0.014 0.014 AP91 except 0.048 0.012 0.012 AP93 except Brno 2004 41 Adaptation: Method 2 • Pr(+adapt2) df 2 Pr( k 2 | k 1) df1 • dfk(w) ≡ number of documents that – mention word w – at least k times • df1(w) ≡ standard def of document freq (df) Brno 2004 42 Pr(+adapt1) ≈ Pr(+adapt2) Within factors of 2-3 (as opposed to 10-1000) 3rd mention Priming Brno 2004 43 Adaptation helps more than it hurts Hist is a great clue • Examples of big winners (Boilerplate) – Lists of major cities and their temperatures – Lists of major currencies and their prices – Lists of commodities and their prices – Lists of senators and how they voted • Examples of big losers Hist is misleading – Summary articles – Articles that were garbled in transmission Brno 2004 44 Recent Work (with Kyoji Umemura) • Applications: Japanese Morphology (text words) – Standard methods: dictionary-based – Challenge: OOV (out of vocabulary) – Good keywords (OOV) adapt more than meaningless fragments • Poisson model: not bad for meaningless random strings • Adaptation (deviations from Poisson): great clues for hidden variables – OOV, good keywords, technical terminology, meaning, content, genre, author, etc. – Extend dictionary method to also look for substrings that adapt a lot • Practical procedure for counting dfk(s) for all substrings s in a large corpus (trigrams million grams) – Suffix array: standard method for computing freq and loc for all s – Yamamoto & Church (2001): count df for all ngrams in large corpus • df (and many other ngram stats) for million-grams • Although there are too many ngrams to work with (n2) – They can be grouped into a manageable number of equiv classes (n) – Where all substrings in a class share the same stats – Umemura (submitted): generalize method for dfk • Adaptation for million-grams Brno 2004 45 struct stackframe { int start, SIL, cdfk } *stack; int kth_neighbor(int suffix, int k) { int i, result = suffix; for(i=0; i < k && result >= 0; i++) result = neighbors[result]; return result; } int find(int suffix) { int low = 0; int high = sp; while(low + 1 < high) { int mid = (low + high) / 2; if(stack[mid].start <= suffix) low = mid; else high = mid; } if(stack[high].start <= suffix) return high; if(stack[low].start <= suffix) return low; fatal("can't get here"); } The Solution (dfk for all ngrams) for(w=0; w<N; w++) { if(LCP[w]> stack[sp].SIL) { sp++; stack[sp].start = w; stack[sp].SIL = LCP[w]; stack[sp].cdfk = 0; } int prev = kth_neighbor(w, K-1); if(prev >= 0) stack[find(prev)].cdfk++; while(LCP[w] < stack[sp].SIL) { putw(stack[sp].cdfk, out); /* report */ if(LCP[w] <= stack[sp-1].SIL) { stack[sp-1].cdfk += stack[sp].cdfk; sp--; } Brno 2004else stack[sp].SIL = LCP[w]; }} 46 App: Word Breaking & Term Extraction Challenge: Establish value beyond standard corpus freq and trigrams • No spaces • English – in Japanese and Chinese – Kim Dae Jung before Presidency • English has spaces – But… • Phrases: white house • NER (named entity recog) • Japanese – 大統領になる以前の 金大中 • Chinese –未上任前的金大中 • Word Breaking – Dictionary-based (ChaSen) • Dynamic Programming • Fewest edges (dictionary entries) that cover input • Challenges for Dictionary – Out-of-Vocabulary (OOV) – Technical Terminology – Proper Nouns Brno 2004 47 Using Adaptation to Distinguish Terms from Random Fragments • Adaptation: Pr(k≥2|k≥1) df2 / df1 • Null hypothesis – Poisson: df2 / df1 df1 / D – Not bad for random fragments • OOV (and substrings thereof) – Adapt too much for null hypothesis (Poisson) – If an OOV word is mentioned once in a document, it will probably be mentioned again – Not true for random fragments Brno 2004 48 word boundary Using Adaptation to Reject Null Hypo Japanese English Gloss フジモリ Fujimori 大統領 President が <function word> Brno 2004 49 English Example Adaptation Doc Freq Baseline Brno 2004 50 Adaptation Conclusions 1. Large magnitude (p/2 >> p2); big quantity discounts 2. Distinctive shape • 1st mention depends on freq – • 2nd does not Priming: between 1st mention and 2nd 3. Lexical: – – Independence assumptions aren’t bad for meaningless random strings, function words, common first names, etc. More adaptation for content words (good keywords, OOV) Brno 2004 51 Outline 1. We’re making consistent progress, or 2. We’re running around in circles, or • Don’t worry, be happy We are here 3. We’re going off a cliff… Brno 2004 52 The rising tide of data will lift all boats! TREC Question Answering & Google: What is the highest point on Earth? Brno 2004 53 The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets England Japan Cat cat France Germany Italy Ireland China India Indonesia Malaysia Dog Horse Fish Bird more ls rm mv Spain Scotland Belgium Korea Taiwan Thailand Rabbit Cattle Rat cd cp mkdir Canada Austria Australia Singapore Australia Bangladesh Livestock Mouse Human man tail pwd Brno 2004 54 Rising Tide of Data Lifts all Boats • More data better results – TREC Question Answering • Remarkable performance: Google and not much else – Norvig (ACL-02) – AskMSR (SIGIR-02) – Lexical Acquisition • Google Sets – Hanks and I tried similar things » but with tiny corpora » which we called large Brno 2004 55 Outline • We’re making consistent progress, or • We’re running around in circles, or – Don’t worry; be happy • We’re going off a cliff… According to unnamed sources: Speech Winter Language Winter Dot Boom Dot Bust Brno 2004 56 What is the answer to all questions? • 6 years Brno 2004 57 % Statistical Papers When will we see the last nonstatistical paper? 2010? 100% 80% 60% 40% 20% 0% 2005 2000 1995 1990 1985 ACL Meeting Bob Moore Fred Jelinek Brno 2004 58 Covering all the Bases It is hard to make predictions (especially about the future) • When will we see the last non-statistical paper? 2010? • Revival of rationalism: 2010? – 1950s: Empiricism • Information Theory, Behaviorism – 1970s: Rationalism • AI, Cognitive Psychology – 1990s: Empiricism • Data Mining, Statistical NLP, Speech – 2010s: Rationalism • TBD Brno 2004 59 Brno 2004 60 Brno 2004 61 Sample of 20 Survey Questions (Strong Emphasis on Applications) • When will – More than 50% of new PCs have dictation on them, either at purchase or shortly after. – Most telephone Interactive Voice Response (IVR) systems accept speech input. – Automatic airline reservation by voice over the telephone is the norm. – TV closed-captioning (subtitling) is automatic and pervasive. – Telephones are answered by an intelligent answering machine that converses with the calling party to determine the nature and priority of the call. – Public proceedings (e.g., courts, public inquiries, parliament, etc.) are transcribed automatically. • Two surveys of ASRU attendees: 1997 & 2003 Brno 2004 62 2003 Responses ≈ 1997 Responses + 6 Years (6 years of hard work No progress) Brno 2004 63 Wrong Apps? • New Priorities • Old Priorities – Dictation app dates back to days of dictation machines – Speech recognition has not displaced typing – Increase demand for space >> Data entry • New Killer Apps • Speech recognition has improved • But typing skills have improved even more – Search >> Dictation • Speech Google! – Data mining – My son will learn typing in 1st grade – Sec rarely take dictation – Dictation machines are history • My son may never see one • Museums have slide rulers and steam trains – But dictation machines? Brno 2004 64 Speech Data Mining & Call Centers: An Intelligence Bonanza • Some companies are collecting information with technology designed to monitor incoming calls for service quality. • Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers. • But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director. Brno 2004 65 Speech Data Mining • Label calls as success or failure based on some subsequent outcome (sale/no sale) • Extract features from speech • Find patterns of features that can be used to predict outcomes • Hypotheses: – Customer: “I’m not interested” no sale – Agent: “I just want to tell you…” no sale Inter-ocular effect (hits you between the eyes); Don’t need a statistician to know which way the wind is blowing Brno 2004 66 Borrowed Slide: Jelinek (LREC) Great Strategy Success Great Challenge: Annotating Data • Produce annotated data with minimal supervision Self-organizing “Magic”? • Active learning – Identify reliable labels – Identify best candidates for annotation • Co-training • Bootstrap (project) resources from one application to another Brno 2004 67 Grand Challenges ftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf Brno 2004 68 Roadmaps: Structure of a Strategy (not the union of what we are all doing) • Goals – Example: Replace keyboard with microphone – Exciting (memorable) sound bite – Broad grand challenge that we can work toward but never solve • Metrics – Examples: • – Quantity is not a good thing – Awareness – 1-slide version • if successful, you get maybe 3 more slides • – Easy to measure • • Mostly for next year: Q1-4 • Plus some for years 2, 5, 10 & 20 Milestones – Should be no question if it has been accomplished – Example: reduce WER on task x by y% by time t – Accomplishments: a dozen • Broad applicability & illustrative – Don’t cover everything – Highlight stuff that Accomplishments v. Activities • Applies to multiple groups • Forward-Looking / Exciting – Accomplishments are good – Activity is not a substitute for accomplishments – Milestones look forward whereas accomplishments look backward • Serendipity is good! Size of container – Goal: 1-3 – Metrics: 3 – Milestones: a dozen • WER: word error rate • Time to perform task • Small is beautiful Brno 2004 69 Grand Challenges Infrastructure Brno 2004 70 Goals: 1. The multilingual companion 2. Life log Grand Challenges Goal: Produce NLP apps that improve the way people communicate with one another Goal: Reduce barriers to entry € Apps & Techniques Resources Evaluation Brno 2004 71 Substance: Recommended if… Summary: What Worked the right and What Didn’t? What’s answer? • Data – Stay on msg: It is the data, stupid! • If you have a lot of data, – • • Then you don’t need a lot of methodology Rising Tide of Data Lifts All Boats Methodology – Empiricism means different things to different people 1. 2. 3. – There’ll be a quiz at the end of the decade… Machine Learning (Self-organizing Methods) Exploratory Data Analysis (EDA) Corpus-Based Lexicography Magic: Recommended if… Lots of papers on 1 • • EMNLP-2004 theme (error analysis) 2 Senseval grew out of 3 Short term ≠ Long term Promise: Recommended if… Brno 2004 Lonely 72