Speech and Language Processing: Where have we been and where are we going? Kenneth Ward Church AT&T Labs-Research church@att.com www.research.att.com/~kwc Where have we been? How To Cook A Demo (After Dinner Talk at TMI-1992 & Invited Talk at TMI-2002) • Great fun! • Effective demos Message for After Dinner Talk – Theater, theater, theater – Production quality matters – Entertainment >> evaluation – Strategic vision >> technical correctness • Success/Catastrophe Message for After Breakfast Talk – Warning: demos can be too effective – Dangerous to raise unrealistic expectations Eurospeech 2003 2 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Eurospeech 2003 3 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – Classic example of a demo embarrassment in retrospect Eurospeech 2003 4 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • Though well understood by research community Eurospeech 2003 5 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • 3. Though well understood by research community Apple (~1990) video – – Still having trouble setting appropriate expectations Factoid: the day of this demo, speech recognition deployed at scale in AT&T network – with significant lasting impact – but little media Eurospeech 2003 6 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – 1. Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. Machine Translation (1950s) video – 2. Classic example of a demo embarrassment in retrospect Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • 3. Apple (~1990) video – – 4. Though well understood by research community Still having trouble setting appropriate expectations Factoid: the day of this demo, speech recognition deployed at scale in AT&T network – with significant lasting impact – but little media Andy Rooney (~1990): reset expectations video Eurospeech 2003 7 Outline: Where have we been and where are we going? 1. Consistent progress over decades 2. Moore’s Law, Speech Coding, Error Rate History repeats itself • • • • 3. Managing Expectations Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: Search >> Compression & Dictation Eurospeech 2003 8 Charles Wayne’s Challenge: Demonstrate Consistent Progress Over Time Managing Expectations • Controversial in 1980s – – • But not in 1990s Though, lgrumbling Benefits 1. Agreement on what to do 2. Limits endless discussion 3. Helps sell the field • • • Manage expectations Fund raising Risks (similar to benefits) 1. All our eggs are in one basket (lack of diversity) 2. Not enough discussion • Hard to change course 3. Methodology Burden Eurospeech 2003 9 $ Hockey Stick Business Case 2002 Last Year 2003 This Year Eurospeech 2003 t 2004 Next Year 10 Moore’s Law: Ideal Answer Where have we been and where are we going? Eurospeech 2003 11 Where have we been and where are we going? Moore’s Law: Ideal Answer Why different slopes? Hyper-Inflation 1. Progress limited by physics – Disk seek: 10 years (normal inflation) – Disk capacity: 1 year (hyper-inflation) Normal Inflation Physics & Investment Rate of Progress in Speech & Language (and everything) Eurospeech 2003 12 Where have we been and where are we going? Moore’s Law: Ideal Answer Why different slopes? Hyper-Inflation 1. Progress limited by physics – Disk seek: 10 years (normal inflation) – Disk capacity: 1 year (hyper-inflation) 2. Progress limited by investment – Case history: PCs improved faster than supercomputers (Cray) • PCs: larger market more R&D – Irony: “Dis-economy of Scale” – Danny Hillis (Thinking Machines) • Normal Inflation Physics & Investment Rate of Progress in Speech & Language (and everything) Computing is better (cheaper & faster) on smaller machines – PCs >> big iron – LAN routers >> 5ESS (big phone switch) – Economies of scale depend on size of market, not size of machine • • Market: PC >> big iron (Economist View) Machine: PC << big iron (CS View) Eurospeech 2003 13 Where have we been and where are we going? Moore’s Law: Ideal Answer Why different slopes? Hyper-Inflation 1. Progress limited by physics – Disk seek: 10 years (normal inflation) – Disk capacity: 1 year (hyper-inflation) 2. Progress limited by investment – Case history: PCs improved faster than supercomputers (Cray) • PCs: larger market more R&D – Irony: “Dis-economy of Scale” – Danny Hillis (Thinking Machines) • Normal Inflation Physics & Investment Rate of Progress in Speech & Language (and everything) Computing is better (cheaper & faster) on smaller machines – PCs >> big iron – LAN routers >> 5ESS (big phone switch) – Economies of scale depend on size of market, not size of machine • • Market: PC >> big iron (Economist View) Machine: PC << big iron (CS View) Eurospeech 2003 14 Where have we been and where are we going? Moore’s Law: Ideal Answer Why different slopes? Hyper-Inflation 1. Progress limited by physics – Disk seek: 10 years (normal inflation) – Disk capacity: 1 year (hyper-inflation) 2. Progress limited by investment – Case history: PCs improved faster than supercomputers (Cray) • PCs: larger market more R&D – Irony: “Dis-economy of Scale” – Danny Hillis (Thinking Machines) • Normal Inflation Physics & Investment Rate of Progress in Speech & Language (and everything) Computing is better (cheaper & faster) on smaller machines – PCs >> big iron – LAN routers >> 5ESS (big phone switch) – Economies of scale depend on size of market, not size of machine • • Market: PC >> big iron (Economist View) Machine: PC << big iron (CS View) Eurospeech 2003 15 Borrowed Slide Rich Cox Evolution of Speech Coder Performance Excellent Good North American TDMA 2000 Fair 1990 ITU Recommendations Cellular Standards 1980 Poor Secure Telephony 1980 Profile 1990 Profile 2000 Profile Bad Bit Rate (kb/s) Eurospeech 2003 16 Ceiling Speech Coding (Telephony) • More complicated than Moore’s Law – Many Dimensions: Bit Rate, Quality, Complexity and Delay – Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) Eurospeech 2003 17 Ceiling Speech Coding (Telephony) • More complicated than Moore’s Law – Many Dimensions: Bit Rate, Quality, Complexity and Delay – Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant – Bit rates half every decade (≤ 8 kb/s) – Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) – Limited more by physics than investment Eurospeech 2003 18 Ceiling Speech Coding (Telephony) • More complicated than Moore’s Law – Many Dimensions: Bit Rate, Quality, Complexity and Delay – Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant – Bit rates half every decade (≤ 8 kb/s) – Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) – Limited more by physics than investment • Potential compression opportunity – At most 10x: 8 kb/s 2 kb/s 1 kb/s (?) – Entropy: 50 bits per sec (Roger Moore) Eurospeech 2003 19 Ceiling Speech Coding (Telephony) • More complicated than Moore’s Law – Many Dimensions: Bit Rate, Quality, Complexity and Delay – Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant – Bit rates half every decade (≤ 8 kb/s) – Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) – Limited more by physics than investment • Potential compression opportunity – At most 10x: 8 kb/s 2 kb/s 1 kb/s (?) 50 bits per sec (??) • Speech (2 kb/s) >> text (2 bits/char): 10-1000 times more bits – Speech coding will not close this gap for foreseeable future Eurospeech 2003 20 Where have we been and where are we going? 1. Consistent progress over decades • • 2. Moore’s Law Speech Coding Reducing Speech Recognition Error Rates History repeats itself • • • • 3. Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: Search >> Compression & Dictation Eurospeech 2003 21 Error Rate Borrowed Slide Audrey Le (NIST) Moore’s Law Time Constant: • 10x improvement per decade • Limited by R&D Investment • (Not Physics) Date (15 years) Eurospeech 2003 22 Milestones in Speech and Multimodal Technology Research Borrowed Slide Small Vocabulary, Acoustic Phoneticsbased Isolated Words Filter-bank analysis; Timenormalization ;Dynamic programming 1962 Medium Large Vocabulary, Vocabulary, Template-based Statistical-based Isolated Words; Connected Digits; Continuous Speech Pattern recognition; LPC analysis; Clustering algorithms; Level building; 1967 1972 Connected Words; Continuous Speech Continuous Speech; Speech Understanding Hidden Markov models; Stochastic Language modeling; Stochastic language understanding; Finite-state machines; Statistical learning; 1977 1982 Very Large Vocabulary; Semantics, Multimodal Dialog, TTS Large Vocabulary; Syntax, Semantics, 1987 1992 Spoken dialog; Multiple modalities Concatenative synthesis; Machine learning; Mixedinitiative dialog; 1997 2002 Year Consistent improvement over time, but unlikeEurospeech Moore’s Law, hard to extrapolate (predict future) 2003 23 Speech-Related Technologies Where will the field go in 10 years? Niels Ole Bernsen (ed) 2003 Useful speech recognition-based language tutor 2003 Useful portable spoken sentence translation systems 2003 First pro-active spoken dialogue with situation awareness 2004 Satisfactory spoken car navigation systems 2005 Small-vocabulary (> 1000 words) spoken conversational systems 2006 Multiple-purpose personal assistants (spoken dialog, animated characters) 2006 Task-oriented spoken translation systems for the web 2006 Useful speech summarization systems in top languages 2008 Useful meeting summarization systems 2010 Medium-size vocabulary conversational systems Eurospeech 2003 24 Where have we been and where are we going? Manage Consistent Progress over Time Expectations Extrapolation/Prediction Physics and is Applicable Physics Extrapolation/Prediction is Not Applicable $ Investment 2002 2003 2004 t Investment Eurospeech 2003 25 Where have we been and where are we going? 1. Consistent progress over decades • Moore’s Law, Speech Coding, Error Rate History repeats itself • • • • 3. Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: Search >> Compression & Dictation Eurospeech 2003 26 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Eurospeech 2003 27 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Eurospeech 2003 28 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Eurospeech 2003 29 It has been claimed that Recent progress made possible by Empiricism Progress (or Oscillating Fads)? • 1950s: Empiricism was at its peak – Dominating a broad set of fields • Ranging from psychology (Behaviorism) • To electrical engineering (Information Theory) • Periodic signals are continuous • Support extrapolation/prediction • Progress? Consistent progress? – Psycholinguistics: Word frequency norms (correlated with reaction time, errors) • Word association norms (priming): bread and butter, doctor / nurse – Linguistics/psycholinguistics: focus on distribution (correlate of meaning) • Firth: “You shall know a word by the company it keeps” • Collocations: Strong tea v. powerful computers • 1970s: Rationalism was at its peak – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • 1990s: Revival of Empiricism – Availability of massive amounts of data (popular arg, even before the web) • “More data is better data” • Quantity >> Quality (balance) Consistent progress? – Pragmatic focus: • What can we do with all this data? • Better to do something than nothing at all – Empirical methods (and focus on evaluation): Speech Language • 2010s: Revival of Rationalism (?) Extrapolation/Prediction: Applicable? Eurospeech 2003 30 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? – Has too much happened since TMI-1992? • I worry that the pendulum has swung so far that – We are no longer training students for the possibility • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Eurospeech 2003 31 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Eurospeech 2003 32 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: • – Statistics and Machine Learning – as well as Linguistic Theory History repeats itself: Mark Twain; bad idea then and still a bad idea now – 1950s: empiricism – 1970s: rationalism (empiricist methodology became too burdensome) – 1990s: empiricism – 2010s: rationalism (empiricist methodology is burdensome, again) Eurospeech 2003 33 Speech Language Has the pendulum swung too far? • What happened between TMI-1992 and TMI-2002 (if anything)? • Have empirical methods become too popular? Plays well at – Has too much happened since TMI-1992? Machine • I worry that the pendulum has swung so far that Translation – We are no longer training students for the possibility conferences • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: – Statistics and Machine Learning – as well as Linguistic Theory • History repeats itself: – – – – 1950s: empiricism 1970s: rationalism (empiricist methodology became too burdensome) 1990s: empiricism 2010s: rationalism (empiricist methodology is burdensome, again) Mark Twain; bad idea then and still a bad idea now Eurospeech 2003 34 Rationalism Well-known Chomsky, Minsky advocates Model Competence Model Contexts of Interest Phrase-Structure Goals Empiricism Shannon, Skinner, Firth, Harris Noisy Channel Model N-Grams All and Only Minimize Prediction Error (Entropy) Explanatory Descriptive Theoretical Applied Linguistic Agreement & WhGeneralizations movement Principle-Based, Parsing Strategies CKY (Chart), ATNs, Unification Understanding Applications Who did what to whom Eurospeech 2003 Collocations & Word Associations Forward-Backward (HMMs), Inside-outside (PCFGs) Recognition Noisy Channel Applications 35 Where have we been and where are we going? 1. Consistent progress over decades • 2. Moore’s Law, Speech Coding, Error Rate History repeats itself • • • • Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: Search >> Compression & Dictation Eurospeech 2003 36 Meeting Demand for Petabytes Bet: Speech >> Text (because we aren’t going to solve all “speech” problems) • Moore’s Law More and More Supply – Disks, Memory, Network Bandwidth, everything… – Petabytes are coming: $2,000,000 (today) $2,000 (in 10 years) • Can demand keep up? – If not, revenues will collapse tech meltdown – Much worse than the Dot-Bomb… Discontinuity • Ans1: no problem – Demand has always kept up – Pundits have never been able to explain why • Thomas J. Watson (1943): I think there is a world market for maybe five computers – But if you build it, they will come • Ans2: big problem (prices for PCs & Networks are collapsing) – – – – Demand is everything Anyone (even a dot-com) can build a network, But the challenge is to sell it Need a kill app (more minutes on the network) Eurospeech 2003 37 Meeting Demand for Petabytes Bet: Speech >> Text (because we aren’t going to solve all “speech” problems) • Moore’s Law More and More Supply – Disks, Memory, Network Bandwidth, everything… – Petabytes are coming: $2,000,000 (today) $2,000 (in 10 years) • Can demand keep up? – If not, revenues will collapse tech meltdown – Much worse than the Dot-Bomb… Eurospeech 2003 Discontinuity 38 Meeting Demand for Petabytes Bet: Speech >> Text (because we aren’t going to solve all “speech” problems) • Moore’s Law More and More Supply – Disks, Memory, Network Bandwidth, everything… – Petabytes are coming: $2,000,000 (today) $2,000 (in 10 years) • Can demand keep up? – If not, revenues will collapse tech meltdown – Much worse than the Dot-Bomb… Discontinuity • Ans1: no problem – Demand has always kept up – Pundits have never been able to explain why • Thomas J. Watson (1943): I think there is a world market for maybe five computers www.wikipedia.org/wiki/Thomas+J.+Watson – But if you build it, they will come • Ans2: big problem (prices for PCs & Networks are collapsing) – – – – Demand is everything Anyone (even a dot-com) can build a network, But the challenge is to sell it Need a killer app (more minutes on the network) Eurospeech 2003 39 How much is a Petabyte? (1015 bytes) • Question from execs: – How do I explain to a lay audience • How much is a petabyte • And why everyone will buy lots of them • Wrong answer: – 106 is a million (a floppy disk/email msg) – 109 is a billion (a billion here, a billion there…) – 1012 is a trillion (the US debt) – 1015 is a zillion (= , an unimaginably large #) Eurospeech 2003 40 How much is a Petabyte? (1015 bytes) • Question from execs: – How do I explain to a lay audience • How much is a petabyte • And why everyone will buy lots of them • Wrong answer: – 106 is a million (a floppy disk/email msg) – 109 is a billion (a billion here, a billion there…) – 1012 is a trillion (the US debt) – 1015 is a zillion (= , an unimaginably large #) Eurospeech 2003 41 How much is a Petabyte? Some more wrong answers • Goal: create demand for a petabyte/lifetime – ≈ 1015 bytes/100 years ≈ 18 megabytes/minute – Text: 18,000 pages/min – Speech: 317 telephone channels for 100 years per capita • Text won’t do it – Speech probably won’t either, but it is closer – DVD video will (1.8 gigabytes/hour = 1.6 petabytes/lifetime), but • Too much opportunity for compression • Not enough demand for Picture Phone (privacy concerns) • Bank on speech recognition not working too well – Can’t afford big improvements in compression: • Speech rates Text rates Eurospeech 2003 Fortunately, that won’t happen 42 New Research Challenges • New Priorities • Old Priorities – Dictation application dates back to days of dictation machines – Speech recognition has not displaced typing – Increase demand for space >> Data entry • New Killer Apps – Search >> Dictation • Speech recognition has improved • But typing skills have improved even more • Speech Google! – Data mining – My son will learn typing in 1st grade – Sec rarely take dictation – Dictation machines are history • My son may never see one • Museums have slide rulers and steam trains – But dictation machines? Eurospeech 2003 43 Data Mining & Call Centers: An Intelligence Bonanza • Some companies are collecting information with technology designed to monitor incoming calls for service quality. • Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers. • But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director. Eurospeech 2003 44 Borrowed Slide In Search of PetaByte Databases Jim Gray Tony Hey Borrowed Slide Personal 100 GB today The Personal Petabyte (someday) • It’s coming (2M$ today…2K$ in 10 years) • Today the pack rats have ~ 10-100GB – 1-10 GB in text (eMail, PDF, PPT, OCR…) – 10GB – 50GB tiff, mpeg, jpeg,… – Some have 1TB (voice + video). Text won’t do it; Speech won’t either • Video can drive it to 1PB. • Online PB affordable in 10 years. • Get ready: tools to capture, manage, organize, search, display will be big app. Eurospeech 2003 46 300 TB (cooked) Hotmail / Yahoo Borrowed Slide • Clone front ends ~10,000@hotmail. • Application servers – – – – Per Capita Demand: Tiny ~100 @ hotmail Get mail box Get/put mail Disk bound • ~30,000 disks • ~ 20 admins Cost of storage: People Eurospeech 2003 47 AOL (msn) (1PB?) • • • • • • Borrowed Slide Per Capita Demand: Tiny 10 B transactions per day (10% of that) Huge storage Huge traffic Lots of eye candy DB used for security/accounting. GUESS AOL is a petabyte – (40M x 10MB = 400 x 1012) Eurospeech 2003 48 Google 1.5PB as of last spring • 8,000 no-name PCs Borrowed Slide 2001 Per Capita Demand: Tiny – Each 1/3U, 2 x 80 GB disk, 2 cpu 256MB ram • • • • • 1.4 PB online. 2 TB ram online 8 TeraOps Slice-price is 1K$ so 8M$. 15 admins (!) (== 1/100TB). Cost of storage: People Eurospeech 2003 49 Digital Immortality: Gordon Bell & Jim Gray (2000) Estimated Lifetime Storage Requirements Data-types Per day Per Lifetime email, papers, text 0.5 MB 15 GB photos 2 MB 150 GB speech 40 MB 1.2 TB music 60 MB 5.0 TB video-lite (200 Kb/s) 1 GB 100 TB DVD video (4.3 Mb/s = 1.8 GB/hour) 20 GB 1 PB Eurospeech 2003 50 Future of Tech Industry Depends On… • Supply running into a (physical) limit – Moore’s Law breaking down – And little progress on compression • Demand keeping up Not Likely Not Optimistic – If we build it, they will come… • Bell & Gray underestimating demand by a lot – Everyone wanting lots and lots of speech – Everyone wanting lots of video – A miracle (the fat lady might sing…) Not Likely – Big progress on searching speech & video Best Bet! Eurospeech 2003 51 Bait and Switch Strategy www.elsnet.org • Bait: public Internet – Large, sexy, available, rich hypertext structure • Switch: as large as the web is – There are larger & more valuable private repositories • Private Intranets & telephone networks – Exclusivity Value • No one cares about data that everyone can have • Just as Groucho Marx doesn’t want to be in a club that… • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories – Value to society: Port solutions to private repositories Eurospeech 2003 52 Bait and Switch Strategy www.elsnet.org • Bait: public Internet – Large, sexy, available, rich hypertext structure • Switch: as large as the web is – There are larger & more valuable private repositories • Private Intranets & telephone networks – Exclusivity Value • No one cares about data that everyone can have • Just as Groucho Marx doesn’t want to be in a club that… • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories – Value to society: Port solutions to private repositories Eurospeech 2003 53 Bait and Switch Strategy www.elsnet.org • Bait: public Internet – Large, sexy, available, rich hypertext structure • Switch: as large as the web is – There are larger & more valuable private repositories • Private Intranets & telephone networks – Exclusivity Value • No one cares about data that everyone can have • Just as Groucho Marx doesn’t want to be in a club that… • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories – Value to society: Port solutions to private repositories Eurospeech 2003 54 Switch: How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Ngram freqs Google = 1000 * BNC 1 TB (ngram freqs) or • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words Eurospeech 2003 1 PB (Gray)? 55 Switch: How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Ngram freqs Google = 1000 * BNC 1 TB (ngram freqs) or • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words 1 PB (Gray)? • It is often said that the web is the largest repository but… – Changes to copyright laws could unlock vast resources: www.lexisnexis.com • Private Intranets and telephone networks >> Public Web – American Telephone Network (FCC): 1 line/person • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day – Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice – AT&T networks: 1 PB/day • Worldwide networks: tens of PB/day Eurospeech 2003 56 Switch: How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Ngram freqs Google = 1000 * BNC 1 TB (ngram freqs) or • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words 1 PB (Gray)? • It is often said that the web is the largest repository but… – Changes to copyright laws could unlock vast resources: www.lexisnexis.com • Private Intranets and telephone networks >> Public Web – American Telephone Network (FCC): 1 line/person • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day – Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice – AT&T networks: 1 PB/day • Worldwide networks: tens of PB/day Eurospeech 2003 57 Switch: How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Ngram freqs Google = 1000 * BNC 1 TB (ngram freqs) or • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words 1 PB (Gray)? • It is often said that the web is the largest repository but… – Changes to copyright laws could unlock vast resources: www.lexisnexis.com • Private Intranets and telephone networks >> Public Web – American Telephone Network (FCC): 1 line/person • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day – Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice – AT&T networks: 1 PB/day • Worldwide networks: tens of PB/day Eurospeech 2003 58 Switch: How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Ngram freqs Google = 1000 * BNC 1 TB (ngram freqs) or • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words 1 PB (Gray)? • It is often said that the web is the largest repository but… – Changes to copyright laws could unlock vast resources: www.lexisnexis.com • Private Intranets and telephone networks >> Public Web – American Telephone Network (FCC): 1 line/person • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day – Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice – AT&T networks: 1 PB/day • Worldwide networks: tens of PB/day Eurospeech 2003 A lot of speech, but not PB per capita 59 Privacy Concerns: Private Data is Private (Exclusivity Value) • Data on private intranets cannot be distributed – And most telephone conversations cannot even be recorded • let alone distributed • But attitudes are changing – It used to be considered rude to have an answering machine – Now it is considered rude not to have one • Between answering machines and call centers, perhaps 10% of telephone traffic can be recorded (≈ 1 PB/day) – Customer expectation: call centers can retrieve recordings of previous calls based on content • New capabilities new public policy – Video recording: • Expected in banks (ATMs) • Prohibited in rest rooms (except children’s YMCA locker room) Eurospeech 2003 60 Privacy Concerns: Private Data is Private (Exclusivity Value) • Data on private intranets cannot be distributed – And most telephone conversations cannot even be recorded • let alone distributed • But attitudes are changing – It used to be considered rude to have an answering machine – Now it is considered rude not to have one • Between answering machines and call centers, perhaps 10% of telephone traffic can be recorded (≈ 1 PB/day) – Customer expectation: call centers can retrieve recordings of previous calls based on content • New capabilities new public policy – Video recording: • Expected in banks (ATMs) • Prohibited in rest rooms (except children’s YMCA locker room) Eurospeech 2003 61 Privacy Concerns: Private Data is Private (Exclusivity Value) • Data on private intranets cannot be distributed – And most telephone conversations cannot even be recorded • let alone distributed • But attitudes are changing – It used to be considered rude to have an answering machine – Now it is considered rude not to have one • Between answering machines and call centers, perhaps 10% of telephone traffic can be recorded (≈ 1 PB/day) – Customer expectation: call centers can retrieve recordings of previous calls based on content • New capabilities new public policy – Video recording: • Expected in banks (ATMs) • Prohibited in rest rooms (except children’s YMCA locker room) Eurospeech 2003 62 Privacy Concerns: Private Data is Private (Exclusivity Value) • Data on private intranets cannot be distributed – And most telephone conversations cannot even be recorded • let alone distributed • But attitudes are changing – It used to be considered rude to have an answering machine – Now it is considered rude not to have one • Between answering machines and call centers, perhaps 10% of telephone traffic can be recorded (≈ 1 PB/day) – Customer expectation: call centers can retrieve recordings of previous calls based on content • New capabilities new public policy – Video recording: • Expected in banks (ATMs) • Prohibited in rest rooms (except children’s YMCA locker room) Eurospeech 2003 63 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Transport: Long-distance telephone calls: 5 cents per minute of speech – Storage: Disk space: ½ cent per minute of speech • If I am willing to pay for a call – I might as well keep the speech online forever • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) Eurospeech 2003 64 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Transport: Long-distance telephone calls: 5 cents per minute of speech – Storage: Disk space: ½ cent per minute of speech • If I am willing to pay for a call – I might as well keep the speech online forever • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) Eurospeech 2003 65 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Transport: Long-distance telephone calls: 5 cents per minute of speech – Storage: Disk space: ½ cent per minute of speech • If I am willing to pay for a call – I might as well keep the speech online forever • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) Eurospeech 2003 66 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Transport: Long-distance telephone calls: 5 cents per minute of speech – Storage: Disk space: ½ cent per minute of speech • If I am willing to pay for a call – I might as well keep the speech online forever • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) Eurospeech 2003 67 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Transport: Long-distance telephone calls: 5 cents per minute of speech – Storage: Disk space: ½ cent per minute of speech • If I am willing to pay for a call – I might as well keep the speech online forever • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) Eurospeech 2003 68 Bait: Use Web to Establish Excitement: More data is better data • Shocking at TMI-1992 (Bob Mercer) Larger market share More $$ for R&D Better Moore’s Law Time Constant – – • but less so a decade later (Eric Brill) Many researchers are finding that performance improves with corpus size, over full range of sizes that are available. EMNLP-2002 Best paper (& CL): Using the Web to Overcome Data Sparseness, Keller et al – For many tasks: Google is displacing BNC just as PCs displaced Crays Larger corpora (100B Google) >> Smaller corpora (100M BNC) – – • – Language modelling Predicting psycholinguistic judgements Collecting more data is better than tricks for not collecting data • • Smoothing, balance, etc. Tricks have limited power: – • – Still find papers on “tiny” corpora Collecting x data with tricks ≈ collecting 10x data without tricks Wish list: more papers measuring power of various tricks Was balancing BNC (British National Corpus) worth the effort? • • My spin Should a corpus be balanced? (Oxford Debate, 1991) The rising tide of data will lift all boats! 1. 2. TREC Question Answering Collocations: Eurospeech 2003 69 Bait: Use Web to Establish Excitement: More data is better data • Shocking at TMI-1992 (Bob Mercer) Larger market share More $$ for R&D Better Moore’s Law Time Constant – – • but less so a decade later (Eric Brill) Many researchers are finding that performance improves with corpus size, over full range of sizes that are available. EMNLP-2002 Best paper (& CL): Using the Web to Overcome Data Sparseness, Keller et al – For many tasks: Google is displacing BNC just as PCs displaced Crays Larger corpora (100B Google) >> Smaller corpora (100M BNC) – – • – Language modelling Predicting psycholinguistic judgements Collecting more data is better than tricks for not collecting data • • Smoothing, balance, etc. Tricks have limited power: – • – Still find papers on “tiny” corpora Collecting x data with tricks ≈ collecting 10x data without tricks Wish list: more papers measuring power of various tricks Was balancing BNC (British National Corpus) worth the effort? • • My spin Should a corpus be balanced? (Oxford Debate, 1991) The rising tide of data will lift all boats! 1. 2. TREC Question Answering Collocations: http://labs1.google.com/sets Eurospeech 2003 70 The rising tide of data will lift all boats! TREC Question Answering & Google: What is the highest point on Earth? Eurospeech 2003 71 The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets Cat cat England Japan Dog Horse more France China Fish Bird Rabbit Cattle Rat Eurospeech 2003 72 The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets Cat cat England Japan Dog Horse more ls France China Fish rm Bird mv Rabbit cd Cattle cp Rat mkdir Eurospeech 2003 73 The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets Cat cat England Japan Dog Horse more ls France Germany China Fish rm Italy Bird mv Ireland Rabbit cd Spain Cattle cp Scotland Rat mkdir Belgium Eurospeech 2003 74 The rising tide of data will lift all boats! Acquiring Lexical Resources from Data: Dictionaries, Ontologies, WordNets, Language Models, etc. http://labs1.google.com/sets Cat cat England Japan Dog Horse Fish Bird more ls rm mv France Germany Italy Ireland China India Indonesia Malaysia Rabbit Cattle Rat cd cp mkdir Spain Scotland Belgium Korea Taiwan Thailand Livestock Mouse Human man tail pwd Canada Austria Australia Singapore Australia Bangladesh Eurospeech 2003 75 Rising Tide of Data Lifts all Boats Bait: use public web to create & socialize new ideas • More data better results – TREC Question Answering • Remarkable performance: Google and not much else – Norvig (ACL-02) – AskMSR (SIGIR-02) – Lexical Acquisition • Google Sets – We tried similar things » but with tiny corpora » which we called large Switch: port these ideas to private repositories Eurospeech 2003 76 Recommendations Bait and Switch Strategy • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories – Value to society: Port solutions to private repositories • Research papers: Bait – Keep up the good work! – There is already considerable interest in evaluation of new ideas on corpora (public repositories) – There will be more interest in Switch • How well methods port to new corpora • How well performance scales with size – Hopefully corpus size helps • But of course, all the data in the world – Will not solve all the world’s problems – Need to understand when more data will help • And when it is better to do something else – Revival of Rationalism (Linguistics) Eurospeech 2003 77 More Recommendations Bait and Switch Strategy • Infrastructure Bait – In addition to traditional public repositories (large) • Web data, data collection efforts such as LDC – We ought to think more about private repositories (even larger) • Most of us do not keep voice mail for long Switch – But I have been using Scanmail to copy my voice mail to email – And like many, I keep email online for a long time • Private repositories would be much larger if – It was more convenient to capture private data – and there was obvious value in doing so. • Currently, tools for public repositories (e.g., Google) – are better than comparable tools for private data (e.g., searching email) • Better search tools (email, speech & video) Larger private repositories • New priorities (consume space) new killer apps – Search (consumes space) >> Dictation (data entry) & Compression Eurospeech 2003 78 More realistic expectations Summary: Where have we been and where are we going? • 1970s: Hot debate: knowledge v. data intensive methods – People think about what they can afford to think about – Data was expensive • Only the richest industrial labs could play • Beyond the reach of most universities • Victor Zue dreams of having an hour of speech online (with annotations) • 1990s: Revival of Empiricism: More data is better data! Demonstrate consistent progress over time – Everyone can afford to play (but still expensive) – Linguistic Data Consortium (LDC) Web – Evaluation, evaluation, evaluation demonstrates consistent progress over time, but not as convincingly as Moore’s Law – Data intensive: method of choice • Pendulum swings (too) far • Is this progress, or is the pendulum about to swing back the other way? Oscillations • 2010s: Petabytes everywhere (be careful what you ask for) – Big problem: Supply >> Demand tech meltdown (??) – No problem: Demand has always kept up new killer apps Discontinuities • Search (consumes space) >> dictation (data entry) & compression • Video >> Speech >> Text Eurospeech 2003 79 Don’t see how to consume PB per capita Where have we been and where are we going? 1. Consistent progress over decades • • 2. Moore’s Law, Speech Coding, Error Rate Time constant limited by: physics and/or R&D investment History repeats itself: • Mark Twain; bad idea then and still a bad idea now • • • • 3. Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: • Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: data entry create demand for petabytes – New Killer Apps: Search (creates demand) >> Compression & Dictation Eurospeech 2003 80 Backup Speech Language Shannon’s: Noisy Channel Model Language Model Channel Model • I Noisy Channel O • I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I) Language Model Word Rank More likely alternatives We 9 The This One Two A Three Please In need 7 are will the would also do to 1 resolve 85 have know do… all 9 The This One Two A Three Please In of 2 The This One Two A Three Please In the important issues Application Independent Channel Model Application Input Output Speech Recognition writer rider OCR (Optical Character Recognition) all a1l Spelling Correction government goverment 1 657 14 document question first… thing point to Eurospeech 2003 82 Speech Language Using (Abusing) Shannon’s Noisy Channel Model: Part of Speech Tagging and Machine Translation • Speech – Words Noisy Channel Acoustics • OCR – Words Noisy Channel Optics • Spelling Correction – Words Noisy Channel Typos • Part of Speech Tagging (POS): – POS Noisy Channel Words • Machine Translation: “Made in America” – English Noisy Channel French Eurospeech 2003 83 I am going to try to avoid making predictions like these because… • Too falsifiable • Appearance of conflicts of interest – Sound like you are trying to raise money for your favorite stuff • Committees do what committees do – Union of all (represented) positions = no position – Advocate what the members are currently working on • Rarely establish new strategic direction • Boring (too obviously correct) Eurospeech 2003 84 Predictions: Where are we going? Change the subject (engage in meta discussion) • Set unrealistic expectations (plenty of examples) – Sound like you are trying to raise money for your favorite stuff • And that you have lost touch with reality • Come up short (fewer examples) • Sound like you’re over the hill (old fogies session at Coling) – Kids these days don’t get it – Everyone should still be working on • what we thought was important when we were kids – Dress up old-style thinking (empiricism/rationalism) • with current fashion (web) Meta discussion: consistent progress, history repeating itself, discontinuities Come up with a new angle: bounds • Lower bound: we will solve such and such (x) – Extrapolations based on Moore’s Law • Upper bound: we won’t solve x (soon/ever) – e.g., pass Turing Test, compress speech down to text rates – And you can bank on it good apps based on assumption x can’t be done Eurospeech 2003 85 Breaking Through Automation Barriers Illustrative Complexity of User Interaction Borrowed Slide Natural Language Dialog Agents Advanced ASR Word Spotting Traditional IVR Complexity of Services Eurospeech 2003 86 Borrowed Slide Past, Present, Future…. Directory Assistance VRCP 1990+ • Constrained speech • Minimal data collection • Manual design Keyword spotting Handcrafted grammars No dialogue 1995+ • Constrained speech • Moderate data collection • Some automation Airline reservation Medium size ASR Banking Handcrafted Grammars System Initiative • Spontaneous speech • Extensive data collection • Semi-automation MATCH: Multimodal Access To Call centers, Large size ASR City Help E-commerce Limited NLU Mixed-initiative 2000+ 2005+ • Spontaneous speech/pen • Fully automated systems Unlimited ASR Deeper NLU Adaptive systems Eurospeech 2003 Multimodal, Multilingual Help Desks, E-commerce 87 Example of Upper Bound: Reverse Turing Test (Kochanski et al., ICSLP-2002) • Assume: won’t pass Turing Test (any time soon) • Assumptions you can bank on – Liberace: cry all the way to the bank • Good apps for crummy (limited) technology – “Good Applications for Crummy Machine Translation” • Church & Hovy (1993) • Reverse Turing Test – Owner of web site wants to grant access to people but not to spiders – Task: distinguish friend from foe, man from beast – Solution: assume there are a class of problems (AI-complete) that any person can do and no machine can. • Currently deployed Reverse Turing Applications – Assume OCR is AI-complete – User is given a degraded image and asked to enter text into a form – Easy for people but challenging for machines • Problem: OCR is not challenging enough for machines • Proposal: Speech recognition with noise is more challenging – We can bank on not solving the cocktail party effect any time soon Eurospeech 2003 88 Where have we been and where are we going? 1. Consistent progress over decades • 2. Moore’s Law, Speech Coding, Error Rate History repeats itself • • • • Empiricism: 1950s Rationalism: 1970s Empiricism: 1990s Rationalism: 2010s (?) Discontinuities: Fundamental changes that invalidate fundamental assumptions • • • • Petabytes: $2,000,000 $2,000 Can demand keep up with supply? If not Tech meltdown New priorities: Search >> Compression & Dictation Eurospeech 2003 89 Statistical MT: IBM Models 1-5 • E Noisy Channel F • E΄ = ARGMAXE Pr(E) Pr(F|E) • Language Model, Pr(E): – Trigram model (borrowed from speech recog) • Channel Model, Pr(F|E): – Based on aligned parallel corpora – Models 1-5: alignment • Mercer & Church (Computational Linguistics, 1993) – Statistical MT may fail for reasons advanced by Chomsky – Regardless of its ultimate success or failure, – There is a growing community of researchers in corpus-based linguistics who believe it will produce valuable lexical resources • Bilingual concordances • Translation tools • Training & testing material for word sense disambig (senseval) Eurospeech 2003 90 Word Sense Disambiguation • Knowledge Acquisition Bottleneck – – – – Bar-Hillel (1960) Expert systems don’t scale Sense-tagged text: expensive Parallel text! • Translation = sense-tagged text – Sentence (judicial sense) peine – Sentence (syntactic sense) phrase • Yarowsky: bilingual monolingual • One sense per discourse • Machine Learning: early example of co-training (EM alg) Eurospeech 2003 91 TMI-02 Keynote (similar subject) The organizers asked me… • What's changed since TMI-92 (if anything)? – TMI-92: great excitement over the use of aligned parallel corpora to help human translators (translation tools) – Also, much controversy over IBM Models 1-5 • Have IBM Models 1-5 failed to solve all the world’s problems? • So what's happened (if anything) since 1992? – Empiricism has come of age • Textbooks: Charniak, Jelinek, Manning & Schultze, Jurafsky & Martin • Textbooks courses in many universities around the world – What used to be considered radical is now accepted practice • Evaluation is practically required for publication – Mercer’s fighting words: More data is better data! • Aren’t as shocking when Brill makes the case a decade later – The new field of Machine Learning has absorbed many good (and formally controversial) ideas including • IBM Models 1-5 • Yarowsky's Word Sense Disambiguation – Grew out of Machine Translation, – But is now widely cited in Machine Learning as an early example of co-training Eurospeech 2003 92 What has happened to the IBMApproach to Machine Translation? • Support for human translators – – – • Terminology: translators don’t need help with the easy vocabulary and the easy grammar Translation Memory: translators are often asked to translate the same material again and again (e.g., revisions of manuals) Alignment Fully automatic – – • CLIR: cross-language information retrieval Translating web pages Academic fields – – Machine Learning: most important contribution Corpus-based Lexicography: spreading into lots of other fields Eurospeech 2003 93 Revival of Empiricism: A Personal Perspective • As a student at MIT, I was solidly opposed to empiricism – But that changed soon after moving to AT&T Bell Labs (1983) • Letter-to-Sound Rules (speech synthesis) – Names: Letter stats Etymology Pronunciation video – NetTalk: Neural Nets video • • • • • • Demo: great theater unrealistic expectations Self-organizing systems v. empiricism Machine Learning v. Corpus-based Linguistics I did it, I did it, I did it, but… Part of Speech Tagging (1988) Word Associations (Hanks) – Mutual info collocations & word associations • Collocations: Strong tea v. powerful computers • Word Associations: bread and butter, doctor/nurse • • • Good-Turing Smoothing (Gale) Aligning Parallel Corpora (inspired by MT) Word Sense Disambiguation – Bilingual Monolingual • Even if IBM’s approach fails for MT lasting benefit (tools, linguistic resources, academic contributions to machine learning) Eurospeech 2003 94 Ceiling Speech Coding (Telephony) • More complicated than Moore’s Law – Many Dimensions: Bit Rate, Quality, Complexity and Delay – Quality ceiling (imposed by telephone standards) • Easy to reach the ceiling at high bit rates (≥ 8 kb/s) • More room for progress at low bit rates (≤ 8 kb/s) • Moore’s Law Time Constant – Bit rates half every decade (≤ 8 kb/s) – Relatively slow by Moore’s Law standards (not hyper-inflation) • Performance doubles every decade • Like disk seek or money in the bank (normal inflation) – Limited more by physics than investment • Potential compression opportunity – At most 10x: 8 kb/s 2 kb/s 1 kb/s (?) • Speech (2 kb/s) >> text (2 bits/char): 100-1000 times more bits – Speech coding will not close this gap for foreseeable future Eurospeech 2003 95