Empiricism from TMI-1992 to AMTA-2002 to AMTA-2012 Have IBM Models 1-5 failed to solve all the world’s problems? Kenneth W. Church AT&T Labs-Research church@att.com The organizers asked me… • What's changed since TMI-92 (if anything)? – TMI-92: great excitement over the use of aligned parallel corpora to help human translators (translation tools) – Also, much controversy over IBM Models 1-5 • So what's happened since 1992? – Empiricism has come of age • Textbooks: Charniak, Jelinek, Manning & Schultze, Jurafsky & Martin • Textbooks courses in many universities around the world – What used to be considered radical is now accepted practice • Evaluation is practically required for publication – Mercer’s fighting words: More data is better data! • Aren’t as shocking when Brill makes the case a decade later – The new field of Machine Learning has absorbed many good (and formally controversial) ideas including • IBM Models 1-5 • Yarowsky's Word Sense Disambiguation – Grew out of Machine Translation, – But is now widely cited in Machine Learning as an early example of co-training October 10, 2002 AMTA 2 Has the pendulum swung too far? • What happened since TMI-1992 (if anything)? – 1980-2000: Revival of Empirical Methods • Have empirical methods become too popular? – Has too much happened since TMI-1992? • I worry that the pendulum has swung so far that – we are no longer training students for the possibility • that the pendulum might swing the other way • We ought to be preparing students with a broad education including: – Statistics and Machine Learning – as well as Linguistic Theory October 10, 2002 AMTA 3 Empiricism: Academia Commercial Practice • Empiricism has not only come of age in academic venues (e.g., conferences, textbooks) – but also in commercial venues • Translation tools (e.g., alignment): – Academia commercial practice (Trados) • Good Applications for Crummy MT – Even better apps: • CLIR (cross-language information retrieval) • MT in web search engines (Systran & AltaVista) October 10, 2002 AMTA 4 So, what do I expect to happen over the next decade? • Scale, stupid: – There is a lot of excitement about the web • Not only large and growing and sexy • But also contains a rich structure of hypertext links – I will propose a bait and switch strategy • Bait: public Internet • Switch: the real target is something larger and more valuable – but more elusive • Good Apps for Crummy NLP: – Spend more time on: • what we can do with what we have • and not spend all our resources on the core technology – There is a lot to a killer app: • Great technology helps, but there is a lot more • Similar arguments apply beyond MT to much of NLP and speech October 10, 2002 AMTA 5 Overview Historical rational reconstruction emphasizing empiricism & business Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism • Hybrid/Tools: – – – • What happened to the IBM-Approach to MT? – – – • Kay’s Workstation Good Apps for Crummy MT Trados Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • • Future (AMTA-2012): Bait and Switch Strategy – – • Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Tools, lexicography apps, word sense, machine learning Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 6 How To Cook A Demo (TMI-1992) • Great fun! • Effective demos – – – – Theater, theater, theater Production quality matters Entertainment >> evaluation Strategic vision >> technical correctness • Maturity: Many fields have come of age since 1950s – Computer Science, Artificial Intelligence, Machine Learning, Natural Language, Machine Translation, Empiricism • Success/Catastrophe – Warning: demos can be too effective – Dangerous to raise unrealistic expectations • Seeds of empiricism – Empirical methods: speech language October 10, 2002 AMTA 7 Let’s go to the video tape! (Lesson: manage expectations) • Lots of predictions – – Entertaining in retrospect Nevertheless, many of these people went on to very successful careers: president of MIT, Microsoft exec, etc. 1. Machine Translation (1950s) video – Classic example of a demo embarrassment in retrospect 2. Translating telephone (late 1980s) video – – Pierre Isabelle pulled a similar demo because it was so effective The limitations of the technology were hard to explain to public • • But well understood by research community We aren’t asking what happened to translating telephones (if anything) 3. Apple (~1990) video – – – Still having trouble setting appropriate expectations Strategy: Point & Click Speech recognition What happened to that (if anything)… Has it moved to Microsoft? 4. Andy Rooney (~1990): reset expectations video October 10, 2002 AMTA 8 TMI-1992 Debate: Rationalism v. Empiricism • Self-organizing systems (IBM) – Statistics do it all (no human intuition) • Stone Soup (Wilks) – Statistics don’t do nothing (all human intuition) • Hybrid/Tools (Kay’s Workbench) – Proper Place of Men and Machines in MT • Use people for what they are good at – Easy vocabulary and easy grammar • Use machines for that they are good at – Technical terminology, translation memories, re-use of previously translated texts – Good Application of Crummy MT – Trados – Pragmatism: low hanging fruit • Supply: do what we can do (with or without stats) • Demand: do what is worth doing October 10, 2002 AMTA 9 Stone Soup Debate (mid-1990s) • IBM-style MT is obnoxious – Agreed • It has all been done before – Agreed • Stone soup: they’ve been adding intuition to their stats – Agreed • It doesn’t work (Systran is better) – Systran is also better than Pangloss • It isn’t about empiricism, evaluation, etc. – Martin Kay’s advice about debating • Natural Ceiling – Chomsky used this argument against Shannon – In the part of speech case, the ceiling was broken with stats • Lack of data (lots of Canadian Hansards, but not much else) – We don’t hear this argument so much any more… • The Future: Hybrid Approaches – Agreed October 10, 2002 AMTA 10 Bottom line Hybrid/Tools • Yorick will get the last word • But from his abstract, it looks like he’s going to tell us that I was right all along • And just in case he doesn’t… – Let me say it now: I told you so! October 10, 2002 AMTA 11 Overview Historical rational reconstruction emphasizing empiricism & business • Before TMI-1992: How to Cook a Demo • TMI-1992 Debate: Rationalism v. Empiricism Hybrid/Tools: – – – • What happened to the IBM-Approach to MT? – – – • Kay’s Workstation Good Apps for Crummy MT Trados Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • • Future (AMTA-2012): Bait and Switch Strategy – – • Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Tools, lexicography apps, word sense, machine learning Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 12 Kay’s Workstation (1980) Proper Place of Men and Machines in MT • The translator’s [workstation] will not run before it can walk. – It will be called on only for that for which its masters have learned to trust it. – It will not require constant infusions of new ad hoc devices that only expensive vendors can supply. • It is a framework that will gracefully accommodate the future contributions – that linguistics and computer science are able to make. • One day it will be built because – its very modesty assures its success. • It is to be hoped that it will be built with taste by people – who understand languages and computers well enough to know how little it is that they know. October 10, 2002 AMTA 13 CWARC: Canadian Workplace Automation Research Center (1989) • A PC • Network access to the Termium terminology database CD-ROM • WorkPerfect • CompareRite (a diff tool) • TextSearch (a concordance tool) • Mercury/Termex (terminology) • Procomm (remote access to data banks via telephone modem) • Seconde Memoire (French verb inflections) • Software Bridge (tool for converting word processing files from one commercial format into another) October 10, 2002 AMTA 14 Good Apps for Crummy MT • • • • It should set reasonable expectations It should make sense economically It should be attractive to the intended users It should exploit the strengths of the machine – and not compete with the strengths of the human • It should be clear to the users what the system can and cannot do, and • It should encourage the field to move forward toward a sensible long-term goal. October 10, 2002 AMTA 15 Evaluation: MultiLingual 13:6 A Trade Magazine for Translators SelfOrganizing Stone Soup Hybrid/ Tools Page Description of Article 18 Reviews: Reverso Pro 5, Reverso Expert Products offer useful new features √ 21 T-Remote Memory √ 24 Content Management: Systems for Managing Content Transformations √ 31 Comparing Tools Used in Software Localization √ 37 Working With Machine Translation √* 42 Integrating Translation Tools With Document Creation (CMU → Déjà Vu) √ 49 A Look At Two Web Translation Portals 53 Going Global With Lingo Systems October 10, 2002 AMTA √ √* 16 MultiLingual 13:6 Very Positive on Tools • T-Remote Memory (p. 21) – Combination of translation memory, workflow & distributed work centers (work at home) • Moore’s Law: large revenues better technology – “Hold on to your seats, translators and agencies, our industry is about to change again… Forecast shakeup: extreme.” • Comparing Tools Used in Software Localization (p. 31): A Consumer Reports-like Review – Presupposition: tools are ready for wide-spread use • “Finally, I am told that there are people out there for whom price does matter.” – ATA Conference (~1990): the translator and the MT salesman October 10, 2002 AMTA 17 MultiLingual 13:6 is mostly positive on technology, but… • Working With Machine Translation (p. 37) – Although the formatting of the source text is extremely simple, the machine translation output requires a lot of painstaking postediting • Grim reality: mark-up is more valuable than translation – The verdict is unanimous – The translators, who have gained a lot of experience of EU topics, prefer to work without Systran • A Look at Two Web Translation Portals (p. 49) – At first I thought it was a machine translation site..., but soon discovered that it was not one of those “free translation” sites. October 10, 2002 AMTA ≈ $0.25/word 18 Machine Translation + Post-Editing: Long History of Mixed Results • Positive: – Magusson-Murray (1985, p. 180): Although you can expect to at least double your translator’s output, the real cost-saving in MT likes in complete electronic transfer of information and the integration into a fully electronic publishing system. – Lawson (1984, p. 6): Substantial rises in translations output, by as much as 75 per cent in one case, are being reported by users of the Logos machine translation (MT) system after only a few months. – Tschira (1985): For one type of text (data description manuals), we observed an increase in throughput of 30 per cent. October 10, 2002 AMTA 19 Surprisingly, automation (MT + Post-editing) can be more expensive than manual baseline • Negative: – Macklovitch (1991, p. 3): The HT production chain was significantly faster than the MT production chain. – Kay (1980): Proper Place of Men and Machines in MT – ALPAC (1966, p. 19): The postedited translation took slightly longer to do and was more expensive than conventional human translation… Dr. J. C. R. Licklider of IBM and Dr. Paul Garvin of Bunker-Ramo said they would not advise their companies to establish such a service. • Credibility gap: – Why so little consistency? 200%? 75%? 30%? 0%? – Why haven’t these products done better in the marketplace? – The tools argument (terminology and translation memories) works better with translators than post-editing – Translators may be biased • but they have considerable expertise (and influence) • Automation will be easier if they believe in it October 10, 2002 AMTA 20 Overview Historical rational reconstruction emphasizing empiricism & business • • • Before TMI-1992: How to Cook a Demo TMI-1992 Debate: Rationalism v. Empiricism Hybrid/Tools: – – – Kay’s Workstation Good Apps for Crummy MT Trados What happened to the IBM-Approach to MT? – – – • Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • • Future (AMTA-2012): Bait and Switch Strategy – – • Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Tools, lexicography apps, word sense, machine learning Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 21 What has happened to the IBMApproach to Machine Translation? • Support for human translators (MultiLingual 13:6) 1. 2. 3. • Fully automatic – – • CLIR: cross-language information retrieval Translating web pages Academic fields – – October 10, 2002 Terminology: translators don’t need help with the easy vocabulary and the easy grammar Translation Memory: translators are often asked to translate the same material again and again (e.g., revisions of manuals) Alignment Machine Learning: most important contribution Corpus-based Lexicography: spreading into lots of other fields including politics (Nunberg) AMTA 22 Use of Political Labels in Major Newspapers Geoffrey Nunberg Commentary broadcast on "Fresh Air," March 19, 2002 Total instances in newspapers database Liberal Pct within 7 words of relevant label Total instances in "liberal" papers 4.8% Pct. within 7 words of label in "liberal" papers 3.78% Paul Wellstone 2939 10.90% 578 8.48% Barney Frank 8501 4.70% 1439 3.89% Tom Harkin 10,147 3.70% 1784 2.02% Ted Kennedy 17,197 3.00% 2444 2.74% Barbara Boxer 8977 2.00% 3093 1.78% Conservative 3.6% 2.89% Jesse Helms 19,874 9.10% 4718 6.02% Tom DeLay 6351 3.60% 1859 2.90% John Ashcroft 10,187 2.10% 1157 3.03% Dick Armey 9222 2.10% 1460 1.44% Trent Lott 18,048 1.40% 4976 1.05% October 10, 2002 AMTA 23 Surprisingly, liberals are more likely to be labeled as such than conservatives • In fact, I [Nunberg] did find a big disparity in the way the press labels liberals and conservatives, – but not in the direction that Goldberg claims. • On the contrary: the average liberal legislator has a 30% greater likelihoods of being identified with a partisan label than the average conservative does. • The press describes – Barney Frank as a liberal 2.5 times as frequently – as it describes Dick Armey as a conservative. • It gives Barbara Boxer a partisan label – almost twice as often as it gives one to Trent Lott. • And while it isn't surprising that the press applies the label conservative to Jesse Helms more often than to any other Republican in the group, October 10, 2002 – it describes Paul Wellstone as a liberal – 20% more frequently than that. AMTA 24 1990s Revival of Empiricism • Empiricism was at its peak in the 1950s – Dominating a broad set of fields • Ranging from psychology (behaviorism) • To electrical engineering (information theory) • At the time, it was common practice in linguistics to classify words not only by meaning but also by collocations (word associations) – Firth: “You shall know a word by the company it keeps” – Collocations: Strong tea v. powerful computers – Word Associations: bread and butter, doctor/nurse • Regrettably, interest in empiricism faded – with Chomsky’s criticism of ngrams in Syntactic Structures (1957) – and Minsky and Papert’s criticism of neural networks in Perceptrons (1969). • Availability of massive amounts of data (even before the web) – “More data is better data” – Quantity >> Quality (balance) • Pragmatic focus: – What can we do with all this data? – Better to do something than nothing at all • October 10, 2002 Empirical methods (and focus on evaluation): Speech Language AMTA 25 Shannon’s: Noisy Channel Model Language Model Channel Model • I Noisy Channel O • I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I) Language Model Word Rank More likely alternatives We 9 The This One Two A Three Please In need 7 are will the would also do to 1 resolve 85 have know do… all 9 The This One Two A Three Please In of 2 The This One Two A Three Please In the important issues October 10, 2002 Channel Model Application Input Output Speech Recognition writer rider OCR (Optical Character Recognition) all a1l Spelling Correction government goverment 1 657 14 document question first… thing point to AMTA 26 Using (Abusing) Shannon’s Noisy Channel Model: Part of Speech Tagging and Machine Translation • Speech – Words Noisy Channel Acoustics • OCR – Words Noisy Channel Optics • Spelling Correction – Intended Text Noisy Channel Typos • Part of Speech Tagging (POS): – POS Noisy Channel Words • Machine Translation: – English Noisy Channel French October 10, 2002 AMTA 27 Statistical MT • E Noisy Channel F • E΄ = ARGMAXE Pr(E) Pr(F|E) • Language Model, Pr(E): – Trigram model (borrowed from speech recog) • Channel Model, Pr(F|E): – Based on aligned parallel corpora – Models 1-5: alignment • Mercer & Church (Computational Linguistics, 1993) – Statistical MT may fail for reasons advanced by Chomsky – Regardless of its ultimate success or failure, – There is a growing community of researchers in corpus-based linguistics who believe it will produce valuable lexical resources • Bilingual concordances • Translation tools • Training & testing material for word sense disambig (senseval) October 10, 2002 AMTA 28 Word Sense Disambiguation • Knowledge Acquisition Bottleneck – – – – Bar-Hillel (1960) Expert systems don’t scale Sense-tagged text: expensive Parallel text! • Translation = sense-tagged text – Sentence (judicial sense) peine – Sentence (syntactic sense) phrase • Yarowsky: bilingual monolingual • One sense per discourse • Machine Learning: early example of co-training October 10, 2002 AMTA 29 Rationalism Empiricism Well-known Chomsky, Minsky advocates Shannon, Skinner, Firth, Harris Model Competence Model Noisy Channel Model Contexts of Phrase-Structure Interest N-Grams All and Only Minimize Prediction Error (Entropy) Goals Explanatory Descriptive Theoretical Applied Linguistic Agreement & Wh-movement Generalizations Principle-Based, CKY Parsing Strategies (Chart), ATNs, Unification Collocations & Word Associations Forward-Backward (HMMs), Inside-outside (PCFGs) Understanding Recognition Who did what to whom Noisy Channel Applications Applications October 10, 2002 AMTA 30 Revival of Empiricism: A Personal Perspective • At MIT, I was solidly opposed to empiricism – But that changed soon after moving to AT&T Bell Labs (1983) • Letter-to-Sound Rules (speech synthesis) – Names: Letter stats Etymology Pronunciation video – NetTalk: Neural Nets video • • • • • • Demo: great theater unrealistic expectations Self-organizing systems v. empiricism Machine Learning v. Corpus-based Linguistics I did it, I did it, I did it, but… Part of Speech Tagging (1988) Word Associations (Hanks) – Mutual info collocations & word associations • Collocations: Strong tea v. powerful computers • Word Associations: bread and butter, doctor/nurse • • • Good-Turing Smoothing (Gale) Aligning Parallel Corpora (inspired by MT) Word Sense Disambiguation – Bilingual Monolingual • October 10, 2002 Even if IBM’s approach fails for MT lasting benefit (tools, linguistic resources, academic contributions to machine learning) AMTA 31 Overview Historical rational reconstruction emphasizing empiricism & business • • • Before TMI-1992: How to Cook a Demo TMI-1992 Debate: Rationalism v. Empiricism Hybrid/Tools: – – – • What happened to the IBM-Approach to MT? – – – • Kay’s Workstation Good Apps for Crummy MT Trados Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • Tools, lexicography apps, word sense, machine learning Future (AMTA-2012): Bait and Switch Strategy – – • Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 32 Strategy is Important www.elsnet.org • Our field is doing better and better! – It used to be hard to prepare a talk for a CL audience • because there was almost nothing that you could assume everyone knew. – The field will really have arrived when a course in speech and language processing is a normal part of every undergraduate and graduate Computer Science, Electronic Engineering, and Linguistics programme • and we're a long way from that • • But things are improving… Now have several textbooks: Manning & Schutze, Jurafsky and Martin – A quick search of the web: textbooks courses (around the world) • There were, of course, many other obstacles – that limited the size of the field. • It used to be hard to join in on the fun – because only a few large industrial labs could afford to collect data • Thanks to data collection efforts such as LDC and ELSNET and the web – Data is no longer the problem it used to be – Of course, you can never have too much of a good thing… • October 10, 2002 Tools also used to be a problem…. AMTA 33 Learn from Theoretical Computer Science • There is always, though, more – we could do to promote our field • Learn from Theoretical Computer Science – Theory has paid more attention to teaching • than we have – They have also worked hard on strategy • www.research.att.com/~dsj/nsflist.html • The theory community regularly exchange lists of open problems along with difficulty ratings – Students know before they solve a problem whether it is worth a conference paper or a superstar award October 10, 2002 AMTA 34 Strategy: not urgent, but important • Many orgs (.edu, .com, .gov) work hard on strategy • Plenty of examples on the web: – – – – www.nsf.gov/pubs/2001/nsf0104/strategy.htm www.darpa.mil/body/mission.html medg.lcs.mit.edu/doyle/publications/sdcr96.pdf www.gridforum.org/L_About/about.htm • Hard to say why strategy is important – But I have noticed, at least within AT&T, that groups that work hard on strategy have grown and prospered over the years – Strategy is never as urgent as the next conference paper deadline, but it is probably more important October 10, 2002 AMTA 35 Strategy documents have impact (even if it appears that they are being ignored) • Organizations may or may not follow their own recommendations • The discussion that produces the strategy document is extremely valuable, nevertheless, – perhaps more so that anything that happens after doc is finalized • Strategy panels offer a forum for people to meet – and look at the field from a broader perspective • In addition, the theory community has observed that even after the people involved in the original discussion have long since forgotten the outcome – Recommendations continue to live on – and broaden the best and most aggressive students for years October 10, 2002 AMTA 36 Strategy Discussions in Our Field • There are a few discussions of strategy within our field: – – – – http://www.elsnet.org/about.html http://www.ldc.upenn.edu/ldc/about/ldc_intro.html http://www-nlpir.nist.gov/projects/duc/papers/ LREC workshops (to order proceedings, see www.lrec-conf.org) • LDC link developed a decade ago – Largely responsible for the success of LDC – If more groups in our field put the same kind of energy into strategy, • There would be more success stories like the LDC • A delightful "near miss" is Martin Kay's reflections on ICCL and COLING – Establishes direction for the format of Coling conferences in a “classic” Martin-style • Proposal: convince Martin to write a doc in the same delightful style – Establish direction for the field rather than atmosphere for Coling October 10, 2002 AMTA 37 Bait and Switch Strategy www.elsnet.org • Bait: public Internet – Large, sexy, available, rich hypertext structure • Switch: as large as the web is – There are larger & more valuable private repositories • Private Intranets & telephone networks – Exclusivity Value • No one cares about data that everyone can have • Just as Groucho Marx doesn’t want to be in a club that… • Strategy: Use the public Intranet to develop, test and socialize new ways to extract value from large linguistic repositories – Value to society: Apply solutions to private repositories October 10, 2002 AMTA 38 Call Centers: An Intelligence Bonanza October 10, 2002 • Some companies are collecting information with technology designed to monitor incoming calls for service quality. • Last summer, Continental Airlines Inc. installed software from Witness Systems Inc. to monitor the 5,200 agents in its four reservation centers. • But the Houston airline quickly realized that the system, which records customer phone calls and information on the responding agent's computer screen, also was an intelligence bonanza, says André Harris, reservations training and quality-assurance director. AMTA 39 Bait: Use Web to establish: More data is better data • • Shocking at TMI-92 (Mercer), but less so a decade later (Brill) EMNLP-02 best paper: Using the Web to Overcome Data Sparseness – Larger corpora (Google) >> smaller corpora (British National Corpus) for predicting psycholinguistic judgements. – Suggested in the conclusions that web counts better than standard smoothing techniques (back-off) for language modelling – Really exciting! Performance on a broad range of computational linguistics tasks will improve as we collect more and more data – The rising tide of data will lift all boats! • Brill (AskMSR Question Answering): – One can do remarkably well in TREC question answering competitions by using a search engine like Google and very little else • • Norvig (ACL-02 invited talk): ditto Google is also very good at finding collocations/associations – http://labs1.google.com/sets • Cat & dog animals • Cat & more Unix commands! – We used to try to do similar things a decade ago, but the results where not as good, probably because we were working with relatively tiny corpora in the subbillion-word range – Unix commands and many other subjects (esp taboo subjects) are overrepresented on web • Quantity v. quality/corpus size v. balance – Is collecting more data better than smoothing? October 10, 2002 AMTA 40 Question Answering & Google October 10, 2002 AMTA 41 http://labs1.google.com/sets Cat more England Japan Dog Horse Fish Bird cat ls rm mv France Germany Italy Ireland China India Indonesia Malaysia Rabbit Cattle Rat cd cp mkdir Spain Scotland Belgium Korea Taiwan Thailand Livestock Mouse Human man tail pwd Canada Austria Australia Singapore Australia Bangladesh October 10, 2002 AMTA 42 How Large is Large? • Web Renewed Excitement – Large, rich hypertext structure & publicly available – Google = 1000 * BNC • Google: 100 Billion Words • British National Corpus (BNC): 100 Million Words • • • It is often said that the web is the largest repository but… Changes to copyright laws could unlock vast quantities of data: www.lexisnexis.com Private Intranets and telephone networks >> Public Web – FCC (trends.html): 200 million telephones in USA (1 line/person) • Usage: 1 hour/day/line • Assume 1 sec ≈ 1 word 10 Google collections/day – Currently, Intranets (data) ≈ telephones (voice) • But data is growing faster than voice • Admittedly, much of the data on Intranets cannot be distributed – And much of the speech on the telephone networks cannot be recorded – But attitudes are changing • It used to be considered rude to have a telephone answering machine • Now it is considered rude not to have one – Between answering machines and call centers, perhaps 10% can be recorded October 10, 2002 AMTA 43 In the past, recording all this data would have been prohibitively expensive • Thanks to Moore’s Law – Storage costs have been falling faster than transport – And will continue to do so for some time • Even at current prices, transport >> storage – Long-distance telephone calls: $0.05/min – Disk space: $0.005/min • If I am willing to pay for a call – I might as well keep the speech online for a long time • Similar comments hold for data (web pages) – If I am willing to pay to fetch a web page • I might as well cache it for a long time • Why flush a page if there is any chance that it might be requested again? – Web caches crawlers • Go find the pages that I might ask for and keep them forever • Storage is cheap (compared to transport) October 10, 2002 AMTA 44 Recommendations Bait and Switch Strategy • Papers: – Keep up the good work! – There is considerable interest in eval on corpora – There will be more interest in how well methods port to new corpora – More interest in how performance scales with size – Hopefully corpus size helps • but of course, all the data in the world will not solve all the world’s problems – Need to understand when more data will help • And when it is better to do something else – revival of linguistics October 10, 2002 AMTA 45 More Bait and Switch Recommendations Investments in infrastructure • In addition to traditional data collection efforts focused on publicly available linguistic repositories – We out to think about private repositories, as well. • Potential: Huge impact on size of private repositories – By making it more convenient to capture private data, and – By demonstrating that there is value in doing so. • For example, most of us do not keep voice mail for long – though I have been using Scanmail to copy voice mail to email – and like many people, I keep a lot of email online for a long time • Unfortunately, tools for searching email and other private repositories are not as good as the tools for searching public repositories (Google) October 10, 2002 AMTA 46 Overview Historical rational reconstruction emphasizing empiricism & business • • • Before TMI-1992: How to Cook a Demo TMI-1992 Debate: Rationalism v. Empiricism Hybrid/Tools: – – – • What happened to the IBM-Approach to MT? – – – • Kay’s Workstation Good Apps for Crummy MT Trados Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • • Tools, lexicography apps, word sense, machine learning Future (AMTA-2012): Bait and Switch Strategy – – Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 47 Market Opportunities for Translation • CACM (with Rau): Commercial Opportunities for NLP – Do we count garage outfits funded by grants? – Fortune-500 perspective: min of $100 million – Identified two application areas • Word Processing (Microsoft) • Information Retrieval (Lexis-Nexis/Web) – Did not identify translation • How large is translation market? Huge estimates: – $10+ Billion (Eurolang) – Comparable to AT&T’s revenues for consumer services – Telcos are a major employer (unlike translation) • Estimates of Market Size – English-only (ASCII): English • Pairs of English Speakers – Monolingual (ISO/Unicode) • Pairs of speakers who share a language Monolingual – Multi-lingual (translation) • Pairs of speakers October 10, 2002 Multi-lingual AMTA 48 Surprisingly Little Demand for Multi-lingual Applications: Translation & Interpretation • AT&T Language Line Lesson: – Surprisingly, monolingual market >> multi-lingual market • • • • Lots of demand for telephone service where both parties speak the same language We thought there would be even more demand for a translating telephone Because there are more pairs of people who don’t share a common language than do But people don’t talk (much) to people they don’t know – AT&T Language Line Service: • Speech to speech interpretation over the phone (low tech except conf calling/work-at-home) • Plus a traditional writing to writing translation service – Interpretation market: focused on emergencies: police, hospital (too small for AT&T) – Translation market: focus on technical manuals (also, too small for AT&T) – Surprise: interpretation market ≠ translation market; demand for language pair • Interpretation (speech to speech): depends on number of domestic speakers • Translation (writing to writing): depends on world-wide GNP • • Putnam: Bowling Alone Bridging and Bonding Employment opportunities for translators – – – – October 10, 2002 Not Good Markup is more valuable than translation Desktop publishing is a better business Business case: Adobe >> Trados AMTA English Monolingual Multi-lingual 49 Summary Historical rational reconstruction emphasizing empiricism & business • • • Before TMI-1992: How to Cook a Demo TMI-1992 Debate: Rationalism v. Empiricism Hybrid/Tools: – – – • What happened to the IBM-Approach to MT? – – – • Kay’s Workstation Good Apps for Crummy MT Trados Support for human translators (Translation Tools: Trados) Fully automatic apps (CLIR) Academia (Machine Learning) Revival of Empiricism: A Personal Perspective – – The IBM-approach to MT was always controversial But there are lots of less controversial spin-offs: • • Future (AMTA-2012): Bait and Switch Strategy – – • Bait: Use Public Internet to develop and test and socialize new ways of extracting value Switch: Apply learnings to larger and more valuable private linguistic repositories Market sizing: translation business is too small for fortune-500 companies – – October 10, 2002 Tools, lexicography apps, word sense, machine learning Major lasting contribution of IBM-Approach: academic (machine learning) MT Strategy: work on MT because it is fun, but apply learnings elsewhere to pay the bills AMTA 50