Slide

advertisement
Speech and Language Processing:
Where have we been
and where are we going?
Kenneth Ward Church
AT&T Labs-Research
church@att.com
www.research.att.com/~kwc
Where have we been?
How To Cook A Demo
(After Dinner Talk at TMI-1992 & Invited Talk at TMI-2002)
• Great fun!
• Effective demos
Message for
After Dinner Talk
– Theater, theater, theater
– Production quality matters
– Entertainment >> evaluation
– Strategic vision >> technical correctness
• Success/Catastrophe
Message for
After Breakfast Talk
– Warning: demos can be too effective
– Dangerous to raise unrealistic expectations
Eurospeech 2003
2
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Eurospeech 2003
3
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
Classic example of a demo  embarrassment in retrospect
Eurospeech 2003
4
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
Though well understood by research community
Eurospeech 2003
5
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Though well understood by research community
Apple (~1990) video
–
–
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Eurospeech 2003
6
Let’s go to the video tape!
(Lesson: manage expectations)
•
Lots of predictions
–
–
1.
Entertaining in retrospect
Nevertheless, many of these people went on to very successful
careers: president of MIT, Microsoft exec, etc.
Machine Translation (1950s) video
–
2.
Classic example of a demo  embarrassment in retrospect
Translating telephone (late 1980s) video
–
–
Pierre Isabelle pulled a similar demo because it was so effective
The limitations of the technology were hard to explain to public
•
3.
Apple (~1990) video
–
–
4.
Though well understood by research community
Still having trouble setting appropriate expectations
Factoid: the day of this demo, speech recognition deployed at scale
in AT&T network – with significant lasting impact – but little media
Andy Rooney (~1990): reset expectations video
Eurospeech 2003
7
Outline: Where have we been
and where are we going?
1.
Consistent progress over decades

2.
Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•
3.
Managing
Expectations
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
8
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing
Expectations
•
Controversial in 1980s
–
–
•
But not in 1990s
Though, lgrumbling
Benefits
1. Agreement on what to do
2. Limits endless discussion
3. Helps sell the field
•
•
•
Manage expectations
Fund raising
Risks (similar to benefits)
1. All our eggs are in one
basket (lack of diversity)
2. Not enough discussion
•
Hard to change course
3. Methodology  Burden
Eurospeech 2003
9
$
Hockey Stick
Business Case
2002
Last
Year
2003
This
Year
Eurospeech 2003
t
2004
Next
Year
10
Moore’s Law: Ideal Answer
Where have we been and where are we going?
Eurospeech 2003
11
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Why different slopes?
Hyper-Inflation
1. Progress limited by physics
– Disk seek: 10 years (normal inflation)
– Disk capacity: 1 year (hyper-inflation)
Normal Inflation
Physics & Investment 
Rate of Progress
in Speech & Language
(and everything)
Eurospeech 2003
12
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Why different slopes?
Hyper-Inflation
1. Progress limited by physics
– Disk seek: 10 years (normal inflation)
– Disk capacity: 1 year (hyper-inflation)
2. Progress limited by investment
– Case history: PCs improved faster than
supercomputers (Cray)
•
PCs: larger market  more R&D
– Irony: “Dis-economy of Scale”
– Danny Hillis (Thinking Machines)
•
Normal Inflation
Physics & Investment 
Rate of Progress
in Speech & Language
(and everything)
Computing is better (cheaper & faster)
on smaller machines
– PCs >> big iron
– LAN routers >> 5ESS (big phone switch)
– Economies of scale depend on size of
market, not size of machine
•
•
Market: PC >> big iron (Economist View)
Machine: PC << big iron (CS View)
Eurospeech 2003
13
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Why different slopes?
Hyper-Inflation
1. Progress limited by physics
– Disk seek: 10 years (normal inflation)
– Disk capacity: 1 year (hyper-inflation)
2. Progress limited by investment
– Case history: PCs improved faster than
supercomputers (Cray)
•
PCs: larger market  more R&D
– Irony: “Dis-economy of Scale”
– Danny Hillis (Thinking Machines)
•
Normal Inflation
Physics & Investment 
Rate of Progress
in Speech & Language
(and everything)
Computing is better (cheaper & faster)
on smaller machines
– PCs >> big iron
– LAN routers >> 5ESS (big phone switch)
– Economies of scale depend on size of
market, not size of machine
•
•
Market: PC >> big iron (Economist View)
Machine: PC << big iron (CS View)
Eurospeech 2003
14
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Why different slopes?
Hyper-Inflation
1. Progress limited by physics
– Disk seek: 10 years (normal inflation)
– Disk capacity: 1 year (hyper-inflation)
2. Progress limited by investment
– Case history: PCs improved faster than
supercomputers (Cray)
•
PCs: larger market  more R&D
– Irony: “Dis-economy of Scale”
– Danny Hillis (Thinking Machines)
•
Normal Inflation
Physics & Investment 
Rate of Progress
in Speech & Language
(and everything)
Computing is better (cheaper & faster)
on smaller machines
– PCs >> big iron
– LAN routers >> 5ESS (big phone switch)
– Economies of scale depend on size of
market, not size of machine
•
•
Market: PC >> big iron (Economist View)
Machine: PC << big iron (CS View)
Eurospeech 2003
15
Borrowed Slide
Rich Cox
Evolution of Speech Coder Performance
Excellent
Good
North American TDMA
2000
Fair
1990
ITU Recommendations
Cellular Standards
1980
Poor
Secure Telephony
1980 Profile
1990 Profile
2000 Profile
Bad
Bit Rate (kb/s)
Eurospeech 2003
16
Ceiling
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
Eurospeech 2003
17
Ceiling
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
Eurospeech 2003
18
Ceiling
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
• Potential compression opportunity
– At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)
– Entropy: 50 bits per sec (Roger Moore)
Eurospeech 2003
19
Ceiling
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
• Potential compression opportunity
– At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)  50 bits per sec (??)
• Speech (2 kb/s) >> text (2 bits/char): 10-1000 times more bits
– Speech coding will not close this gap for foreseeable future
Eurospeech 2003
20
Where have we been
and where are we going?
1.
Consistent progress over decades
•
•

2.
Moore’s Law
Speech Coding
Reducing Speech Recognition Error Rates
History repeats itself
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate fundamental
assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
21
Error Rate
Borrowed Slide
Audrey Le (NIST)
Moore’s Law Time Constant:
• 10x improvement per decade
• Limited by R&D Investment
• (Not Physics)
Date (15 years)
Eurospeech 2003
22
Milestones in Speech and Multimodal
Technology Research
Borrowed
Slide
Small
Vocabulary,
Acoustic
Phoneticsbased
Isolated
Words
Filter-bank
analysis;
Timenormalization
;Dynamic
programming
1962
Medium
Large
Vocabulary,
Vocabulary,
Template-based Statistical-based
Isolated Words;
Connected Digits;
Continuous
Speech
Pattern
recognition; LPC
analysis;
Clustering
algorithms; Level
building;
1967
1972
Connected
Words;
Continuous
Speech
Continuous
Speech; Speech
Understanding
Hidden Markov
models;
Stochastic
Language
modeling;
Stochastic language
understanding;
Finite-state
machines;
Statistical learning;
1977
1982
Very Large
Vocabulary;
Semantics,
Multimodal
Dialog, TTS
Large
Vocabulary;
Syntax,
Semantics,
1987
1992
Spoken dialog;
Multiple
modalities
Concatenative
synthesis; Machine
learning; Mixedinitiative dialog;
1997
2002
Year
Consistent improvement over time, but unlikeEurospeech
Moore’s
Law, hard to extrapolate (predict future)
2003
23
Speech-Related Technologies
Where will the field go in 10 years?
Niels Ole Bernsen (ed)
2003 Useful speech recognition-based language tutor
2003 Useful portable spoken sentence translation systems
2003 First pro-active spoken dialogue with situation awareness
2004 Satisfactory spoken car navigation systems
2005
Small-vocabulary (> 1000 words)
spoken conversational systems
2006
Multiple-purpose personal assistants
(spoken dialog, animated characters)
2006 Task-oriented spoken translation systems for the web
2006 Useful speech summarization systems in top languages
2008 Useful meeting summarization systems
2010 Medium-size vocabulary conversational systems
Eurospeech 2003
24
Where have we been and where are we going?
Manage
Consistent Progress over Time
Expectations
Extrapolation/Prediction
Physics and
is Applicable
Physics
Extrapolation/Prediction
is Not Applicable
$
Investment
2002
2003
2004
t
Investment
Eurospeech 2003
25
Where have we been
and where are we going?
1.
Consistent progress over decades
•

Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
26
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
27
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
28
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Eurospeech 2003
29
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
• Periodic signals are continuous
• Support extrapolation/prediction
• Progress? Consistent progress?
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
Consistent progress?
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
Extrapolation/Prediction: Applicable?
Eurospeech 2003
30
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
•
that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
31
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
32
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
Eurospeech 2003
33
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself:
–
–
–
–
1950s: empiricism
1970s: rationalism (empiricist methodology became too burdensome)
1990s: empiricism
2010s: rationalism (empiricist methodology is burdensome, again)
Mark Twain; bad idea then
and still a bad idea now
Eurospeech 2003
34
Rationalism
Well-known
Chomsky, Minsky
advocates
Model Competence Model
Contexts of Interest Phrase-Structure
Goals
Empiricism
Shannon, Skinner, Firth,
Harris
Noisy Channel Model
N-Grams
All and Only
Minimize Prediction Error
(Entropy)
Explanatory
Descriptive
Theoretical
Applied
Linguistic Agreement & WhGeneralizations
movement
Principle-Based,
Parsing Strategies
CKY (Chart),
ATNs, Unification
Understanding
Applications Who did what to
whom
Eurospeech 2003
Collocations & Word
Associations
Forward-Backward
(HMMs), Inside-outside
(PCFGs)
Recognition
Noisy Channel Applications
35
Where have we been
and where are we going?
1.
Consistent progress over decades
•
2.
Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•

Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
36
Meeting Demand for Petabytes
Bet: Speech >> Text
(because we aren’t going to solve all “speech” problems)
• Moore’s Law  More and More Supply
– Disks, Memory, Network Bandwidth, everything…
– Petabytes are coming: $2,000,000 (today)  $2,000 (in 10 years)
• Can demand keep up?
– If not, revenues will collapse  tech meltdown
– Much worse than the Dot-Bomb…
Discontinuity
• Ans1: no problem
– Demand has always kept up
– Pundits have never been able to explain why
• Thomas J. Watson (1943): I think there is a world market for maybe five
computers
– But if you build it, they will come
• Ans2: big problem (prices for PCs & Networks are collapsing)
–
–
–
–
Demand is everything
Anyone (even a dot-com) can build a network,
But the challenge is to sell it
Need a kill app (more minutes on the network)
Eurospeech 2003
37
Meeting Demand for Petabytes
Bet: Speech >> Text
(because we aren’t going to solve all “speech” problems)
• Moore’s Law  More and More Supply
– Disks, Memory, Network Bandwidth, everything…
– Petabytes are coming: $2,000,000 (today)  $2,000 (in 10 years)
• Can demand keep up?
– If not, revenues will collapse  tech meltdown
– Much worse than the Dot-Bomb…
Eurospeech 2003
Discontinuity
38
Meeting Demand for Petabytes
Bet: Speech >> Text
(because we aren’t going to solve all “speech” problems)
• Moore’s Law  More and More Supply
– Disks, Memory, Network Bandwidth, everything…
– Petabytes are coming: $2,000,000 (today)  $2,000 (in 10 years)
• Can demand keep up?
– If not, revenues will collapse  tech meltdown
– Much worse than the Dot-Bomb…
Discontinuity
• Ans1: no problem
– Demand has always kept up
– Pundits have never been able to explain why
• Thomas J. Watson (1943): I think there is a world market for maybe five
computers www.wikipedia.org/wiki/Thomas+J.+Watson
– But if you build it, they will come
• Ans2: big problem (prices for PCs & Networks are collapsing)
–
–
–
–
Demand is everything
Anyone (even a dot-com) can build a network,
But the challenge is to sell it
Need a killer app (more minutes on the network)
Eurospeech 2003
39
How much is a Petabyte?
(1015 bytes)
• Question from execs:
– How do I explain to a lay audience
• How much is a petabyte
• And why everyone will buy lots of them
• Wrong answer:
– 106 is a million (a floppy disk/email msg)
– 109 is a billion (a billion here, a billion there…)
– 1012 is a trillion (the US debt)
– 1015 is a zillion (= , an unimaginably large #)
Eurospeech 2003
40
How much is a Petabyte?
(1015 bytes)
• Question from execs:
– How do I explain to a lay audience
• How much is a petabyte
• And why everyone will buy lots of them
• Wrong answer:
– 106 is a million (a floppy disk/email msg)
– 109 is a billion (a billion here, a billion there…)
– 1012 is a trillion (the US debt)
– 1015 is a zillion (= , an unimaginably large #)
Eurospeech 2003
41
How much is a Petabyte?
Some more wrong answers
• Goal: create demand for a petabyte/lifetime
– ≈ 1015 bytes/100 years ≈ 18 megabytes/minute
– Text: 18,000 pages/min
– Speech: 317 telephone channels for 100 years per capita
• Text won’t do it
– Speech probably won’t either, but it is closer
– DVD video will (1.8 gigabytes/hour = 1.6 petabytes/lifetime), but
• Too much opportunity for compression
• Not enough demand for Picture Phone (privacy concerns)
• Bank on speech recognition not working too well
– Can’t afford big improvements in compression:
• Speech rates  Text rates
Eurospeech 2003
Fortunately, that
won’t happen
42
New Research Challenges
• New Priorities
• Old Priorities
– Dictation application dates
back to days of dictation
machines
– Speech recognition has not
displaced typing
– Increase demand for
space >> Data entry
• New Killer Apps
– Search >> Dictation
• Speech recognition has
improved
• But typing skills have
improved even more
• Speech Google!
– Data mining
– My son will learn typing in
1st grade
– Sec rarely take dictation
– Dictation machines are history
• My son may never see one
• Museums have slide rulers
and steam trains
– But dictation machines?
Eurospeech 2003
43
Data Mining & Call Centers:
An Intelligence Bonanza
• Some companies are collecting
information with technology
designed to monitor incoming calls
for service quality.
• Last summer, Continental Airlines
Inc. installed software from
Witness Systems Inc. to monitor
the 5,200 agents in its four
reservation centers.
• But the Houston airline quickly
realized that the system, which
records customer phone calls and
information on the responding
agent's computer screen, also was
an intelligence bonanza, says
André Harris, reservations training
and quality-assurance director.
Eurospeech 2003
44
Borrowed
Slide
In Search of PetaByte
Databases
Jim Gray
Tony Hey
Borrowed
Slide
Personal 100 GB today
The Personal Petabyte (someday)
• It’s coming (2M$ today…2K$ in 10 years)
• Today the pack rats have ~ 10-100GB
– 1-10 GB in text (eMail, PDF, PPT, OCR…)
– 10GB – 50GB tiff, mpeg, jpeg,…
– Some have 1TB (voice + video).
Text won’t do it;
Speech won’t either
• Video can drive it to 1PB.
• Online PB affordable in 10 years.
• Get ready: tools to capture, manage,
organize, search, display will be big app.
Eurospeech 2003
46
300 TB (cooked)
Hotmail / Yahoo
Borrowed
Slide
• Clone front ends
~10,000@hotmail.
• Application servers
–
–
–
–
Per Capita Demand: Tiny
~100 @ hotmail
Get mail box
Get/put mail
Disk bound
• ~30,000 disks
• ~ 20 admins
Cost of storage: People
Eurospeech 2003
47
AOL (msn)
(1PB?)
•
•
•
•
•
•
Borrowed
Slide
Per Capita Demand: Tiny
10 B transactions per day (10% of that)
Huge storage
Huge traffic
Lots of eye candy
DB used for security/accounting.
GUESS AOL is a petabyte
– (40M x 10MB = 400 x 1012)
Eurospeech 2003
48
Google
1.5PB as of last spring
• 8,000 no-name PCs
Borrowed
Slide
2001
Per Capita Demand: Tiny
– Each 1/3U, 2 x 80 GB disk, 2
cpu 256MB ram
•
•
•
•
•
1.4 PB online.
2 TB ram online
8 TeraOps
Slice-price is 1K$ so 8M$.
15 admins (!) (== 1/100TB).
Cost of storage: People
Eurospeech 2003
49
Digital Immortality:
Gordon Bell & Jim Gray (2000)
Estimated Lifetime Storage Requirements
Data-types
Per day Per Lifetime
email, papers, text
0.5 MB
15 GB
photos
2 MB
150 GB
speech
40 MB
1.2 TB
music
60 MB
5.0 TB
video-lite (200 Kb/s)
1 GB
100 TB
DVD video (4.3 Mb/s = 1.8 GB/hour)
20 GB
1 PB
Eurospeech 2003
50
Future of Tech Industry Depends On…
• Supply running into a (physical) limit
– Moore’s Law breaking down
– And little progress on compression
• Demand keeping up
Not Likely
Not Optimistic
– If we build it, they will come…
• Bell & Gray underestimating demand by a lot
– Everyone wanting lots and lots of speech
– Everyone wanting lots of video
– A miracle (the fat lady might sing…)
Not Likely
– Big progress on searching speech & video
Best Bet!
Eurospeech 2003
51
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
52
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
53
Bait and Switch Strategy
www.elsnet.org
• Bait: public Internet
– Large, sexy, available, rich hypertext structure
• Switch: as large as the web is
– There are larger & more valuable private repositories
• Private Intranets & telephone networks
– Exclusivity  Value
• No one cares about data that everyone can have
• Just as Groucho Marx doesn’t want to be in a club that…
• Strategy: Use the public Intranet to develop, test
and socialize new ways to extract value from
large linguistic repositories
– Value to society: Port solutions to private repositories
Eurospeech 2003
54
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
Eurospeech 2003
1 PB (Gray)?
55
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
56
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
57
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
58
Switch: How Large is Large?
• Web  Renewed Excitement
– Large, rich hypertext structure & publicly available
– Ngram freqs  Google = 1000 * BNC
1 TB (ngram freqs) or
• Google: 100 Billion Words
• British National Corpus (BNC): 100 Million Words
1 PB (Gray)?
• It is often said that the web is the largest repository but…
– Changes to copyright laws could unlock vast resources:
www.lexisnexis.com
• Private Intranets and telephone networks >> Public Web
– American Telephone Network (FCC): 1 line/person
• Usage: 1 hour/day/line
• Assume 1 sec ≈ 1 word  10 Google collections/day
– Currently, Intranets (data) ≈ telephones (voice)
• But data is growing faster than voice
– AT&T networks: 1 PB/day
• Worldwide networks: tens of PB/day
Eurospeech 2003
A lot of speech, but not
PB per capita
59
Privacy Concerns: Private Data is Private
(Exclusivity  Value)
• Data on private intranets cannot be distributed
– And most telephone conversations cannot even be recorded
• let alone distributed
• But attitudes are changing
– It used to be considered rude to have an answering machine
– Now it is considered rude not to have one
• Between answering machines and call centers, perhaps
10% of telephone traffic can be recorded (≈ 1 PB/day)
– Customer expectation: call centers can retrieve recordings of
previous calls based on content
• New capabilities  new public policy
– Video recording:
• Expected in banks (ATMs)
• Prohibited in rest rooms (except children’s YMCA locker room)
Eurospeech 2003
60
Privacy Concerns: Private Data is Private
(Exclusivity  Value)
• Data on private intranets cannot be distributed
– And most telephone conversations cannot even be recorded
• let alone distributed
• But attitudes are changing
– It used to be considered rude to have an answering machine
– Now it is considered rude not to have one
• Between answering machines and call centers, perhaps
10% of telephone traffic can be recorded (≈ 1 PB/day)
– Customer expectation: call centers can retrieve recordings of
previous calls based on content
• New capabilities  new public policy
– Video recording:
• Expected in banks (ATMs)
• Prohibited in rest rooms (except children’s YMCA locker room)
Eurospeech 2003
61
Privacy Concerns: Private Data is Private
(Exclusivity  Value)
• Data on private intranets cannot be distributed
– And most telephone conversations cannot even be recorded
• let alone distributed
• But attitudes are changing
– It used to be considered rude to have an answering machine
– Now it is considered rude not to have one
• Between answering machines and call centers, perhaps
10% of telephone traffic can be recorded (≈ 1 PB/day)
– Customer expectation: call centers can retrieve recordings of
previous calls based on content
• New capabilities  new public policy
– Video recording:
• Expected in banks (ATMs)
• Prohibited in rest rooms (except children’s YMCA locker room)
Eurospeech 2003
62
Privacy Concerns: Private Data is Private
(Exclusivity  Value)
• Data on private intranets cannot be distributed
– And most telephone conversations cannot even be recorded
• let alone distributed
• But attitudes are changing
– It used to be considered rude to have an answering machine
– Now it is considered rude not to have one
• Between answering machines and call centers, perhaps
10% of telephone traffic can be recorded (≈ 1 PB/day)
– Customer expectation: call centers can retrieve recordings of
previous calls based on content
• New capabilities  new public policy
– Video recording:
• Expected in banks (ATMs)
• Prohibited in rest rooms (except children’s YMCA locker room)
Eurospeech 2003
63
In the past, recording all this data would
have been prohibitively expensive
• Thanks to Moore’s Law
– Storage costs have been falling faster than transport
– And will continue to do so for some time
• Even at current prices, transport >> storage
– Transport: Long-distance telephone calls: 5 cents per minute of speech
– Storage: Disk space: ½ cent per minute of speech
• If I am willing to pay for a call
– I might as well keep the speech online forever
• Similar comments hold for data (web pages)
– If I am willing to pay to fetch a web page
• I might as well cache it for a long time
• Why flush a page if there is any chance that it might be requested again?
– Web caches  crawlers
• Go find the pages that I might ask for and keep them forever
• Storage is cheap (compared to transport)
Eurospeech 2003
64
In the past, recording all this data would
have been prohibitively expensive
• Thanks to Moore’s Law
– Storage costs have been falling faster than transport
– And will continue to do so for some time
• Even at current prices, transport >> storage
– Transport: Long-distance telephone calls: 5 cents per minute of speech
– Storage: Disk space: ½ cent per minute of speech
• If I am willing to pay for a call
– I might as well keep the speech online forever
• Similar comments hold for data (web pages)
– If I am willing to pay to fetch a web page
• I might as well cache it for a long time
• Why flush a page if there is any chance that it might be requested again?
– Web caches  crawlers
• Go find the pages that I might ask for and keep them forever
• Storage is cheap (compared to transport)
Eurospeech 2003
65
In the past, recording all this data would
have been prohibitively expensive
• Thanks to Moore’s Law
– Storage costs have been falling faster than transport
– And will continue to do so for some time
• Even at current prices, transport >> storage
– Transport: Long-distance telephone calls: 5 cents per minute of speech
– Storage: Disk space: ½ cent per minute of speech
• If I am willing to pay for a call
– I might as well keep the speech online forever
• Similar comments hold for data (web pages)
– If I am willing to pay to fetch a web page
• I might as well cache it for a long time
• Why flush a page if there is any chance that it might be requested again?
– Web caches  crawlers
• Go find the pages that I might ask for and keep them forever
• Storage is cheap (compared to transport)
Eurospeech 2003
66
In the past, recording all this data would
have been prohibitively expensive
• Thanks to Moore’s Law
– Storage costs have been falling faster than transport
– And will continue to do so for some time
• Even at current prices, transport >> storage
– Transport: Long-distance telephone calls: 5 cents per minute of speech
– Storage: Disk space: ½ cent per minute of speech
• If I am willing to pay for a call
– I might as well keep the speech online forever
• Similar comments hold for data (web pages)
– If I am willing to pay to fetch a web page
• I might as well cache it for a long time
• Why flush a page if there is any chance that it might be requested again?
– Web caches  crawlers
• Go find the pages that I might ask for and keep them forever
• Storage is cheap (compared to transport)
Eurospeech 2003
67
In the past, recording all this data would
have been prohibitively expensive
• Thanks to Moore’s Law
– Storage costs have been falling faster than transport
– And will continue to do so for some time
• Even at current prices, transport >> storage
– Transport: Long-distance telephone calls: 5 cents per minute of speech
– Storage: Disk space: ½ cent per minute of speech
• If I am willing to pay for a call
– I might as well keep the speech online forever
• Similar comments hold for data (web pages)
– If I am willing to pay to fetch a web page
• I might as well cache it for a long time
• Why flush a page if there is any chance that it might be requested again?
– Web caches  crawlers
• Go find the pages that I might ask for and keep them forever
• Storage is cheap (compared to transport)
Eurospeech 2003
68
Bait: Use Web to Establish Excitement:
More data is better data
•
Shocking at TMI-1992 (Bob Mercer)
Larger market share 
More $$ for R&D 
Better Moore’s Law Time Constant
–
–
•
but less so a decade later (Eric Brill)
Many researchers are finding that performance improves with corpus
size, over full range of sizes that are available.
EMNLP-2002 Best paper (& CL): Using the Web to Overcome
Data Sparseness, Keller et al
–
For many tasks:
Google is displacing BNC
just as PCs displaced Crays
Larger corpora (100B Google) >> Smaller corpora (100M BNC)
–
–
•
–
Language modelling
Predicting psycholinguistic judgements
Collecting more data is better than tricks for not collecting data
•
•
Smoothing, balance, etc.
Tricks have limited power:
–
•
–
Still find papers on “tiny” corpora
Collecting x data with tricks ≈ collecting 10x data without tricks
Wish list: more papers measuring power of various tricks
Was balancing BNC (British National Corpus) worth the effort?
•
•
My spin
Should a corpus be balanced? (Oxford Debate, 1991)
The rising tide of data will lift all boats!
1.
2.
TREC Question Answering
Collocations:
Eurospeech 2003
69
Bait: Use Web to Establish Excitement:
More data is better data
•
Shocking at TMI-1992 (Bob Mercer)
Larger market share 
More $$ for R&D 
Better Moore’s Law Time Constant
–
–
•
but less so a decade later (Eric Brill)
Many researchers are finding that performance improves with corpus
size, over full range of sizes that are available.
EMNLP-2002 Best paper (& CL): Using the Web to Overcome
Data Sparseness, Keller et al
–
For many tasks:
Google is displacing BNC
just as PCs displaced Crays
Larger corpora (100B Google) >> Smaller corpora (100M BNC)
–
–
•
–
Language modelling
Predicting psycholinguistic judgements
Collecting more data is better than tricks for not collecting data
•
•
Smoothing, balance, etc.
Tricks have limited power:
–
•
–
Still find papers on “tiny” corpora
Collecting x data with tricks ≈ collecting 10x data without tricks
Wish list: more papers measuring power of various tricks
Was balancing BNC (British National Corpus) worth the effort?
•
•
My spin
Should a corpus be balanced? (Oxford Debate, 1991)
The rising tide of data will lift all boats!
1.
2.
TREC Question Answering
Collocations: http://labs1.google.com/sets
Eurospeech 2003
70
The rising tide of data will lift all boats!
TREC Question Answering & Google:
What is the highest point on Earth?
Eurospeech 2003
71
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
Cat
cat
England
Japan
Dog
Horse
more
France
China
Fish
Bird
Rabbit
Cattle
Rat
Eurospeech 2003
72
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
Cat
cat
England
Japan
Dog
Horse
more
ls
France
China
Fish
rm
Bird
mv
Rabbit
cd
Cattle
cp
Rat
mkdir
Eurospeech 2003
73
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
Cat
cat
England
Japan
Dog
Horse
more
ls
France
Germany
China
Fish
rm
Italy
Bird
mv
Ireland
Rabbit
cd
Spain
Cattle
cp
Scotland
Rat
mkdir
Belgium
Eurospeech 2003
74
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
Cat
cat
England
Japan
Dog
Horse
Fish
Bird
more
ls
rm
mv
France
Germany
Italy
Ireland
China
India
Indonesia
Malaysia
Rabbit
Cattle
Rat
cd
cp
mkdir
Spain
Scotland
Belgium
Korea
Taiwan
Thailand
Livestock
Mouse
Human
man
tail
pwd
Canada
Austria
Australia
Singapore
Australia
Bangladesh
Eurospeech 2003
75
Rising Tide of Data Lifts all Boats
Bait: use public web to create & socialize new ideas
• More data  better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– We tried similar things
» but with tiny corpora
» which we called large
Switch: port these ideas to private repositories
Eurospeech 2003
76
Recommendations
Bait and Switch Strategy
• Strategy: Use the public Intranet to develop, test and socialize new
ways to extract value from large linguistic repositories
– Value to society: Port solutions to private repositories
• Research papers:
Bait
– Keep up the good work!
– There is already considerable interest in evaluation of new ideas
on corpora (public repositories)
– There will be more interest in
Switch
• How well methods port to new corpora
• How well performance scales with size
– Hopefully corpus size helps
• But of course, all the data in the world
– Will not solve all the world’s problems
– Need to understand when more data will help
• And when it is better to do something else
– Revival of Rationalism (Linguistics)
Eurospeech 2003
77
More Recommendations
Bait and Switch Strategy
• Infrastructure
Bait
– In addition to traditional public repositories (large)
• Web data, data collection efforts such as LDC
– We ought to think more about private repositories (even larger)
• Most of us do not keep voice mail for long
Switch
– But I have been using Scanmail to copy my voice mail to email
– And like many, I keep email online for a long time
• Private repositories would be much larger if
– It was more convenient to capture private data
– and there was obvious value in doing so.
• Currently, tools for public repositories (e.g., Google)
– are better than comparable tools for private data (e.g., searching email)
• Better search tools (email, speech & video)  Larger private
repositories
• New priorities (consume space)  new killer apps
– Search (consumes space) >> Dictation (data entry) & Compression
Eurospeech 2003
78
More realistic expectations
Summary:
Where have we been and where are we going?
• 1970s: Hot debate: knowledge v. data intensive methods
– People think about what they can afford to think about
– Data was expensive
• Only the richest industrial labs could play
• Beyond the reach of most universities
• Victor Zue dreams of having an hour of speech online (with annotations)
• 1990s: Revival of Empiricism: More data is better data!
Demonstrate
consistent
progress
over time
– Everyone can afford to play (but still expensive)
– Linguistic Data Consortium (LDC)  Web
– Evaluation, evaluation, evaluation  demonstrates consistent
progress over time, but not as convincingly as Moore’s Law
– Data intensive: method of choice
• Pendulum swings (too) far
• Is this progress, or is the pendulum about to swing back the other way?
Oscillations
• 2010s: Petabytes everywhere (be careful what you ask for)
– Big problem: Supply >> Demand  tech meltdown (??)
– No problem: Demand has always kept up  new killer apps
Discontinuities
• Search (consumes space) >> dictation (data entry) & compression
• Video >> Speech >> Text
Eurospeech 2003
79
Don’t see how to consume PB per capita
Where have we been
and where are we going?
1.
Consistent progress over decades
•
•
2.
Moore’s Law, Speech Coding, Error Rate
Time constant limited by: physics and/or R&D investment
History repeats itself:
•
Mark Twain; bad idea then and still a bad idea now
•
•
•
•
3.
Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities:
•
Fundamental changes that invalidate fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: data entry  create demand for petabytes
–
New Killer Apps: Search (creates demand) >> Compression & Dictation
Eurospeech 2003
80
Backup
Speech  Language
Shannon’s: Noisy Channel Model
Language
Model
Channel
Model
• I  Noisy Channel  O
• I΄ ≈ ARGMAXI Pr(I|O) = ARGMAXI Pr(I) Pr(O|I)
Language Model
Word
Rank
More likely alternatives
We
9
The This One Two A Three
Please In
need
7
are will the would also do
to
1
resolve
85
have know do…
all
9
The This One Two A Three
Please In
of
2
The This One Two A Three
Please In
the
important
issues
Application
Independent
Channel Model
Application
Input
Output
Speech Recognition
writer
rider
OCR (Optical
Character
Recognition)
all
a1l
Spelling Correction
government
goverment
1
657
14
document question first…
thing point to
Eurospeech 2003
82
Speech  Language
Using (Abusing) Shannon’s Noisy Channel Model:
Part of Speech Tagging and Machine Translation
• Speech
– Words  Noisy Channel  Acoustics
• OCR
– Words  Noisy Channel  Optics
• Spelling Correction
– Words  Noisy Channel  Typos
• Part of Speech Tagging (POS):
– POS  Noisy Channel  Words
• Machine Translation: “Made in America”
– English  Noisy Channel  French
Eurospeech 2003
83
I am going to try to avoid making
predictions like these because…
• Too falsifiable
• Appearance of conflicts of interest
– Sound like you are trying to raise money for your
favorite stuff
• Committees do what committees do
– Union of all (represented) positions = no position
– Advocate what the members are currently working on
• Rarely establish new strategic direction
• Boring (too obviously correct)
Eurospeech 2003
84
Predictions: Where are we going?
Change the subject (engage in meta discussion)
• Set unrealistic expectations (plenty of examples)
– Sound like you are trying to raise money for your favorite stuff
• And that you have lost touch with reality
• Come up short (fewer examples)
• Sound like you’re over the hill (old fogies session at Coling)
– Kids these days don’t get it
– Everyone should still be working on
• what we thought was important when we were kids
– Dress up old-style thinking (empiricism/rationalism)

• with current fashion (web)
Meta discussion: consistent progress, history repeating itself, discontinuities
 Come up with a new angle: bounds
• Lower bound: we will solve such and such (x)
– Extrapolations based on Moore’s Law
• Upper bound: we won’t solve x (soon/ever)
– e.g., pass Turing Test, compress speech down to text rates
– And you can bank on it  good apps based on assumption x can’t be done
Eurospeech 2003
85
Breaking Through Automation
Barriers
Illustrative
Complexity of User
Interaction
Borrowed
Slide
Natural Language
Dialog
Agents
Advanced
ASR
Word
Spotting
Traditional
IVR
Complexity of Services
Eurospeech 2003
86
Borrowed
Slide
Past, Present, Future….
Directory Assistance
VRCP
1990+
• Constrained speech
• Minimal data collection
• Manual design
Keyword spotting
Handcrafted grammars
No dialogue
1995+
• Constrained speech
• Moderate data collection
• Some automation
Airline reservation
Medium size ASR
Banking
Handcrafted Grammars
System Initiative
• Spontaneous speech
• Extensive data collection
• Semi-automation
MATCH: Multimodal Access To
Call centers,
Large size ASR
City Help
E-commerce
Limited NLU
Mixed-initiative
2000+
2005+
• Spontaneous speech/pen
• Fully automated systems
Unlimited ASR
Deeper NLU
Adaptive systems
Eurospeech 2003
Multimodal,
Multilingual
Help Desks,
E-commerce
87
Example of Upper Bound:
Reverse Turing Test
(Kochanski et al., ICSLP-2002)
• Assume: won’t pass Turing Test (any time soon)
• Assumptions you can bank on
– Liberace: cry all the way to the bank
• Good apps for crummy (limited) technology
– “Good Applications for Crummy Machine Translation”
• Church & Hovy (1993)
• Reverse Turing Test
– Owner of web site wants to grant access to people but not to spiders
– Task: distinguish friend from foe, man from beast
– Solution: assume there are a class of problems (AI-complete) that any
person can do and no machine can.
• Currently deployed Reverse Turing Applications
– Assume OCR is AI-complete
– User is given a degraded image and asked to enter text into a form
– Easy for people but challenging for machines
• Problem: OCR is not challenging enough for machines
• Proposal: Speech recognition with noise is more challenging
– We can bank on not solving the cocktail party effect any time soon
Eurospeech 2003
88
Where have we been
and where are we going?
1.
Consistent progress over decades
•
2.
Moore’s Law, Speech Coding, Error Rate
History repeats itself
•
•
•
•

Empiricism: 1950s
Rationalism: 1970s
Empiricism: 1990s
Rationalism: 2010s (?)
Discontinuities: Fundamental changes that invalidate
fundamental assumptions
•
•
•
•
Petabytes: $2,000,000  $2,000
Can demand keep up with supply?
If not  Tech meltdown
New priorities: Search >> Compression & Dictation
Eurospeech 2003
89
Statistical MT:
IBM Models 1-5
• E  Noisy Channel  F
• E΄ = ARGMAXE Pr(E) Pr(F|E)
• Language Model, Pr(E):
– Trigram model (borrowed from speech recog)
• Channel Model, Pr(F|E):
– Based on aligned parallel corpora
– Models 1-5: alignment
• Mercer & Church (Computational Linguistics, 1993)
– Statistical MT may fail for reasons advanced by Chomsky
– Regardless of its ultimate success or failure,
– There is a growing community of researchers in corpus-based
linguistics who believe it will produce valuable lexical resources
• Bilingual concordances
• Translation tools
• Training & testing material for word sense disambig (senseval)
Eurospeech 2003
90
Word Sense Disambiguation
• Knowledge Acquisition Bottleneck
–
–
–
–
Bar-Hillel (1960)
Expert systems don’t scale
Sense-tagged text: expensive
Parallel text!
• Translation = sense-tagged text
– Sentence (judicial sense)  peine
– Sentence (syntactic sense)  phrase
• Yarowsky: bilingual  monolingual
• One sense per discourse
• Machine Learning: early example of co-training (EM alg)
Eurospeech 2003
91
TMI-02 Keynote (similar subject)
The organizers asked me…
• What's changed since TMI-92 (if anything)?
– TMI-92: great excitement over the use of aligned parallel corpora to help
human translators (translation tools)
– Also, much controversy over IBM Models 1-5
• Have IBM Models 1-5 failed to solve all the world’s problems?
• So what's happened (if anything) since 1992?
– Empiricism has come of age
• Textbooks: Charniak, Jelinek, Manning & Schultze, Jurafsky & Martin
• Textbooks  courses in many universities around the world
– What used to be considered radical is now accepted practice
• Evaluation is practically required for publication
– Mercer’s fighting words: More data is better data!
• Aren’t as shocking when Brill makes the case a decade later
– The new field of Machine Learning has absorbed many good (and
formally controversial) ideas including
• IBM Models 1-5
• Yarowsky's Word Sense Disambiguation
– Grew out of Machine Translation,
– But is now widely cited in Machine Learning as an early example of co-training
Eurospeech 2003
92
What has happened to the IBMApproach to Machine Translation?
•
Support for human translators
–
–
–
•
Terminology: translators don’t need help with the easy
vocabulary and the easy grammar
Translation Memory: translators are often asked to translate
the same material again and again (e.g., revisions of manuals)
Alignment
Fully automatic
–
–
•
CLIR: cross-language information retrieval
Translating web pages
Academic fields
–
–
Machine Learning: most important contribution
Corpus-based Lexicography: spreading into lots of other fields
Eurospeech 2003
93
Revival of Empiricism:
A Personal Perspective
•
As a student at MIT, I was solidly opposed to empiricism
– But that changed soon after moving to AT&T Bell Labs (1983)
•
Letter-to-Sound Rules (speech synthesis)
– Names: Letter stats  Etymology  Pronunciation video
– NetTalk: Neural Nets video
•
•
•
•
•
•
Demo: great theater  unrealistic expectations
Self-organizing systems v. empiricism
Machine Learning v. Corpus-based Linguistics
I did it, I did it, I did it, but…
Part of Speech Tagging (1988)
Word Associations (Hanks)
– Mutual info  collocations & word associations
• Collocations: Strong tea v. powerful computers
• Word Associations: bread and butter, doctor/nurse
•
•
•
Good-Turing Smoothing (Gale)
Aligning Parallel Corpora (inspired by MT)
Word Sense Disambiguation
– Bilingual  Monolingual
•
Even if IBM’s approach fails for MT  lasting benefit (tools, linguistic
resources, academic contributions to machine learning)
Eurospeech 2003
94
Ceiling
Speech
Coding
(Telephony)
• More complicated than Moore’s Law
– Many Dimensions: Bit Rate, Quality, Complexity and Delay
– Quality ceiling (imposed by telephone standards)
• Easy to reach the ceiling at high bit rates (≥ 8 kb/s)
• More room for progress at low bit rates (≤ 8 kb/s)
• Moore’s Law Time Constant
– Bit rates half every decade (≤ 8 kb/s)
– Relatively slow by Moore’s Law standards (not hyper-inflation)
• Performance doubles every decade
• Like disk seek or money in the bank (normal inflation)
– Limited more by physics than investment
• Potential compression opportunity
– At most 10x: 8 kb/s  2 kb/s  1 kb/s (?)
• Speech (2 kb/s) >> text (2 bits/char): 100-1000 times more bits
– Speech coding will not close this gap for foreseeable future
Eurospeech 2003
95
Download