ppt - Department of Computer Science

advertisement
It is the best of times
(and the worst of times)
Kenneth Church
Microsoft
church@microsoft.com
Responsibility; Attribute
Dangerous Positions to Others
Interesting &
Controversial
Wow!
(What a difference a decade makes)
• Empiricism has come of age
Lonely
Preaching to Choir
• 1993: Workshop on Very Large
Corpora (WVLC)
– Intended to be a 1-time event
– But so successful that it
evolved into a series of
EMNLP conferences
% Statistical
Papers
– Radical Fringe  Mainstream
2005
2000
ACL Meeting
Bob Moore
EMNLP-2004 & Senseval-2004
1995
July 25, 2004
1990
– Success/Catastrophe
1985
• EMNLP-2004 received so
many submissions that the
program committee had to be
expanded at the last minute
100%
80%
60%
40%
20%
0%
Fred Jelinek
2
The Structure of Scientific
Revolutions (1962) – Kuhn (p.10)
•
Paradigms
–
Examples from Physics
•
•
•
•
•
•
•
Aristotle’s Physica
Ptolemy’s Almagest
Newton’s Principia and Optics
Franklin’s Electricity
Lavoisier’s Chemistry
Lyell’s Geology
Two characteristics:
1.
2.
July 25, 2004
Sufficiently unprecedented to attract an enduring group of
adherents from competing modes of scientific activity
Simultaneously, sufficiently open-ended to leave all sorts of
problems for the redefined group of practitioners to resolve
EMNLP-2004 & Senseval-2004
3
Organizational Innovations
(Radical  Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons
Innovation
– Short review cycles  Freshness
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas)
– Co-chair (Established, Knows how it has been done)
• Avoid incremental papers
– Reviewers prefer boring papers over radical ones
– Reviewers do what reviewers do; chairs  correction
• Inclusiveness: Diversity  Growth (Sales)
– Thankless chores  Marketing carrots
– 1/3 promising, 1/3 stability, 1/3 outreach
– Hold conferences in Europe, Asia & America
July 25, 2004
Short term ≠ Long term
EMNLP-2004 & Senseval-2004
Checks &
Balances
4
What Worked and What Didn’t?
•
Stay on msg: It is data, stupid!
–
–
WVLC (Very Large) >> EMNLP (Empirical Methods)
If you have a lot of data,
Methodology
•
•
Data
Then you don’t need a lot of methodology
Empiricism means diff things to diff people
1. Machine Learning (Self-organizing Methods)
2. Exploratory Data Analysis (EDA)
Kucera & Francis gave
3. Corpus-Based Lexicography
•
great invited talk
Lots of papers on 1
(but they couldn’t
– EMNLP-2004 theme (error analysis)follow
 2submitted talks)
–
July 25, 2004
Senseval grew out of 3
EMNLP-2004 & Senseval-2004
5
Word Sense Disambiguation (WSD) History
• Bar-Hillel (1960):
• Yarowsky:
– Abandoned Machine
Translation (MT)
– Couldn’t see how to make
progress on WSD (pen)
– Can’t translate without
disambiguating
• bank (money)  banque
• bank (river)  banc
• 1990s
• interest
• wear
– ML: Co-training
• Supervised 
Unsupervised
• Lexicography: Hector
– Parallel text ≈ Labeled
corpus for supervised
training and testing
– Isn’t it great the translators
have WSD labeled all this
data for us!
July 25, 2004
– Parallel corpus 
encyclopedia + thesaurus
– Bilingual ≠ Monolingual
– Joint collaboration: Oxford
University Press & DEC
– flagging  flogging
• Senseval
EMNLP-2004 & Senseval-2004
6
A Road Rarely Taken:
Tukey’s Exploratory Data Analysis (EDA)
• Linear Regression
50000
40000
Time
– Standard practice:
• Plug data into off-theshelf package
• Publish (if “significant”)
30000
20000
10000
0
– Better:
0
• Check for outliers
• Bowed residuals
No Result
• Deviations from
assumptions (normality)
– Fanout
• Slocum’s Thesis (1981)
– “Proof” that CKY takes
linear time
July 25, 2004
20
30
Sentence Length
50000
Standard
texts (e.g., Aho)…
40000
consider
… worst case… This
30000
assumption
clearly fails to apply to
20000
natural
language… Our
10000
experiments
have shown that
0
0
10
20
30
average-case
time
performance…
Sentence Length
is approximately
linear (p. 102)
EMNLP-2004 & Senseval-2004
Time
– Evidence of a positive
or negative derivative
10
7
Many Machine Learning (ML) Techniques (SVMs,
Perceptrons) are Similar to (Logistic) Regression;
Rarely see EDA (Robust Statistical) Methods in ML
The Elements of Statistical Learning
– Hastie, Tibshirani, Friedman
(2001), p 380
July 25, 2004
EMNLP-2004 & Senseval-2004
8
Historical Context
Empiricists
feel lonely
• 1950s:
Rationalists
feel lonely
– Rigorous methodology
% Statistical
Papers
• Information theory
• Behaviorism
• Unfulfilled unrealistic
expectations video
2005
– Revival of empiricism
2000
• Artificial Intelligence
• Cognitive Psychology
July 25, 2004
1995
ACL Meeting
– Let it all hang out
• 1990s:
1990
• 1970s:
Kuhn Crisis
1985
– ALPAC report
– Whither Speech
Recognition?
100%
80%
60%
40%
20%
0%
Bob Moore
Fred Jelinek
Kuhn Crisis
EMNLP-2004 & Senseval-2004
9
Borrowed Slide: Jelinek (LREC)
“Whither Speech Recognition?”
Also, ALPAC (chair)
& Bell Labs exec
Pierce, JASA 1969
…ASR is attractive to money. The attraction is perhaps
similar to the attraction of schemes for turning water
into gasoline, extracting gold from the sea, or going
to the moon.
Most recognizers behave not like scientists, but like
mad inventors or untrustworthy engineers.
…performance will continue to be very limited unless
the recognizing device understands what is being
said with something of the facility of a native speaker
(that is, better than a foreigner fluent in the language)
Any application of the foregoing discussion to work in
the general area of pattern recognition is left as an
exercise for the reader.
July 25, 2004
EMNLP-2004 & Senseval-2004
10
ALPAC (1966): the (in)famous report
John Hutchins
• The best known event in the history of MT is …
– Automatic Language Processing Advisory Committee (ALPAC)
• Its effect was to bring to an end the substantial funding
of MT research in US for some twenty years.
– More significantly was the clear message to the general public
and the rest of the scientific community that MT was hopeless.
– For years afterwards, an interest in MT was something to keep
quiet about; it was almost shameful.
– To this day, the 'failure' of MT is still repeated by many as an
indisputable fact.
• The impact of ALPAC is undeniable
– While the fame or notoriety of ALPAC is familiar,
– What the report actually said is now becoming less familiar and
often forgotten or misunderstood…
July 25, 2004
EMNLP-2004 & Senseval-2004
11
ALPAC Recommendations
Theory
The committee recommends expenditures in two distinct areas
• Computational
linguistics as part of
linguistics
July 25, 2004
Improvement of translation:
1. practical methods for evaluation of
translations;
2. means for speeding up the human
translation process;
3. evaluation of quality and cost of various
sources of translations;
4. investigation of the utilization of
translations, to guard against
production of translations that are
never read;
5. study of delays in the over-all
translation process, and means for
eliminating them, both in journals and
in individual items;
6. evaluation of the relative speed and
cost of various sorts of machine-aided
and should not be
translation;
judged by any
7. adaptation of existing mechanized
immediate or
editing and production processes in
translation;
foreseeable contribution
8. the over-all translation process; and
to practical translation
9. production of adequate reference
works for the translator, including the
adaptation of glossaries that now exist
Practice
primarily for automatic dictionary lookup in machine translation
EMNLP-2004 & Senseval-2004
12
– Studies of parsing,
generation… including
experiments in
translation…
– Linguistics should be
supported as science,
•
•
Best of Times
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
July 25, 2004
EMNLP-2004 & Senseval-2004
13
Where have we been and where are we going?
Moore’s Law: Ideal Answer
Moores: Bob ≠ Gorden ≠ Roger
July 25, 2004
EMNLP-2004 & Senseval-2004
14
Error Rate
Borrowed Slide
Audrey Le (NIST)
Moore’s Law Time Constant:
• 10x improvement per decade
Date (15 years)
July 25, 2004
EMNLP-2004 & Senseval-2004
15
Charles Wayne’s Challenge:
Demonstrate Consistent Progress Over Time
Managing
Expectations
•
Controversial in 1980s
–
–
•
But not in 1990s
Though,  grumbling
Benefits
1. Agreement on what to do
2. Limits endless discussion
3. Helps sell the field
•
•
•
Manage expectations
Fund raising
Risks (similar to benefits)
1. All our eggs are in one
basket (lack of diversity)
2. Not enough discussion
•
Hard to change course
3. Methodology  Burden
July 25, 2004
EMNLP-2004 & Senseval-2004
16
$
Hockey Stick
Business Case
2003
Last
Year
July 25, 2004
2004
This
Year
EMNLP-2004 & Senseval-2004
t
2005
Next
Year
17
Where have we been and where are we going?
Manage
Consistent Progress over Time
Expectations
Extrapolation/Prediction
is Not Applicable
$
Extrapolation/Prediction
is Applicable
2003
2004
2005
t
July 25, 2004
EMNLP-2004 & Senseval-2004
18
% Statistical
Papers
When will we see the last nonstatistical paper? 2010?
100%
80%
60%
40%
20%
0%
2005
2000
1995
1990
1985
ACL Meeting
Bob Moore
July 25, 2004
Fred Jelinek
EMNLP-2004 & Senseval-2004
19
Top Ten Metrics of Success
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Search
Value Creation (Reality)
Speech
Stock Prices (Belief)
Startup Companies Raise Venture Capital (Excitement)
Prototype Applications (Plausibility)
Senseval
Grand-Students (Survive the Test of Time)
wants to
Students Get Good Jobs
be here
We
Students Finish PhD Theses
are
Citations
here
Conference Registrations
Publications (Quantity)
July 25, 2004
EMNLP-2004 & Senseval-2004
20
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
Best of Times
(Not!)
Been there;
Done that
July 25, 2004
EMNLP-2004 & Senseval-2004
21
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
July 25, 2004
EMNLP-2004 & Senseval-2004
22
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
July 25, 2004
EMNLP-2004 & Senseval-2004
23
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
July 25, 2004
EMNLP-2004 & Senseval-2004
24
It has been claimed that
Recent progress made possible by Empiricism
Progress (or Oscillating Fads)?
•
1950s: Empiricism was at its peak
– Dominating a broad set of fields
• Ranging from psychology (Behaviorism)
• To electrical engineering (Information Theory)
• Periodic signals are continuous
• Support extrapolation/prediction
• Progress? Consistent progress?
– Psycholinguistics: Word frequency norms (correlated with reaction time, errors)
• Word association norms (priming): bread and butter, doctor / nurse
– Linguistics/psycholinguistics: focus on distribution (correlate of meaning)
• Firth: “You shall know a word by the company it keeps”
• Collocations: Strong tea v. powerful computers
•
1970s: Rationalism was at its peak
– with Chomsky’s criticism of ngrams in Syntactic Structures (1957)
– and Minsky and Papert’s criticism of neural networks in Perceptrons (1969).
•
1990s: Revival of Empiricism
– Availability of massive amounts of data (popular arg, even before the web)
• “More data is better data”
• Quantity >> Quality (balance)
– Pragmatic focus:
Consistent progress?
• What can we do with all this data?
• Better to do something than nothing at all
– Empirical methods (and focus on evaluation): Speech  Language
•
2010s: Revival of Rationalism (?)
July 25, 2004
Extrapolation/Prediction: Applicable?
25
EMNLP-2004 & Senseval-2004
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
– Has too much happened since TMI-1992?
• I worry that the pendulum has swung so far that
– We are no longer training students for the possibility
•
that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
26
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
27
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
•
– Statistics and Machine Learning
– as well as Linguistic Theory
History repeats itself: Mark Twain; bad idea then and still a bad idea now
– 1950s: empiricism
– 1970s: rationalism (empiricist methodology became too burdensome)
– 1990s: empiricism
– 2010s: rationalism (empiricist methodology is burdensome, again)
July 25, 2004
EMNLP-2004 & Senseval-2004
28
Speech  Language
Has the pendulum
swung too far?
• What happened between TMI-1992 and TMI-2002 (if anything)?
• Have empirical methods become too popular?
Plays well at
– Has too much happened since TMI-1992?
Machine
• I worry that the pendulum has swung so far that
Translation
– We are no longer training students for the possibility
conferences
• that the pendulum might swing the other way
• We ought to be preparing students with a broad education including:
– Statistics and Machine Learning
– as well as Linguistic Theory
• History repeats itself:
–
–
–
–
July 25, 2004
1950s: empiricism
1970s: rationalism (empiricist methodology became too burdensome)
1990s: empiricism
2010s: rationalism (empiricist methodology is burdensome, again)
Grandparents and grandchildren
have a natural alliance…
EMNLP-2004 & Senseval-2004
29
Rationalism
Well-known
Chomsky, Minsky
advocates
Model Competence Model
Contexts of Interest Phrase-Structure
Goals
Shannon, Skinner, Firth,
Harris
Noisy Channel Model
N-Grams
All and Only
Minimize Prediction Error
(Entropy)
Explanatory
Descriptive
Theoretical
Applied
Linguistic Agreement & WhGeneralizations
movement
Principle-Based,
Parsing Strategies
CKY (Chart),
ATNs, Unification
Understanding
Applications Who did what to
July 25, 2004
Empiricism
whomEMNLP-2004 & Senseval-2004
Collocations & Word
Associations
Forward-Backward
(HMMs), Inside-outside
(PCFGs)
Recognition
Noisy Channel Applications
30
Covering all the Bases
It is hard to make predictions (especially about the future)
• When will we see the last
non-statistical paper?
– 2010?
• Revival of rationalism:
– 2010?
July 25, 2004
EMNLP-2004 & Senseval-2004
The answer to any
question: 6 years!
31
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
Rising tide of data
lifts all boats
No matter what
happens, it’s goin’
be great!
July 25, 2004
EMNLP-2004 & Senseval-2004
32
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• 1985: “There is no data like more data”
– Fighting words uttered by radical fringe elements (Mercer at
Arden House)
• 1993 Workshop on Very Large Corpora
– Perfect timing: Just before the web
– Couldn’t help but succeed
– Fate
• 1995: The Web changes everything
• All you need is data (magic sauce)
–
–
–
–
–
July 25, 2004
No linguistics
No artificial intelligence (representation)
No machine learning
No statistics
No error analysis
EMNLP-2004 & Senseval-2004
33
“It never pays to think until you’ve
run out of data” – Eric Brill
Moore’s Law Constant:
Banko & Brill: Mitigating the Paucity-of-Data Problem (HLT 2001)
Data Collection Rates  Improvement Rates
No consistently
best learner
Quoted out of context
More
data is
better
data!
Fire everybody and
spend the money on data
July 25, 2004
EMNLP-2004 & Senseval-2004
34
Borrowed Slide: Jelinek (LREC)
Benefit of Data
LIMSI: Lamel (2002) – Broadcast News
WER
hours
Supervised:
transcripts
Lightly supervised: closed captions
July 25, 2004
EMNLP-2004 & Senseval-2004
35
The rising tide of data will lift all boats!
TREC Question Answering & Google:
What is the highest point on Earth?
July 25, 2004
EMNLP-2004 & Senseval-2004
36
The rising tide of data will lift all boats!
Acquiring Lexical Resources from Data:
Dictionaries, Ontologies, WordNets, Language Models, etc.
http://labs1.google.com/sets
England
Japan
Cat
cat
France
Germany
Italy
Ireland
China
India
Indonesia
Malaysia
Dog
Horse
Fish
Bird
more
ls
rm
mv
Spain
Scotland
Belgium
Korea
Taiwan
Thailand
Rabbit
Cattle
Rat
cd
cp
mkdir
Canada
Austria
Australia
Singapore
Australia
Bangladesh
Livestock
Mouse
Human
man
tail
pwd
July 25, 2004
EMNLP-2004 & Senseval-2004
37
Rising Tide of Data Lifts All Boats
If you have a lot of data, then you don’t need a lot of methodology
• More data  better results
– TREC Question Answering
• Remarkable performance: Google
and not much else
– Norvig (ACL-02)
– AskMSR (SIGIR-02)
– Lexical Acquisition
• Google Sets
– We tried similar things
» but with tiny corpora
» which we called large
July 25, 2004
EMNLP-2004 & Senseval-2004
38
Applications
5 Ian Andersons
•
Don’t worry;
Be happy
What good is word sense disambiguation (WSD)?
–
Information Retrieval (IR)
•
Salton: Tried hard to find ways to use NLP to help IR
–
•
•
–
Croft: WSD doesn’t help because IR is already using those
methods
Sanderson (next two slides)
Machine Translation (MT)
•
•
•
•
but failed to find much (if anything)
Original motivation for much of the work on WSD
But IR arguments may apply just as well to MT
What good is POS tagging? Parsing? NLP? Speech?
Commercial Applications of Natural Language
Processing, CACM 1995
–
$100M opportunity (worthy of government/industry’s attention)
1.
2.
•
July 25, 2004
Search (Lexis-Nexis)
Word Processing (Microsoft)
ALPAC
Warning: premature commercialization is risky
EMNLP-2004 & Senseval-2004
39
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Not much?
5 Ian Andersons
F
• Could WSD help IR?
• Answer: no
– Introducing ambiguity
by pseudo-words
doesn’t hurt (much)
Query Length (Words)
July 25, 2004
Short queries matter most, but hardest for WSD
EMNLP-2004 & Senseval-2004
40
Sanderson (SIGIR-94)
http://dis.shef.ac.uk/mark/cv/publications/papers/my_papers/SIGIR94.pdf
Soft WSD?
F
• Resolving ambiguity
badly is worse than not
resolving at all
– 75% accurate WSD
degrades performance
– 90% accurate WSD:
breakeven point
Query Length (Words)
July 25, 2004
EMNLP-2004 & Senseval-2004
41
An example of Error Analysis/Representation
Some Promising Suggestions
(Generate lots of conference papers, but may not support the field)
• Two Languages are
Better than One
• Demonstrate that NLP is good
for something
– For many classic hard NLP
problems
• Word Sense
Disambiguation (WSD)
• PP-attachment
• Conjunction
• Predicate-argument
relationships
• Japanese and Chinese
Word breaking
– Parallel corpora  plenty
of annotated (labeled)
testing and training data
– Don’t need unsupervised
magic (data >> magic)
Senseval++
July 25, 2004
– Statistical methods (IR &
WSD) focus on bags of nouns,
• Ignoring verbs, adjectives,
predicates, intensifiers, etc.
– Hypothesis: Ignored because
perceptrons can’t model XOR
– Task: classify “comments” into
“good,” “bad” and “neutral”
• Lots of terms associated with
just one category
• Some associated with two
– Depending on argument
• Good & Bad, but not neutral:
Mickey Mouse, Rinky Dink
– Bad: Mickey Mouse(us)
– Good: Mickey Mouse(them)
– Current IR/WSD methods
don’t capture predicateargument relationships
EMNLP-2004 & Senseval-2004
42
L
W UN
AS E
P D
R S-W - LS
es
ea ork -U
rc be
h
- D nc h
IM
IIT A P
2
IIT (R)
1
(R
)
IIT
2
IIT
JH 1
U
(R
)
S
St
M
an
Ul
s
fo
K
Si rd UN
ne - C L
qu
S P
a- 22
LI
4
A N
-S
C
T
TA
D LP
ul
ut
h
3
BC U
JH
M
U
U
-e Dhu SS
-d T
l is
D t-al l
ul
ut
D h5
ul
ut
h
C
D
ul
ut
h
D
ul 4
ut
D h2
ul
u
D th 1
ul
ut
h
D
A
U ul u
N
t
ED h B
-L
Al S-T
BC
ic
an
U
te
Ba - e
s e hu
IR
Ba
l
S
i
d
n
se
T
l
l i Ba e L is t
Ba ne s e e sk -be
l
G
s
i
se
n
t
C
r
l in ou e C or
p
p
e
o
G ing m m u s
ro
L
B up es on
Ba as e ing k C es t
s e l in
C or
l in e G om pus
e
m
G rou on
ro
p
up ing es t
Ba
in
Le
se
g
s
l in
L
e Ba esk k
s
G
e
D
ro
l
u ine ef
Ba ping Le
s
se
l i Ra k
Ba ne ndo
m
s e Le
l in sk
D
e
R ef
an
do
m
C
July 25, 2004
Unsupervised
Supervised
EMNLP-2004 & Senseval-2004
20%
0.1
83
0.1
63
0.1
41
0.2
3
0.2
26
0.2
68
50%
0.5
12
0.4
76
Supervision
0.4
37
0.4
27
0.4
11
60%
0.6
42
0.6
38
0.6
29
0.6
17
0.6
13
0.5
94
0.5
71
0.5
68
0.5
68
0.5
64
0.5
54
0.5
5
0.5
42
0.5
39
0.5
34
0.5
23
0.5
08
0.4
98
70%
0.2
49
0.2
33
0.2
44
0.2
39
0.2
32
0.2
2
30%
0.3
19
0.2
93
0.4
01
80%
10%
Baseline
I-
Magic
IT
R
Supervision >> Magic > Baseline
http://www.sle.sharp.co.uk/senseval2/Results/all_graphs.xls
90%
English Lexical Sample
(fine-grained scoring)
Precision
Recall
40%
0%
Baseline
Bragging Rights
43
Breakdown by
Systems & Words
• Spelling correction task
– Golding & Schabes (1996)
• Some methods work
better on some words
– and other methods work
better on other words
• Should breakdown
Senseval results by both
systems and words
• Discover opportunities for
hybrids across systems
• Error analysis
– POS distinctions (easy)
– Local context (trigrams)
– Larger contexts (IR)
July 25, 2004
EMNLP-2004 & Senseval-2004
44
July 25, 2004
L
EMNLP-2004 & Senseval-2004
Unsupervised
Supervised
20%
0.1
83
0.1
63
0.1
41
0.2
68
0.5
12
0.4
76
50%
0.2
3
0.2
26
60%
0.6
42
0.6
38
0.6
29
0.6
17
0.6
13
0.5
94
0.5
71
0.5
68
0.5
68
0.5
64
0.5
54
0.5
5
0.5
42
0.5
39
0.5
34
0.5
23
0.5
08
0.4
98
70%
0.4
37
0.4
27
0.4
11
0.4
01
80%
0.2
49
0.2
33
0.2
44
0.2
39
0.2
32
0.2
2
30%
0.2
93
0.3
19
• Shared learnings
C
– Scores going up and up 
Funding goes up and up
– Rising tide lifts all boats
W UN
AS E
P D
R S-W - LS
es
ea ork -U
rc be
h
- D nc h
IM
IIT A P
2
IIT (R)
1
(R
)
IIT
2
IIT
JH 1
U
(R
SM )
St
an
U
fo
K ls
Si rd UN
ne - C L
qu
S P
a- 22
LI
4
A N
-S
C
T
TA
D LP
ul
ut
h
3
BC U
J
U MD HU
-e hu SS
-d T
l is
D t-al l
ul
ut
D h5
ul
ut
h
C
D
ul
ut
D h4
ul
ut
D h2
ul
u
D th 1
ul
ut
h
D
A
U ul u
N
ED th B
-L
Al S-T
BC
ic
a
U
nt
Ba - e
e
s e hu
I
Ba
l in
- d RS
se
T
l
l in Ba e L is t
Ba e s e e sk -be
s e Gr l ine
C st
o
l in
up C orp
e
o
G ing m m u s
ro
L
Ba up esk one
Ba s e ing
C st
s e l in
C or
l in e G om pus
e
m
G rou on
ro
p
up ing es t
Ba
in
Le
se
g
s
l in
L
e Ba esk k
G se
D
ro
l
u ine ef
Ba ping Le
sk
se
R
l in a
Ba e ndo
m
s e Le
l in sk
D
e
R ef
an
do
m
• Benchmarking:
I-
– Compare and contrast
– What works and what doesn’t?
– Error analysis
IT
R
Goals of Shared Evaluations
• Marketing & Sales
90%
English Lexical Sample
(fine-grained scoring)
Precision
Recall
40%
10%
0%
Baseline
– How hard are various problems?
– What makes problems easier or
harder?
– Rate of progress?
• Not bragging rights:
– Mirror, mirror on the wall, who’s the
smartest of them all…
45
Outline
• We’re making consistent progress, or
• We’re running around in circles, or
– Don’t worry; be happy
• We’re going off a cliff…
According to unnamed sources:
Speech Winter  Language Winter
Dot Boom  Dot Bust
July 25, 2004
EMNLP-2004 & Senseval-2004
46
Kuhn Crisis
• Senseval feels the need to demonstrate applications of their stuff
(and maybe there aren’t any)
• Complacency (don’t worry; be happy)
– Too little dissent: students aren’t rebelling against their teachers
– I get uncomfortable when
• There is so much agreement on what to do and so much optimism
• And so few worries and so little dissent/controversy.
• Mindless Metrics
– Whatever you measure, you get…
– Scores go up and up and up, but are we really doing better?
• According to the scores, parsing is doing well without words,
• But you can’t solve classic problems (PPs) without words!
• Burdensome Methodology  Exclusiveness
Campbell (ACL-04):
Rules >> ML
Early Warning Signs for Future
– Can’t play (in speech) unless you work in a big lab
• Following Speech off a Cliff
Been great,
– Empirical methods: Speech  Language
– Speech Winter  Language Winter (Dot Boom  Dot Bust)
– What goes up, (usually) comes down…
July 25, 2004
EMNLP-2004 & Senseval-2004
but…
47
July 25, 2004
EMNLP-2004 & Senseval-2004
48
July 25, 2004
EMNLP-2004 & Senseval-2004
49
Sample of 20 Survey Questions
(Strong Emphasis on Applications)
• When will
– More than 50% of new PCs have dictation on them, either at
purchase or shortly after.
– Most telephone Interactive Voice Response (IVR) systems
accept speech input.
– Automatic airline reservation by voice over the telephone is the
norm.
– TV closed-captioning (subtitling) is automatic and pervasive.
– Telephones are answered by an intelligent answering machine
that converses with the calling party to determine the nature and
priority of the call.
– Public proceedings (e.g., courts, public inquiries, parliament,
etc.) are transcribed automatically.
• Two surveys of ASRU attendees: 1997 & 2003
July 25, 2004
EMNLP-2004 & Senseval-2004
50
2003 Responses ≈ 1997 Responses + 6 Years
(6 years of hard work  No progress)
July 25, 2004
EMNLP-2004 & Senseval-2004
51
Top Ten Metrics of Success
(Risky to Promise Apps and Fail to Deliver)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Search
Value Creation (Reality)
Speech
Stock Prices (Belief)
Startup Companies Raise Venture Capital (Excitement)
Prototype Applications (Plausibility)
Senseval
Grand-Students (Survive the Test of Time)
wants to
Students Get Jobs
be here
We
Students Finish PhD Theses
are
Citations
here
Conference Registrations
Publications (Quantity)
July 25, 2004
EMNLP-2004 & Senseval-2004
52
Wrong Apps?
• New Priorities
• Old Priorities
– Increase demand for
space >> Data entry
• New Killer Apps
– Dictation app dates back to
days of dictation machines
– Speech recognition has not
displaced typing
• Speech recognition has
improved
• But typing skills have
improved even more
– Search >> Dictation
• Speech Google!
– Data mining
– My son will learn typing in
1st grade
– Sec rarely take dictation
– Dictation machines are history
• My son may never see one
• Museums have slide rulers
and steam trains
– But dictation machines?
July 25, 2004
EMNLP-2004 & Senseval-2004
53
Speech Data Mining
& Call Centers:
An Intelligence Bonanza
• Some companies are collecting
information with technology
designed to monitor incoming calls
for service quality.
• Last summer, Continental Airlines
Inc. installed software from
Witness Systems Inc. to monitor
the 5,200 agents in its four
reservation centers.
• But the Houston airline quickly
realized that the system, which
records customer phone calls and
information on the responding
agent's computer screen, also was
an intelligence bonanza, says
André Harris, reservations training
and quality-assurance director.
July 25, 2004
EMNLP-2004 & Senseval-2004
54
Speech Data Mining
• Label calls as success or failure based on
some subsequent outcome (sale/no sale)
• Extract features from speech
• Find patterns of features that can be used
to predict outcomes
• Hypotheses:
– Customer: “I’m not interested”  no sale
– Agent: “I just want to tell you…”  no sale
Inter-ocular effect (hits you between the eyes);
Don’t need a statistician to know which way the wind is blowing
July 25, 2004
EMNLP-2004 & Senseval-2004
55
Ways for Conferences to Fail
• Incrementalism/Burdensome Methodology (Lesson from 1950s)
– We do research for fun and profit – Arno Penzias
– Fun and/or Profit >> By-the-Book Correctness
• Arrogance, Mindless Metrics, etc.
• Control
– Too much control
•
•
•
•
Excessive Exclusiveness (mutual admiration society/old-boy network)
Change (serendipity) is essential: New and Different  Fun and Excitement
Growth and prosperity depends on new talent (students) & new topics
Can’t afford to keep doing what we already know how to do
– Too little control
• Stay on msg: It’s data, stupid! (Our msg ≠ ACL’s)
• Set Inappropriate Expectations
– Promise too little
• Senseval feels the need to become more applied
Rarely a problem,
especially with
thesis proposals
– Promise too much: Promise Applications and Fail to Deliver
– Success/Catastrophe
Rarely a problem
• What if we actually achieved all our goals?
July 25, 2004
EMNLP-2004 & Senseval-2004
(except for
March of Dimes)
56
Ways for Conferences to Succeed
•
•
I wish I knew…
Fate (can’t fail)
–
•
•
Rising Tide of Data Lifts All Boats
Luck/timing: WVLC-93 was just before Web
Sales & Marketing
–
•
Evaluation, Evaluation, Evaluation
Strategic Vision
–
–
–
In retrospect, 1993 WVLC worked wonderfully
Distinguished us from mainstream
Offered excitement and hope for future
•
July 25, 2004
Especially appealing to students (growth opportunity)
EMNLP-2004 & Senseval-2004
57
Borrowed Slide: Jelinek (LREC)
Great Strategy  Success
Great Challenge: Annotating Data
• Produce annotated data with minimal
supervision Self-organizing “Magic” ≠ Error Analysis
• Active learning
– Identify reliable labels
– Identify best candidates for annotation
• Co-training
• Bootstrap (project) resources from one
application to another
July 25, 2004
EMNLP-2004 & Senseval-2004
58
Grand Challenges
ftp://ftp.cordis.lu/pub/ist/docs/istag040319-draftnotesofthemeeting.pdf
July 25, 2004
EMNLP-2004 & Senseval-2004
59
Roadmaps: Structure of a Strategy
(not the union of what we are all doing)
•
Goals
– Example: Replace keyboard with
microphone
– Exciting (memorable) sound bite
– Broad grand challenge that we
can work toward but never solve
•
Metrics
•
– Quantity is not a good thing
– Awareness
– 1-slide version
• if successful, you get maybe 3
more slides
•
– Examples:
– Easy to measure
• Mostly for next year: Q1-4
• Plus some for years 2, 5, 10 & 20
Milestones
– Should be no question if it has
been accomplished
– Example: reduce WER on task x
by y% by time t
•
Size of container
– Goal: 1-3
– Metrics: 3
– Milestones: a dozen
• WER: word error rate
• Time to perform task
•
Small is beautiful
Accomplishments v. Activities
– Accomplishments: a dozen
•
Broad applicability & illustrative
– Don’t cover everything
– Highlight stuff that
– Accomplishments are good
– Activity is not a substitute for
accomplishments
– Milestones look forward whereas
accomplishments look backward
July 25, 2004
• Serendipity is good!
EMNLP-2004 & Senseval-2004
• Applies to multiple groups
• Forward-Looking / Exciting
60
Goals:
1. The multilingual companion
2. Life log
Grand Challenges
Goal: Produce NLP apps
that improve the way
people communicate
with one another
Goal: Reduce
barriers to entry
€€€
Apps &
Techniques
Resources
July 25, 2004
Evaluation
EMNLP-2004 & Senseval-2004
61
Substance: Recommended if…
Summary: What Worked
and What Didn’t? What’s the right
answer?
•
Data
–
Stay on msg: It is the data, stupid!
•
•
WVLC (Very Large) >> EMNLP (Empirical Methods)
If you have a lot of data,
–
•
•
Then you don’t need a lot of methodology
Rising Tide of Data Lifts All Boats
Methodology
–
Empiricism means different things to different people
1.
2.
3.
–
Machine Learning (Self-organizing Methods)
Exploratory Data Analysis (EDA)
Corpus-Based Lexicography
Magic: Recommended if…
Lots of papers on 1
•
•
EMNLP-2004 theme (error analysis)  2
Senseval grew out of 3
Short term ≠ Long term
Promise: Recommended if…
July 25, 2004
There’ll be a
quiz at the end
of the decade…
EMNLP-2004 & Senseval-2004
Lonely
62
Backup
Speech  Language
• Been great so far,
– But too much of a good thing…
• Take the good
July 25, 2004
EMNLP-2004 & Senseval-2004
64
Fire
• Fuel
– Infrastructure: Shared datasets and lexical resources
• Wordnet, LDC, the Web
– Organizers
• Walker & Zampolli
– Funding
• Darpa (Charles Wayne), EU…
• Sparks
– Exciting Applications (The Web)
– Grand Challenges
– Leaders: Jelinek, Mercer, Miller, Kucera & Francis,
Leech, Sinclair, Tukey, Liberman…
July 25, 2004
EMNLP-2004 & Senseval-2004
65
• Hi Ken,
• Rada probably has more to add, but obviously we would
like to hear something about WSD or word senses. We
are currently trying to move Senseval to include
application-specific evaluations (eg within MT or IR, or in
specialized domains) and to more general semantic
analysis of text (eg frames or subcats). Something to
inspire people in this direction would be great.
• Phil.
July 25, 2004
EMNLP-2004 & Senseval-2004
66
Organizational Innovations
(Radical  Mainstream)
• Late Submission Deadline
– Immediately after ACL notifications
• ACL was rejecting good papers for bad reasons
– Short review cycles  Freshness
Innovation
• Invest in the Future: Encourage Innovation
– Chair (Energetic, Promising, Source of new ideas)
– Co-chair (Established, Knows how it has been done)
• Inclusiveness:
Checks &
Balances
– Thankless Chores  Marketing Carrots (Maximize # of reviewers)
– Balance program committee, reviewers (and hopefully submissions,
acceptances and registrations):
• 1/3 stability, 1/3 promising, 1/3 outreach
• Diversity: experience, gender, geography, topic
– Hold conferences in Europe, Asia & America
• Huge potential market in Asia: 4 out of 5 jumbo jets
– Maintain 20-25% acceptance rate  Parallel Sessions & Posters
• Avoid incremental papers
– Average grades (low grade dominates)  Advocate + Second
July 25, 2004
EMNLP-2004 & Senseval-2004
67
Download