Reiter

advertisement
NLG Shared Tasks: Lets try it
and see what happens
Ehud Reiter
(Univ of Aberdeen)
http://www.csd.abdn.ac.uk/~ereiter
Dr. Ehud Reiter, Computing Science, University of Aberdeen
1
Contents
General Comments
 Geneval proposal

Dr. Ehud Reiter, Computing Science, University of Aberdeen
2
Good points of Shared Task
Compare different approaches
 Encourage people to interact more
 Reduce NLG “barriers to entry”
 Better understanding of evaluation

Dr. Ehud Reiter, Computing Science, University of Aberdeen
3
Bad Points

May narrow focus of community
» IR ignored web search because of TREC?

May encourage incremental research
instead of new ideas
Dr. Ehud Reiter, Computing Science, University of Aberdeen
4
My opinion
Lets give it a try
 But I suspect one-off exercises are
better than a series

» Many people think MUC, DUC, etc were
very useful initially but became less
scientifically exciting over time
Dr. Ehud Reiter, Computing Science, University of Aberdeen
5
Practical Issues

Domain/task?
» Need something which several (6?) group
are interested in

Evaluation technique
» Avoid techniques that are biased
– Eg, some automatic metrics may favour stat
systems
Dr. Ehud Reiter, Computing Science, University of Aberdeen
6
Geneval

Proposal to evaluate NLG evaluation
» Core idea is to evaluate in many ways a
set of systems with similar input/output
functionality, and see how well different
evaluation techniques correlate
» Anja Belz and Ehud Reiter
» Hope to submit to EPSRC (roughly similar
to NSF in US) soon
Dr. Ehud Reiter, Computing Science, University of Aberdeen
7
NLG Evaluation

Many types
» Task-based, human ratings, BLEU-like
metrics, etc

Little consensus on best technique
» Ie, most appropriate for a context

Poorly understood
Dr. Ehud Reiter, Computing Science, University of Aberdeen
8
Some open questions

How well do diff types correlate?
» Eg, does BLEU predict human ratings?

Are there biases?
» Eg, are statistical NLG systems over/under
rated by some techniques?

What is best design?
» Number of subjects, subject expertise,
number (quality) of reference texts, etc
Dr. Ehud Reiter, Computing Science, University of Aberdeen
9
Belz and Reiter (2006)



Evaluated several systems for generating
wind statements in weather forecasts, using
both human judgements and BLEU-like
metrics
Found OK (not wonderful) correlation, but
also some biases
Geneval: do this on a much larger scale
» More domains, more systems, more evaluation
techniques (including new ones), etc
Dr. Ehud Reiter, Computing Science, University of Aberdeen
10
Geneval: Possible Domains

Weather forecasts (not wind statements)
» Use SumTime corpus

Referring expressions
» Use Prodigy-Grec or Tuna corpus

Medical summaries
» Use Babytalk corpus

Statistical summaries
» Use Atlas corpus
Dr. Ehud Reiter, Computing Science, University of Aberdeen
11
Geneval: Evaluation techniques

Human task-based
» Eg, referential success

Human ratings
» Likert vs pref; expert vs non-expert

Automatic metrics based on ref texts
» BLEU, ROUGE, METEOR, etc

Automatic metrics without ref texts
» MT T and X scores, length
Dr. Ehud Reiter, Computing Science, University of Aberdeen
12
Geneval: new techniques

Would also like to explore and develop
new evaluation techniques
» Post-edit based human evaluations?
» Automatic metrics which look at semantic
features?
» Open to suggestions for other ideas!
Dr. Ehud Reiter, Computing Science, University of Aberdeen
13
Would like systems contributed

Study would be better if other people
would contribute systems
» We supply data sets and corpora, and
carry out evaluations
» So you can focus 100% on your great new
algorithmic ideas!
Dr. Ehud Reiter, Computing Science, University of Aberdeen
14
Geneval from STEC perspect

Sort of like STEC???
» If people contribute systems based on our data
sets and corpora
» But results will be anonymised
– only developer of system X knows how well X did
» One-off exercises, not repeated
» Multiple evaluation techniques

Hope data sets will reduce barriers to entry
Dr. Ehud Reiter, Computing Science, University of Aberdeen
15
Geneval

Please let Anja or I know if
» You have general comments, and/or
» You have a suggestion for an additional
evaluation technique
» You might be interested in contributing a
system
Dr. Ehud Reiter, Computing Science, University of Aberdeen
16
Download