NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter Dr. Ehud Reiter, Computing Science, University of Aberdeen 1 Contents General Comments Geneval proposal Dr. Ehud Reiter, Computing Science, University of Aberdeen 2 Good points of Shared Task Compare different approaches Encourage people to interact more Reduce NLG “barriers to entry” Better understanding of evaluation Dr. Ehud Reiter, Computing Science, University of Aberdeen 3 Bad Points May narrow focus of community » IR ignored web search because of TREC? May encourage incremental research instead of new ideas Dr. Ehud Reiter, Computing Science, University of Aberdeen 4 My opinion Lets give it a try But I suspect one-off exercises are better than a series » Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time Dr. Ehud Reiter, Computing Science, University of Aberdeen 5 Practical Issues Domain/task? » Need something which several (6?) group are interested in Evaluation technique » Avoid techniques that are biased – Eg, some automatic metrics may favour stat systems Dr. Ehud Reiter, Computing Science, University of Aberdeen 6 Geneval Proposal to evaluate NLG evaluation » Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate » Anja Belz and Ehud Reiter » Hope to submit to EPSRC (roughly similar to NSF in US) soon Dr. Ehud Reiter, Computing Science, University of Aberdeen 7 NLG Evaluation Many types » Task-based, human ratings, BLEU-like metrics, etc Little consensus on best technique » Ie, most appropriate for a context Poorly understood Dr. Ehud Reiter, Computing Science, University of Aberdeen 8 Some open questions How well do diff types correlate? » Eg, does BLEU predict human ratings? Are there biases? » Eg, are statistical NLG systems over/under rated by some techniques? What is best design? » Number of subjects, subject expertise, number (quality) of reference texts, etc Dr. Ehud Reiter, Computing Science, University of Aberdeen 9 Belz and Reiter (2006) Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics Found OK (not wonderful) correlation, but also some biases Geneval: do this on a much larger scale » More domains, more systems, more evaluation techniques (including new ones), etc Dr. Ehud Reiter, Computing Science, University of Aberdeen 10 Geneval: Possible Domains Weather forecasts (not wind statements) » Use SumTime corpus Referring expressions » Use Prodigy-Grec or Tuna corpus Medical summaries » Use Babytalk corpus Statistical summaries » Use Atlas corpus Dr. Ehud Reiter, Computing Science, University of Aberdeen 11 Geneval: Evaluation techniques Human task-based » Eg, referential success Human ratings » Likert vs pref; expert vs non-expert Automatic metrics based on ref texts » BLEU, ROUGE, METEOR, etc Automatic metrics without ref texts » MT T and X scores, length Dr. Ehud Reiter, Computing Science, University of Aberdeen 12 Geneval: new techniques Would also like to explore and develop new evaluation techniques » Post-edit based human evaluations? » Automatic metrics which look at semantic features? » Open to suggestions for other ideas! Dr. Ehud Reiter, Computing Science, University of Aberdeen 13 Would like systems contributed Study would be better if other people would contribute systems » We supply data sets and corpora, and carry out evaluations » So you can focus 100% on your great new algorithmic ideas! Dr. Ehud Reiter, Computing Science, University of Aberdeen 14 Geneval from STEC perspect Sort of like STEC??? » If people contribute systems based on our data sets and corpora » But results will be anonymised – only developer of system X knows how well X did » One-off exercises, not repeated » Multiple evaluation techniques Hope data sets will reduce barriers to entry Dr. Ehud Reiter, Computing Science, University of Aberdeen 15 Geneval Please let Anja or I know if » You have general comments, and/or » You have a suggestion for an additional evaluation technique » You might be interested in contributing a system Dr. Ehud Reiter, Computing Science, University of Aberdeen 16