Reiter

NLG Shared Tasks: Lets try it and see what happens Ehud Reiter (Univ of Aberdeen) http://www.csd.abdn.ac.uk/~ereiter Dr. Ehud Reiter, Computing Science, University of Aberdeen 1 Contents General Comments  Geneval proposal  Dr. Ehud Reiter, Computing Science, University of Aberdeen 2 Good points of Shared Task Compare different approaches  Encourage people to interact more  Reduce NLG “barriers to entry”  Better understanding of evaluation  Dr. Ehud Reiter, Computing Science, University of Aberdeen 3 Bad Points  May narrow focus of community » IR ignored web search because of TREC?  May encourage incremental research instead of new ideas Dr. Ehud Reiter, Computing Science, University of Aberdeen 4 My opinion Lets give it a try  But I suspect one-off exercises are better than a series  » Many people think MUC, DUC, etc were very useful initially but became less scientifically exciting over time Dr. Ehud Reiter, Computing Science, University of Aberdeen 5 Practical Issues  Domain/task? » Need something which several (6?) group are interested in  Evaluation technique » Avoid techniques that are biased – Eg, some automatic metrics may favour stat systems Dr. Ehud Reiter, Computing Science, University of Aberdeen 6 Geneval  Proposal to evaluate NLG evaluation » Core idea is to evaluate in many ways a set of systems with similar input/output functionality, and see how well different evaluation techniques correlate » Anja Belz and Ehud Reiter » Hope to submit to EPSRC (roughly similar to NSF in US) soon Dr. Ehud Reiter, Computing Science, University of Aberdeen 7 NLG Evaluation  Many types » Task-based, human ratings, BLEU-like metrics, etc  Little consensus on best technique » Ie, most appropriate for a context  Poorly understood Dr. Ehud Reiter, Computing Science, University of Aberdeen 8 Some open questions  How well do diff types correlate? » Eg, does BLEU predict human ratings?  Are there biases? » Eg, are statistical NLG systems over/under rated by some techniques?  What is best design? » Number of subjects, subject expertise, number (quality) of reference texts, etc Dr. Ehud Reiter, Computing Science, University of Aberdeen 9 Belz and Reiter (2006)    Evaluated several systems for generating wind statements in weather forecasts, using both human judgements and BLEU-like metrics Found OK (not wonderful) correlation, but also some biases Geneval: do this on a much larger scale » More domains, more systems, more evaluation techniques (including new ones), etc Dr. Ehud Reiter, Computing Science, University of Aberdeen 10 Geneval: Possible Domains  Weather forecasts (not wind statements) » Use SumTime corpus  Referring expressions » Use Prodigy-Grec or Tuna corpus  Medical summaries » Use Babytalk corpus  Statistical summaries » Use Atlas corpus Dr. Ehud Reiter, Computing Science, University of Aberdeen 11 Geneval: Evaluation techniques  Human task-based » Eg, referential success  Human ratings » Likert vs pref; expert vs non-expert  Automatic metrics based on ref texts » BLEU, ROUGE, METEOR, etc  Automatic metrics without ref texts » MT T and X scores, length Dr. Ehud Reiter, Computing Science, University of Aberdeen 12 Geneval: new techniques  Would also like to explore and develop new evaluation techniques » Post-edit based human evaluations? » Automatic metrics which look at semantic features? » Open to suggestions for other ideas! Dr. Ehud Reiter, Computing Science, University of Aberdeen 13 Would like systems contributed  Study would be better if other people would contribute systems » We supply data sets and corpora, and carry out evaluations » So you can focus 100% on your great new algorithmic ideas! Dr. Ehud Reiter, Computing Science, University of Aberdeen 14 Geneval from STEC perspect  Sort of like STEC??? » If people contribute systems based on our data sets and corpora » But results will be anonymised – only developer of system X knows how well X did » One-off exercises, not repeated » Multiple evaluation techniques  Hope data sets will reduce barriers to entry Dr. Ehud Reiter, Computing Science, University of Aberdeen 15 Geneval  Please let Anja or I know if » You have general comments, and/or » You have a suggestion for an additional evaluation technique » You might be interested in contributing a system Dr. Ehud Reiter, Computing Science, University of Aberdeen 16

Reiter

Related documents

Products

Support

Reiter

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib