Module u1: Speech in the Interface 4: User-centered Design and Evaluation Jacques Terken SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 1 Contents Methodological issues: design Evaluation methodology SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 2 The design process Requirements Specifications of prototype Evaluation 1: Wizard-of-Oz experiments “bionic wizards” Redesign and implementation: V1 Evaluation 2: Objective and subjective measurements (laboratory tests) Redesign and implementation: V2 Evaluation 3: Lab tests, field tests SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 3 Requirements Source of requirements: – you yourself – potential end users – customer – manufacturer Checklist – consistency – feasibility (w. r. to performance and price) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 4 Interface design success of design depends on consideration of – task demands – knowledge, needs and expectations of user population – capabilities of technology SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 5 Task demands exploit structure in task to make interaction more transparent – E.g. form-filling metaphor SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 6 User expectations Users may bring advance knowledge of domain Users may bring too high expectations of communicative capabilities of system, especially if quality of output speech is high; this will lead to user utterances that the system can’t handle Instruction of limited value Interactive tutorial more useful (kamm et al., icslp98) Can also include training on how to speak to the system Edutainment approach (weevers, 2004) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 7 Capabilities of technology Awareness of ASR and NLP limitations Necessary modelling of domain knowledge through ontology Understanding of needs w.r. to cooperative communication: rationality; inferencing Understanding of needs w.r. to conversational dynamics, including mechanisms for graceful recovery from errors SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 8 Specifications: check ui design principles Shneiderman (1986) continuous representation of objects and actions of interest (transparency) rapid, incremental, reversible operations with immediately visible impact physical actions or labelled button presses, not complex syntax~nl SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 9 Application to speech interfaces Kamm & Walker (1997) continuous representation: – may be impossible or undesirable as such in speech interfaces • open question - pause – options (zooming) • subset of vocabulary with consistent meaning throughout (“help me out”, “cancel”) immediate impact agent: anny here, what can i do for you user: call lyn walker agent: calling lyn walker SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 10 incrementality user: i want to go from boston to san francisco agent: san francisco has two airports: ….. reversibility – “cancel” NB Discussion topic – Schneiderman heuristic 7: Locus of control vs mixed control dialogue SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 11 Contents Methodological issues: design evaluation methodology SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 12 Aim of evaluation diagnostic test/formative evaluation: – To inform the design team – Ensure that the system meets the expectations and requirements of end users – To improve the design where possible Benchmarking/summative evaluation: – To inform the manufacturer about quality of system relative to those of competitors or previous releases SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 13 Benchmarking Requires accepted, standardised test No accepted solution for benchmarking of complete spoken dialogue systems Stand-alone tests of separate components both for diagnostic and benchmarking purposes (glass box approach ) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 14 Glass box / black box Black box: system evaluation (e.g. “how will it perform in an application”) Glass box: performance of individual modules (both for benchmarking and diagnostic purposes) – with perfect input from previous modules – or with real input (always imperfect!) – evaluation methods: statistical, performancebased (objective/subjective) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 15 problem of componentiality: – relation between performance of individual components and performance of whole system SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 16 Anchoring: choosing the right contrast condition In the absence of validated standards: Need for reference condition to evaluate performance of test system(s) speech output: often natural speech used as reference will lead to compression effects for experimental systems when evaluation is conducted by means of rating scales anchoring preferably in context of objective evaluation and with preference judgements SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 17 Evaluation tools/frameworks Hone and Graham: Sassi questionnaire tuned towards evaluation of speech interfaces Walker et al: Paradise Establishing connections between objective and subjective measures Extension of Paradise to multimodal interfaces: Promise SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 18 Sassi Subjective Assessment of Speech System Interfaces http://people.brunel.ac.uk/~csstksh/sassi.html and pdf Likert type questions Factors: – Response accuracy – Likeability – Cognitive demand – Annoyance – Habitability (match between mental model and actual system) – Speed SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 19 Examples of questions ( n= 36) The system is accurate The system is unreliable The interaction with the system is unpredictable The system is pleasant The system is friendly I was able to recover easily from errors I enjoyed using the system It is clear how to speak to the system The interaction with the system is frustrating . The system is too inflexible I sometimes wondered if I was using the right word I always knew what to say to the system It is easy to lose track of where you are in an interaction with the system The interaction with the system is fast The system responds too slowly SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 20 Paradise User satisfaction (subjective) brought in connection with task success and costs (objective measure) pdf – Users perform scenario-based tasks – Measure task success for scenarios, correcting for chance on the basis of attribute value matrices denoting the number of possible options (measure: kappa; κ = 1 if all scenarios were successfully completed) – Obtain objective measures of costs: • Efficiency measures (number of utterances, dialogue time, …) • Qualitative measures (repair ratio, inappropriate utterance ratio, …) – Normalize task success and cost measures across subjects by taking the z-scores SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 21 – Measure user satisfaction (Mean Opinion Scores across one or more scales) – estimate performance function zappa - (wi zcosti) compute value of and wi by multiple linear regression wi indicates the relative weight of the individual cost components costi wi gives information about what are the primary cost factors, i.e. which factors have most influence on (the lack of) usability of the system SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 22 case study: performance = .40 zappa - .78 cost2 with cost2 is number of repetitions once the weights have been established and validated, user satisfaction can be predicted from objective data The typical finding is that user satisfaction as measured by the questionnaire is primarily determined by the quality of the speech recognition (which is not very informative) Concerns: – “Conservative” scoring on semantic scales – Not all cost functions may be linear SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 23 Promise Evaluation of multimodal interfaces References: Pdf1 and pdf2 Basic idea same as for PARADISE but differences in the way task success is calculated and the correlations are computed SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 24 Where to evaluate: Laboratory tests Use of scenarios gives some degree of experimental control Objective and subjective measurements aimed at identifying problem sources and testing potential solutions Interviews BUT: Scenarios implicitly specify domain AND: subjects may be co-operative of overly non-cooperative (exploring the limits of the system) SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 25 Where: Field tests Advantage: gives information about performance of system with actual end users with self-defined, real goals in realistic situations Mainly diagnostic (how does the system perform in realistic conditions) BUT: no information about reasons for particular actions in the dialogue SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 26 Additional considerations evaluation also in terms of suitability of system given the technological and cost constraints imposed by the application – Cpu consumption, real-time performance – bandwidth, memory consumption – cost SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 27 Project Wizard-of-Oz – usual assumption is that subjects are made to believe that they are interacting with a real system – most suited when system to be developed is very complex, or when performance of individual modules strongly affects overall performance – Full vs bionic wizard SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 28 WOZ: General set-up subject data collection (logging) assistant scenarios user interface wizard interface simulation tools wizard SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation 29