Speech interfaces: Human factors and evaluation

advertisement
Module u1:
Speech in the Interface
4: User-centered Design and Evaluation
Jacques Terken
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
1
Contents


Methodological issues: design
Evaluation methodology
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
2
The design process







Requirements 
Specifications of prototype
Evaluation 1: Wizard-of-Oz experiments
“bionic wizards”
Redesign and implementation: V1
Evaluation 2: Objective and subjective
measurements (laboratory tests)
Redesign and implementation: V2
Evaluation 3: Lab tests, field tests
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
3
Requirements


Source of requirements:
– you yourself
– potential end users
– customer
– manufacturer
Checklist
– consistency
– feasibility (w. r. to performance and price)
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
4
Interface design

success of design depends on consideration of
– task demands 
– knowledge, needs and expectations of user
population 
– capabilities of technology 
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
5
Task demands

exploit structure in task to make interaction more
transparent
– E.g. form-filling metaphor
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
6
User expectations






Users may bring advance knowledge of domain
Users may bring too high expectations of
communicative capabilities of system, especially if
quality of output speech is high; this will lead to user
utterances that the system can’t handle
Instruction of limited value
Interactive tutorial more useful
(kamm et al., icslp98)
Can also include training on how to speak to the
system
Edutainment approach (weevers, 2004)
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
7
Capabilities of technology




Awareness of ASR and NLP limitations
Necessary modelling of domain knowledge through
ontology
Understanding of needs w.r. to cooperative
communication: rationality; inferencing
Understanding of needs w.r. to conversational
dynamics, including mechanisms for graceful
recovery from errors
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
8
Specifications:
check ui design principles
Shneiderman (1986)
 continuous representation of objects and actions of
interest (transparency)
 rapid, incremental, reversible operations with
immediately visible impact
 physical actions or labelled button presses, not
complex syntax~nl
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
9
Application to speech interfaces
Kamm & Walker (1997)


continuous representation:
– may be impossible or undesirable as such in
speech interfaces
• open question - pause – options (zooming)
• subset of vocabulary with consistent meaning
throughout (“help me out”, “cancel”)
immediate impact
agent: anny here, what can i do for you
user: call lyn walker
agent: calling lyn walker
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
10



incrementality
user: i want to go from boston to san
francisco
agent: san francisco has two airports: …..
reversibility
– “cancel”
NB Discussion topic
– Schneiderman heuristic 7: Locus of control vs
mixed control dialogue
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
11
Contents


Methodological issues: design
evaluation methodology
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
12
Aim of evaluation


diagnostic test/formative evaluation:
– To inform the design team
– Ensure that the system meets the expectations
and requirements of end users
– To improve the design where possible
Benchmarking/summative evaluation:
– To inform the manufacturer about quality of system
relative to those of competitors or previous
releases 
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
13
Benchmarking



Requires accepted, standardised test
No accepted solution for benchmarking of complete
spoken dialogue systems
Stand-alone tests of separate components both for
diagnostic and benchmarking purposes (glass box
approach )
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
14
Glass box / black box


Black box: system evaluation (e.g. “how will it
perform in an application”)
Glass box: performance of individual modules (both
for benchmarking and diagnostic purposes)
– with perfect input from previous modules
– or with real input (always imperfect!)
– evaluation methods: statistical, performancebased (objective/subjective)
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
15

problem of componentiality:
– relation between performance of individual
components and performance of whole system
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
16
Anchoring: choosing the right
contrast condition




In the absence of validated standards: Need for
reference condition to evaluate performance of test
system(s)
speech output: often natural speech used as
reference
will lead to compression effects for experimental
systems when evaluation is conducted by means of
rating scales
anchoring preferably in context of objective
evaluation and with preference judgements
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
17
Evaluation tools/frameworks



Hone and Graham: Sassi
questionnaire tuned towards evaluation of speech
interfaces
Walker et al: Paradise
Establishing connections between objective and
subjective measures
Extension of Paradise to multimodal interfaces:
Promise
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
18
Sassi




Subjective Assessment of Speech System Interfaces
http://people.brunel.ac.uk/~csstksh/sassi.html and pdf
Likert type questions
Factors:
– Response accuracy
– Likeability
– Cognitive demand
– Annoyance
– Habitability (match between mental model and
actual system)
– Speed
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
19
Examples of questions ( n= 36)















The system is accurate
The system is unreliable
The interaction with the system is unpredictable
The system is pleasant
The system is friendly
I was able to recover easily from errors
I enjoyed using the system
It is clear how to speak to the system
The interaction with the system is frustrating .
The system is too inflexible
I sometimes wondered if I was using the right word
I always knew what to say to the system
It is easy to lose track of where you are in an interaction with the system
The interaction with the system is fast
The system responds too slowly
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
20
Paradise

User satisfaction (subjective) brought in connection with task
success and costs (objective measure)
pdf
– Users perform scenario-based tasks
– Measure task success for scenarios, correcting for
chance on the basis of attribute value matrices denoting
the number of possible options (measure: kappa; κ = 1 if
all scenarios were successfully completed)
– Obtain objective measures of costs:
• Efficiency measures (number of utterances, dialogue
time, …)
• Qualitative measures (repair ratio, inappropriate
utterance ratio, …)
– Normalize task success and cost measures across
subjects by taking the z-scores
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
21
– Measure user satisfaction (Mean Opinion Scores
across one or more scales)
– estimate performance function
  zappa - (wi  zcosti)
compute value of  and wi by multiple linear
regression
wi indicates the relative weight of the individual
cost components costi
wi gives information about what are the primary
cost factors, i.e. which factors have most influence
on (the lack of) usability of the system
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
22




case study:
performance = .40 zappa - .78 cost2
with cost2 is number of repetitions
once the weights have been established and
validated, user satisfaction can be predicted from
objective data
The typical finding is that user satisfaction as
measured by the questionnaire is primarily
determined by the quality of the speech recognition
(which is not very informative)
Concerns:
– “Conservative” scoring on semantic scales
– Not all cost functions may be linear
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
23
Promise



Evaluation of multimodal interfaces
References: Pdf1 and pdf2
Basic idea same as for PARADISE but differences in
the way task success is calculated and the
correlations are computed
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
24
Where to evaluate: Laboratory tests





Use of scenarios gives some degree of experimental
control
Objective and subjective measurements aimed at
identifying problem sources and testing potential
solutions
Interviews
BUT: Scenarios implicitly specify domain
AND: subjects may be co-operative of overly non-cooperative (exploring the limits of the system)
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
25
Where: Field tests



Advantage: gives information about performance of
system with actual end users with self-defined, real
goals in realistic situations
Mainly diagnostic (how does the system perform in
realistic conditions)
BUT: no information about reasons for particular
actions in the dialogue
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
26
Additional considerations

evaluation also in terms of suitability of system
given the technological and cost constraints
imposed by the application
– Cpu consumption, real-time performance
– bandwidth, memory consumption
– cost
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
27
Project

Wizard-of-Oz
– usual assumption is that subjects are made to
believe that they are interacting with a real system
– most suited when system to be developed is very
complex, or when performance of individual
modules strongly affects overall performance
– Full vs bionic wizard
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
28
WOZ: General set-up
subject
data collection
(logging)
assistant
scenarios
user interface
wizard interface
simulation tools
wizard
SAI User-System Interaction U1, Speech in the Interface: 4. Evaluation
29
Download