Visualizing Information - Measuring the User Experience

advertisement
Rating Scales:
What the Research Says
Joe Dumas
UX Consultant
Tom Tullis
Fidelity Investments
joe.dumas99@gmail.com
tom.tullis@fmr.com
Mini_UPA, 2009
The Scope of the Session
Discussion of literature about rating scales in
usability methods, primarily usability testing
 Brief review of recommendations from older
literature
 Focus on recent studies
 Recommendations for practitioners

Mini_UPA, 2009
2
Table of Contents
Types of rating scales
 Guidelines from past studies
 How to evaluate a rating scale
 Guidelines from recent studies
 Additional advantages of rating scales

Mini_UPA, 2009
3
Types of Rating Scales
Formats



One question format
Before-after format
Multiple question format
Mini_UPA, 2009
5
One Question Formats
Original Likert scale format:
I think that I would like to use this system frequently:
___ Strongly Disagree
___ Disagree
___ Neither agree not disagree
___ Agree
Rensis
___ Strongly Agree

Likert
Mini_UPA, 2009
6
One Question Formats

Likert-like scales:
Characters on the screen are:
Hard to read
1
2
3
4
5
6
Mini_UPA, 2009
7
Easy to read
8
9
7
One Question Formats
One more Likert-like scale (used in SUMI):
I would recommend this software to my colleagues:
__ Agree
__ Undecided
__ Disagree
Mini_UPA, 2009
8
One Question Formats
Subjective
Mental
Effort
Scale
(SMEQ)
Mini_UPA, 2009
9
One Question Formats


Semantic Differential:
Magnitude estimation:
 Use any positive number
Mini_UPA, 2009
10
Before-After Ratings
Before the task:
How easy or difficult do you expect this task to be:
Very easy
Very difficult
1
2
3
4
5
6
7
 After the task:
How easy or difficult was task to do:
Very easy
Very difficult
1
2
3
4
5
6
7

Mini_UPA, 2009
11
Multiple Question Formats
(Selected List)
Software Usability Scale (SUS) – 10 ratings
 *Questionnaire for User-Interface Satisfaction – QUIS
71 (long form), 26 (short form) ratings
 *Software Usability Measurement Inventory (SUMI) –
50 ratings
 After Scenario Questionnaire (ASQ) – three ratings

* Requires a license
Mini_UPA, 2009
12
More Multiple Question Formats
Post Study System Usability Questionnaire
(PSSOQ) - 19 ratings. Electronic version called the
Computer System Usability Questionnaire (CSUQ)
 *Website Analysis and MeasureMent Inventory
(WAMMI) – 20 ratings of website usability

* Requires a license
Mini_UPA, 2009
13
Guidelines from Past Studies
Guidelines


Have 5-9 levels in a rating
 You gain no additional information by having
more than 10 levels
Include a neutral point in the middle of the scale
 Otherwise you lose information by forcing
some participants to take sides
 People from some Asian cultures are more
likely to choose the midpoint
Mini_UPA, 2009
15
Guidelines
Use positive integers as numbers
 1-7 instead of -3 to +3 (Participants are less
likely to go below 0 than they are to use 1-3)
 Or don’t show numbers at all
 Use word labels for at least the end points.
 Hard to create labels for every point beyond 5
levels
 Having labels on the end points only also
makes the data more “interval-like”.

Mini_UPA, 2009
16
Guidelines

Most word labels produce a bipolar scale
 In a 1 to 7 scale from easy to difficult, what is
increasing with the numbers? Is ease the
absence of difficulty?
 This may be one reason why participants are
reluctant to move to the difficult end – it is a
different concept that lack of ease
 One solution – scale from “not at all easy” to
“very easy”
Mini_UPA, 2009
17
Evaluating a Rating Scale
Statistical Criteria

Is it valid? Does it measure what it’s suppose to
measure?


For example, does it correlate with other usability
measures.
Is it sensitive?

Can it discriminate between tasks or products with
small samples
Mini_UPA, 2009
19
Practical Criteria
Is it easy for the participant to understand and use?
 Do they get what it means?
 Is it easy for the tester to present – online or paper –
and score?
 Do you need a widget to present it?
 Can scoring be done automatically?

Mini_UPA, 2009
20
Guidelines from Recent Studies
Post-Task Ratings
The simpler the better
 Tedesco and Tullis found this format the most
sensitive
Overall this task was:
Very Easy
Very Difficult

 Sauro
and Dumas found SMEQ just as sensitive as
Likert
Mini_UPA, 2009
22
More on Post-Task Ratings
They provide diagnostic information about usability
issues with tasks
 They correlate moderately well with other measures
especially time, and their correlations are higher
than for post-test ratings

Mini_UPA, 2009
23
More on Post-Task Ratings
Even post-task ratings may be inflated (Teague et al.,
2001). Ratings made during a task were significantly
lower than after the task and even higher if given only
after the task
Ease
Concurrent
During task
Concurrent
After task
4.44
4.78
Mini_UPA, 2009
Post-task
Only
5.60
24
Post-Test Ratings
Home grown questionnaires perform more poorly
than standardized ones
 Tullis and Stetson and others have found SUS most
sensitive. Many testers are using it.
 Some of the standardized questionnaires have
industry norms to compare against - SUMI and
WAMMI
 But no one knows what the database of norms
contains

Mini_UPA, 2009
25
More on Post-Test Ratings

The lowest correlations among all measures used in
testing are with post-task ratings (Sauro and Lewis)
 Why? – they are tapping into factors that don’t
effect other measures such as demand
characteristics, need to please, need to appear
competent, lack of understanding of what an
“overall” rating means, etc.
Mini_UPA, 2009
26
Examine the Distribution
See how the average would miss how bimodal the
distribution is. Some participants find it very hard to
use. Why?
Mini_UPA, 2009
27
Low Sensitivity with Small Samples
Three recent studies have all shown that post-task
and post-test ratings do not discriminate well with
sample sizes below about 10-12
 For sample sizes typical of laboratory formative
tests, ratings are not reliable
 Ratings can be used as an opportunity to get
participant to talk about why they have chosen a
value

Mini_UPA, 2009
28
The Value of Confidence Intervals
Actual data
from an
online study
comparing
the NASA &
Wikipedia
sites for
finding info
on the Apollo
space
program.
Ratings of "Ease of Finding Information" (1-7, Higher=Better)
(Error bars represent a 90% confidence interval)
6
5
4
NASA
3
Wikipedia
2
1
0
Mini_UPA,
2009 20 Sample: 30 Sample: 50
Sample: 5 Sample:
10 Sample:
29
Little Known Advantages of
Rating Scales
Ratings Can Help Prioritize Work
Average Expectation and Experience Ratings
by Task
“Promote It”
“Big
Opportunity”
Avg. Experience Rating
7
6
“Don’t Touch It”
5
4
3
“Fix it Fast”
2
1
1=Difficult
…
7=Easy
1
2
3
4
5
6
7
Average Expectation Rating
Mini_UPA, 2009
31
Ratings Can Help Identify
“Disconnects”
Ease Rating
100%
5.0
90%
4.9
80%
4.8
70%
4.7
60%
4.6
50%
4.5
40%
4.4
30%
4.3
20%
4.2
10%
4.1
0%
4.0
Task 1
Task 2
Task 3
Mini_UPA, 2009
Task Ease Rating (1-5, Higher=Better)
Percent Correct
Accuracy
This
“disconnect”
between the
accuracy and
task ease
ratings is
worrisome– it
indicates users
didn’t realize
they were
screwing up on
Task 2!
Task 4
32
Ratings Can Help You Make
Comparisons
You can be very
Frequency Distribution of SUS Scores for 129
Conditions from 50 Studies
50
45
pleased if you get an
average SUS score of
83 (which is the 94th
percentile of this
distribution).
Frequency
40
35
30
25
20
15
10
5
0
<=40
But you should be
worried if you get an
average SUS score of
48 (the 12th percentile).
41-50
51-60
61-70
71-80
81-90
91-100
Average SUS Scores
Mini_UPA, 2009
33
In Closing…

These slides, a bibliography of readings, and
associated examples, can be downloaded from:
http://www.measuringUX.com/
Feel free to contact us with questions!
Mini_UPA, 2009
34
Download