How Many Participants Should I Test?

advertisement
HCI460: Week 8 Lecture
October 28, 2009
Outline
 Midterm Review
 How Many Participants Should I Test?
– Review
– Exercises
 Stats
– Review of material covered last week
– New material
 Project 3
– Next Steps
– Feedback on the Test Plans
2
Midterm Review
3
Midterm Review
Overall
 Mean / average: 8.55
 Median: 8.75
 Mode: 10 (most frequent score)
12
N = 44
Number of Students
10
8
6
4
2
0
0, 0.5
1, 1.5
2, 2.5
3, 3.5
4, 4.5
5, 5.5
6, 6.5
7, 7.5
8, 8.5
9, 9.5
10
Midterm Score
4
Midterm Review
Q1: Heuristic vs. Expert Evaluation
 Question: What is the main difference between a heuristic
evaluation and an expert evaluation?
 Answer:
– Heuristic evaluation uses a specific set of guidelines or
heuristics.
– Expert evaluation relies on the evaluator’s expertise (including
internalized guidelines) and experience.
• No need to explicitly match issues to specific heuristics.
• More flexibility.
5
Midterm Review
Q2: Research-Based Guidelines (RBGs)
 Question: What is unique about the research-based guidelines on
usability.gov relative to heuristics and other guidelines? What are
the unique advantages of using the research-based guidelines?
 Answer:
– This is a very comprehensive list of very specific guidelines
(over 200). Other guideline sets are much smaller and the
guidelines are more general.
– RBGs were created by a group of experts (not an individual).
– RBGs are specific to the web.
– Unlike other heuristics and guidelines, RBGs have two ratings:
• Relative importance to the success of a site
– Helps prioritize issues.
• Strength of research evidence that supports the guideline
– Research citations lend credibility to the guidelines.
6
Midterm Review
Q3: Positive Findings
 Question: Why should positive findings be presented in usability
reports?
 Answer:
– To let stakeholders know what they should not change and
which current practices they should try to emulate.
– To make the report sound more objective and make
stakeholders more receptive to the findings in the report.
• Humans are more open to criticism if it is balanced with
praise.
7
Midterm Review
Q4: Think-Aloud vs. Retrospective TA
 Question: What is the main difference between the think-aloud
protocol (TA) and the retrospective think-aloud protocol (RTA)?
When should you use each of these methods and why?
RTA ≠ post-task interview
 Answer:
– TA involves having the participant state what they are thinking
while they are completing a task.
• Great for formative studies; helps understand participant
actions as they happen.
– RTA is used after the task has been completed in silence. The
participants walks through the task one more time (or watches a
video of himself/herself performing the task) and explains their
thoughts and actions.
• Good when time on task and other quantitative behavioral
measures need to be collected in addition to qualitative data.
• Good for participants who may not be able to do TA,
8
Midterm Review
Q5: Time on Task in Formative UTs
 Question: What are the main concerns associated with using time
on task in a formative study with 5 participants?
 Answer:
– Formative studies often involve think-aloud protocol.
• Time on task will be longer because thinking aloud takes
more time and changes the workflow.
– Sample size is too small for the time on task to generalize to the
population or show significant differences between conditions.
9
Midterm Review
Q6: Human Error
 Question: Why is the term “human error” no longer used in the
medical field?
 Answer:
– “Human error” places the blame on the human when in fact
errors usually result from problems with the design.
– A more neutral term “use error” is used instead.
10
Midterm Review
Q7: Side-by-Side Moderation
 Question: When would you opt for side-by-side moderation in place
of moderation from another room with audio communication?
 Answer:
– Side-by-side moderation is better when:
• Building rapport with participant is important (e.g., in
formative think-aloud studies)
• Moderator has to simulate interaction (e.g., paper prototype)
• The tested object / interaction may be difficult to see via
camera feed or through the one-way mirror
Moderating from another room with audio communication ≠ remote study
11
How Many Participants Should I Test?
Review from Last Week
12
How Many Participants Should I Test?
Overview
How many participants?
Formative test?
Nielsen’s 5 or more?
Sauro’s calculator?
Summative test?
Need a directional
answer?
(descriptive stats)
Need a definitive
answer?
(inferential stats)
How complex is the
system? How many tasks?
How many distinct user
groups? Do these groups
use the same areas in the
product?
Precision
testing?
Hypothesis
testing?
Does the score
generalize to
the
population?
Is the score
for A different
than the score
for B?
“Smell check:” How many
participants will make
stakeholders comfortable?
13
How Many Participants Should I Test?
Sample Size Calculator for Formative Studies
 Jeff Sauro’s Sample Size Calculator for Discovering Problems in a
User Interface: http://www.measuringusability.com/problem_discovery.php
14
How Many Participants Should I Test?
Sample Size for Precision Testing
 We need sufficient sample size to be
able to generalize the results to
the population.
 Sample size for precision testing
depends on:
– Confidence level (usually 95% or
99%)
– Desired level of precision
• Acceptable sampling error
(+/- 5%)
– Size of population to which we
want to generalize the results
Sampling Error:
 Free online sample size calculator
from Creative Research Systems:
http://www.surveysystem.com/sscalc.htm
15
How Many Participants Should I Test?
Sample Size for Precision Testing
 Confidence interval: 95%
Population of:
+/- 3%
+/- 5%
+/- 7%
100M
1067
384
196
1M
1066
384
196
100,000
1056
383
196
10,000
964
370
192
1,000
516
278
164
100
92
80
66
 When generalizing a score to the population, high sample size is
needed.
 However, the more the better is not true.
– Getting 2000 participants is a waste.
16
How Many Participants Should I Test?
Sample Size for Hypothesis Testing
 Hypothesis testing: comparing means
– E.g., accuracy of typing on Device A is significantly better than it
is on Device B.
– Inferential statistics
 Necessary sample size is derived from a calculation of power.
– Under assumed criteria, the study will have a good chance of
detecting a significant difference if the difference indeed exists.
 Sample size depends on:
– Assumed confidence level (e.g., 95%, 99%)
– Acceptable sampling error (e.g., +/- 5%)
– Expected effect size
– Power
– Statistical test (e.g., t-test,
correlation, ANOVA)
17
How Many Participants Should I Test?
Hypothesis Testing: Sample Size Table*
*Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), http://www.math.unm.edu/~schrader/biostat/bio2/Spr06/cohen.pdf.
18
How Many Participants Should I Test?
Reality
 Usability tests do not typically require statistical significance.
 Objectives dictate type of study and reasonable sample sizes
necessary.
 Sample size used is influenced by many factors—not all of them
statistically driven.
 Power analysis provides an estimate of sample size necessary to
detect a difference, if it does indeed exist
 Risk of not performing power analysis?
– Too few  Low power  Inability to detect difference
– Too many  Waste (and possibly find differences that are not
real)
 What if you find significance even with a small sample size?
– It is probably really there (at a certain p level)
19
How Many Participants Should I Test?
Exercises
20
How Many Participants Should I Test?
Exercise 1: Background Information
 Package inserts for chemicals
used in hospital labs were
shortened and standardized
to reduce cost.
– E.g., many chemicals x
many languages = high
translation cost
 New inserts:
– ½ size of old inserts
(booklet, not “map”)
– More concise (charts, bullet
points)
– Less redundant
Old
New
 Users: Lab techs in hospitals
21
How Many Participants Should I Test?
Exercise 1: The Question
 Client question:
– Will the new inserts negatively
impact user performance?
 How many participants do you need
for the study and why?
 Exercise:
– Discuss in groups and prepare
questions for the client.
– Q & A with the client
– Come up with the answer and be
prepared to explain it.
22
How Many Participants Should I Test?
Exercise 1: Possible Method
 Each participant was asked to complete 30 search tasks
– 2 insert versions x 3 chemicals x 5 search tasks
– Sample question: “How long may the _____ be stored at the
refrigeration temperature?” (for up to 7 days)
 Tasks instructions were printed on a card and placed in front of the
participant.
 The tasks were timed.
– Participants had a maximum of 1 minute to complete each task.
– Those who exceeded the time limit were asked to move on to the
next task.
 To indicate that they were finished, participants had to:
– Say the answer to the question out loud
– Point to the answer in the insert
23
How Many Participants Should I Test?
Exercise 1: Possible Method
24
How Many Participants Should I Test?
Exercise 1: Sample Size
How many participants?
Formative test?
Nielsen’s 5 or more?
Sauro’s calculator?
Summative test?
Need a directional
answer?
(descriptive stats)
Need a definitive
answer?
(inferential stats)
How complex is the
system? How many tasks?
How many distinct user
groups? Do these groups
use the same areas in the
product?
Precision
testing?
Hypothesis
testing?
Does the score
generalize to
the
population?
Is the score
for A different
than the score
for B?
“Smell check:” How many
participants will make
stakeholders comfortable?
25
How Many Participants Should I Test?
Exercise 1: Sample Size
 32 lab techs:
– 17 in the US
– 15 in Germany
26
How Many Participants Should I Test?
Exercise 1: Results
 Short inserts performed significantly better than the long inserts
in terms of:
New
Old
p
Success rate
62%
50%
p = .0005
Time on task
31 s
35 s
p < .05
Ease of use (1-7)
5.6
5.4
p < .01
Overall satisfaction (1-7)
5.9
4.1
p < .0001
27
How Many Participants Should I Test?
Exercise 2
 Client question:
– Does our website work for
our users?
 How many participants do you
need for the study and why?
 Exercise:
– Discuss in groups and
prepare questions for the
client.
– Q & A with the client
– Come up with the answer
and be prepared to explain
it.
28
Stats: What you needed to hear…
29
Stats
Planted the Seed Last Week
 Stats are:
– Not learnable in an hour
– More than just p-values
– Powerful
– Dangerous
– Time consuming
 But, you are the Expert
 Need to know:
– Foundation
– Rationale
– Useful application
30
Stats
Foundation
 Just with any user experience research endeavor
– Think hard first (Test Plan)
– Anticipate outcomes to keep you on track with objectives and not
get pulled to tangents
– Then begin research...
 Definition of statistics:
– A set of procedures to for describing measurements and for
making inferences on what is generally true
 Statistics and experimental design go hand in hand
– Objective  Method  Measures  Outcomes  Objectives
31
Stats
Measures
 Ratio scales (Interval + Absolute Zero)
– Measure that has:
• Comparable intervals (inches, feet, etc.)
• An absolute zero (not meaningful to non-statisticians)
– Differences are comparable
• Device A is 72 inches in length while Device B is 36 inches
– One is twice as tall as the other
• Performance on Device A was 30 sec while Device B was 60
– Users were twice as fast completing task on Device A
over Device B
 Interval scales do not have a zero
– Difference between 40F - 50F = 100F – 90F
 Take away: You get powerful statistics using ratio/interval measures
32
Stats
What Does Power Mean Again?
 Statistical Power
– “The power to find a significant difference, if it does indeed exist”
– Too little power  Miss significance when it is really there
– Too much power  Might find significance when it is NOT there
 You can get MORE Power by:
– Adding more participants (by impact is non-linear)
– Having a greater “effect size,” which is the anticipated difference
– Picking within-subjects designs over between-subjects designs
– Using ratio/interval measures
– Changing alpha
 Practical Power
– Sample size costs money
– If you find significance, then it is probably really true!
33
Stats
Other Measures (non-ratio/non-interval)
 Likert scales
 Rank data
 Count data
 Each of these measures use a different statistical test
 Power is different (reduced)
– Consider Likert Data
1
A
2
B
3
4
5
C
 Could say:
– A=1, B=2, C=5
– C came in 1st, B came in 2nd and A came in 3rd
 Precision influences power and the less precise your measure, the
less power you have to detect differences
34
Stats
Between-Groups Designs
 Between-groups study splits sample into two or more groups
 Each group only interacts with one device
 What causes variability?
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
35
Stats
What About Within-Groups Designs?
 Within-Groups study has participants interact with all devices
 What causes variability?
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
• No longer applies
 Thus, less causes for variability results in statistical power
36
Stats
More Common Statistical Tests
 You are actually well aware of statistics – Descriptive statistics!
 Measures of central tendency
– Mean
– Median
– Mode
 Definitions?
– Mean = ?
• Average
– Median = ?
• The exact point that divides the distribution into two parts
such that an equal number fall above and below that point
– Mode = ?
• Most frequently occurring score
37
Stats
When In Doubt, Plot
3
4
4
4
5
5
Frequency
 Take scores
– 1
2
– 2
3
– 3
3
5
1
3
Normal distribution
Randomly sampled
Mean = Median = Mode
1
2
3
4
Score
5
38
Stats
Skewed Distributions
 Positive skew  Tail on the right
 Negative skew  Tail on the left
 Impact to measures of central tendency?
– Mode
– Median
– Mean
 “Central tendency”
39
Stats
Got Central Tendency, Now Variability
 We must first understand variability
– We tend to think of a mean as “it” or “everything”
 Consider a time on task as a metric
– Measurement error
• The tool or procedure can be an imprecise device
– Starting and stopping the stop watch
– Individual differences
• Participants are different, so some get different scores than
others on the same task. Since we are testing for differences
between A and B, this can be a problem
– Unreliable
• We are human, so if you test the same participant on
different days and you might get a different time!
40
Stats
Got Central Tendency, Now Variability
 We must first understand variability
– We tend to think of a mean as “it” or “everything”
– Class scored 80
– School scored 76
– Many scores went into the mean score
– Variability can be quantified
– [draw]
41
Stats
Variability is Standardized (~Normalized)
 Your score is in the 50th percentile
– Ethan and Madeline are the smartest kids in class
 AT scores, you saw your score—how did they get a percentile?
– Distribution is normal
– Numerically, the distribution can be described by only:
• Mean and standard deviation
1 Std Dev
2 Std Dev
1 Std Dev
2 Std Dev
3 Std Dev
42
Stats
Empirical Rule
 Empirical rule = 68/95/99
– 68%
Mean +/- 1 std dev
– 95%
Mean +/- 2 std dev
– 99%
Mean +/- 3 std dev
1 Std Dev
2 Std Dev
1 Std Dev
2 Std Dev
3 Std Dev
3 Std Dev
43
Stats
Clear on Normal Curves?
 Represent a single dataset on a single measure for a single sample
 Once data are normalized, you can describe dataset simple by
– Mean and standard deviation
• 60% success rate with a std dev of 10%
40%
50%
60%
70%
44
Stats
Randomly Sampled
 Think of data as coming from a population
45
Stats
Things Can Happen By Chance Alone
 Is this is really two samples of 10 drawn from a population who may
have these characteristics
46
Stats
Exercise
 Pass out Yes / No cards
 Procedure
– Will give you a task
– I will count (because it is hard to install a timer on DePaul PCs)
– Make a Yes or No decision
– Record count
– Hold up the Yes / No
 Practice
47
Stats
Exercise
 Is the man wearing a red shirt?
 Decide Yes or No
 Note time
 Hold up card
 Ready?
 Decide Yes or No
 Note time
 Hold up card
48
Stats
1 sec
49
Stats
2 sec
50
Stats
3 sec
51
Stats
4 sec
52
Stats
5 sec
53
Stats
6 sec
54
Stats
7 sec
55
Stats
8 sec
56
Stats
9 sec
57
Stats
Task 1
 Hospital setting. Physician is concerned about the patient’s MCH
values as it is now higher than normal. On January 27, 2001, was
the patient’s MCH value higher than normal?
 Decide Yes or No
 Note time
 Hold up card
 Ready?
58
Stats
1 sec
59
Stats
2 sec
60
Stats
3 sec
61
Stats
4 sec
62
Stats
5 sec
63
Stats
6 sec
64
Stats
7 sec
65
Stats
8 sec
66
Stats
9 sec
67
Stats
10 sec
68
Stats
11 sec
69
Stats
12 sec
70
Stats
13 sec
71
Stats
14 sec
72
Stats
15 sec
73
Stats
16 sec
74
Stats
17 sec
75
Stats
18 sec
76
Stats
19 sec
77
Stats
20 sec
78
Stats
Task 1: Recap
 Collect
– Time
– Success/Fail
 Who forgot the task?
– Controls and impact to data collection…?
79
Stats
Task 2
 Hospital setting. Assessment of different prompts for an interaction
with an interface. The patient has declined further treatment. The
physician asked you to go into the system and cancel all of the
orders. So, you select the orders and press cancel.
 Decide Yes or No
 Note time
 Hold up card
 Ready?
80
Stats
1 sec
81
Stats
2 sec
82
Stats
3 sec
83
Stats
4 sec
84
Stats
5 sec
85
Stats
6 sec
86
Stats
7 sec
87
Stats
8 sec
88
Stats
9 sec
89
Stats
10 sec
90
Stats
Task 3
 Hospital setting. Assessment of different prompts for an interaction
with an interface. The patient has declined further treatment. The
physician asked you to go into the system and cancel all of the
orders. So, you select the orders and press cancel.
 Decide Yes or No
 Note time
 Hold up card
 Ready?
91
Stats
1 sec
92
Stats
2 sec
93
Stats
3 sec
94
Stats
4 sec
95
Stats
5 sec
96
Stats
6 sec
97
Stats
7 sec
98
Stats
8 sec
99
Stats
9 sec
100
Stats
10 sec
101
Stats
Task 3: Recap
Affirmative Negative
Non Destructive
What just
Happened?
Statement Type
 True Affirmative
only
 False Affirmative
user
 True Negative
 False Negative
Use When
Fast, easy, low-cost-to-user outcome, confirmation
Need user to think about the response, high cost to
Should almost never use
Never use, unless trying to deceive use
Opt out response [ ] Do not send me the newsletter
102
Stats
Data Collection
 Time
 Blue = Correct (success)
 Red = Incorrect (fail)
103
Stats
Tasting Soup
 You don’t need to drink half a pot to see if the soup is done / tasty
 As long as the soup is well mixed, you should be able to just take
one or two tastes with “well mixed” as
– Randomly sampled
– Normally distributed
– Variances are equal
– Samples are independent
40%
50%
60%
70%
80%
90%
100%
 Can you say these are different?
– Missing confidence intervals?
104
Stats
Confidence Intervals Matter
 Confidence interval of 95% says that that dot falls within that range
95% of the time
 With confidence intervals attached, can you say these are different?
40%
50%
60%
70%
80%
90%
100%
 These lines are affected by:
– Confidence interval itself
– Variability
– Sample size
105
Stats
What If Variability Were Better Controlled?
 What if we could be more precise or reduce variability like in a
between-groups design?
– Confidence interval still stays the same at 95%
– But, the variability is reduced
 With confidence intervals attached, can you say these are different?
40%
50%
60%
70%
80%
90%
100%
 Yes! And in fact, you do not need stats
106
Stats
Usually, Inferential Stats are Needed
 The world is usually dirty
– There is typically overlap
 Statistics can determine if these are significantly different
40%
50%
60%
70%
80%
90%
100%
107
Stats
What If There Is No Known Population?
 To properly measure quality control standards, Guinness hired a
statistician, William Sealy Gosset in 1899 to assist
108
Stats
Student’s T-distribution Mimics Population
 Population is unknown, so it is estimated from the sample itself
– Notice that relatively small sample sizes are necessary to
resemble a normal distribution
109
Stats
Hypotheses Testing
 All examples are from a single sample—now compare two samples
 Research question (i.e., state hypothesis)
– Is there is diff between A and B?
 Test the opposite (i.e., null hypothesis)
– There is no difference between A and B
– Scientific method is to disprove the null hypothesis (Ho)
• Because we cannot prove the hypothesis, we disprove others
one-by-one
110
Stats
How Can We Be Wrong?
 Type I: Saying there is a difference when one does not exist
– Like convicting an innocent
– False positive
– All based on P-value
• Established a priori (you set this before the study)
 Type II: Saying there is not a difference when there is one!
– Letting bad guy go free
 There are volumes on Type I / II errors
 So, let’s focus on inferring the New Device is better than Old Device
111
Stats
Sample Time Data
 Remember that with time, there is
typically a long tail
 Perform Log transforms to normalize data
– Geometric mean is recommended
– Plug these log data instead of time
112
Stats
Hypothesis
 Null Hypothesis (Ho): There is no difference in time for old vs. new
 Alternative Hypothesis (Ha): There is a difference
 Hypothesis Test
– Set alpha (p < .05)
– Ho: Mean (old) = Mean (new)
– Ha: Mean (old) <> Mean (new)  Two-tailed test
– H1: Mean (old) < Mean (new)  One-tailed test
 Run statistical test
– Two-sample t-test
– Result: p = .0106
 If p-value from t-test < .05, the difference is significant
113
Stats
Stats
 p = .0106
Reject Ho
 The current design was significantly faster than the new design on
Task XXX (p < .05)
114
Stats
Same For Success Data… Fischer’s T-test
 Null Hypothesis (Ho): There is no difference in success for old vs.
new
 Alternative Hypothesis (Ha): There is a difference
 Hypothesis Test
– Set alpha (p < .05)
– Ho: Mean (old) = Mean (new)
– Ha: Mean (old) <> Mean (new)  Two-tailed test
– H1: Mean (old) < Mean (new)  One-tailed test
 Run statistical test
 If p-value from t-test < .05, the difference is significant
115
Project 3
116
Project 3
Next Steps
 Wait for feedback on your test plan.
 Create a moderator’s guide based on the test plan.
– Must be very explicit, so that all moderators are consistent.
– Consider all contingencies.
 Run a pilot session or two with the entire group.
 Revise the guide.
 Collect data.
 Analyze data.
117
Project 3
General Feedback on Test Plans
 Objective:
– Why would someone want to pay for this study?
 Think of the project:
– Quantitatively – Which is better, A or B?
– Qualitatively – Why is A (or B) better (or worse)?
 Be careful when selecting participants:
– Ideally, they should be equally familiar or unfamiliar with both A
and B.
 Make sure all moderators have identical stimuli.
– E.g., if you are using printouts, make sure they are all in color
and printed in the same format etc.
118
Project 3
General Feedback on Test Plans
 If you are measuring time on task:
– Decide exactly when to start and stop the stopwatch.
• When the participant start saying the answer out loud?
• When they point to the answer?
• What if they point and then change their mind?
 If you are measuring the number of errors, define what an error is
prior to the study
– [Website] If someone clicks on the wrong link? What if they
continue going down that path – is that still one error or more
than one?
– [Text entry] If someone types “a” instead of “s”? What about
missing letters? What about extra letters?
119
Project 3
General Feedback on Test Plans
 If you have a between-groups design:
– Make sure that the two groups are matched well in terms of
demographics, experience with the tested products etc.
 If you have a within-groups design:
– Counterbalance the presentation order of the two stimuli: half of
the participants should get A first and then B and the other half B
first and then A.
120
Download