Evaluation (cont.): Empirical Studies CS352

advertisement
Evaluation (cont.):
Empirical Studies
CS352
Announcements
• Where we are in PRICPE:
– Predispositions: Did this in Project Proposal.
– RI: Research was studying users. Hopefully led
to Insights.
– CP: Concept and initial (very low-fi) Prototypes
due Fri 7/16 at midnight.
Evaluate throughout, repeat iteratively!!
2
Evaluation
• Analytical – based on your head
• Empirical – based on data
• Advantages/disadvantages of empirical
– More expensive (time, $) to do.
+ Greater accuracy in what users REALLY do.
+ You’ll get more surprises.
+ Greater credibility with your bosses.
Empirical Work with Humans
• What do you want from it?
– List of problems:
• Usability study. (e.g., 5 users in a lab).
– List of behaviors/strategies/needs:
• Think-aloud study or field observation.
– Compare, boolean outcome (e.g., A>B)
• Statistical study.
• Note:
– Impossible to “prove” no problems exist.
• Can only find problems.
The Usability Study
• Returns a list of UI problems.
• Metrics:
– Time to complete task
– Errors made
– Difficulty to use (via questionnaire)
– Emotional response, e.g., stressed out,
discouraged, fun, enjoyable…
• Pick a task and user profile.
• Users do the task with your prototype
– in here: paper OR CogTool
Examples
• A Xerox Palo Alto Research Center (PARC) employee wrote that
PARC used extensive usability testing in creating the Xerox Star,
introduced in 1981.[2]] Only about 25,000 were sold, leading many to
consider the Xerox Star a commercial failure.
• The Inside Intuit book, says (page 22, 1984), "... in the first instance
of the Usability Testing that later became standard industry practice,
LeFevre recruited people off the streets... and timed their Kwik-Chek
(Quicken) usage with a stopwatch. After every test... programmers
worked to improve the program."[1]) Scott Cook, Intuit co-founder,
said, "... we did usability testing in 1984, five years before anyone
else... there's a very big difference between doing it and having
marketing people doing it as part of their... design... a very big
difference between doing it and having it be the core of what
engineers focus on.[3]
Usability Study: How
• How many: 5 users find 60-80% problems.
• How:
– Be organized!! Have everything ready, ...
– Test it first.
– Users do task (one user at a time).
• Data IS the result (no stats needed).
Think-Aloud
• Usually most helpful with a working
prototype or system,
– but might be able to get use out of it for early
prototypes.
• Returns a list of
behaviors/strategies/impressions/thoughts.
..
• Pick a task and user profile.
• Users do the task with your prototype.
Think-Aloud: How
• How many: 5-10 users is usual.
– Data analysis is time-consuming.
– In here: 1-2 max!
• How:
– Be organized!! Have everything ready, ...
– Test it first.
– Users do task (one user at a time).
• Analyze data for patterns, surprises, etc.
– No stats: not enough subjects for this.
Think-Aloud: How (cont.)
• Think-aloud training (in class)
• Sample think-aloud results:
– From VL/HCC'03 (Prabhakararao et al.)
Announcement
• Quiz Friday on Evaluating
• Concepts and Prototypes due Mon at
11:59pm
• Midterm on Tuesday
Evaluation
• Analytical – based on your head
• Empirical – based on data
• Advantages/disadvantages of empirical
– More expensive (time, $) to do.
+ Greater accuracy in what users REALLY do.
+ You’ll get more surprises.
+ Greater credibility with your bosses.
Statistical Studies
• We will not do these in this class, but
– you need to know some basics.
• Goal: answer a binary question.
– eg: does system X help users create
animations?
– eg: are people better debuggers using X than
using Y?
• Advantage: your audience believes it.
• Disadvantage: you might not find out
enough about “why or why not”.
Hypotheses Testing
• Need to be specific, and
provable/refutable.
– e.g.: “users will debug better using system X
than in system Y”
– (Strictly speaking we use the “null”
hypothesis, which says there won’t be a
difference, but it’s a fine point...)
– Pick a significance value (rule of thumb: 0.05).
• If you get a p-value <=0.05, this says you’ve
shown a significant difference, but there’s a 5%
chance that the difference is a fluke.
Lucid and testable hypothesis
• State a lucid, testable hypothesis
– this is a precise problem statement
• Example 1:
There is no difference in the number of
cavities in children and teenagers using
crest and no-teeth toothpaste when
brushing daily over a one month period
Lucid and testable hypothesis
• Example 2:
There is no difference in user performance (time
and error rate) when selecting a single item from a
pop-up or a pull down menu of 4 items, regardless
of the subject’s previous expertise in using a
mouse or using the different menu types”
File
Edit
View
Insert
File
Edit
New
Open
View
Close
Insert
Save
New
Open
Close
Save
Design the Experiment
Identify independent variables
(“treatments”) we’ll manipulate:
– eg: which system they use, X vs. Y?
• Identify outputs (dependent variables) for
the hypotheses:
– eg: more bugs fixed?
– eg: fewer minutes to fix same number of
bugs?
– eg.: less time to check out?
Independent variables
• Hypothesis includes the independent
variables that are to be altered
– the things you manipulate independent of a subject’s
behavior
– determines a modification to the conditions the
subjects undergo
– may arise from subjects being classified into
different groups
Independent variables
•
in toothpaste experiment
• toothpaste type: uses Crest or No-teeth
toothpaste
• age: <= 11 years or > 11 years
•
in menu experiment
• menu type: pop-up or pull-down
• menu length: 3, 6, 9, 12, 15
• subject type (expert or novice)
Design the Experiment
• Identify independent variables
(“treatments”) we’ll manipulate:
– eg: which system they use, X vs. Y?
 Identify outputs (dependent variables)
for the hypotheses:
– eg: more bugs fixed?
– eg: fewer minutes to fix same number of
bugs?
– eg.: less time to check out?
Dependant variables
• Hypothesis includes the dependent
variables that will be measured
• variables dependent on the subject’s behavior /
reaction to the independent variable
• the specific things you set out to quantitatively
measure / observe
Dependant variables
•
in menu experiment
• time to select an item
• selection errors made
• time to learn to use it to proficiency
•
in toothpaste experiment
• number of cavities
• frequency of brushing
• preference
Design the experiment (cont.)
• Decide on within vs. between subject.
– “Within”: 1 group experiences all treatments.
• In random order.
• “Within” is best, if possible. (Why?)
– “Between”: different group for each treatment.
Between-Groups Design
• Wilma and Betty use
one interface
• Dino and Fred use the
other
Within-Groups Design
Everyone uses both interfaces
Between-Groups vs. Within-Groups
• Within groups design
– Pros:
• Is more powerful statistically (can compare the
same person across different conditions, thus
isolating effects of individual differences)
• Requires fewer participants than between-groups
– Cons:
• Learning effects (can be reduced by randomizing
the order of treatments)
• Fatigue effects
Design the experiment (cont.)
• How many subjects?
– Rule of thumb: 30/treatment.
– More subjects  more statistical power 
more likely to get p<=0.05 if there really is a
difference.
StratCell [Grigoreanu et al. ‘10]
• A spreadsheet
system with
debugging aid
• Target users:
spreadsheet users
28
Control -11, Treatment-5
29
65 males, 67 females; 60+
users each treatment group
Design the experiment (cont.)
• Design the task they will do.
– Since you usually run a lot of these at once
and you’re comparing them, you need to be
careful with length.
• Long enough to get over learning curve.
• Big enough to be convincing.
• Small enough to be do-able in the amount of time
subjects have to spend with you.
– Vary the order if multiple tasks.
Design the experiment (cont.)
• Develop the tutorial.
– Practice it like crazy! (Must work the same for
everyone!)
– Example (see mashup study tutorial)
• Plan the data to gather.
– Log files?
– Questionnaires before/after?
– Saved result files at end?
Designing the experiment (cont.)
• Water in the beer:
– Sources of uncontrolled variation spoil your
statistical power.
• Sources:
– Too much variation in subject background.
– Not a good enough tutorial.
– Task not a good match for what you wanted
to find out.
– Etc.
• Result: no significant difference.
Finally, Analyze the Data
• Choose an appropriate statistical test.
– An entire courses on this, i.e., ST516 Method
of Analysis
• Run it
– using stats software packages, e.g., R, SPSS,
Excel.
• Hope for p<=0.05 (5% chance of being wrong)
• Summary:
– Statistical studies are a lot of work, too much
to do in this class!
– Right choice for answering X>Y questions.
Statistical vs practical
significance
• When n (sample size)is large, even a trivial
difference may show up as a statistically
significant result
– eg menu choice:
mean selection time of menu a is 3.00 seconds;
menu b is 3.05 seconds
• Statistical significance does not imply that the
difference is important!
– a matter of interpretation
– statistical significance often abused and used
to misinform
The rest of the course
• Continue to work on your project… (prototype in
CogTool, eval plan, final prototype)
• Introduction to various projects/research may
include but not limited to:
– Foundations and strategies (e.g., surprise-explainreward, interruption, information foraging)
– Gender issues in software environments
– Studies of designers
– Usability engineering for programmers
– Designing for special populations, e.g., seniors,
amnesia, …
• Extra credit opportunities
– Presenting papers from the above areas
Download