Lecture 19 Evaluation techniques 2

advertisement
Lecture 19
chapter 9
evaluation techniques
(part 2)
1
Evaluating Implementations
Requires an artefact:
a simulation, a prototype, or a
full implementation
2
Experimental evaluation
• controlled evaluation of specific aspects of
interactive behaviour
• evaluator chooses hypothesis to be tested
• a number of experimental conditions are
considered which differ only in the value of
some controlled variable.
• changes in behavioural measure are attributed
to different conditions
3
Experimental factors
• Subjects (i.e., the users, aka ‘participants’)
– who – representative, sufficient sample
• not the programmer’s friend, boss, etc.
• huge variability in performance of individuals
• Variables
– things to modify and measure
• Hypothesis
– what you’d like to show
• Experimental design
– how you are going to show it
– Includes ‘Protocol’ – what the subjects do
4
Variables
• independent variable (IV)
characteristic changed to produce different
conditions
e.g. interface style, number of menu items
• dependent variable (DV)
characteristics measured in the experiment
e.g. time taken, number of errors.
5
Hypothesis
• prediction of outcome
– framed in terms of IV and DV
e.g. “error rate will increase as font size decreases”
• null hypothesis:
– states no difference between conditions
– aim is to disprove this
e.g. null hyp. = “no change with font size”
6
Experimental design
• “within groups” design (also called “repeated
measures”)
– each subject performs experiment under each
condition
– transfer of learning possible (practice makes
performance better; or alternatively fatigue or
boredom makes it worse)
– less costly and less likely to suffer from user
variation (each user is compared to themselves)
• between groups design
–
–
–
each subject performs under only one condition
no transfer of learning
more users required
7
Within v. Between
• Consider a test on the difference of beer v.
vodka martinis on reaction time
– Null hypothesis – no difference in increase in
reaction time between the two beverages
• Design 1:
– 30 people try beer; 30 other people try vodka – D.V.
is change in reaction time pre- v. post drinking
• Not bad – be sure to randomize who goes into beer
group v. vodka group
• But ‘power’ of the experiment will be reduced due to
the great variability of individuals in reaction to alcohol
8
Within v. Between (contd.)
• Design 2:
– All 60 people first try beer, then immediately try
vodka
• Problem of carryover effect
• Better Design:
– All 60 try beer, then a week later try vodka
• Now each individual is compared with themselves
• Still possible problem of ordering effect (e.g., they
might get a little better at the reaction time test)
• Best Design:
– 30 try beer, then a week later vodka; 30 try vodka
and then a week later beer
9
Analysis of data
• Before you start to do any statistics:
– look at data (e.g. average=5.25 – but 4.9 without outlier)
– save original data
• Choice of statistical technique depends on
– type of data
– information required
• Type of data
– discrete
• finite number of values
• may be ordered, or
unordered (e.g., colors)
– continuous
• any value
14
12
10
8
6
4
2
10
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Analysis - types of test
• parametric
– assume normal distribution
– robust
– powerful
• non-parametric
– do not assume normal distribution
– less powerful
– more reliable
• contingency table
– classify data by discrete attributes
– count number of data items in each group
11
Analysis of data (cont.)
• What information is required?
– is there a difference?
– how big is the difference?
– how accurate is the estimate?
• Parametric and non-parametric tests
mainly address first of these
12
ANOVA – analysis of variance
• Quite easy to test whether there’s a
significant difference between groups in
Excel
– Need to invoke
Tools/Add-ins/Analysis Toolpack to enable
– Then just apply Tools/Data Analysis/ANOVA:
Single Factor to the data
13
ANOVA from Excel
Say we have three columns of
numbers representing the time to
complete a task for 5, 5 and 7
users using three variations of an
Anova: Single Factor
interface
SUMMARY
Groups
Group 1
Group 2
Group 3
Count
than expected chance variation)
Sum
5
5
7
ANOVA
Source of Variation
SS
Between Groups
76.28908
Within Groups
101.8286
Total
If P-value < 0.05 then
we usually say the result
is ‘significant’ (result is more
178.1176
df
Average Variance
82
16.4
9.3
67
13.4
10.3
79 11.28571 3.904762
MS
F
P-value
F crit
2 38.14454 5.244339 0.019959 3.738892
14 7.273469
16
14
When is a difference a
difference?
• In the world of parametric stats, we
look for a statistic to be large enough to
be ‘significant’
– On the Gaussian (‘normal’) curve a
|Z|=1.96 leaves 95% of the area of the
curve behind so is a
common ‘critical
value’ for
claiming
significance
15
Parametric assumptions
• Parametric statistics assume that
some mathematically elegant
assumptions hold true for the data
– E.g., ANOVA (and standard
‘regression’) assume, among other
things, normally distributed random
error
– Trivia: The mathematical form of
the probability density function
for the normal distribution is
remarkably formidable
• Centres on mean, m, and is
flattened by standard deviation, s
Galton machine simulates normal
distribution (aka ‘bell curve’)
Exponential distribution models
time between events happening
16
with a constant average rate
So, what to measure?
• Usually one (or several) of these things:
– Speed / efficiency
• How many (or whatever) per unit time can the user
process with this interface?
– Accuracy (errors)
– Learnability (time to acquire the ability to do
something in particular with the interface)
• And retention (how well they can do it some particular
time later)
– Satisfaction
• Subjective assessment – how does the user feel about
the interface
17
Subjective assessment
• Very important
– If the user seriously doesn’t like it, probably there’s
something really wrong with the design (just maybe
they can’t articulate what)
• Quantifiable
– Yes/No, or better, Likert scale (ratings) through
interview or questionnaire
• Need to ask the right questions
– E.g., don’t have leading questions (BAD: “Is this the
absolute worst system you have used ever?”)
• And need to ask the questions well (so user
reliably expresses what they mean)
– E.g., don’t trip them up with double negatives
18
Likert Scale
• Can be from 4 to 7 “points”
• Usually about agreement to a phrase
– E.g., “I found the search function easy to use”
– Strongly Agree, Agree, Neutral (optional), Disagree,
Strongly Disagree
• May also be about importance
– E.g., “A site search function is…”
– “not very important” to “extremely important”
• Or a general assessment
– E.g., “The performance of the search function was…”
– Poor, Fair, Satisfactory, Good, Excellent
• Great to ask open-ended questions, too
– E.g., “What was the best aspect of the search function?”
– But it’s the Likert scale data that you can quantify
19
Dimensions and validity
• When designing a questionnaire…
– Have in mind a few underlying issues that you are
trying to assess
– Ask a few different questions that are coordinated
around each issue
– Ask different ways – vary whether positive or
negative favours the issue in question
– Ideally, verify the questionnaire by having people
role-play particularly happy or angry, and middle-ofthe-road users
• And see if they answer the questions the way you’d
expect!
20
Questionnaire administration
• Face-to-face or telephone interviews
– Esp. efficient (and unbiased) to have third party give
the telephone survey
• Mail-out (including email) questionnaire
• Web-based questionnaire (maybe email out
URL)
• Even a questionnaire is an experiment on
humans from the point of view of research
ethics
– Easier to achieve an ethical questionnaire
administration if it’s truly anonymous
21
Response rates and bias
• Low response rate is a problem
– Below 50% response rate, one wonders whether
respondents were exceptional (happiest, angriest or
a mix; but not the “normal” folks)
– Better if you have some authority to motivate a
response
• But back to issues of ethics – e.g., not truly anonymous
if you know who to nag about non-response; and the
“pressure” may be unfair
• Similar bias problems when you use
volunteers for any experiment
– Are these volunteers representative of your “normal”
users?
22
Experimental studies on groups
More difficult than single-user experiments
Problems with:
–
–
–
–
subject groups
choice of task
data gathering
Analysis
• Unfortunately (in terms of experimental
requirements) a lot of things that are interesting
in the real world, involve computers mediating
group behaviour
23
Subject groups
larger number of subjects
 more expensive
longer time to `settle down’
… even more variation!
difficult to timetable
so … often only three or four groups
24
Groups (contd.): Data gathering
several video cameras
+ direct logging of application
problems:
– synchronisation
– sheer volume!
one solution:
– record from each perspective
25
Groups (contd.): Analysis
N.B. vast variation between groups
solutions:
– ‘within groups’ experiments (each group works under
various conditions)
– micro-analysis (e.g., gaps in speech)
– anecdotal and qualitative analysis
controlled experiments may `waste' resources!
– Experiments dominated by dynamics of group formation
– Field studies are apt to be more realistic
26
About statistics
• It’s an amazingly complex field
– A lot of hidden complexities in running experiments and
saying that the observed differences really make a
difference
• ‘threats to validity’ – are those things that make it possible
that your experimental conclusion is in error
– Threats to internal validity: like carryover effects, or lack of
randomization
– Threats to external validity: like that your whole population of
subjects were unusual in some way, or the task was not
representative of real use of the tool
– When the outcomes are serious (e.g., medical trials)
professional statisticians are always used in design of the
experiment as well as analysis and reporting of the
findings
– Plenty of texts and courses on stats available (the
Wikipedia is pretty good on this topics, too – e.g., for
ANOVA)
27
Download