Lecture 19 chapter 9 evaluation techniques (part 2) 1 Evaluating Implementations Requires an artefact: a simulation, a prototype, or a full implementation 2 Experimental evaluation • controlled evaluation of specific aspects of interactive behaviour • evaluator chooses hypothesis to be tested • a number of experimental conditions are considered which differ only in the value of some controlled variable. • changes in behavioural measure are attributed to different conditions 3 Experimental factors • Subjects (i.e., the users, aka ‘participants’) – who – representative, sufficient sample • not the programmer’s friend, boss, etc. • huge variability in performance of individuals • Variables – things to modify and measure • Hypothesis – what you’d like to show • Experimental design – how you are going to show it – Includes ‘Protocol’ – what the subjects do 4 Variables • independent variable (IV) characteristic changed to produce different conditions e.g. interface style, number of menu items • dependent variable (DV) characteristics measured in the experiment e.g. time taken, number of errors. 5 Hypothesis • prediction of outcome – framed in terms of IV and DV e.g. “error rate will increase as font size decreases” • null hypothesis: – states no difference between conditions – aim is to disprove this e.g. null hyp. = “no change with font size” 6 Experimental design • “within groups” design (also called “repeated measures”) – each subject performs experiment under each condition – transfer of learning possible (practice makes performance better; or alternatively fatigue or boredom makes it worse) – less costly and less likely to suffer from user variation (each user is compared to themselves) • between groups design – – – each subject performs under only one condition no transfer of learning more users required 7 Within v. Between • Consider a test on the difference of beer v. vodka martinis on reaction time – Null hypothesis – no difference in increase in reaction time between the two beverages • Design 1: – 30 people try beer; 30 other people try vodka – D.V. is change in reaction time pre- v. post drinking • Not bad – be sure to randomize who goes into beer group v. vodka group • But ‘power’ of the experiment will be reduced due to the great variability of individuals in reaction to alcohol 8 Within v. Between (contd.) • Design 2: – All 60 people first try beer, then immediately try vodka • Problem of carryover effect • Better Design: – All 60 try beer, then a week later try vodka • Now each individual is compared with themselves • Still possible problem of ordering effect (e.g., they might get a little better at the reaction time test) • Best Design: – 30 try beer, then a week later vodka; 30 try vodka and then a week later beer 9 Analysis of data • Before you start to do any statistics: – look at data (e.g. average=5.25 – but 4.9 without outlier) – save original data • Choice of statistical technique depends on – type of data – information required • Type of data – discrete • finite number of values • may be ordered, or unordered (e.g., colors) – continuous • any value 14 12 10 8 6 4 2 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Analysis - types of test • parametric – assume normal distribution – robust – powerful • non-parametric – do not assume normal distribution – less powerful – more reliable • contingency table – classify data by discrete attributes – count number of data items in each group 11 Analysis of data (cont.) • What information is required? – is there a difference? – how big is the difference? – how accurate is the estimate? • Parametric and non-parametric tests mainly address first of these 12 ANOVA – analysis of variance • Quite easy to test whether there’s a significant difference between groups in Excel – Need to invoke Tools/Add-ins/Analysis Toolpack to enable – Then just apply Tools/Data Analysis/ANOVA: Single Factor to the data 13 ANOVA from Excel Say we have three columns of numbers representing the time to complete a task for 5, 5 and 7 users using three variations of an Anova: Single Factor interface SUMMARY Groups Group 1 Group 2 Group 3 Count than expected chance variation) Sum 5 5 7 ANOVA Source of Variation SS Between Groups 76.28908 Within Groups 101.8286 Total If P-value < 0.05 then we usually say the result is ‘significant’ (result is more 178.1176 df Average Variance 82 16.4 9.3 67 13.4 10.3 79 11.28571 3.904762 MS F P-value F crit 2 38.14454 5.244339 0.019959 3.738892 14 7.273469 16 14 When is a difference a difference? • In the world of parametric stats, we look for a statistic to be large enough to be ‘significant’ – On the Gaussian (‘normal’) curve a |Z|=1.96 leaves 95% of the area of the curve behind so is a common ‘critical value’ for claiming significance 15 Parametric assumptions • Parametric statistics assume that some mathematically elegant assumptions hold true for the data – E.g., ANOVA (and standard ‘regression’) assume, among other things, normally distributed random error – Trivia: The mathematical form of the probability density function for the normal distribution is remarkably formidable • Centres on mean, m, and is flattened by standard deviation, s Galton machine simulates normal distribution (aka ‘bell curve’) Exponential distribution models time between events happening 16 with a constant average rate So, what to measure? • Usually one (or several) of these things: – Speed / efficiency • How many (or whatever) per unit time can the user process with this interface? – Accuracy (errors) – Learnability (time to acquire the ability to do something in particular with the interface) • And retention (how well they can do it some particular time later) – Satisfaction • Subjective assessment – how does the user feel about the interface 17 Subjective assessment • Very important – If the user seriously doesn’t like it, probably there’s something really wrong with the design (just maybe they can’t articulate what) • Quantifiable – Yes/No, or better, Likert scale (ratings) through interview or questionnaire • Need to ask the right questions – E.g., don’t have leading questions (BAD: “Is this the absolute worst system you have used ever?”) • And need to ask the questions well (so user reliably expresses what they mean) – E.g., don’t trip them up with double negatives 18 Likert Scale • Can be from 4 to 7 “points” • Usually about agreement to a phrase – E.g., “I found the search function easy to use” – Strongly Agree, Agree, Neutral (optional), Disagree, Strongly Disagree • May also be about importance – E.g., “A site search function is…” – “not very important” to “extremely important” • Or a general assessment – E.g., “The performance of the search function was…” – Poor, Fair, Satisfactory, Good, Excellent • Great to ask open-ended questions, too – E.g., “What was the best aspect of the search function?” – But it’s the Likert scale data that you can quantify 19 Dimensions and validity • When designing a questionnaire… – Have in mind a few underlying issues that you are trying to assess – Ask a few different questions that are coordinated around each issue – Ask different ways – vary whether positive or negative favours the issue in question – Ideally, verify the questionnaire by having people role-play particularly happy or angry, and middle-ofthe-road users • And see if they answer the questions the way you’d expect! 20 Questionnaire administration • Face-to-face or telephone interviews – Esp. efficient (and unbiased) to have third party give the telephone survey • Mail-out (including email) questionnaire • Web-based questionnaire (maybe email out URL) • Even a questionnaire is an experiment on humans from the point of view of research ethics – Easier to achieve an ethical questionnaire administration if it’s truly anonymous 21 Response rates and bias • Low response rate is a problem – Below 50% response rate, one wonders whether respondents were exceptional (happiest, angriest or a mix; but not the “normal” folks) – Better if you have some authority to motivate a response • But back to issues of ethics – e.g., not truly anonymous if you know who to nag about non-response; and the “pressure” may be unfair • Similar bias problems when you use volunteers for any experiment – Are these volunteers representative of your “normal” users? 22 Experimental studies on groups More difficult than single-user experiments Problems with: – – – – subject groups choice of task data gathering Analysis • Unfortunately (in terms of experimental requirements) a lot of things that are interesting in the real world, involve computers mediating group behaviour 23 Subject groups larger number of subjects more expensive longer time to `settle down’ … even more variation! difficult to timetable so … often only three or four groups 24 Groups (contd.): Data gathering several video cameras + direct logging of application problems: – synchronisation – sheer volume! one solution: – record from each perspective 25 Groups (contd.): Analysis N.B. vast variation between groups solutions: – ‘within groups’ experiments (each group works under various conditions) – micro-analysis (e.g., gaps in speech) – anecdotal and qualitative analysis controlled experiments may `waste' resources! – Experiments dominated by dynamics of group formation – Field studies are apt to be more realistic 26 About statistics • It’s an amazingly complex field – A lot of hidden complexities in running experiments and saying that the observed differences really make a difference • ‘threats to validity’ – are those things that make it possible that your experimental conclusion is in error – Threats to internal validity: like carryover effects, or lack of randomization – Threats to external validity: like that your whole population of subjects were unusual in some way, or the task was not representative of real use of the tool – When the outcomes are serious (e.g., medical trials) professional statisticians are always used in design of the experiment as well as analysis and reporting of the findings – Plenty of texts and courses on stats available (the Wikipedia is pretty good on this topics, too – e.g., for ANOVA) 27