Evaluation (cont.): Empirical Studies CS352 Announcements • Where we are in PRICPE: – Predispositions: Did this in Project Proposal. – RI: Research was studying users. Hopefully led to Insights. – CP: Concept and initial (very low-fi) Prototypes due Fri 7/16 at midnight. Evaluate throughout, repeat iteratively!! 2 Evaluation • Analytical – based on your head • Empirical – based on data • Advantages/disadvantages of empirical – More expensive (time, $) to do. + Greater accuracy in what users REALLY do. + You’ll get more surprises. + Greater credibility with your bosses. Empirical Work with Humans • What do you want from it? – List of problems: • Usability study. (e.g., 5 users in a lab). – List of behaviors/strategies/needs: • Think-aloud study or field observation. – Compare, boolean outcome (e.g., A>B) • Statistical study. • Note: – Impossible to “prove” no problems exist. • Can only find problems. The Usability Study • Returns a list of UI problems. • Metrics: – Time to complete task – Errors made – Difficulty to use (via questionnaire) – Emotional response, e.g., stressed out, discouraged, fun, enjoyable… • Pick a task and user profile. • Users do the task with your prototype – in here: paper OR CogTool Examples • A Xerox Palo Alto Research Center (PARC) employee wrote that PARC used extensive usability testing in creating the Xerox Star, introduced in 1981.[2]] Only about 25,000 were sold, leading many to consider the Xerox Star a commercial failure. • The Inside Intuit book, says (page 22, 1984), "... in the first instance of the Usability Testing that later became standard industry practice, LeFevre recruited people off the streets... and timed their Kwik-Chek (Quicken) usage with a stopwatch. After every test... programmers worked to improve the program."[1]) Scott Cook, Intuit co-founder, said, "... we did usability testing in 1984, five years before anyone else... there's a very big difference between doing it and having marketing people doing it as part of their... design... a very big difference between doing it and having it be the core of what engineers focus on.[3] Usability Study: How • How many: 5 users find 60-80% problems. • How: – Be organized!! Have everything ready, ... – Test it first. – Users do task (one user at a time). • Data IS the result (no stats needed). Think-Aloud • Usually most helpful with a working prototype or system, – but might be able to get use out of it for early prototypes. • Returns a list of behaviors/strategies/impressions/thoughts. .. • Pick a task and user profile. • Users do the task with your prototype. Think-Aloud: How • How many: 5-10 users is usual. – Data analysis is time-consuming. – In here: 1-2 max! • How: – Be organized!! Have everything ready, ... – Test it first. – Users do task (one user at a time). • Analyze data for patterns, surprises, etc. – No stats: not enough subjects for this. Think-Aloud: How (cont.) • Think-aloud training (in class) • Sample think-aloud results: – From VL/HCC'03 (Prabhakararao et al.) Announcement • Quiz Friday on Evaluating • Concepts and Prototypes due Mon at 11:59pm • Midterm on Tuesday Evaluation • Analytical – based on your head • Empirical – based on data • Advantages/disadvantages of empirical – More expensive (time, $) to do. + Greater accuracy in what users REALLY do. + You’ll get more surprises. + Greater credibility with your bosses. Statistical Studies • We will not do these in this class, but – you need to know some basics. • Goal: answer a binary question. – eg: does system X help users create animations? – eg: are people better debuggers using X than using Y? • Advantage: your audience believes it. • Disadvantage: you might not find out enough about “why or why not”. Hypotheses Testing • Need to be specific, and provable/refutable. – e.g.: “users will debug better using system X than in system Y” – (Strictly speaking we use the “null” hypothesis, which says there won’t be a difference, but it’s a fine point...) – Pick a significance value (rule of thumb: 0.05). • If you get a p-value <=0.05, this says you’ve shown a significant difference, but there’s a 5% chance that the difference is a fluke. Lucid and testable hypothesis • State a lucid, testable hypothesis – this is a precise problem statement • Example 1: There is no difference in the number of cavities in children and teenagers using crest and no-teeth toothpaste when brushing daily over a one month period Lucid and testable hypothesis • Example 2: There is no difference in user performance (time and error rate) when selecting a single item from a pop-up or a pull down menu of 4 items, regardless of the subject’s previous expertise in using a mouse or using the different menu types” File Edit View Insert File Edit New Open View Close Insert Save New Open Close Save Design the Experiment Identify independent variables (“treatments”) we’ll manipulate: – eg: which system they use, X vs. Y? • Identify outputs (dependent variables) for the hypotheses: – eg: more bugs fixed? – eg: fewer minutes to fix same number of bugs? – eg.: less time to check out? Independent variables • Hypothesis includes the independent variables that are to be altered – the things you manipulate independent of a subject’s behavior – determines a modification to the conditions the subjects undergo – may arise from subjects being classified into different groups Independent variables • in toothpaste experiment • toothpaste type: uses Crest or No-teeth toothpaste • age: <= 11 years or > 11 years • in menu experiment • menu type: pop-up or pull-down • menu length: 3, 6, 9, 12, 15 • subject type (expert or novice) Design the Experiment • Identify independent variables (“treatments”) we’ll manipulate: – eg: which system they use, X vs. Y? Identify outputs (dependent variables) for the hypotheses: – eg: more bugs fixed? – eg: fewer minutes to fix same number of bugs? – eg.: less time to check out? Dependant variables • Hypothesis includes the dependent variables that will be measured • variables dependent on the subject’s behavior / reaction to the independent variable • the specific things you set out to quantitatively measure / observe Dependant variables • in menu experiment • time to select an item • selection errors made • time to learn to use it to proficiency • in toothpaste experiment • number of cavities • frequency of brushing • preference Design the experiment (cont.) • Decide on within vs. between subject. – “Within”: 1 group experiences all treatments. • In random order. • “Within” is best, if possible. (Why?) – “Between”: different group for each treatment. Between-Groups Design • Wilma and Betty use one interface • Dino and Fred use the other Within-Groups Design Everyone uses both interfaces Between-Groups vs. Within-Groups • Within groups design – Pros: • Is more powerful statistically (can compare the same person across different conditions, thus isolating effects of individual differences) • Requires fewer participants than between-groups – Cons: • Learning effects (can be reduced by randomizing the order of treatments) • Fatigue effects Design the experiment (cont.) • How many subjects? – Rule of thumb: 30/treatment. – More subjects more statistical power more likely to get p<=0.05 if there really is a difference. StratCell [Grigoreanu et al. ‘10] • A spreadsheet system with debugging aid • Target users: spreadsheet users 28 Control -11, Treatment-5 29 65 males, 67 females; 60+ users each treatment group Design the experiment (cont.) • Design the task they will do. – Since you usually run a lot of these at once and you’re comparing them, you need to be careful with length. • Long enough to get over learning curve. • Big enough to be convincing. • Small enough to be do-able in the amount of time subjects have to spend with you. – Vary the order if multiple tasks. Design the experiment (cont.) • Develop the tutorial. – Practice it like crazy! (Must work the same for everyone!) – Example (see mashup study tutorial) • Plan the data to gather. – Log files? – Questionnaires before/after? – Saved result files at end? Designing the experiment (cont.) • Water in the beer: – Sources of uncontrolled variation spoil your statistical power. • Sources: – Too much variation in subject background. – Not a good enough tutorial. – Task not a good match for what you wanted to find out. – Etc. • Result: no significant difference. Finally, Analyze the Data • Choose an appropriate statistical test. – An entire courses on this, i.e., ST516 Method of Analysis • Run it – using stats software packages, e.g., R, SPSS, Excel. • Hope for p<=0.05 (5% chance of being wrong) • Summary: – Statistical studies are a lot of work, too much to do in this class! – Right choice for answering X>Y questions. Statistical vs practical significance • When n (sample size)is large, even a trivial difference may show up as a statistically significant result – eg menu choice: mean selection time of menu a is 3.00 seconds; menu b is 3.05 seconds • Statistical significance does not imply that the difference is important! – a matter of interpretation – statistical significance often abused and used to misinform The rest of the course • Continue to work on your project… (prototype in CogTool, eval plan, final prototype) • Introduction to various projects/research may include but not limited to: – Foundations and strategies (e.g., surprise-explainreward, interruption, information foraging) – Gender issues in software environments – Studies of designers – Usability engineering for programmers – Designing for special populations, e.g., seniors, amnesia, … • Extra credit opportunities – Presenting papers from the above areas