HCI460: Week 8 Lecture October 28, 2009 Outline Midterm Review How Many Participants Should I Test? – Review – Exercises Stats – Review of material covered last week – New material Project 3 – Next Steps – Feedback on the Test Plans 2 Midterm Review 3 Midterm Review Overall Mean / average: 8.55 Median: 8.75 Mode: 10 (most frequent score) 12 N = 44 Number of Students 10 8 6 4 2 0 0, 0.5 1, 1.5 2, 2.5 3, 3.5 4, 4.5 5, 5.5 6, 6.5 7, 7.5 8, 8.5 9, 9.5 10 Midterm Score 4 Midterm Review Q1: Heuristic vs. Expert Evaluation Question: What is the main difference between a heuristic evaluation and an expert evaluation? Answer: – Heuristic evaluation uses a specific set of guidelines or heuristics. – Expert evaluation relies on the evaluator’s expertise (including internalized guidelines) and experience. • No need to explicitly match issues to specific heuristics. • More flexibility. 5 Midterm Review Q2: Research-Based Guidelines (RBGs) Question: What is unique about the research-based guidelines on usability.gov relative to heuristics and other guidelines? What are the unique advantages of using the research-based guidelines? Answer: – This is a very comprehensive list of very specific guidelines (over 200). Other guideline sets are much smaller and the guidelines are more general. – RBGs were created by a group of experts (not an individual). – RBGs are specific to the web. – Unlike other heuristics and guidelines, RBGs have two ratings: • Relative importance to the success of a site – Helps prioritize issues. • Strength of research evidence that supports the guideline – Research citations lend credibility to the guidelines. 6 Midterm Review Q3: Positive Findings Question: Why should positive findings be presented in usability reports? Answer: – To let stakeholders know what they should not change and which current practices they should try to emulate. – To make the report sound more objective and make stakeholders more receptive to the findings in the report. • Humans are more open to criticism if it is balanced with praise. 7 Midterm Review Q4: Think-Aloud vs. Retrospective TA Question: What is the main difference between the think-aloud protocol (TA) and the retrospective think-aloud protocol (RTA)? When should you use each of these methods and why? RTA ≠ post-task interview Answer: – TA involves having the participant state what they are thinking while they are completing a task. • Great for formative studies; helps understand participant actions as they happen. – RTA is used after the task has been completed in silence. The participants walks through the task one more time (or watches a video of himself/herself performing the task) and explains their thoughts and actions. • Good when time on task and other quantitative behavioral measures need to be collected in addition to qualitative data. • Good for participants who may not be able to do TA, 8 Midterm Review Q5: Time on Task in Formative UTs Question: What are the main concerns associated with using time on task in a formative study with 5 participants? Answer: – Formative studies often involve think-aloud protocol. • Time on task will be longer because thinking aloud takes more time and changes the workflow. – Sample size is too small for the time on task to generalize to the population or show significant differences between conditions. 9 Midterm Review Q6: Human Error Question: Why is the term “human error” no longer used in the medical field? Answer: – “Human error” places the blame on the human when in fact errors usually result from problems with the design. – A more neutral term “use error” is used instead. 10 Midterm Review Q7: Side-by-Side Moderation Question: When would you opt for side-by-side moderation in place of moderation from another room with audio communication? Answer: – Side-by-side moderation is better when: • Building rapport with participant is important (e.g., in formative think-aloud studies) • Moderator has to simulate interaction (e.g., paper prototype) • The tested object / interaction may be difficult to see via camera feed or through the one-way mirror Moderating from another room with audio communication ≠ remote study 11 How Many Participants Should I Test? Review from Last Week 12 How Many Participants Should I Test? Overview How many participants? Formative test? Nielsen’s 5 or more? Sauro’s calculator? Summative test? Need a directional answer? (descriptive stats) Need a definitive answer? (inferential stats) How complex is the system? How many tasks? How many distinct user groups? Do these groups use the same areas in the product? Precision testing? Hypothesis testing? Does the score generalize to the population? Is the score for A different than the score for B? “Smell check:” How many participants will make stakeholders comfortable? 13 How Many Participants Should I Test? Sample Size Calculator for Formative Studies Jeff Sauro’s Sample Size Calculator for Discovering Problems in a User Interface: http://www.measuringusability.com/problem_discovery.php 14 How Many Participants Should I Test? Sample Size for Precision Testing We need sufficient sample size to be able to generalize the results to the population. Sample size for precision testing depends on: – Confidence level (usually 95% or 99%) – Desired level of precision • Acceptable sampling error (+/- 5%) – Size of population to which we want to generalize the results Sampling Error: Free online sample size calculator from Creative Research Systems: http://www.surveysystem.com/sscalc.htm 15 How Many Participants Should I Test? Sample Size for Precision Testing Confidence interval: 95% Population of: +/- 3% +/- 5% +/- 7% 100M 1067 384 196 1M 1066 384 196 100,000 1056 383 196 10,000 964 370 192 1,000 516 278 164 100 92 80 66 When generalizing a score to the population, high sample size is needed. However, the more the better is not true. – Getting 2000 participants is a waste. 16 How Many Participants Should I Test? Sample Size for Hypothesis Testing Hypothesis testing: comparing means – E.g., accuracy of typing on Device A is significantly better than it is on Device B. – Inferential statistics Necessary sample size is derived from a calculation of power. – Under assumed criteria, the study will have a good chance of detecting a significant difference if the difference indeed exists. Sample size depends on: – Assumed confidence level (e.g., 95%, 99%) – Acceptable sampling error (e.g., +/- 5%) – Expected effect size – Power – Statistical test (e.g., t-test, correlation, ANOVA) 17 How Many Participants Should I Test? Hypothesis Testing: Sample Size Table* *Cohen, J. (1992). A Power Primer. Psychological Bulletin, 112 (1), http://www.math.unm.edu/~schrader/biostat/bio2/Spr06/cohen.pdf. 18 How Many Participants Should I Test? Reality Usability tests do not typically require statistical significance. Objectives dictate type of study and reasonable sample sizes necessary. Sample size used is influenced by many factors—not all of them statistically driven. Power analysis provides an estimate of sample size necessary to detect a difference, if it does indeed exist Risk of not performing power analysis? – Too few Low power Inability to detect difference – Too many Waste (and possibly find differences that are not real) What if you find significance even with a small sample size? – It is probably really there (at a certain p level) 19 How Many Participants Should I Test? Exercises 20 How Many Participants Should I Test? Exercise 1: Background Information Package inserts for chemicals used in hospital labs were shortened and standardized to reduce cost. – E.g., many chemicals x many languages = high translation cost New inserts: – ½ size of old inserts (booklet, not “map”) – More concise (charts, bullet points) – Less redundant Old New Users: Lab techs in hospitals 21 How Many Participants Should I Test? Exercise 1: The Question Client question: – Will the new inserts negatively impact user performance? How many participants do you need for the study and why? Exercise: – Discuss in groups and prepare questions for the client. – Q & A with the client – Come up with the answer and be prepared to explain it. 22 How Many Participants Should I Test? Exercise 1: Possible Method Each participant was asked to complete 30 search tasks – 2 insert versions x 3 chemicals x 5 search tasks – Sample question: “How long may the _____ be stored at the refrigeration temperature?” (for up to 7 days) Tasks instructions were printed on a card and placed in front of the participant. The tasks were timed. – Participants had a maximum of 1 minute to complete each task. – Those who exceeded the time limit were asked to move on to the next task. To indicate that they were finished, participants had to: – Say the answer to the question out loud – Point to the answer in the insert 23 How Many Participants Should I Test? Exercise 1: Possible Method 24 How Many Participants Should I Test? Exercise 1: Sample Size How many participants? Formative test? Nielsen’s 5 or more? Sauro’s calculator? Summative test? Need a directional answer? (descriptive stats) Need a definitive answer? (inferential stats) How complex is the system? How many tasks? How many distinct user groups? Do these groups use the same areas in the product? Precision testing? Hypothesis testing? Does the score generalize to the population? Is the score for A different than the score for B? “Smell check:” How many participants will make stakeholders comfortable? 25 How Many Participants Should I Test? Exercise 1: Sample Size 32 lab techs: – 17 in the US – 15 in Germany 26 How Many Participants Should I Test? Exercise 1: Results Short inserts performed significantly better than the long inserts in terms of: New Old p Success rate 62% 50% p = .0005 Time on task 31 s 35 s p < .05 Ease of use (1-7) 5.6 5.4 p < .01 Overall satisfaction (1-7) 5.9 4.1 p < .0001 27 How Many Participants Should I Test? Exercise 2 Client question: – Does our website work for our users? How many participants do you need for the study and why? Exercise: – Discuss in groups and prepare questions for the client. – Q & A with the client – Come up with the answer and be prepared to explain it. 28 Stats: What you needed to hear… 29 Stats Planted the Seed Last Week Stats are: – Not learnable in an hour – More than just p-values – Powerful – Dangerous – Time consuming But, you are the Expert Need to know: – Foundation – Rationale – Useful application 30 Stats Foundation Just with any user experience research endeavor – Think hard first (Test Plan) – Anticipate outcomes to keep you on track with objectives and not get pulled to tangents – Then begin research... Definition of statistics: – A set of procedures to for describing measurements and for making inferences on what is generally true Statistics and experimental design go hand in hand – Objective Method Measures Outcomes Objectives 31 Stats Measures Ratio scales (Interval + Absolute Zero) – Measure that has: • Comparable intervals (inches, feet, etc.) • An absolute zero (not meaningful to non-statisticians) – Differences are comparable • Device A is 72 inches in length while Device B is 36 inches – One is twice as tall as the other • Performance on Device A was 30 sec while Device B was 60 – Users were twice as fast completing task on Device A over Device B Interval scales do not have a zero – Difference between 40F - 50F = 100F – 90F Take away: You get powerful statistics using ratio/interval measures 32 Stats What Does Power Mean Again? Statistical Power – “The power to find a significant difference, if it does indeed exist” – Too little power Miss significance when it is really there – Too much power Might find significance when it is NOT there You can get MORE Power by: – Adding more participants (by impact is non-linear) – Having a greater “effect size,” which is the anticipated difference – Picking within-subjects designs over between-subjects designs – Using ratio/interval measures – Changing alpha Practical Power – Sample size costs money – If you find significance, then it is probably really true! 33 Stats Other Measures (non-ratio/non-interval) Likert scales Rank data Count data Each of these measures use a different statistical test Power is different (reduced) – Consider Likert Data 1 A 2 B 3 4 5 C Could say: – A=1, B=2, C=5 – C came in 1st, B came in 2nd and A came in 3rd Precision influences power and the less precise your measure, the less power you have to detect differences 34 Stats Between-Groups Designs Between-groups study splits sample into two or more groups Each group only interacts with one device What causes variability? – Measurement error • The tool or procedure can be an imprecise device – Starting and stopping the stop watch – Unreliable • We are human, so if you test the same participant on different days and you might get a different time! – Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem 35 Stats What About Within-Groups Designs? Within-Groups study has participants interact with all devices What causes variability? – Measurement error • The tool or procedure can be an imprecise device – Starting and stopping the stop watch – Unreliable • We are human, so if you test the same participant on different days and you might get a different time! – Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem • No longer applies Thus, less causes for variability results in statistical power 36 Stats More Common Statistical Tests You are actually well aware of statistics – Descriptive statistics! Measures of central tendency – Mean – Median – Mode Definitions? – Mean = ? • Average – Median = ? • The exact point that divides the distribution into two parts such that an equal number fall above and below that point – Mode = ? • Most frequently occurring score 37 Stats When In Doubt, Plot 3 4 4 4 5 5 Frequency Take scores – 1 2 – 2 3 – 3 3 5 1 3 Normal distribution Randomly sampled Mean = Median = Mode 1 2 3 4 Score 5 38 Stats Skewed Distributions Positive skew Tail on the right Negative skew Tail on the left Impact to measures of central tendency? – Mode – Median – Mean “Central tendency” 39 Stats Got Central Tendency, Now Variability We must first understand variability – We tend to think of a mean as “it” or “everything” Consider a time on task as a metric – Measurement error • The tool or procedure can be an imprecise device – Starting and stopping the stop watch – Individual differences • Participants are different, so some get different scores than others on the same task. Since we are testing for differences between A and B, this can be a problem – Unreliable • We are human, so if you test the same participant on different days and you might get a different time! 40 Stats Got Central Tendency, Now Variability We must first understand variability – We tend to think of a mean as “it” or “everything” – Class scored 80 – School scored 76 – Many scores went into the mean score – Variability can be quantified – [draw] 41 Stats Variability is Standardized (~Normalized) Your score is in the 50th percentile – Ethan and Madeline are the smartest kids in class AT scores, you saw your score—how did they get a percentile? – Distribution is normal – Numerically, the distribution can be described by only: • Mean and standard deviation 1 Std Dev 2 Std Dev 1 Std Dev 2 Std Dev 3 Std Dev 42 Stats Empirical Rule Empirical rule = 68/95/99 – 68% Mean +/- 1 std dev – 95% Mean +/- 2 std dev – 99% Mean +/- 3 std dev 1 Std Dev 2 Std Dev 1 Std Dev 2 Std Dev 3 Std Dev 3 Std Dev 43 Stats Clear on Normal Curves? Represent a single dataset on a single measure for a single sample Once data are normalized, you can describe dataset simple by – Mean and standard deviation • 60% success rate with a std dev of 10% 40% 50% 60% 70% 44 Stats Randomly Sampled Think of data as coming from a population 45 Stats Things Can Happen By Chance Alone Is this is really two samples of 10 drawn from a population who may have these characteristics 46 Stats Exercise Pass out Yes / No cards Procedure – Will give you a task – I will count (because it is hard to install a timer on DePaul PCs) – Make a Yes or No decision – Record count – Hold up the Yes / No Practice 47 Stats Exercise Is the man wearing a red shirt? Decide Yes or No Note time Hold up card Ready? Decide Yes or No Note time Hold up card 48 Stats 1 sec 49 Stats 2 sec 50 Stats 3 sec 51 Stats 4 sec 52 Stats 5 sec 53 Stats 6 sec 54 Stats 7 sec 55 Stats 8 sec 56 Stats 9 sec 57 Stats Task 1 Hospital setting. Physician is concerned about the patient’s MCH values as it is now higher than normal. On January 27, 2001, was the patient’s MCH value higher than normal? Decide Yes or No Note time Hold up card Ready? 58 Stats 1 sec 59 Stats 2 sec 60 Stats 3 sec 61 Stats 4 sec 62 Stats 5 sec 63 Stats 6 sec 64 Stats 7 sec 65 Stats 8 sec 66 Stats 9 sec 67 Stats 10 sec 68 Stats 11 sec 69 Stats 12 sec 70 Stats 13 sec 71 Stats 14 sec 72 Stats 15 sec 73 Stats 16 sec 74 Stats 17 sec 75 Stats 18 sec 76 Stats 19 sec 77 Stats 20 sec 78 Stats Task 1: Recap Collect – Time – Success/Fail Who forgot the task? – Controls and impact to data collection…? 79 Stats Task 2 Hospital setting. Assessment of different prompts for an interaction with an interface. The patient has declined further treatment. The physician asked you to go into the system and cancel all of the orders. So, you select the orders and press cancel. Decide Yes or No Note time Hold up card Ready? 80 Stats 1 sec 81 Stats 2 sec 82 Stats 3 sec 83 Stats 4 sec 84 Stats 5 sec 85 Stats 6 sec 86 Stats 7 sec 87 Stats 8 sec 88 Stats 9 sec 89 Stats 10 sec 90 Stats Task 3 Hospital setting. Assessment of different prompts for an interaction with an interface. The patient has declined further treatment. The physician asked you to go into the system and cancel all of the orders. So, you select the orders and press cancel. Decide Yes or No Note time Hold up card Ready? 91 Stats 1 sec 92 Stats 2 sec 93 Stats 3 sec 94 Stats 4 sec 95 Stats 5 sec 96 Stats 6 sec 97 Stats 7 sec 98 Stats 8 sec 99 Stats 9 sec 100 Stats 10 sec 101 Stats Task 3: Recap Affirmative Negative Non Destructive What just Happened? Statement Type True Affirmative only False Affirmative user True Negative False Negative Use When Fast, easy, low-cost-to-user outcome, confirmation Need user to think about the response, high cost to Should almost never use Never use, unless trying to deceive use Opt out response [ ] Do not send me the newsletter 102 Stats Data Collection Time Blue = Correct (success) Red = Incorrect (fail) 103 Stats Tasting Soup You don’t need to drink half a pot to see if the soup is done / tasty As long as the soup is well mixed, you should be able to just take one or two tastes with “well mixed” as – Randomly sampled – Normally distributed – Variances are equal – Samples are independent 40% 50% 60% 70% 80% 90% 100% Can you say these are different? – Missing confidence intervals? 104 Stats Confidence Intervals Matter Confidence interval of 95% says that that dot falls within that range 95% of the time With confidence intervals attached, can you say these are different? 40% 50% 60% 70% 80% 90% 100% These lines are affected by: – Confidence interval itself – Variability – Sample size 105 Stats What If Variability Were Better Controlled? What if we could be more precise or reduce variability like in a between-groups design? – Confidence interval still stays the same at 95% – But, the variability is reduced With confidence intervals attached, can you say these are different? 40% 50% 60% 70% 80% 90% 100% Yes! And in fact, you do not need stats 106 Stats Usually, Inferential Stats are Needed The world is usually dirty – There is typically overlap Statistics can determine if these are significantly different 40% 50% 60% 70% 80% 90% 100% 107 Stats What If There Is No Known Population? To properly measure quality control standards, Guinness hired a statistician, William Sealy Gosset in 1899 to assist 108 Stats Student’s T-distribution Mimics Population Population is unknown, so it is estimated from the sample itself – Notice that relatively small sample sizes are necessary to resemble a normal distribution 109 Stats Hypotheses Testing All examples are from a single sample—now compare two samples Research question (i.e., state hypothesis) – Is there is diff between A and B? Test the opposite (i.e., null hypothesis) – There is no difference between A and B – Scientific method is to disprove the null hypothesis (Ho) • Because we cannot prove the hypothesis, we disprove others one-by-one 110 Stats How Can We Be Wrong? Type I: Saying there is a difference when one does not exist – Like convicting an innocent – False positive – All based on P-value • Established a priori (you set this before the study) Type II: Saying there is not a difference when there is one! – Letting bad guy go free There are volumes on Type I / II errors So, let’s focus on inferring the New Device is better than Old Device 111 Stats Sample Time Data Remember that with time, there is typically a long tail Perform Log transforms to normalize data – Geometric mean is recommended – Plug these log data instead of time 112 Stats Hypothesis Null Hypothesis (Ho): There is no difference in time for old vs. new Alternative Hypothesis (Ha): There is a difference Hypothesis Test – Set alpha (p < .05) – Ho: Mean (old) = Mean (new) – Ha: Mean (old) <> Mean (new) Two-tailed test – H1: Mean (old) < Mean (new) One-tailed test Run statistical test – Two-sample t-test – Result: p = .0106 If p-value from t-test < .05, the difference is significant 113 Stats Stats p = .0106 Reject Ho The current design was significantly faster than the new design on Task XXX (p < .05) 114 Stats Same For Success Data… Fischer’s T-test Null Hypothesis (Ho): There is no difference in success for old vs. new Alternative Hypothesis (Ha): There is a difference Hypothesis Test – Set alpha (p < .05) – Ho: Mean (old) = Mean (new) – Ha: Mean (old) <> Mean (new) Two-tailed test – H1: Mean (old) < Mean (new) One-tailed test Run statistical test If p-value from t-test < .05, the difference is significant 115 Project 3 116 Project 3 Next Steps Wait for feedback on your test plan. Create a moderator’s guide based on the test plan. – Must be very explicit, so that all moderators are consistent. – Consider all contingencies. Run a pilot session or two with the entire group. Revise the guide. Collect data. Analyze data. 117 Project 3 General Feedback on Test Plans Objective: – Why would someone want to pay for this study? Think of the project: – Quantitatively – Which is better, A or B? – Qualitatively – Why is A (or B) better (or worse)? Be careful when selecting participants: – Ideally, they should be equally familiar or unfamiliar with both A and B. Make sure all moderators have identical stimuli. – E.g., if you are using printouts, make sure they are all in color and printed in the same format etc. 118 Project 3 General Feedback on Test Plans If you are measuring time on task: – Decide exactly when to start and stop the stopwatch. • When the participant start saying the answer out loud? • When they point to the answer? • What if they point and then change their mind? If you are measuring the number of errors, define what an error is prior to the study – [Website] If someone clicks on the wrong link? What if they continue going down that path – is that still one error or more than one? – [Text entry] If someone types “a” instead of “s”? What about missing letters? What about extra letters? 119 Project 3 General Feedback on Test Plans If you have a between-groups design: – Make sure that the two groups are matched well in terms of demographics, experience with the tested products etc. If you have a within-groups design: – Counterbalance the presentation order of the two stimuli: half of the participants should get A first and then B and the other half B first and then A. 120