Evaluation, cont’d Two main types of evaluation Formative evaluation is done at different stages of development to check that the product meets users’ needs. Summative evaluation assesses the quality of a finished product. Our focus is on formative evaluation What to evaluate Iterative design & evaluation is a continuous process that examines: – Early ideas for conceptual model – Early prototypes of the new system – Later, more complete prototypes Designers need to check that they understand users’ requirements. Tog says … “Iterative design, with its repeating cycle of design and testing, is the only validated methodology in existence that will consistently produce successful results. If you don’t have user-testing as an integral part of your design process you are going to throw buckets of money down the drain.” When to evaluate Throughout design From the first descriptions, sketches etc. of users needs through to the final product Design proceeds through iterative cycles of ‘design-test-redesign’ Evaluation is a key ingredient for a successful design. Another example - development of “HutchWorld” Many informal meetings with patients, carers & medical staff early in design Early prototype informally tested on site – Designers learned a lot • language of designers & users was different • asynchronous communication was also needed Redesigned to produce the portal version Usability testing – User tasks investigated: - how users’ identify was represented - communication - information searching - entertainment User satisfaction questionnaire Triangulation to get different perspectives Findings from the usability test • The back button didn’t always work • Users didn’t pay attention to navigation buttons • Users expected all objects in the 3-D view to be clickable. • Users did not realize that there could be others in the 3-D world with whom to chat, • Users tried to chat to the participant list. Key points Evaluation & design are closely integrated in user-centered design. Some of the same techniques are used in evaluation & requirements but they are used differently (e.g., interviews & questionnaires) Triangulation involves using a combination of techniques to gain different perspectives Dealing with constraints is an important skill for evaluators to develop. A case in point … “The Butterfly Ballot: Anatomy of disaster”. See http://www.asktog.com/columns/042ButterflyBallot.html An evaluation framework The aims Explain key evaluation concepts & terms. Describe the evaluation paradigms & techniques used in interaction design. Discuss the conceptual, practical and ethical issues that must be considered when planning evaluations. Introduce the DECIDE framework. Evaluation paradigm Any kind of evaluation is guided explicitly or implicitly by a set of beliefs, which are often under-pinned by theory. These beliefs and the methods associated with them are known as an ‘evaluation paradigm’ User studies User studies involve looking at how people behave in their natural environments, or in the laboratory, both with old technologies and with new ones. Four evaluation paradigms ‘quick and dirty’ usability testing field studies predictive evaluation Quick and dirty ‘quick & dirty’ evaluation describes the common practice in which designers informally get feedback from users or consultants to confirm that their ideas are inline with users’ needs and are liked. Quick & dirty evaluations are done any time. The emphasis is on fast input to the design process rather than carefully documented findings. Usability testing Usability testing involves recording typical users’ performance on typical tasks in controlled settings. Field observations may also be used. As the users perform these tasks they are watched & recorded on video & their key presses are logged. This data is used to calculate performance times, identify errors & help explain why the users did what they did. User satisfaction questionnaires & interviews are used to elicit users’ opinions. Field studies Field studies are done in natural settings The aim is to understand what users do naturally and how technology impacts them. In product design field studies can be used to: - identify opportunities for new technology - determine design requirements - decide how best to introduce new technology - evaluate technology in use. Predictive evaluation Experts apply their knowledge of typical users, often guided by heuristics, to predict usability problems. Another approach involves theoretically based models. A key feature of predictive evaluation is that users need not be present Relatively quick & inexpensive Overview of techniques observing users, asking users’ their opinions, asking experts’ their opinions, testing users’ performance modeling users’ task performance DECIDE: A framework to guide evaluation Determine the goals the evaluation addresses. Explore the specific questions to be answered. Choose the evaluation paradigm and techniques to answer the questions. Identify the practical issues. Decide how to deal with the ethical issues. Evaluate, interpret and present the data. Determine the goals What are the high-level goals of the evaluation? Who wants it and why? The goals influence the paradigm for the study Some examples of goals: Identify the best metaphor on which to base the design. Check to ensure that the final interface is consistent. Investigate how technology affects working practices. Improve the usability of an existing product . Explore the questions All evaluations need goals & questions to guide them so time is not wasted on ill-defined studies. For example, the goal of finding out why many customers prefer to purchase paper airline tickets rather than e-tickets can be broken down into subquestions: - What are customers’ attitudes to these new tickets? - Are they concerned about security? - Is the interface for obtaining them poor? What questions might you ask about the design of a cell phone? Choose the evaluation paradigm & techniques The evaluation paradigm strongly influences the techniques used, how data is analyzed and presented. E.g. field studies do not involve testing or modeling Identify practical issues For example, how to: • select users • stay on budget • staying on schedule • find evaluators • select equipment Decide on ethical issues Develop an informed consent form Participants have a right to: - know the goals of the study - know what will happen to the findings - privacy of personal information - not to be quoted without their agreement - leave when they wish - be treated politely Evaluate, interpret & present data How data is analyzed & presented depends on the paradigm and techniques used. The following also need to be considered: - Reliability: can the study be replicated? - Validity: is it measuring what you thought? - Biases: is the process creating biases? - Scope: can the findings be generalized? - Ecological validity: is the environment of the study influencing it - e.g. Hawthorne effect Pilot studies A small trial run of the main study. The aim is to make sure your plan is viable. Pilot studies check: - that you can conduct the procedure - that interview scripts, questionnaires, experiments, etc. work appropriately It’s worth doing several to iron out problems before doing the main study. Ask colleagues if you can’t spare real users. Key points An evaluation paradigm is an approach that is influenced by particular theories and philosophies. Five categories of techniques were identified: observing users, asking users, asking experts, user testing, modeling users. The DECIDE framework has six parts: - Determine the overall goals - Explore the questions that satisfy the goals - Choose the paradigm and techniques - Identify the practical issues - Decide on the ethical issues - Evaluate ways to analyze & present data Observing users The aims Discuss the benefits & challenges of different types of observation. Describe how to observe as an on-looker, a participant, & an ethnographer. Discuss how to collect, analyze & present observational data. Examine think-aloud, diary studies & logging. Provide you with experience in doing observation and critiquing observation studies. What and when to observe Goals & questions determine the paradigms and techniques used. Observation is valuable any time during design. Quick & dirty observations early in design Observation can be done in the field (i.e., field studies) and in controlled environments (i.e., usability studies) Observers can be: - outsiders looking on - participants, i.e., participant observers - ethnographers Frameworks to guide observation - The person. Who? - The place. Where? - The thing. What? The Goetz and LeCompte (1984) framework: - Who is present? - What is their role? - What is happening? - When does the activity occur? - Where is it happening? - Why is it happening? - How is the activity organized? The Robinson (1993) framework Space. What is the physical space like? Actors. Who is involved? Activities. What are they doing? Objects. What objects are present? Acts. What are individuals doing? Events. What kind of event is it? Goals. What do they to accomplish? Feelings. What is the mood of the group and of individuals? You need to consider Goals & questions Which framework & techniques How to collect data Which equipment to use How to gain acceptance How to handle sensitive issues Whether and how to involve informants How to analyze the data Whether to triangulate Observing as an outsider As in usability testing More objective than participant observation In usability lab equipment is in place Recording is continuous Analysis & observation almost simultaneous Care needed to avoid drowning in data Analysis can be coarse or fine grained Video clips can be powerful for telling story Participant observation & ethnography Debate about differences Participant observation is key component of ethnography Must get co-operation of people observed Informants are useful Data analysis is continuous Interpretivist technique Questions get refined as understanding grows Reports usually contain examples Data collection techniques Notes & still camera Audio & still camera Video Tracking users: - diaries - interaction logging Data analysis Qualitative data - interpreted & used to tell the ‘story’ about what was observed. Qualitative data - categorized using techniques such as content analysis. Quantitative data - collected from interaction & video logs. Presented as values, tables, charts, graphs and treated statistically. Interpretive data analysis Look for key events that drive the group’s activity Look for patterns of behavior Test data sources against each other - triangulate Report findings in a convincing and honest way Produce ‘rich’ or ‘thick descriptions’ Include quotes, pictures, and anecdotes Software tools can be useful e.g., NUDIST, Ethnograph (URLs will be provided) Looking for patterns Critical incident analysis Content analysis Discourse analysis Quantitative analysis - i.e., statistics Key points Observe from outside or as a participant Analyzing video and data logs can be timeconsuming. In participant observation collections of comments, incidents, and artifacts are made. Ethnography is a philosophy with a set of techniques that include participant observation and interviews. Ethnographers immerse themselves in the culture that they study. Asking users & experts The aims Discuss the role of interviews & questionnaires in evaluation. Teach basic questionnaire design. Describe how do interviews, heuristic evaluation & walkthroughs. Describe how to collect, analyze & present data. Discuss strengths & limitations of these techniques Interviews Unstructured - are not directed by a script. Rich but not replicable. Structured - are tightly scripted, often like a questionnaire. Replicable but may lack richness. Semi-structured - guided by a script but interesting issues can be explored in more depth. Can provide a good balance between richness and replicability. Basics of interviewing Remember the DECIDE framework Goals and questions guide all interviews Two types of questions: ‘closed questions’ have a predetermined answer format, e.g., ‘yes’ or ‘no’ ‘open questions’ do not have a predetermined format Closed questions are quicker and easier to analyze Things to avoid when preparing interview questions Long questions Compound sentences - split into two Jargon & language that the interviewee may not understand Leading questions that make assumptions e.g., why do you like …? Unconscious biases e.g., gender stereotypes Components of an interview Introduction - introduce yourself, explain the goals of the interview, reassure about the ethical issues, ask to record, present an informed consent form. Warm-up - make first questions easy & nonthreatening. Main body – present questions in a logical order A cool-off period - include a few easy questions to defuse tension at the end Closure - thank interviewee, signal the end, e.g, switch recorder off. The interview process Use the DECIDE framework for guidance Dress in a similar way to participants Check recording equipment in advance Devise a system for coding names of participants to preserve confidentiality. Be pleasant Ask participants to complete an informed consent form Probes and prompts Probes - devices for getting more information. e.g., ‘would you like to add anything?’ Prompts - devices to help interviewee, e.g., help with remembering a name Remember that probing and prompting should not create bias. Too much can encourage participants to try to guess the answer. Group interviews Also known as ‘focus groups’ Typically 3-10 participants Provide a diverse range of opinions Need to be managed to: - ensure everyone contributes - discussion isn’t dominated by one person - the agenda of topics is covered Analyzing interview data Depends on the type of interview Structured interviews can be analyzed like questionnaires Unstructured interviews generate data like that from participant observation It is best to analyze unstructured interviews as soon as possible to identify topics and themes from the data Questionnaires Questions can be closed or open Closed questions are easiest to analyze, and may be done by computer Can be administered to large populations Paper, email & the web used for dissemination Advantage of electronic questionnaires is that data goes into a data base & is easy to analyze Sampling can be a problem when the size of a population is unknown as is common online Questionnaire style Varies according to goal so use the DECIDE framework for guidance Questionnaire format can include: - ‘yes’, ‘no’ checkboxes - checkboxes that offer many options - Likert rating scales - semantic scales - open-ended responses Likert scales have a range of points 3, 5, 7 & 9 point scales are common Debate about which is best Developing a questionnaire Provide a clear statement of purpose & guarantee participants anonymity Plan questions - if developing a web-based questionnaire, design off-line first Decide on whether phrases will all be positive, all negative or mixed Pilot test questions - are they clear, is there sufficient space for responses Decide how data will be analyzed & consult a statistician if necessary Encouraging a good response Make sure purpose of study is clear Promise anonymity Ensure questionnaire is well designed Offer a short version for those who do not have time to complete a long questionnaire If mailed, include a s.a.e. Follow-up with emails, phone calls, letters Provide an incentive 40% response rate is high, 20% is often acceptable Advantages of online questionnaires Responses are usually received quickly No copying and postage costs Data can be collected in database for analysis Time required for data analysis is reduced Errors can be corrected easily Disadvantage - sampling problematic if population size unknown Disadvantage - preventing individuals from responding more than once Problems with online questionnaires Sampling is problematic if population size is unknown Preventing individuals from responding more than once Individuals have also been known to change questions in email questionnaires Questionnaire data analysis & presentation Present results clearly - tables may help Simple statistics can say a lot, e.g., mean, median, mode, standard deviation Percentages are useful but give population size Bar graphs show categorical data well More advanced statistics can be used if needed Well-known forms SUMI MUMMS QUIS -- see Perlman site Asking experts Experts use their knowledge of users & technology to review software usability Expert critiques (crits) can be formal or informal reports Heuristic evaluation is a review guided by a set of heuristics Walkthroughs involve stepping through a pre-planned scenario noting potential problems Heuristic evaluation Developed by Jakob Nielsen in the early 1990s Based on heuristics distilled from an empirical analysis of 249 usability problems These heuristics have been revised for current technology, e.g., HOMERUN for web Heuristics still needed for mobile devices, wearables, virtual worlds, etc. Design guidelines form a basis for developing heuristics Nielsen’s heuristics Visibility of system status Match between system and real world User control and freedom Consistency and standards Help users recognize, diagnose, recover from errors Error prevention Recognition rather than recall Flexibility and efficiency of use Aesthetic and minimalist design Help and documentation Discount evaluation Heuristic evaluation is referred to as discount evaluation when 5 evaluators are used. Empirical evidence suggests that on average 5 evaluators identify 75-80% of usability problems. 3 stages for doing heuristic evaluation Briefing session to tell experts what to do Evaluation period of 1-2 hours in which: - Each expert works separately - Take one pass to get a feel for the product - Take a second pass to focus on specific features Debriefing session in which experts work together to prioritize problems Advantages and problems Few ethical & practical issues to consider Can be difficult & expensive to find experts Best experts have knowledge of application domain & users Biggest problems - important problems may get missed - many trivial problems are often identified Cognitive walkthroughs Focus on ease of learning Designer presents an aspect of the design & usage scenarios One of more experts walk through the design prototype with the scenario Expert is told the assumptions about user population, context of use, task details Experts are guided by 3 questions The 3 questions Will the correct action be sufficiently evident to the user? Will the user notice that the correct action is available? Will the user associate and interpret the response from the action correctly? As the experts work through the scenario they note problems Pluralistic walkthrough Variation on the cognitive walkthrough theme Performed by a carefully managed team The panel of experts begins by working separately Then there is managed discussion that leads to agreed decisions The approach lends itself well to participatory design Key points Structured, unstructured, semi-structured interviews, focus groups & questionnaires Closed questions are easiest to analyze & can be replicated Open questions are richer Check boxes, Likert & semantic scales Expert evaluation: heuristic & walkthroughs Relatively inexpensive because no users Heuristic evaluation relatively easy to learn May miss key problems & identify false ones A project for you … Activeworlds.com Questionnaire to test reactions with friends http://www.acm.org/~perlman/question.htm l http://www.ifsm.umbc.edu/djenni1/osg/ Develop heuristics to evaluate usability and sociability aspects A project for you … http://www.id-book.com/catherb/ -provides heuristics and a template so that you can evaluate different kinds of systems. More information about this is provided in the interactivities section of the id-book.com website. A project for you … Go to the The Pew Internet & American Life Survey www.pewinternet.org/ (or to another survey of your choice) Critique one of the recent online surveys Critique a recent survey report Interpretive Evaluation Contextual inquiry Cooperative and participative evaluation Ethnography rather than emphasizing statement of goals, objective tests, research reports, instead emphasizes usefulness of findings to the people concerned good for feasibility study, design feedback, post-implementation review Contextual Inquiry Users and researchers participate to identify and understand usability problems within the normal working environment of the user Differences from other methods include: – – – – work context -- larger tasks time context -- longer times motivational context -- more user control social context -- social support included that is normally lacking in experiments Why use contextual inquiry? Usability issues located that go undetected in laboratory testing. – Line counting in word processing – unpacking and setting up equipment Issues identified by users or by user/evaluator Contextual interview: topics of interest Structure and language used in work individual and group actions and intentions culture affecting the work explicit and implicit aspects of the work Cooperative evaluation A technique to improve a user interface specification by detecting the possible usability problems in an early prototype or partial simulation low cost, little training needed think aloud protocols collected during evaluation Cooperative Evaluation Typical user(s) recruited representative tasks selected user verbalizes problems/ evaluator makes notes debriefing sessions held Summarize and report back to design team Participative Evaluation More open than cooperative evaluation subject to greater control by users cooperative prototyping, facilitated by – focus groups – designers work with users to prepare prototypes – stable prototypes provided, users evaluate – tight feedback loop with designers Ethnography Standard practice in anthropology Researchers strive to immerse themselves in the situation they want to learn about Goal: understand the ‘real’ work situation typically applies video - videos viewed, reviewed, logged, analyzed, collections made, often placed in databases, retrieved, visualized …. Predictive Evaluation Predict aspects of usage rather than observe and measure doesn’t involve users cheaper Predictive Evaluation Methods Inspection Methods – Standards inspections – Consistency inspection – Heuristic evaluation – “Discount” usability evaluation – Walkthroughs Modelling:The keystroke level model Standards inspections Standards experts inspect the interface for compliance with specified standards relatively little task knowledge required Consistency inspections Teams of designers inspect a set of interfaces for a family of products – usually one designer from each project Usage simulations Aka - “expert review”, “expert simulation” Experts simulate behavior of lessexperienced users, try to anticipate usability problems more efficient than user trials prescriptive feedback Heuristic evaluation Usage simulation in which system is evaluted against list of “heuristics”, e.g. Two passes: per screen, and flow from screeen to screen Study: 5 evaluators found 75% of problems Sample heuristics Use simple and natural dialogue speak the user’s language minimize user memory load be consistent provide feedback provide clearly marked exits provide shortcuts provide good error messages prevent errors Discount usability engineering Phase 1: usability testing + scenario construction (1-3 users) Phase 2: scenarios refined + heuristic evaluation “Discount” features – – – – – small scenarios, paper mockups informal think-aloud (no psychologists) Scenarios + think-aloud + heuristic evaluation small number of heuristics (see previous slide) 2-3 testers sufficient Walkthroughs Goal - detect problems early on; remove construct carefully designed tasks from a system specification or screen mockup walk-through the activities required, predict how users would likely behave, determine problems they will encounter -- see checklist for cognitive walkthrough Modeling: keystroke level model Goal: calculate task performance times for experienced users Requires – specification of system functionality – task analysis, breakdown of each task into its components Keystroke-level modeling Time to execute sum of: – Tk - keystroking (0.35 sec) – Tp - pointing (1.10) – Td - drawing (problem-dependent) – Tm - mental (1.35) – Th - homing (0.4) – Tr - system response (1.2) KLM: example Save file with new name in wp that uses mouse and pulldown menus (1) initial homing: (Th) (2) move cursor to file menu at top of screen(Tp + Tm ) (3) select ‘save as’ in file menu(click on file menu, move down file menu, click on ‘save as’) (Tm + Tk + Tp +Tk) (4) word processor prompts for new file name, user types filename (Tr + Tm + Tk(filename) + Tk) Experiments and Benchmarking Traditional experiments Usability Engineering Traditional Experiments Typically narrowly defined, evaluate particular aspects such as: – menu depth v. context – icon design – tickers v. fade_boxes v. replace_boxes Usually not practical to include in design process Example: Star Workstation, text selection Goal: evaluate methods for selecting text, using 1-3 mouse buttons Operations: – Point (between characters, target of move,copy, or insert) – Select text (character, word, sentence, par, doc) – Extend selection to include more text Selection Schemes A B C D E F Button 1 Point Point Point Point C C, W, Drwthru S, P, D Drwthru Point Point C, W, C S, P, D, Dthru Drwthru Point C, W, S, P, D Button 2 C C, W, S, Drwthru P, D Drwthru W, S, P, D Drwthru Adjust Adjust Button 3 W, S, P, D Drwthru Adjust G Methodology Between-subjects paradigm six groups, 4 subjects per group in each group: 2 experienced w/mouse, 2 not each subject first trained in use of mouse and in editing techniques in Star w.p. system Assigned scheme taught Each subject performs 10 text-editing tasks, 6 times each Results: selection time Time: Scheme A :12.25 s Scheme B: 15.19 s Scheme C: 13.41 s Scheme D: 13.44 s Scheme E: 12.85 s Scheme F: 9.89 s (p < 0.001) Results: Selection Errors Average: 1 selection error per four tasks 65% of errors were drawthrough errrors, same across all selection schemes 20% of errors were “too many clicks” , schemes with less clicking better 15% of errors were ‘click wrong mouse button”, schemes with fewer buttons better Selection scheme: test 2 Results of test 1 lead to conclusion to avoid: – drawthroughs – three buttons – multiple clicking Scheme “G” introduced -- avoids drawthrough, uses only 2 buttons New test, but test groups were 3:1 experienced w/mouse to not Results of test 2 Mean selection time: 7.96s for scheme G, frequency of “too many clicks” stayed about the same Conclusion: scheme G acceptable – selection time shorter – advantage of quick selection balances moderate error rate of multi-clicking Experimental design - concerns What to change? What to keep constant? What to measure? Hypothesis, stated in a way that can be tested. Statistical tests: which ones, why? Variables Independent variable - the one the experimenter manipulates (input) Dependent variable - affected by the independent varialbe (output) experimental effect - changes in dependent caused by changes in independent confounded -- when dependent changes because of other variables (task order, learning, fatigue, etc.) Selecting subjects - avoiding bias Age bias -- Cover target age range Gender bias -- equal numbers of male/female Experience bias -- similar level of experience with computers etc. ... Experimental Designs Independent subject design – single group of subjects allocated randomly to each of the experimental conditions Matched subject design – subjects matched in pairs, pairs allocated randomly to each of the experimental conditions Repeated measures design – all subjects appear in all experimental conditions – Concerns: order of tasks, learning effects Single subject design – in-depth experiments on just one subject Critical review of experimental procedure User preparation – adequate instructions and training? Impact of variables – how do changes in independent variables affect users Structure of the tasks – were tasks complex enough, did users know aim? Time taken – fatigue or boredom? Critical review of experimental results Size of effect – statistically signficant? Practically significant? Alternative interpretations – other possible causes for results found? Consistency between dependent variables – task completion and error scores versus user preferences and learning scores Generalization of results – to other tasks,users, working environments? Usability Engineering Usability of product specified quantitatively, and in advance As product is built, it can be demonstrated that it does or does not reach required levels of usability Usability Engineering Define usability goals through metrics Set planned levels of usability that need to be achieved Analyze the impact of various design solutions Incorporate user-defined feedback in product design Iterate through design-evaluate-design loop until planned levels are achieved Metrics Include: – time to complete a particular task – number of errors – attitude ratings by users Metrics - example, conferencing system Attribute Measuring Concept Measuring Method Worst case Planned level Best case Now level Initial use Conferencing task successful interxns / 30 min 1-2 3-4 8-10 ? Infreq. Use Tasks after 12 weeks disuse % of errors Equal to product Z 50% better 0 errors ? Learning rate Task 1st half vs. 2nd half score Two halves equal Second half better ‘much’ better ? Preference over prod. Z Questionnaire Score Ratio of scores Same as Z None prefer Z ? Pref over product A Questionnaire Score Ratio of scores Same as Q None prefer Q ? Error recovery Critical incident analysis % incidents accounted for 10% 50% 100% ? Initial evaluation Attitude questionnaire Semantic differential score 0 (neutral) 1(somewhat positive) 2 (highly positive) ? Casual eval. Attitude questionnaire Semantic differential 0 (neutral) 1(somewhat positive) 2 (highly positive) ? Benchmark tasks Carefully constructed standard tests used to monitor users’ performance in usability testing typically use multiple videos, keyboard logging controlled testing -- specified set of users, well-specified tasks, controlled environment tasks longer than scientific experiments, shorter than “real life” Making tradeoffs impact analysis - used to establish priorities among usability attributes. It is a listing of attributes and proposed design decisions, and % impact of each. Usability engineering reported to produce a measurable improvement in usability of about 30%.