User-centered System Evaluation Reference • Diane Kelly (2009). Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval, 3(1-2), 1-224. DOI: 10.1561/1500000012 INTRODUCTION Interactive Information Retrieval (IIR) • Traditional IR evaluations abstract users out of the evaluation process • IIR focuses on user’s behaviors and experiences, – physical, cognitive and affective – Interactions between users and systems – Interactions between users and information Different evaluation questions • Classic IR evaluation (non-user centric): does this system retrieve relevant document? • IIR evaluation (user-centric): can people use this system to retrieve relevant documents. • Therefore: IIR is viewed as a sub-area of HCI Relevance Feedback • Same information needs different queries different search results different relevance feedback. • Dealing with users is difficult as causes and consequences of interactions cannot be observed easily (it is in user’s head) • The available observation: query, save a document, provide relevance feedback. • Based on these observation, we must infer Difficulties • Each individual user has a different cognitive composition and behavioral disposition • Some interactions are not easily observable nor measurable – Motivation, – How much to know the topic – expectations IIR • Using users to evaluate IR • Different approaches – Using users to evaluate the research results of a system (users are treated as black boxes) – Search log analysis (queries, search results and clickthrough behavior) – TREC Interactive Track evaluation model (evaluating a system or interface) – General information search behavior in electronic environments (observing and documenting user’s natural search behaviors and interactions) APPROACHES Research goals • Setting up a clear research goal: – Exploration: when the subject is less known, focusing on learning the subject, rather than make prediction, open-end research questions or hypotheses are uncommon. – Description: documenting and describing a subject (query log or query behavior analysis), to provide benchmark description and classification, results can be used to inform other studies – Explanation: examine the relationship between two or more variables with the goal of prediction and explanation, establish causality, Approaches • Evaluations vs. Experiments – Evaluation: to assess the goodness of a system, interface or interaction technique. – Experiments: to understand behavior, (similar as experiments in psychology or education), compare at least two things. • Lab and naturalistic studies – Lab (more controls) vs. naturalistic (less controls) • Longitudinal studies – Taking place over an extended period of time and measurements are taken at fixed intervals. Approaches • Case studies – The intensive study of a small number of cases – A case maybe a user, a system or an organization. – It usually takes place in naturalistic settings and involve some longitudinal elements. – Not for generalizing rather than gaining an in-depth view of a particular case. • Wizard of Oz studies and simulations – – – – – Testing “non-real” or simulated system Used for proof-of-concept Provide an indication of what might happen in ideal circumstances Wizard of Oz studies are simulations Simulated users can represent different actions or steps a real user might take while interacting with an IR system RESEARCH BASICS Problems and Questions • Identify and describe problems – Provide roadmap for research • Example of research questions – Exploratory: • How do people re-find information on the Web? – Descriptive: • What Web browser functionalities are currently being used during web-based information-seeking tasks – Explanatory: • What are the differences between written and spoken queries in terms of their retrieval characteristics and performance outcomes? • What is the relationship between query box size and query length? What is the relationship between query length and performance? Theory • A theory is a system of logical principles that attempts to explain relations among natural, observable phenomena. • Theory is abstract, general, can generate more specific hypotheses Hypotheses • Hypotheses state expected relationships between two variables • Alternative hypotheses vs. null hypotheses – Specific relationship vs. no relationship • Hypotheses can be directional or nondirectional Variables and measurement • Variables represent concept • To analyze concepts – Conceptualization • To define concepts: provide temporary definition, divide into dimensions – Operationalization • How to measure the concept: • Direct and indirect observables – Directly observed: • # of queries entered, the amount time spent searching – Indirectly observed: • User satisfaction Variables • Independent: the causes – examining differences in how males and females use an experimental and baseline IIR system – Sex is independent variables • Dependent: the effects – E.g., Satisfaction or performance of the systems. • Confounding variables – Affect the independent or dependent variable, but have not been controlled by the researcher. – E.g., maybe males are more familiar with these systems than females. Measurement • Range of variation – Preciseness of the measure – E.g., category of usage frequency of a system • Exhaustiveness – Complete list of choices • Exclusiveness – How to differentiate partially relevant vs. somewhat relevant (in your relevance rubric) • Equivalence – Find items that are of the same type and at the same level of specificity • Different scales: I know details=very familiar, I know nothing=very unfamiliar • Appropriateness – How likely are you to recommend this system to others? Scale: a fivepoint scale with strongly agree and strongly disagree – which does not match the question Level of Measurement • Two basic levels of measurement: discrete vs. continuous – Discrete measures: categorical responses • Nominal: no order – E.g., interface type, sex, task-type • Ordinal: ordered – Rank-order (from most relevant to least relevant) or Likert-type order (five-point scale with 1=not relevant, 5=relevant) – Relative measure » one subject’s 2 may not represent the same thing internally as another subject’s 2. » we could not say that a document rated 4 was twice as relevant as a document rated 2 since the scale contains no true zero Level of Measurement • Two basic levels of measurement: discrete vs. continuous – Continuous measure: interval vs. ratio • Different between consecutive points are equal, but there is no true zero for interval scales – Fahrenheit temperature scale, IQ test scores – Zero does not mean no heat or no intelligence – The differences between 50 vs. 80 and 90 vs. 120 are same • Ratio: the highest level of measurement: the number of occurrences. – There is a true zero – E.g. time, number of pages viewed (zero is meaningful) EXPERIMENTAL DESIGN • The basic experimental design in IIR evaluation examines the relationship between two or more systems or interfaces (independent variable) on some set of outcome measures (dependent variables) IIR design • General goal of IIR is to determine if a particular system helps subjects find relevant documents • Developing a valid baseline in IIR evaluation involves identifying and blending the status quo and the experimental system. • Random assignment can be used to increase the characteristics being evenly distributed across groups Factorial Designs • Good for studying the impact of more than one stimulus or variable Rotation and counterbalancing • The primary purpose of rotation and counterbalancing is to control for order effects and to increase the change that results can be attributed to the experimental treatments and conditions. • Rotating variables: – Latin square design – Graeco-Latin square design Rotation and counterbalancing A basic design with no rotation. Numbers in cells represent different topics Cons: 1. Order effects 2. Some topics are easier than others, some systems may do better with some topics than others. 3. Fatigue can impact the results Latin Square rotation Basic Latin Square rotation of topics Problems: -Interaction among topics - the order effects of interfaces still exist Basic Latin Square rotation of topics and randomization of columns Graeco-Latin Square Design • To solve the problem of orders of interfaces existing above. • Graeco-Latin Square is a combination of two or more Latin squares. Graeco-Latin Square Design Study mode • Batch-mode – Multiple subjects complete the study at the same location and time • Single-mode – Subjects complete the study alone, with only the researcher present. • The choice of mode is determined by the purpose of the study. – Single-mode: if each subject has to be interviewed, or some interactive communication needed between subject and researcher – Batch-mode: self-contained, efficient (but subject can influence each other) Protocols • A protocol is a step by step account of what will happen in a study. • Protocol helps maintain the integrity of the study and ensure that subjects experience the study in similar ways. Tutorials • Provide some instruction on how to use a new IIR system – Printed materials – Verbal instructions – Video tutorial • Try to avoid bias in the tutorial – Such as specially focusing on one special feature. Pilot testing • To estimate time • To identify problems with instruments, instructions, and protocols • To get detailed feedback from test subjects SAMPLING Sampling • It is not possible to include all elements from a population in a study • The population in IIR evaluation is assumed to be all people who engage in online information search. • The size of sample: the more the better • Two approaches to sampling: probability sampling and non-probability sampling Probability Sampling • Selecting a sample from a population that maintains the same variation and diversity that exists within the population. • Representative sample: – In a population: 60% are males and 40% are females, then your representative sample would also contain roughly the same ratio of males and females. – Increase the generalizability of the results – Assumes that all elements in the population have an equal chance of being selected. Probability sampling • Simple random sampling – Randomly pick up an element • Systematic sampling – Pick up every kth element, where k=population size/sample size • Stratified sampling – Subdivide the population into more refined groups according to specific strata – Select a sample that is proportionate to the population in each strata. Non-probability sampling • Used when all of the elements in a population is unknown, or not available. • It limits its ability to generalize • Researchers should be cautious when generalizing their data and be aware of the sampling limitations in their research. Non-probability sampling • Three major types of non-probability sampling: – Convenience: relying on available elements the researcher can access: undergraduate students, people is located closer to the researcher. – Purposive or judgmental sampling: a researcher selects subjects or other elements that have particular characteristics, expertise or perspectives – Quota sampling: similar as stratified sampling, but the subjects for the strata are based on a first-come-firstserved policy. Subject Recruitment • Many ways to recruit subjects – Send solicitations to mailinglists – Inviting – Using referral services – Crowdsourcing – Mechanical Turk – Web advertising – Mass mailings – Virtual posting in online locations – Pros and Cons: using lab mates, or own research group members as study subjects COLLECTIONS Collections for testing • Identification of a set of documents for subjects to search, a set of tasks or topics which directs this searching, and the ground truth about the relevance of the searched objects to the topics - • A test collection: corpus, topics, and relevance judgments TREC collections • TREC Interactive and HARD tracks – Newswire, blog, legal – Artificial topics – Relevance assessment generalization problem Web corpora • The major drawback is that it is impossible to replicate the study since the Web is constantly changing. • The same queries issued at different time can get completely different results Natural corpora • Corpora assembled over time by study participants – Pros: meaningful to subjects, controllable – Cons: lack of replicability and equivalence across participants, Tasks and topics • Most information needs can be characterized in terms of tasks and topics – Information need = task = topic • Information needs – People do not know their information needs – People have difficulties to articulate their information needs – Or using a vocabulary proper for a system Generating information needs • It is not clear at what level of specificity a task or topic should be defined – Task can be broken down into a series of subtasks, such as writing a research proposal • Working on the query logs to develop information needs DATA COLLECTION TECHNIQUES Data collection techniques • Corpora, tasks, topics, and relevance assessments are major instruments to evaluate IIR systems • Other instruments: questionnaires, screen capture software allow researchers to collect data. Think-Aloud • Subjects articulate their thinking and decisionmaking during the evaluation process of IIR. • Microphone, recording software, • It is unnatural as most people do not articulate their thoughts as they complete tasks. Stimulated Recall • Researcher records the screen of the computer as the subject completes a searching task. Then, the recording is played back to the subject and ask the subject to articulate thinking and decision-making. • Tool: screen recording software Spontaneous and prompted selfreport • Elicit feedback from subjects periodically while they search. • Goal: get more refined feedback about the search, rather than summative feedback at the end of the search observation • Researcher is seated near subjects and observes them when they conduct IIR activities • Tool: video camera, screen capture software • Time consuming, and labor intensive • Prone to selective attention and researcher bias. logging • Analyzing transaction logs. • Client-side logging provides a more robust and comprehensive log of the user’s interactions. • But is very hard to build a client-side logger Questionnaire • Consist of – closed questions where a specific response set is provides (e.g. a five-point scale) quantitative analysis – open questions qualitative analysis • Closed questions: Likert-type scale (e.g. five to seven point: strongly agree, agree, neutral, disagree, strongly disagree) • Open questions: content analysis • Different modes: electronic, pen-and-paper, interview Interview • Few IIR evaluation consist solely of interviews, but interviews are a common component of many study protocols. • Subjects response to open-ended questions in interview better than in other two modes (electronic, or pen-and-paper) • Interview: structured, semi-structured or open MEASURES Four basic measures • Four basic classes of measures – Contextual (age, sex, search experience, personalitytype), – Interaction (# of queries issued, # of documents viewed, query length), can be extracted from log data – Performance (# of relevant documents saved, mean average precision, discounted cumulated gain), can be computed from log data – Usability: subject attitudes and feelings about the system and their interactions contextual • Individual differences: their impact on the study results • Information needs: domain expertise is measured using credentials • Persistence of information needs • Immediacy of information need • Information-seeking stage Interaction • Measures: – # of queries, # of search results viewed, # of documents viewed, # of documents saved, query length • The implicit definition of interaction is tied to feedback Performance • When directly apply TREC measures to IIR evaluation, assume: relevance is binary, static, uni-dimensional and generalizable • Whether the TREC-based performance metrics is meaningful to end users – A measure that evaluates systems based on the retrieval of 1000 documents is unlikely to be meaningful to users since most users will not look through 1000 documents. Traditional IR performance measures Interactive recall and precision Measures that accommodate multilevel relevance and rank Time-based measures • A variety of time-based measures – The length of time subjects spend in different states or modes – The amount of time it takes a subject to save the first relevant articles – The number of relevant documents saved during a fixed period of time – The number of actions or steps taken to complete a task Cost and utility measures • Some search services are not free • Have always been an important part of the evaluation of library and information services Evaluative feedback from subjects • Usability – Effectiveness, efficiency and satisfaction as key dimensions of usability – Effectiveness: precision, recall – Efficiency: the time it takes a subject to complete a task. – Satisfaction: be satisfied for each different experimental feature of the system, subject perceptions of outcomes and interactions Available instruments for measuring usability • Questionnaire for User Interface Satisfaction (QUIS): http://lap.umd.edu/quis/ – 10-point scale for software, screen, terminology, system, etc. • The USE questionnaire – Usefulness, ease of use, ease of learning, satisfaction (7-point scale) • Software Usability Measurement Inventory (SUMI): http://sumi.ucc.ie/whatis.html – Agree, do not know and disagree for 50 items DATA ANALYSIS Qualitative data analyses • The goal of most qualitative data analyses that are constructed in IIR is to reduce the qualitative responses into a set of categories or themes that can be used to characterize and summarize responses. • Content analysis: it starts with a well-defined and structured classification scheme, including categories and classification rules. • Open coding: the categories are usually developed inductively during the analysis process as the researcher analyzes the data. Quantitative data analysis VALIDITY AND RELIABILITY validity • Internal validity: quality of what happens during the study – Whether the selected instrument yields poor or inaccurate data • External validity: to what extent the results from a study can be generalized to the real world. • Lab studies are generally less valid, but more reliable than naturalistic studies • Using instruments with established reliability