User-centered System Evaluation
Reference
• Diane Kelly (2009). Methods for Evaluating
Interactive Information Retrieval Systems with
Users. Foundations and Trends in Information
Retrieval, 3(1-2), 1-224. DOI:
10.1561/1500000012
INTRODUCTION
Interactive Information Retrieval (IIR)
• Traditional IR evaluations abstract users out of
the evaluation process
• IIR focuses on user’s behaviors and
experiences,
– physical, cognitive and affective
– Interactions between users and systems
– Interactions between users and information
Different evaluation questions
• Classic IR evaluation (non-user centric): does
this system retrieve relevant document?
• IIR evaluation (user-centric): can people use
this system to retrieve relevant documents.
• Therefore: IIR is viewed as a sub-area of HCI
Relevance Feedback
• Same information needs  different queries
 different search results  different
relevance feedback.
• Dealing with users is difficult as causes and
consequences of interactions cannot be
observed easily (it is in user’s head)
• The available observation: query, save a
document, provide relevance feedback.
• Based on these observation, we must infer
Difficulties
• Each individual user has a different cognitive
composition and behavioral disposition
• Some interactions are not easily observable
nor measurable
– Motivation,
– How much to know the topic
– expectations
IIR
• Using users to evaluate IR
• Different approaches
– Using users to evaluate the research results of a
system (users are treated as black boxes)
– Search log analysis (queries, search results and clickthrough behavior)
– TREC Interactive Track evaluation model (evaluating a
system or interface)
– General information search behavior in electronic
environments (observing and documenting user’s
natural search behaviors and interactions)
APPROACHES
Research goals
• Setting up a clear research goal:
– Exploration: when the subject is less known, focusing
on learning the subject, rather than make prediction,
open-end research questions or hypotheses are
uncommon.
– Description: documenting and describing a subject
(query log or query behavior analysis), to provide
benchmark description and classification, results can
be used to inform other studies
– Explanation: examine the relationship between two or
more variables with the goal of prediction and
explanation, establish causality,
Approaches
• Evaluations vs. Experiments
– Evaluation: to assess the goodness of a system,
interface or interaction technique.
– Experiments: to understand behavior, (similar as
experiments in psychology or education), compare at
least two things.
• Lab and naturalistic studies
– Lab (more controls) vs. naturalistic (less controls)
• Longitudinal studies
– Taking place over an extended period of time and
measurements are taken at fixed intervals.
Approaches
• Case studies
– The intensive study of a small number of cases
– A case maybe a user, a system or an organization.
– It usually takes place in naturalistic settings and involve some
longitudinal elements.
– Not for generalizing rather than gaining an in-depth view of a
particular case.
• Wizard of Oz studies and simulations
–
–
–
–
–
Testing “non-real” or simulated system
Used for proof-of-concept
Provide an indication of what might happen in ideal circumstances
Wizard of Oz studies are simulations
Simulated users can represent different actions or steps a real user
might take while interacting with an IR system
RESEARCH BASICS
Problems and Questions
• Identify and describe problems
– Provide roadmap for research
• Example of research questions
– Exploratory:
• How do people re-find information on the Web?
– Descriptive:
• What Web browser functionalities are currently being used during
web-based information-seeking tasks
– Explanatory:
• What are the differences between written and spoken queries in
terms of their retrieval characteristics and performance
outcomes?
• What is the relationship between query box size and query length?
What is the relationship between query length and performance?
Theory
• A theory is a system of logical principles that
attempts to explain relations among natural,
observable phenomena.
• Theory is abstract, general, can generate more
specific hypotheses
Hypotheses
• Hypotheses state expected relationships
between two variables
• Alternative hypotheses vs. null hypotheses
– Specific relationship vs. no relationship
• Hypotheses can be directional or nondirectional
Variables and measurement
• Variables represent concept
• To analyze concepts
– Conceptualization
• To define concepts: provide temporary definition, divide into
dimensions
– Operationalization
• How to measure the concept:
• Direct and indirect observables
– Directly observed:
• # of queries entered, the amount time spent searching
– Indirectly observed:
• User satisfaction
Variables
• Independent: the causes
– examining differences in how males and females use
an experimental and baseline IIR system
– Sex is independent variables
• Dependent: the effects
– E.g., Satisfaction or performance of the systems.
• Confounding variables
– Affect the independent or dependent variable, but
have not been controlled by the researcher.
– E.g., maybe males are more familiar with these
systems than females.
Measurement
• Range of variation
– Preciseness of the measure
– E.g., category of usage frequency of a system
• Exhaustiveness
– Complete list of choices
• Exclusiveness
– How to differentiate partially relevant vs. somewhat relevant (in your
relevance rubric)
• Equivalence
– Find items that are of the same type and at the same level of
specificity
• Different scales: I know details=very familiar, I know nothing=very unfamiliar
• Appropriateness
– How likely are you to recommend this system to others? Scale: a fivepoint scale with strongly agree and strongly disagree – which does not
match the question
Level of Measurement
• Two basic levels of measurement: discrete vs.
continuous
– Discrete measures: categorical responses
• Nominal: no order
– E.g., interface type, sex, task-type
• Ordinal: ordered
– Rank-order (from most relevant to least relevant) or Likert-type
order (five-point scale with 1=not relevant, 5=relevant)
– Relative measure
» one subject’s 2 may not represent the same thing internally
as another subject’s 2.
» we could not say that a document rated 4 was twice as
relevant as a document rated 2 since the scale contains no
true zero
Level of Measurement
• Two basic levels of measurement: discrete vs.
continuous
– Continuous measure: interval vs. ratio
• Different between consecutive points are equal, but there is
no true zero for interval scales
– Fahrenheit temperature scale, IQ test scores
– Zero does not mean no heat or no intelligence
– The differences between 50 vs. 80 and 90 vs. 120 are same
• Ratio: the highest level of measurement: the number of
occurrences.
– There is a true zero
– E.g. time, number of pages viewed (zero is meaningful)
EXPERIMENTAL DESIGN
• The basic experimental design in IIR
evaluation examines the relationship between
two or more systems or interfaces
(independent variable) on some set of
outcome measures (dependent variables)
IIR design
• General goal of IIR is to determine if a
particular system helps subjects find relevant
documents
• Developing a valid baseline in IIR evaluation
involves identifying and blending the status
quo and the experimental system.
• Random assignment can be used to increase
the characteristics being evenly distributed
across groups
Factorial Designs
• Good for studying the impact of more than
one stimulus or variable
Rotation and counterbalancing
• The primary purpose of rotation and
counterbalancing is to control for order effects
and to increase the change that results can be
attributed to the experimental treatments and
conditions.
• Rotating variables:
– Latin square design
– Graeco-Latin square design
Rotation and counterbalancing
A basic design with no rotation. Numbers in cells represent different topics
Cons:
1. Order effects
2. Some topics are easier than others, some systems may do better with some
topics than others.
3. Fatigue can impact the results
Latin Square rotation
Basic Latin Square
rotation of topics
Problems:
-Interaction among topics
- the order effects of
interfaces still exist
Basic Latin Square rotation
of topics and
randomization of columns
Graeco-Latin Square Design
• To solve the problem of orders of interfaces
existing above.
• Graeco-Latin Square is a combination of two
or more Latin squares.
Graeco-Latin Square Design
Study mode
• Batch-mode
– Multiple subjects complete the study at the same location
and time
• Single-mode
– Subjects complete the study alone, with only the
researcher present.
• The choice of mode is determined by the purpose of
the study.
– Single-mode: if each subject has to be interviewed, or
some interactive communication needed between subject
and researcher
– Batch-mode: self-contained, efficient (but subject can
influence each other)
Protocols
• A protocol is a step by step account of what
will happen in a study.
• Protocol helps maintain the integrity of the
study and ensure that subjects experience the
study in similar ways.
Tutorials
• Provide some instruction on how to use a new
IIR system
– Printed materials
– Verbal instructions
– Video tutorial
• Try to avoid bias in the tutorial
– Such as specially focusing on one special feature.
Pilot testing
• To estimate time
• To identify problems with instruments,
instructions, and protocols
• To get detailed feedback from test subjects
SAMPLING
Sampling
• It is not possible to include all elements from a
population in a study
• The population in IIR evaluation is assumed to
be all people who engage in online
information search.
• The size of sample: the more the better
• Two approaches to sampling: probability
sampling and non-probability sampling
Probability Sampling
• Selecting a sample from a population that
maintains the same variation and diversity that
exists within the population.
• Representative sample:
– In a population: 60% are males and 40% are females,
then your representative sample would also contain
roughly the same ratio of males and females.
– Increase the generalizability of the results
– Assumes that all elements in the population have an
equal chance of being selected.
Probability sampling
• Simple random sampling
– Randomly pick up an element
• Systematic sampling
– Pick up every kth element, where k=population
size/sample size
• Stratified sampling
– Subdivide the population into more refined groups
according to specific strata
– Select a sample that is proportionate to the
population in each strata.
Non-probability sampling
• Used when all of the elements in a population
is unknown, or not available.
• It limits its ability to generalize
• Researchers should be cautious when
generalizing their data and be aware of the
sampling limitations in their research.
Non-probability sampling
• Three major types of non-probability sampling:
– Convenience: relying on available elements the
researcher can access: undergraduate students,
people is located closer to the researcher.
– Purposive or judgmental sampling: a researcher
selects subjects or other elements that have particular
characteristics, expertise or perspectives
– Quota sampling: similar as stratified sampling, but the
subjects for the strata are based on a first-come-firstserved policy.
Subject Recruitment
• Many ways to recruit subjects
– Send solicitations to mailinglists
– Inviting
– Using referral services
– Crowdsourcing
– Mechanical Turk
– Web advertising
– Mass mailings
– Virtual posting in online locations
– Pros and Cons: using lab mates, or own research group members as
study subjects
COLLECTIONS
Collections for testing
• Identification of a set of documents for
subjects to search, a set of tasks or topics
which directs this searching, and the ground
truth about the relevance of the searched
objects to the topics -
• A test collection: corpus, topics, and relevance
judgments
TREC collections
• TREC Interactive and HARD tracks
– Newswire, blog, legal
– Artificial topics
– Relevance assessment generalization problem
Web corpora
• The major drawback is that it is impossible to
replicate the study since the Web is constantly
changing.
• The same queries issued at different time can
get completely different results
Natural corpora
• Corpora assembled over time by study
participants
– Pros: meaningful to subjects, controllable
– Cons: lack of replicability and equivalence across
participants,
Tasks and topics
• Most information needs can be characterized
in terms of tasks and topics
– Information need = task = topic
• Information needs
– People do not know their information needs
– People have difficulties to articulate their
information needs
– Or using a vocabulary proper for a system
Generating information needs
• It is not clear at what level of specificity a task
or topic should be defined
– Task can be broken down into a series of subtasks, such as writing a research proposal
• Working on the query logs to develop
information needs
DATA COLLECTION TECHNIQUES
Data collection techniques
• Corpora, tasks, topics, and relevance
assessments are major instruments to
evaluate IIR systems
• Other instruments: questionnaires, screen
capture software allow researchers to collect
data.
Think-Aloud
• Subjects articulate their thinking and decisionmaking during the evaluation process of IIR.
• Microphone, recording software,
• It is unnatural as most people do not
articulate their thoughts as they complete
tasks.
Stimulated Recall
• Researcher records the screen of the
computer as the subject completes a
searching task. Then, the recording is played
back to the subject and ask the subject to
articulate thinking and decision-making.
• Tool: screen recording software
Spontaneous and prompted selfreport
• Elicit feedback from subjects periodically while
they search.
• Goal: get more refined feedback about the
search, rather than summative feedback at
the end of the search
observation
• Researcher is seated near subjects and
observes them when they conduct IIR
activities
• Tool: video camera, screen capture software
• Time consuming, and labor intensive
• Prone to selective attention and researcher
bias.
logging
• Analyzing transaction logs.
• Client-side logging provides a more robust and
comprehensive log of the user’s interactions.
• But is very hard to build a client-side logger
Questionnaire
• Consist of
– closed questions where a specific response set is
provides (e.g. a five-point scale)  quantitative
analysis
– open questions  qualitative analysis
• Closed questions: Likert-type scale (e.g. five to
seven point: strongly agree, agree, neutral,
disagree, strongly disagree)
• Open questions: content analysis
• Different modes: electronic, pen-and-paper,
interview
Interview
• Few IIR evaluation consist solely of interviews,
but interviews are a common component of
many study protocols.
• Subjects response to open-ended questions in
interview better than in other two modes
(electronic, or pen-and-paper)
• Interview: structured, semi-structured or open
MEASURES
Four basic measures
• Four basic classes of measures
– Contextual (age, sex, search experience, personalitytype),
– Interaction (# of queries issued, # of documents
viewed, query length), can be extracted from log data
– Performance (# of relevant documents saved, mean
average precision, discounted cumulated gain), can be
computed from log data
– Usability: subject attitudes and feelings about the
system and their interactions
contextual
• Individual differences: their impact on the
study results
• Information needs: domain expertise is
measured using credentials
• Persistence of information needs
• Immediacy of information need
• Information-seeking stage
Interaction
• Measures:
– # of queries, # of search results viewed, # of
documents viewed, # of documents saved, query
length
• The implicit definition of interaction is tied to
feedback
Performance
• When directly apply TREC measures to IIR
evaluation, assume: relevance is binary, static,
uni-dimensional and generalizable
• Whether the TREC-based performance metrics
is meaningful to end users
– A measure that evaluates systems based on the
retrieval of 1000 documents is unlikely to be
meaningful to users since most users will not look
through 1000 documents.
Traditional IR performance measures
Interactive recall and precision
Measures that accommodate multilevel relevance and rank
Time-based measures
• A variety of time-based measures
– The length of time subjects spend in different
states or modes
– The amount of time it takes a subject to save the
first relevant articles
– The number of relevant documents saved during a
fixed period of time
– The number of actions or steps taken to complete
a task
Cost and utility measures
• Some search services are not free
• Have always been an important part of the
evaluation of library and information services
Evaluative feedback from subjects
• Usability
– Effectiveness, efficiency and satisfaction as key
dimensions of usability
– Effectiveness: precision, recall
– Efficiency: the time it takes a subject to complete
a task.
– Satisfaction: be satisfied for each different
experimental feature of the system, subject
perceptions of outcomes and interactions
Available instruments for measuring
usability
• Questionnaire for User Interface Satisfaction
(QUIS): http://lap.umd.edu/quis/
– 10-point scale for software, screen, terminology,
system, etc.
• The USE questionnaire
– Usefulness, ease of use, ease of learning, satisfaction
(7-point scale)
• Software Usability Measurement Inventory
(SUMI): http://sumi.ucc.ie/whatis.html
– Agree, do not know and disagree for 50 items
DATA ANALYSIS
Qualitative data analyses
• The goal of most qualitative data analyses that
are constructed in IIR is to reduce the qualitative
responses into a set of categories or themes that
can be used to characterize and summarize
responses.
• Content analysis: it starts with a well-defined and
structured classification scheme, including
categories and classification rules.
• Open coding: the categories are usually
developed inductively during the analysis process
as the researcher analyzes the data.
Quantitative data analysis
VALIDITY AND RELIABILITY
validity
• Internal validity: quality of what happens during
the study
– Whether the selected instrument yields poor or
inaccurate data
• External validity: to what extent the results from
a study can be generalized to the real world.
• Lab studies are generally less valid, but more
reliable than naturalistic studies
• Using instruments with established reliability