EVALUATION
in searching
IR systems
Digital libraries
Reference sources
Web sources
© Tefko Saracevic
1
Definition of
evaluation
Dictionary:
1. assessment of value
the act of considering or examining something in
order to judge its value, quality,
importance, extent, or condition
In searching:
assessment of search results on basis of given
criteria as related to users and use
criteria may be specified by users or derived from
professional practice, other sources or
standards
Results are judged & with them
the whole process, including
searcher & searching
© Tefko Saracevic
2
Importance of
evaluation
Integral part of searching
always there - wanted or not
ω no matter what user will in some way or
other evaluate what obtained
could be informal or formal
Growing problem for all
information explosion makes finding
“good” stuff very difficult
Formal evaluation part of
professional job & skills
requires knowledge of evaluation
criteria, measures, methods
more & more prized
© Tefko Saracevic
3
Place of evaluation
Search
Inf. need
Results
User
© Tefko Saracevic
Evaluation
4
General application
Evaluation (as discussed here) is
applicable to results from a
variety of information systems:
information retrieval (IR) systems,
e.g. Dialog, LexisNexis …
sources included in digital libraries,
e.g. Rutgers
reference services e.g. in libraries or
commercial on the web
web sources e.g. as found on many
domain sites
Many approaches, criteria,
measures, methods are similar &
can be adapted for specific
source or information system
© Tefko Saracevic
5
Broad context
Evaluating the role that an
information system plays as
related to:
SOCIETY - community,
culture, discipline ...
INSTITUTION - university,
organization, company ...
INDIVIDUALS - users &
potential users (nonusers)
Roles lead to broad, but hard
questions as to what
CONTEXT to choose for
evaluation
© Tefko Saracevic
6
Questions asked in
different contexts
Social:
how well does an information
system support social demands &
roles?
ω hardest to evaluate
Institutional:
how well does it support
institutional/organizational mission
& objectives?
ω tied to objectives of institution
ω also hard to evaluate
Individual:
how well does it support inf. needs
& activities of people?
ω most evaluations in this context
© Tefko Saracevic
7
Approaches to
evaluation
Many approaches exist
quantitative, qualitative …
effectiveness, efficiency ...
each has strong & weak points
Systems approach prevalent
Effectiveness: How well does a
system perform that for which it was
designed?
Evaluation related to objective(s)
Requires choices:
ω Which objective, function to evaluate?
© Tefko Saracevic
8
Approaches …
(cont.)
Economics approach:
Efficiency: at what costs?
Effort, time also are costs
Cost-effectiveness: cost for a given
level of effectiveness
Ethnographic approach
practices, effects within an
organization, community
learning & using practices &
comparisons
© Tefko Saracevic
9
Prevalent approach
System approach used in many
different ways & purposes – in
evaluation of:
inputs to system & contents
operations of a system
use of a system
outputs from a system
Also, in evaluation of search
outputs for given user(s) and
use
applied on the individual level
ω derived from assessments from users or
their surrogates, e.g. searchers
this is what searchers do most often
this is what you will apply in your
projects
© Tefko Saracevic
10
Five basic
requirements for
system evaluation
Once a context is selected need to
specify ALL five:
1. Construct
o A system, process, source
a given IR system, web site, digital library ...
what are you going to evaluate?
2. Criteria
o to reflect objective(s) of searching
e.g. relevance, utility, satisfaction, accuracy,
completeness, time, costs …
on basis of what will you make judgments?
3. Measure(s)
o to reflect criteria in some quantity or
quality
precision, recall, various Likert scales, $$$ ...
how are you going to express judgment?
© Tefko Saracevic
11
Requirements …
(cont.)
4. Measuring instrument
o recording by users or user surrogates
(e.g. you) on the measure
expressing if relevant or not, marking a
scale, indicating cost
people are instruments – who will it be?
5. Methodology
o procedures for collecting & analyzing
data
how are you going to get all this done?
Assemble the stuff to evaluate
(construct)? Choose what criteria?
Determine what measures to use to
reflect the criteria? Establish who will
judge and how will the judgment be
done? How will you analyze results?
Verify validity and reliability?
© Tefko Saracevic
12
Requirements …
(cont.)
Ironclad rule:
No evaluation can proceed if not
ALL five of these are specified!
Sometimes specification on
some are informal & implied,
but they are always there!
© Tefko Saracevic
13
1. Constructs
In IR research: most done on
test collections & test questions
Text Retrieval Conference - TREC
ω evaluation of algorithms, interactions
ω reported in research literature
In practice: on use & user level:
mostly done on operational
collections & systems, web sites
e.g. Dialog, LexisNexis, various files
ω evaluation, comparison of various contents,
procedures, commands,
ω user proficiencies, characteristics
ω evaluation of interactions
ω reported in professional literature
© Tefko Saracevic
14
2. Criteria
In IR: Relevance basic & most
used criterion
related to the problem at hand
On user & use level: many other
utility, satisfaction, success, time, value,
impact, ...
Web sources
those + quality, usability, penetration,
accessibility ...
Digital libraries, web sites
those + usability
© Tefko Saracevic
15
2. Criteria - relevance
Relevance as criterion
strengths:
ω intuitively understood, people know what
it means
ω universally applied in information systems
weaknesses:
ω not static - changes dynamically, thus hard
to pin down
ω tied to cognitive structure & situation of a
user – possible disagreements
Relevance as area of study
ω basic notion in information science
ω many studies done about various aspects of
relevance
Number of relevance types exist
indication of different relations
ω had to be specified which ones
© Tefko Saracevic
16
2. Criteria usability
Increasingly used for web sites
& digital libraries
General definition (ISO)
“extent to which a product can be used
by specified users to achieve specified
goals with effectiveness, efficiency,
and satisfaction in a specified context
of use”
Number of criteria
enhancing user performance
ease of operations
serving the intended purpose
learnability – how easy to learn, memorize?
losstness – how often got lost in using it?
satisfaction
and quite a few more
© Tefko Saracevic
17
3. Measures
in IR: Precision & recall
preferred (treated in Module 4)
based on relevance
could be two or more dimensions
ω e.g. relevant–not relevant;
relevant–partially relevant–not relevant
Problem with recall
how to find what's relevant in a file?
ω e.g. estimate; broad & narrow searching
or union of many outputs then comparison
On use & user level
Likert scales - semantic differentials
ω e.g. satisfaction on a scale of 1 to x
(1=not satisfied, x=satisfied)
observational measures
ω e.g. overlap, consistency
© Tefko Saracevic
18
4.Instruments
People used as instruments
they judge relevance, scale ...
But people who?
users, surrogates, analysts, domain
experts, librarians ...
How do relevance, utility ...
judges effect results?
who knows?
Reliability of judgments:
about 50 - 60% for experts
© Tefko Saracevic
19
5. Methods
Includes design, procedures
for observations,
experiments, analysis of
results
Challenges:
Validity? Reliability? Reality?
ω Collection - selection? size?
ω Request - generation?
ω Searching - conduct?
ω Results - obtaining? judging? feedback?
ω Analysis - conduct? tools?
ω Interpretation - warranted?
generalizable?
© Tefko Saracevic
20
Evaluation of web
sources
Web is value neutral
it has everything from diamonds to trash
Thus evaluation becomes
imperative
and a primary obligation & skill of
professional searchers – you
continues & expands on evaluation
standards & skills in library tradition
A number of criteria are used
most derived from traditional criteria, but
modified for the web, others added
could be found on many library sites
ω librarians provide the public and colleagues
with web evaluation tools and guidelines as
part of their services
© Tefko Saracevic
21
Criteria for evaluation
of web & Dlib sources
What? Content
What subject(s), topic(s) covered?
Level? Depth? Exhaustively?
Specificity? Organization?
Timeliness of content? Up-to-date?
Revisions?
Accuracy?
Why? Intention
Purpose? Scope? Viewpoint?
For? Users, use
Intended audience?
What need satisfied?
Use intended or possible?
How appropriate?
© Tefko Saracevic
22
criteria ...
Who done it? Authority
Author(s), institution, company, publisher,
creator:
ω What authority? Reputation? Credibility?
Trustworthiness? Refereeing?
ω Persistence? Will it be around?
ω Is it transparent who done it?
How? Treatment
Content treatment:
ω Readability? Style? Organization? Clarity?
Physical treatment:
ω Format? Layout? Legibility? Visualization?
Usability
Where? Access
How available? Accessible? Restrictions?
Links persistence, stability?
© Tefko Saracevic
23
criteria ...
How? Functionality
Searching, navigation, browsing?
Feedback? Links?
Output: Organization? Features?
Variations? Control?
How much? Effort, economics
Time, effort in learning it?
Time, effort in using it
Price? Total costs? Cost-benefits?
In comparison to? Wider world
Other similar sources?
ω where & how similar or better results may
be obtained?
ω how do they compare?
© Tefko Saracevic
24
Main criteria for web site evaluation
Intention
purpose
scope
viewpoint
Content
coverage
accuracy
timeliness
…
…
Authority
reputation
credibility
“About us”
Users, use
audience
need
appropriateness
…
Functionality
navigation
features
output
Quality
…
…
Treatment
content
layout
visualization
…
Access
availability
persistence
links
Effort
in using it
in learning it
time, cost
…
…
© Tefko Saracevic
25
Evaluation:
To what end?
To asses & then improve
performance – MAIN POINT
to change searches & search results for
better
To understand what went on
what went right, what wrong, what
works, what doesn't & then change
To communicate with user
explain & get feedback
To gather data for best practices
conversely: eliminate or reduce bad ones
To keep your job
even more: to advance
To get satisfaction from job well
done
© Tefko Saracevic
26
Conclusions
Evaluation is a complex task
but also an essential part of being an
information professional
Traditional approaches &
criteria still apply
but new ones added or adapted to
satisfy new sources, & new methods of
access & use
Evaluation skills in growing
demand particularly because
web is value neutral
Great professional skill to sell!
© Tefko Saracevic
27
Evaluation
perspectives Rockwell
© Tefko Saracevic
28
Evaluation
perspectives
© Tefko Saracevic
29
Evaluation
perspective …
© Tefko Saracevic
30
Possible rewards*
* but don’t bet on it!
© Tefko Saracevic
31