Evaluation.ppt

advertisement
EVALUATION
in searching
IR systems
Digital libraries
Reference sources
Web sources
© Tefko Saracevic
1
Definition of
evaluation
Dictionary:
1. assessment of value
the act of considering or examining something in
order to judge its value, quality,
importance, extent, or condition
In searching:
assessment of search results on basis of given
criteria as related to users and use
criteria may be specified by users or derived from
professional practice, other sources or
standards
Results are judged & with them
the whole process, including
searcher & searching
© Tefko Saracevic
2
Importance of
evaluation
 Integral part of searching
 always there - wanted or not
ω no matter what user will in some way or
other evaluate what obtained
 could be informal or formal
 Growing problem for all
 information explosion makes finding
“good” stuff very difficult
 Formal evaluation part of
professional job & skills
 requires knowledge of evaluation
criteria, measures, methods
 more & more prized
© Tefko Saracevic
3
Place of evaluation
Search
Inf. need
Results
User
© Tefko Saracevic
Evaluation
4
General application
 Evaluation (as discussed here) is
applicable to results from a
variety of information systems:
 information retrieval (IR) systems,
e.g. Dialog, LexisNexis …
 sources included in digital libraries,
e.g. Rutgers
 reference services e.g. in libraries or
commercial on the web
 web sources e.g. as found on many
domain sites
 Many approaches, criteria,
measures, methods are similar &
can be adapted for specific
source or information system
© Tefko Saracevic
5
Broad context
Evaluating the role that an
information system plays as
related to:
SOCIETY - community,
culture, discipline ...
INSTITUTION - university,
organization, company ...
INDIVIDUALS - users &
potential users (nonusers)
Roles lead to broad, but hard
questions as to what
CONTEXT to choose for
evaluation
© Tefko Saracevic
6
Questions asked in
different contexts
 Social:
 how well does an information
system support social demands &
roles?
ω hardest to evaluate
 Institutional:
 how well does it support
institutional/organizational mission
& objectives?
ω tied to objectives of institution
ω also hard to evaluate
 Individual:
 how well does it support inf. needs
& activities of people?
ω most evaluations in this context
© Tefko Saracevic
7
Approaches to
evaluation
 Many approaches exist
 quantitative, qualitative …
 effectiveness, efficiency ...
 each has strong & weak points
 Systems approach prevalent
 Effectiveness: How well does a
system perform that for which it was
designed?
 Evaluation related to objective(s)
 Requires choices:
ω Which objective, function to evaluate?
© Tefko Saracevic
8
Approaches …
(cont.)
 Economics approach:
 Efficiency: at what costs?
 Effort, time also are costs
 Cost-effectiveness: cost for a given
level of effectiveness
 Ethnographic approach
 practices, effects within an
organization, community
 learning & using practices &
comparisons
© Tefko Saracevic
9
Prevalent approach
 System approach used in many
different ways & purposes – in
evaluation of:




inputs to system & contents
operations of a system
use of a system
outputs from a system
 Also, in evaluation of search
outputs for given user(s) and
use
 applied on the individual level
ω derived from assessments from users or
their surrogates, e.g. searchers
 this is what searchers do most often
 this is what you will apply in your
projects
© Tefko Saracevic
10
Five basic
requirements for
system evaluation
Once a context is selected need to
specify ALL five:
1. Construct
o A system, process, source
 a given IR system, web site, digital library ...
 what are you going to evaluate?
2. Criteria
o to reflect objective(s) of searching
 e.g. relevance, utility, satisfaction, accuracy,
completeness, time, costs …
 on basis of what will you make judgments?
3. Measure(s)
o to reflect criteria in some quantity or
quality
 precision, recall, various Likert scales, $$$ ...
 how are you going to express judgment?
© Tefko Saracevic
11
Requirements …
(cont.)
4. Measuring instrument
o recording by users or user surrogates
(e.g. you) on the measure
 expressing if relevant or not, marking a
scale, indicating cost
 people are instruments – who will it be?
5. Methodology
o procedures for collecting & analyzing
data
 how are you going to get all this done?
 Assemble the stuff to evaluate
(construct)? Choose what criteria?
Determine what measures to use to
reflect the criteria? Establish who will
judge and how will the judgment be
done? How will you analyze results?
Verify validity and reliability?
© Tefko Saracevic
12
Requirements …
(cont.)
 Ironclad rule:
No evaluation can proceed if not
ALL five of these are specified!
 Sometimes specification on
some are informal & implied,
but they are always there!
© Tefko Saracevic
13
1. Constructs
 In IR research: most done on
test collections & test questions
 Text Retrieval Conference - TREC
ω evaluation of algorithms, interactions
ω reported in research literature
 In practice: on use & user level:
mostly done on operational
collections & systems, web sites
 e.g. Dialog, LexisNexis, various files
ω evaluation, comparison of various contents,
procedures, commands,
ω user proficiencies, characteristics
ω evaluation of interactions
ω reported in professional literature
© Tefko Saracevic
14
2. Criteria
 In IR: Relevance basic & most
used criterion
 related to the problem at hand
 On user & use level: many other
 utility, satisfaction, success, time, value,
impact, ...
 Web sources
 those + quality, usability, penetration,
accessibility ...
 Digital libraries, web sites
 those + usability
© Tefko Saracevic
15
2. Criteria - relevance
 Relevance as criterion
 strengths:
ω intuitively understood, people know what
it means
ω universally applied in information systems
 weaknesses:
ω not static - changes dynamically, thus hard
to pin down
ω tied to cognitive structure & situation of a
user – possible disagreements
 Relevance as area of study
ω basic notion in information science
ω many studies done about various aspects of
relevance
 Number of relevance types exist
 indication of different relations
ω had to be specified which ones
© Tefko Saracevic
16
2. Criteria usability
 Increasingly used for web sites
& digital libraries
 General definition (ISO)
“extent to which a product can be used
by specified users to achieve specified
goals with effectiveness, efficiency,
and satisfaction in a specified context
of use”
 Number of criteria







enhancing user performance
ease of operations
serving the intended purpose
learnability – how easy to learn, memorize?
losstness – how often got lost in using it?
satisfaction
and quite a few more
© Tefko Saracevic
17
3. Measures
 in IR: Precision & recall
preferred (treated in Module 4)
 based on relevance
 could be two or more dimensions
ω e.g. relevant–not relevant;
relevant–partially relevant–not relevant
 Problem with recall
 how to find what's relevant in a file?
ω e.g. estimate; broad & narrow searching
or union of many outputs then comparison
 On use & user level
 Likert scales - semantic differentials
ω e.g. satisfaction on a scale of 1 to x
(1=not satisfied, x=satisfied)
 observational measures
ω e.g. overlap, consistency
© Tefko Saracevic
18
4.Instruments
 People used as instruments
 they judge relevance, scale ...
 But people who?
 users, surrogates, analysts, domain
experts, librarians ...
 How do relevance, utility ...
judges effect results?
 who knows?
 Reliability of judgments:
 about 50 - 60% for experts
© Tefko Saracevic
19
5. Methods
 Includes design, procedures
for observations,
experiments, analysis of
results
 Challenges:
 Validity? Reliability? Reality?
ω Collection - selection? size?
ω Request - generation?
ω Searching - conduct?
ω Results - obtaining? judging? feedback?
ω Analysis - conduct? tools?
ω Interpretation - warranted?
generalizable?
© Tefko Saracevic
20
Evaluation of web
sources
 Web is value neutral
 it has everything from diamonds to trash
 Thus evaluation becomes
imperative
 and a primary obligation & skill of
professional searchers – you
 continues & expands on evaluation
standards & skills in library tradition
 A number of criteria are used
 most derived from traditional criteria, but
modified for the web, others added
 could be found on many library sites
ω librarians provide the public and colleagues
with web evaluation tools and guidelines as
part of their services
© Tefko Saracevic
21
Criteria for evaluation
of web & Dlib sources
 What? Content
 What subject(s), topic(s) covered?
 Level? Depth? Exhaustively?
Specificity? Organization?
 Timeliness of content? Up-to-date?
Revisions?
 Accuracy?
 Why? Intention
 Purpose? Scope? Viewpoint?
 For? Users, use




Intended audience?
What need satisfied?
Use intended or possible?
How appropriate?
© Tefko Saracevic
22
criteria ...
 Who done it? Authority
 Author(s), institution, company, publisher,
creator:
ω What authority? Reputation? Credibility?
Trustworthiness? Refereeing?
ω Persistence? Will it be around?
ω Is it transparent who done it?
 How? Treatment
 Content treatment:
ω Readability? Style? Organization? Clarity?
 Physical treatment:
ω Format? Layout? Legibility? Visualization?
 Usability
 Where? Access
 How available? Accessible? Restrictions?
 Links persistence, stability?
© Tefko Saracevic
23
criteria ...
 How? Functionality
 Searching, navigation, browsing?
 Feedback? Links?
 Output: Organization? Features?
Variations? Control?
 How much? Effort, economics
 Time, effort in learning it?
 Time, effort in using it
 Price? Total costs? Cost-benefits?
 In comparison to? Wider world
 Other similar sources?
ω where & how similar or better results may
be obtained?
ω how do they compare?
© Tefko Saracevic
24
Main criteria for web site evaluation
Intention
purpose
scope
viewpoint
Content
coverage
accuracy
timeliness
…
…
Authority
reputation
credibility
“About us”
Users, use
audience
need
appropriateness
…
Functionality
navigation
features
output
Quality
…
…
Treatment
content
layout
visualization
…
Access
availability
persistence
links
Effort
in using it
in learning it
time, cost
…
…
© Tefko Saracevic
25
Evaluation:
To what end?
 To asses & then improve
performance – MAIN POINT
 to change searches & search results for
better
 To understand what went on
 what went right, what wrong, what
works, what doesn't & then change
 To communicate with user
 explain & get feedback
 To gather data for best practices
 conversely: eliminate or reduce bad ones
 To keep your job
 even more: to advance
 To get satisfaction from job well
done
© Tefko Saracevic
26
Conclusions
 Evaluation is a complex task
 but also an essential part of being an
information professional
 Traditional approaches &
criteria still apply
 but new ones added or adapted to
satisfy new sources, & new methods of
access & use
 Evaluation skills in growing
demand particularly because
web is value neutral
 Great professional skill to sell!
© Tefko Saracevic
27
Evaluation
perspectives Rockwell
© Tefko Saracevic
28
Evaluation
perspectives
© Tefko Saracevic
29
Evaluation
perspective …
© Tefko Saracevic
30
Possible rewards*
* but don’t bet on it!
© Tefko Saracevic
31
Download