EVALUATION in searching IR systems Digital libraries Reference sources Web sources © Tefko Saracevic 1 Definition of evaluation Dictionary: 1. assessment of value the act of considering or examining something in order to judge its value, quality, importance, extent, or condition In searching: assessment of search results on basis of given criteria as related to users and use criteria may be specified by users or derived from professional practice, other sources or standards Results are judged & with them the whole process, including searcher & searching © Tefko Saracevic 2 Importance of evaluation Integral part of searching always there - wanted or not ω no matter what user will in some way or other evaluate what obtained could be informal or formal Growing problem for all information explosion makes finding “good” stuff very difficult Formal evaluation part of professional job & skills requires knowledge of evaluation criteria, measures, methods more & more prized © Tefko Saracevic 3 Place of evaluation Search Inf. need Results User © Tefko Saracevic Evaluation 4 General application Evaluation (as discussed here) is applicable to results from a variety of information systems: information retrieval (IR) systems, e.g. Dialog, LexisNexis … sources included in digital libraries, e.g. Rutgers reference services e.g. in libraries or commercial on the web web sources e.g. as found on many domain sites Many approaches, criteria, measures, methods are similar & can be adapted for specific source or information system © Tefko Saracevic 5 Broad context Evaluating the role that an information system plays as related to: SOCIETY - community, culture, discipline ... INSTITUTION - university, organization, company ... INDIVIDUALS - users & potential users (nonusers) Roles lead to broad, but hard questions as to what CONTEXT to choose for evaluation © Tefko Saracevic 6 Questions asked in different contexts Social: how well does an information system support social demands & roles? ω hardest to evaluate Institutional: how well does it support institutional/organizational mission & objectives? ω tied to objectives of institution ω also hard to evaluate Individual: how well does it support inf. needs & activities of people? ω most evaluations in this context © Tefko Saracevic 7 Approaches to evaluation Many approaches exist quantitative, qualitative … effectiveness, efficiency ... each has strong & weak points Systems approach prevalent Effectiveness: How well does a system perform that for which it was designed? Evaluation related to objective(s) Requires choices: ω Which objective, function to evaluate? © Tefko Saracevic 8 Approaches … (cont.) Economics approach: Efficiency: at what costs? Effort, time also are costs Cost-effectiveness: cost for a given level of effectiveness Ethnographic approach practices, effects within an organization, community learning & using practices & comparisons © Tefko Saracevic 9 Prevalent approach System approach used in many different ways & purposes – in evaluation of: inputs to system & contents operations of a system use of a system outputs from a system Also, in evaluation of search outputs for given user(s) and use applied on the individual level ω derived from assessments from users or their surrogates, e.g. searchers this is what searchers do most often this is what you will apply in your projects © Tefko Saracevic 10 Five basic requirements for system evaluation Once a context is selected need to specify ALL five: 1. Construct o A system, process, source a given IR system, web site, digital library ... what are you going to evaluate? 2. Criteria o to reflect objective(s) of searching e.g. relevance, utility, satisfaction, accuracy, completeness, time, costs … on basis of what will you make judgments? 3. Measure(s) o to reflect criteria in some quantity or quality precision, recall, various Likert scales, $$$ ... how are you going to express judgment? © Tefko Saracevic 11 Requirements … (cont.) 4. Measuring instrument o recording by users or user surrogates (e.g. you) on the measure expressing if relevant or not, marking a scale, indicating cost people are instruments – who will it be? 5. Methodology o procedures for collecting & analyzing data how are you going to get all this done? Assemble the stuff to evaluate (construct)? Choose what criteria? Determine what measures to use to reflect the criteria? Establish who will judge and how will the judgment be done? How will you analyze results? Verify validity and reliability? © Tefko Saracevic 12 Requirements … (cont.) Ironclad rule: No evaluation can proceed if not ALL five of these are specified! Sometimes specification on some are informal & implied, but they are always there! © Tefko Saracevic 13 1. Constructs In IR research: most done on test collections & test questions Text Retrieval Conference - TREC ω evaluation of algorithms, interactions ω reported in research literature In practice: on use & user level: mostly done on operational collections & systems, web sites e.g. Dialog, LexisNexis, various files ω evaluation, comparison of various contents, procedures, commands, ω user proficiencies, characteristics ω evaluation of interactions ω reported in professional literature © Tefko Saracevic 14 2. Criteria In IR: Relevance basic & most used criterion related to the problem at hand On user & use level: many other utility, satisfaction, success, time, value, impact, ... Web sources those + quality, usability, penetration, accessibility ... Digital libraries, web sites those + usability © Tefko Saracevic 15 2. Criteria - relevance Relevance as criterion strengths: ω intuitively understood, people know what it means ω universally applied in information systems weaknesses: ω not static - changes dynamically, thus hard to pin down ω tied to cognitive structure & situation of a user – possible disagreements Relevance as area of study ω basic notion in information science ω many studies done about various aspects of relevance Number of relevance types exist indication of different relations ω had to be specified which ones © Tefko Saracevic 16 2. Criteria usability Increasingly used for web sites & digital libraries General definition (ISO) “extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use” Number of criteria enhancing user performance ease of operations serving the intended purpose learnability – how easy to learn, memorize? losstness – how often got lost in using it? satisfaction and quite a few more © Tefko Saracevic 17 3. Measures in IR: Precision & recall preferred (treated in Module 4) based on relevance could be two or more dimensions ω e.g. relevant–not relevant; relevant–partially relevant–not relevant Problem with recall how to find what's relevant in a file? ω e.g. estimate; broad & narrow searching or union of many outputs then comparison On use & user level Likert scales - semantic differentials ω e.g. satisfaction on a scale of 1 to x (1=not satisfied, x=satisfied) observational measures ω e.g. overlap, consistency © Tefko Saracevic 18 4.Instruments People used as instruments they judge relevance, scale ... But people who? users, surrogates, analysts, domain experts, librarians ... How do relevance, utility ... judges effect results? who knows? Reliability of judgments: about 50 - 60% for experts © Tefko Saracevic 19 5. Methods Includes design, procedures for observations, experiments, analysis of results Challenges: Validity? Reliability? Reality? ω Collection - selection? size? ω Request - generation? ω Searching - conduct? ω Results - obtaining? judging? feedback? ω Analysis - conduct? tools? ω Interpretation - warranted? generalizable? © Tefko Saracevic 20 Evaluation of web sources Web is value neutral it has everything from diamonds to trash Thus evaluation becomes imperative and a primary obligation & skill of professional searchers – you continues & expands on evaluation standards & skills in library tradition A number of criteria are used most derived from traditional criteria, but modified for the web, others added could be found on many library sites ω librarians provide the public and colleagues with web evaluation tools and guidelines as part of their services © Tefko Saracevic 21 Criteria for evaluation of web & Dlib sources What? Content What subject(s), topic(s) covered? Level? Depth? Exhaustively? Specificity? Organization? Timeliness of content? Up-to-date? Revisions? Accuracy? Why? Intention Purpose? Scope? Viewpoint? For? Users, use Intended audience? What need satisfied? Use intended or possible? How appropriate? © Tefko Saracevic 22 criteria ... Who done it? Authority Author(s), institution, company, publisher, creator: ω What authority? Reputation? Credibility? Trustworthiness? Refereeing? ω Persistence? Will it be around? ω Is it transparent who done it? How? Treatment Content treatment: ω Readability? Style? Organization? Clarity? Physical treatment: ω Format? Layout? Legibility? Visualization? Usability Where? Access How available? Accessible? Restrictions? Links persistence, stability? © Tefko Saracevic 23 criteria ... How? Functionality Searching, navigation, browsing? Feedback? Links? Output: Organization? Features? Variations? Control? How much? Effort, economics Time, effort in learning it? Time, effort in using it Price? Total costs? Cost-benefits? In comparison to? Wider world Other similar sources? ω where & how similar or better results may be obtained? ω how do they compare? © Tefko Saracevic 24 Main criteria for web site evaluation Intention purpose scope viewpoint Content coverage accuracy timeliness … … Authority reputation credibility “About us” Users, use audience need appropriateness … Functionality navigation features output Quality … … Treatment content layout visualization … Access availability persistence links Effort in using it in learning it time, cost … … © Tefko Saracevic 25 Evaluation: To what end? To asses & then improve performance – MAIN POINT to change searches & search results for better To understand what went on what went right, what wrong, what works, what doesn't & then change To communicate with user explain & get feedback To gather data for best practices conversely: eliminate or reduce bad ones To keep your job even more: to advance To get satisfaction from job well done © Tefko Saracevic 26 Conclusions Evaluation is a complex task but also an essential part of being an information professional Traditional approaches & criteria still apply but new ones added or adapted to satisfy new sources, & new methods of access & use Evaluation skills in growing demand particularly because web is value neutral Great professional skill to sell! © Tefko Saracevic 27 Evaluation perspectives Rockwell © Tefko Saracevic 28 Evaluation perspectives © Tefko Saracevic 29 Evaluation perspective … © Tefko Saracevic 30 Possible rewards* * but don’t bet on it! © Tefko Saracevic 31