Text Mining Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA 92521 eamonn@cs.ucr.edu Text Mining/Information Retrieval • Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. • This assumption underlies the field of Information Retrieval. Information need text input Parse Collections How is the query constructed? Pre-process Query Index How is the text processed? Rank Evaluate Terminology Token: A natural language word “Swim”, “Simpson”, “92513” etc Document: Usually a web page, but more generally any file. Some IR History – Roots in the scientific “Information Explosion” following WWII – Interest in computer-based IR from mid 1950’s • • • • • • • H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s) Relevance • In what ways can a document be relevant to a query? – Answer precise question precisely. – Who is Homer’s Boss? Montgomery Burns. – Partially answer question. – Where does Homer work? Power Plant. – Suggest a source for more information. – What is Bart’s middle name? Look in Issue 234 of Fanzine – Give background information. – Remind the user of other knowledge. – Others ... Information need text input Collections How is the query constructed? Parse Pre-process Query Index How is the text processed? Rank The section that follows is about Content Analysis (transforming raw text into a computationally more manageable form) Evaluate Document Processing Steps Figure from Baeza-Yates & RibeiroNeto Stemming and Morphological Analysis • Goal: “normalize” similar words • Morphology (“form” of words) – Inflectional Morphology • E.g,. inflect verb endings and noun number • Never change grammatical class – dog, dogs – Bike, Biking – Swim, Swimmer, Swimming What about… build, building; Examples of Stemming (using Porters algorithm) Porters algorithms is available in Java, C, Lisp, Perl, Python etc from http://www.tartarus.org/ ~martin/PorterStemmer/ Original Words … consign consigned consigning consignment consist consisted consistency consistent consistently consisting consists … Stemmed Words … consign consign consign consign consist consist consist consist consist consist consist Errors Generated by Porter Stemmer (Krovetz 93) Too Aggressive Too Timid organization/organ european/europe policy/police cylinder/cylindrical execute/executive create/creation arm/army search/searcher Statistical Properties of Text • Token occurrences in text are not uniformly distributed • They are also not normally distributed • They do exhibit a Zipf distribution Government documents, 157734 tokens, 32259 unique 8164 the 4771 of 4005 to 2834 a 2827 and 2802 in 1592 The 1370 for 1326 is 1324 s 1194 that 973 by 969 on 915 FT 883 Mr 860 was 855 be 849 Pounds 798 TEXT 798 PUB 798 PROFILE 798 PAGE 798 HEADLINE 798 DOCNO 1 ABC 1 ABFT 1 ABOUT 1 ACFT 1 ACI 1 ACQUI 1 ACQUISITIONS 1 ACSIS 1 ADFT 1 ADVISERS 1 AE Plotting Word Frequency by Rank • Main idea: count – How many times tokens occur in the text • Over all texts in the collection • Now rank these according to how often they occur. This is called the rank. The Corresponding Zipf Curve Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Freq 37 32 24 20 18 15 15 15 13 13 11 11 10 10 10 10 10 10 9 9 system knowledg base problem abstract model languag implem reason inform expert analysi rule program oper evalu comput case gener form Zipf Distribution • The Important Points: – a few elements occur very frequently – a medium number of elements have medium frequency – many elements occur very infrequently Zipf Distribution • The product of the frequency of words (f) and their rank (r) is approximately constant – Rank = order of words’ frequency of occurrence f C 1 / r C N / 10 • Another way to state this is with an approximately correct rule of thumb: – – – – Say the most common term occurs C times The second most common occurs C/2 times The third most common occurs C/3 times … Zipf Distribution (linear and log scale) Illustration by Jacob Nielsen What Kinds of Data Exhibit a Zipf Distribution? • Words in a text collection – Virtually any language usage • • • • • • Library book checkout patterns Incoming Web Page Requests Outgoing Web Page Requests Document Size on Web City Sizes … Consequences of Zipf • There are always a few very frequent tokens that are not good discriminators. – Called “stop words” in IR • English examples: to, from, on, and, the, ... • There are always a large number of tokens that occur once and can mess up algorithms. • Medium frequency words most descriptive Word Frequency vs. Resolving Power (from van Rijsbergen 79) The most frequent words are not the most descriptive. Statistical Independence Two events x and y are statistically independent if the product of their probability of their happening individually equals their probability of happening together. P( x)P( y ) P( x, y ) Statistical Independence and Dependence • What are examples of things that are statistically independent? • What are examples of things that are statistically dependent? Lexical Associations • Subjects write first word that comes to mind – doctor/nurse; black/white (Palermo & Jenkins 64) • Text Corpora yield similar associations • One measure: Mutual Information (Church and Hanks 89) P ( x, y ) I ( x, y ) log 2 P( x), P( y ) • If word occurrences were independent, the numerator and denominator would be equal (if measured across a large collection) Statistical Independence • Compute for a window of words P ( x ) P ( y ) P ( x, y ) if independen t. abcdefghij klmnop P( x ) f ( x ) / N We' ll approximat e P ( x, y ) as follows : w1 w11 1 N |w| P ( x, y ) wi ( x, y ) N i 1 | w | length of window w (say 5) wi words within window starting at position i w( x, y ) number of times x and y co - occur in w N number of words in collection w21 Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89) I(x,y) f(x,y) f(x) x f(y) y 11.3 12 111 Honorary 621 Doctor 11.3 8 1105 Doctors 44 Dentists 10.7 30 1105 Doctors 241 Nurses 9.4 8 1105 Doctors 154 Treating 9.0 6 275 Examined 621 Doctor 8.9 11 1105 Doctors 317 Treat 8.7 25 621 Doctor 1407 Bills Un-Interesting Associations with “Doctor” (AP Corpus, N=15 million, Church & Hanks 89) I(x,y) f(x,y) f(x) x f(y) y 0.96 6 621 doctor 73785 with 0.95 41 284690 a 1105 doctors 0.93 12 84716 is 1105 doctors These associations were likely to happen because the non-doctor words shown here are very common and therefore likely to co-occur with any noun. Associations Are Important Because… • We may be able to discover that phrases that should be treated as a word. I.e. “data mining”. • We may be able to automatically discover synonyms. I.e. “Bike” and “Bicycle” Content Analysis Summary • Content Analysis: transforming raw text into more computationally useful forms • Words in text collections exhibit interesting statistical properties – Word frequencies have a Zipf distribution – Word co-occurrences exhibit dependencies • Text documents are transformed to vectors – Pre-processing includes tokenization, stemming, collocations/phrases Information need Collections Pre-process text input Parse Query Index How is the index constructed? Rank The section that follows is about Index Construction Evaluate Inverted Index • This is the primary data structure for text indexes • Main Idea: – Invert documents into a big index • Basic steps: – Make a “dictionary” of all the tokens in the collection – For each token, list all the docs it occurs in. – Do a few things to reduce redundancy in the data structure How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 How Inverted Files are Created • After all documents have been parsed the inverted file is sorted alphabetically. Term now is the time for all good men to come to the aid of their country it was a dark and stormy night in the country manor the time was past midnight Doc # 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 How Inverted Files are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled. Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the the the their time time to to was was Doc # 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 How Inverted Files are Created • Then the file can be split into – A Dictionary file and – A Postings file How Inverted Files are Created Term a aid all and come country country dark for good in is it manor men midnight night now of past stormy the the their time time to was Doc # Dictionary Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Postings 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 Inverted Indexes • Permit fast search for individual terms • For each term, you get a list consisting of: – document ID – frequency of term in doc (optional) – position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms How Inverted Files are Used Dictionary Term a aid all and come country dark for good in is it manor men midnight night now of past stormy the their time to was N docs Doc # Tot Freq 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 Postings 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 4 1 2 2 2 Query on “time” AND “dark” Freq 2 1 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 docs with “time” in dictionary -> IDs 1 and 2 from posting file 1 doc with “dark” in dictionary -> ID 2 from posting file Therefore, only doc 2 satisfied the query. Information need Collections Pre-process text input Parse Query Index How is the index constructed? Rank The section that follows is about Querying (and ranking) Evaluate Simple query language: Boolean – Terms + Connectors (or operators) – terms • words • normalized (stemmed) words • phrases – connectors • • • • AND OR NOT NEAR (Pseudo Boolean) Word Doc • Cat x • Dog • Collar x • Leash Boolean Queries • Cat • Cat OR Dog • Cat AND Dog • (Cat AND Dog) • (Cat AND Dog) OR Collar • (Cat AND Dog) OR (Collar AND Leash) • (Cat OR Dog) AND (Collar OR Leash) Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) – Each of the following combinations works: • • • • Cat Dog Collar Leash x x x x x x x x x x x x x x x x x Boolean Queries • (Cat OR Dog) AND (Collar OR Leash) – None of the following combinations work: • • • • Cat Dog Collar Leash x x x x x x x x Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Cracks Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Width measurement Beams Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Ordering of Retrieved Documents • Pure Boolean has no ordering • In practice: – order chronologically – order by total number of “hits” on query terms • What if one term has more hits than others? • Is it better to one of each term or many of one term? Boolean Model • Advantages – simple queries are easy to understand – relatively easy to implement • Disadvantages – difficult to specify what is wanted – too much returned, or too little – ordering not well determined • Dominant language in commercial Information Retrieval systems until the WWW Since the Boolean model is limited, lets consider a generalization… Vector Model • Documents are represented as “bags of words” • Represented as vectors when used computationally – – – – A vector is like an array of floating point Has direction and magnitude Each vector holds a place for every term in the collection Therefore, most vectors are sparse • Smithers secretly loves Monty Burns • Monty Burns secretly loves Smithers Both map to… [ Burns, loves, Monty, secretly, Smithers] Document Vectors One location for each word Document ids nova 10 A 5 B C D E F G 5 H I galaxy heat 5 3 10 7 6 10 h’wood film role 10 9 8 10 7 5 2 7 9 8 5 diet fur 10 9 10 10 1 3 We Can Plot the Vectors Star Doc about movie stars Doc about astronomy Doc about mammal behavior Diet Documents in 3D Vector Space t3 D1 D9 D11 D5 D3 D10 D4 D2 t1 t2 D7 D8 D6 Illustration from Jurafsky & Martin Vector Space Model docs Homer Marge Bart D1 * * D2 * D3 * * D4 * D5 * * * D6 * * D7 * D8 * D9 * D10 * * D11 * * Q * Note that the query is projected into the same vector space as the documents. The query here is for “Marge”. We can use a vector similarity model to determine the best match to our query (details in a few slides). But what weights should we use for the terms? Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf – Recall the Zipf distribution – Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 1 1 0 1 1 1 0 0 0 0 1 t2 0 0 1 0 1 1 1 1 0 1 0 t3 1 0 1 0 1 0 0 0 1 1 1 We have already seen and discussed this model. Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector Counts can be normalized by document lengths. docs D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 t1 2 1 0 3 1 3 0 0 0 0 4 t2 0 0 4 0 6 5 8 10 0 3 0 t3 3 0 7 0 3 0 0 0 1 5 1 This model is open to exploitation by websites… sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex sex tf * idf Weights • tf * idf measure: – term frequency (tf) – inverse document frequency (idf) -- a way to deal with the problems of the Zipf distribution • Goal: assign a tf * idf weight to each term in each document tf * idf wik tfik * log( N / nk ) Tk term k in document Di tfik frequency of term Tk in document Di idf k inverse document frequency of term Tk in C N total number of documents in the collection C nk the number of documents in C that contain Tk idf k log N nk Inverse Document Frequency • IDF provides high values for rare words and low values for common words idf k log N nk For a collection of 10000 documents 10000 log 0 10000 10000 log 0.301 5000 10000 log 2.698 20 10000 log 4 1 Similarity Measures Simple matching (coordination level match) |QD| 2 |QD| |Q|| D| Dice’s Coefficient |QD| |QD| Jaccard’s Coefficient |QD| 1 |Q | | D | 2 1 2 Cosine Coefficient |QD| min(| Q |, | D |) Overlap Coefficient Cosine D1 (0.8, 0.3) D2 (0.2, 0.7) 1.0 Q (0.4, 0.8) cos 1 0.74 Q D2 0.8 0.6 0.4 0.2 cos 2 0.98 2 1 0.2 D1 0.4 0.6 0.8 1.0 Problems with Vector Space • There is no real theoretical basis for the assumption of a term space – it is more for visualization that having any real basis – most similarity measures work about the same regardless of model • Terms are not really orthogonal dimensions – Terms are not independent of all other terms Probabilistic Models • Rigorous formal model attempts to predict the probability that a given document will be relevant to a given query • Ranks retrieved documents according to this probability of relevance (Probability Ranking Principle) • Rely on accurate estimates of probabilities Relevance Feedback • Main Idea: – Modify existing query based on relevance judgements • Query Expansion: Extract terms from relevant documents and add them to the query • Term Re-weighing: and/or re-weight the terms already in the query – Two main approaches: • Automatic (psuedo-relevance feedback) • Users select relevant documents – Users/system select terms from an automaticallygenerated list Definition: Relevance Feedback is the reformulation of a search query in response to feedback provided by the user for the results of previous versions of the query. Suppose you are interested in bovine agriculture on the banks of the river Jordan… Term Vector Term Weights [Jordan , Bank, Bull, River] [ 1 , 1 , 1 , 1 ] Search Display Results Gather Feedback Update Weights Term Vector [Jordan , Bank, Bull, River] Term Weights [ 1.1 , 0.1 , 1.3 , 1.2 ] Rocchio Method n1 n2 Ri Si Q1 Q0 i 1 n1 i 1 n2 where Q0 the vector for the initial query Ri the vector for the relevant document i S i the vector for the non - relevant document i n1 the number of relevant documents chosen n2 the number of non - relevant documents chosen and tune the importance of relevant and nonrelevan t terms (in some studies best to set to 0.75 and to 0.25) Rocchio Illustration Although we usually work in vector space for text, it is easier to visualize Euclidian space Original Query Term Re-weighting Note that both the location of the center, and the shape of the query have changed Query Expansion Rocchio Method • Rocchio automatically – re-weights terms – adds in new terms (from relevant docs) • have to be careful when using negative terms • Rocchio is not a machine learning algorithm • Most methods perform similarly – results heavily dependent on test collection • Machine learning methods are proving to work better than standard IR approaches like Rocchio Using Relevance Feedback • Known to improve results • People don’t seem to like giving feedback! Relevance Feedback for Time Series The original query The weigh vector. Initially, all weighs are the same. Note: In this example we are using a piecewise linear approximation of the data. We will learn more about this representation later. The initial query is executed, and the five best matches are shown (in the dendrogram) One by one the 5 best matching sequences will appear, and the user will rank them from between very bad (-3) to very good (+3) Based on the user feedback, both the shape and the weigh vector of the query are changed. The new query can be executed. The hope is that the query shape and weights will converge to the optimal query. Two papers consider relevance feedback for time series. Query Expansion L Wu, C Faloutsos, K Sycara, T. Payne: FALCON: Feedback Adaptive Loop for ContentBased Retrieval. VLDB 2000: 297-306 Term Re-weighting Keogh, E. & Pazzani, M. Relevance feedback retrieval of time series data. In Proceedings of SIGIR 99 Document Space has High Dimensionality • What happens beyond 2 or 3 dimensions? • Similarity still has to do with how many tokens are shared in common. • More terms -> harder to understand which subsets of words are shared among similar documents. • One approach to handling high dimensionality:Clustering Text Clustering • Finds overall similarities among groups of documents. • Finds overall similarities among groups of tokens. • Picks out some themes, ignores others. Scatter/Gather Hearst & Pedersen 95 • Cluster sets of documents into general “themes”, like a table of contents (using K-means) • Display the contents of the clusters by showing topical terms and typical titles • User chooses subsets of the clusters and re-clusters the documents within • Resulting new groups have different “themes” S/G Example: query on “star” Encyclopedia text 8 symbols 68 film, tv (p) 97 astrophysics 67 astronomy(p) 10 flora/fauna 14 sports 47 film, tv 7 music 12 stellar phenomena 49 galaxies, stars 29 constellations 7 miscellaneous Clustering and re-clustering is entirely automated Ego Surfing! http://vivisimo.com/ Information need Collections Pre-process text input Parse Query Index How is the index constructed? Rank The section that follows is about Evaluation Evaluate Evaluation • Why Evaluate? • What to Evaluate? • How to Evaluate? Why Evaluate? • Determine if the system is desirable • Make comparative assessments • Others? What to Evaluate? • How much of the information need is satisfied. • How much was learned about a topic. • Incidental learning: – How much was learned about the collection. – How much was learned about other topics. • How inviting the system is. What to Evaluate? effectiveness What can be measured that reflects users’ ability to use system? (Cleverdon 66) – – – – – Coverage of Information Form of Presentation Effort required/Ease of Use Time and Space Efficiency Recall • proportion of relevant material actually retrieved – Precision • proportion of retrieved material actually relevant Relevant vs. Retrieved All docs Retrieved Relevant Precision vs. Recall | RelRetriev ed | Precision | Retrieved | | RelRetriev ed | Recall | Rel in Collection | All docs Retrieved Relevant Why Precision and Recall? Intuition: Get as much good stuff while at the same time getting as little junk as possible. Retrieved vs. Relevant Documents Very high precision, very low recall Relevant Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant Retrieved vs. Relevant Documents High recall, but low precision Relevant Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant Precision/Recall Curves • There is a tradeoff between Precision and Recall • So measure Precision at different levels of Recall • Note: this is an AVERAGE over MANY queries precision x x x recall x Precision/Recall Curves • Difficult to determine which of these two hypothetical results is better: precision x x x recall x Precision/Recall Curves Recall under various retrieval assumptions R E C A L L 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Perfect Tangent Parabolic Parabolic Recall Recall random Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved 1000 Documents 100 Relevant Precision under various assumptions P R E C I S I O N 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 Perfect Tangent Parabolic Recall 1000 Documents 100 Relevant Parabolic Recall random Perverse 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Proportion of documents retrieved Document Cutoff Levels • Another way to evaluate: – Fix the number of documents retrieved at several levels: • • • • • • top 5 top 10 top 20 top 50 top 100 top 500 – Measure precision at each of these levels – Take (weighted) average over results • This is a way to focus on how well the system ranks the first k documents. Problems with Precision/Recall • Can’t know true recall value – except in small collections • Precision/Recall are related – A combined measure sometimes more appropriate • Assumes batch mode – Interactive IR is important and has different criteria for successful searches – Assumes a strict rank ordering matters. Relation to Contingency Table Doc is Relevant Doc is retrieved Doc is NOT retrieved • • • • a c Doc is NOT relevant Doc is Relevant b Doc is retrieved N retrel N retrel d Doc is NOT retrieved N retrel N retrel Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: a/(a+c) Why don’t we use Accuracy for IR? – – – – Doc is NOT relevant (Assuming a large collection) Most docs aren’t relevant Most docs aren’t retrieved Inflates the accuracy value The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) 1 b2 E 1 2 b 1 R P P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall How to Evaluate? Test Collections Test Collections • Cranfield 2 – – 1400 Documents, 221 Queries – 200 Documents, 42 Queries • INSPEC – 542 Documents, 97 Queries • UKCIS -- > 10000 Documents, multiple sets, 193 Queries • ADI – 82 Document, 35 Queries • CACM – 3204 Documents, 50 Queries • CISI – 1460 Documents, 35 Queries • MEDLARS (Salton) 273 Documents, 18 Queries TREC • Text REtrieval Conference/Competition – Run by NIST (National Institute of Standards & Technology) – 2002 (November) will be 11th year • Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs – Newswire & full text news (AP, WSJ, Ziff, FT) – Government documents (federal register, Congressional Record) – Radio Transcripts (FBIS) – Web “subsets” TREC (cont.) • Queries + Relevance Judgments – Queries devised and judged by “Information Specialists” – Relevance judgments done only for those documents retrieved -- not entire collection! • Competition – Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) – Results judged on precision and recall, going up to a recall level of 1000 documents TREC • Benefits: – made research systems scale to large collections (preWWW) – allows for somewhat controlled comparisons • Drawbacks: – emphasis on high recall, which may be unrealistic for what most users want – very long queries, also unrealistic – comparisons still difficult to make, because systems are quite different on many dimensions – focus on batch ranking rather than interaction – no focus on the WWW TREC is changing • Emphasis on specialized “tracks” – Interactive track – Natural Language Processing (NLP) track – Multilingual tracks (Chinese, Spanish) – Filtering track – High-Precision – High-Performance • http://trec.nist.gov/ What to Evaluate? • Effectiveness – Difficult to measure – Recall and Precision are one way – What might be others?