CS 430: Information Discovery Lecture 26 Automated Information Retrieval 1 Course Administration 2 Information Discovery People have many reasons to look for information: 3 • Known item Where will I find the wording of the US Copyright Act? • Facts What is the capital of Barbados? • Introduction or overview How do diesel engines work? • Related information Is there a review of this article? • Comprehensive search What is known of the effects of global warming on hurricanes? Types of Information Discovery media type text image, video, audio, etc. linking searching CS 502 natural language processing CS 474 4 browsing By user statistical catalogs, indexes (metadata) No human effort user-in-loop Automated information discovery Creating catalog records manually is labor intensive and hence expensive. The aim of automatic indexing is to build indexes and retrieve information without human intervention. The aim of automated information discovery is for users to discover information without using skilled human effort to build indexes. 5 Resources for automated information discovery Computer power brute force computing ranking methods automatic generation of metadata The intelligence of the user browsing relevance feedback information visualization 6 Brute force computing Few people really understand Moore's Law -- Computing power doubles every 18 months -- Increases 100 times in 10 years -- Increases 10,000 times in 20 years Simple algorithms + immense computing power may outperform human intelligence 7 Problems with (old-fashioned) Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms (professional indexing) • Requires precise formulation of queries (professional training) 8 Relevance and Ranking Classical methods assume that a document is either relevant to a query or not relevant. Often a user will consider a document to be partially relevant. Ranking methods: measure the degree of similarity between a query and a document. Similar Requests Documents Similar: How similar is document to a request? 9 Contrast with (old-fashioned) Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries (to have as many dimensions as possible) • Benefits from large numbers of index terms • Permits queries with many terms, not all of which need match the document 10 SMART System An experimental system for automatic information retrieval • automatic indexing to assign terms to documents and queries • identify documents to be retrieved by calculating similarities between documents and queries • collect related documents into common subject classes • procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988 11 The index term vector space The space has as many dimensions as there are terms in the word list. t3 d1 d2 t2 t1 12 Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, tij = 0 if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i) Similarity between di and dj is defined as n cos(di, dj) = 13 t t k=1 ik jk |di| |dj| Term weighting Zipf's Law: If the words, w, in a collection are ranked, r(w), by their frequency, f(w), they roughly fit the relation: r(w) * f(w) = c This suggests that some terms are more effective than others in retrieval. In particular relative frequency is a useful measure that identifies terms that occur with substantial frequency in some documents, but with relatively low overall collection frequency. Term weights are functions that are used to quantify these concepts. 14 Term Frequency Concept A term that appears many times within a document is likely to be more important than a term that appears only once. 15 Inverse Document Frequency Concept A term that occurs in a few documents is likely to be a better discriminator that a term that appears in most or all documents. 16 Ranking -- Practical Experience 1. Basic method is inner (dot) product with no weighting 2. Cosine (dividing by product of lengths) normalizes for vectors of different lengths 3. Term weighting using frequency of terms in document usually improves ranking 4. Term weighting using an inverse function of terms in the entire collection improves ranking (e.g., IDF) 5. Weightings for document structure improve ranking 6. 17 Relevance weightings after initial retrieval improve ranking Effectiveness of methods depends on characteristics of the collection. In general, there are few improvements beyond simple weighting schemes. Page Rank Algorithm (Google) Concept: The rank of a web page is higher if many pages link to it. Links from highly ranked pages are given greater weight than links from less highly ranked pages. 18 Google PageRank Model A user: 1. Starts at a random page on the web 2a. With probability p, selects any random page and jumps to it 2b. With probability 1-p, selects a random hyperlink from the current page and jumps to the corresponding page 3. Repeats Step 2a and 2b a very large number of times Pages are ranked according to the relative frequency with which they are visited. 19 Compare TF.IDF to PageRank With TF.IDF document are ranked depending on how well they match a specific query. With PageRank, the pages are ranked in order of importance, with no reference to a specific query. 20 Latent Semantic Indexing Objective Replace indexes that use sets of index terms by indexes that use concepts. Approach Map the index term vector space into a lower dimensional space, using singular value decomposition. 21 Use of Concept Space: Term Suggestion 22 Non-Textual Materials Content maps photograph bird songs and images software data set video 23 Attribute lat. and long., content subject, date and place field mark, bird song task, algorithm survey characteristics subject, date, etc. Direct Searching of Content Sometimes it is possible to match a query against the content of a digital object. The effectiveness varies from field to field. Examples 24 • Images -- crude characteristics of color, texture, shape, etc. • Music -- optical recognition of score • Bird song -- spectral analysis of sounds • Fingerprints Image Retrieval: Blobworld 25 Automated generation of metadata • Vector methods are for textual material only. • Metadata is needed for non-textual materials. (Vector methods can be applied to textual metadata.) • Automated extraction of metadata is still weak because of the semantic knowledge needed. 26 Surrogates for non-textual materials Textual catalog record about a non-textual item (photograph) Surrogate Text based methods of information retrieval can search a surrogate for a photograph 27 Library of Congress catalog record CREATED/PUBLISHED: [between 1925 and 1930?] SUMMARY: U. S. President Calvin Coolidge sits at a desk and signs a photograph, probably in Denver, Colorado. A group of unidentified men look on. NOTES: Title supplied by cataloger. Source: Morey Engle. SUBJECTS: Coolidge, Calvin,--1872-1933. Presidents--United States--1920-1930. Autographing--Colorado--Denver--1920-1930. Denver (Colo.)--1920-1930. Photographic prints. 28 MEDIUM: 1 photoprint ; 21 x 26 cm. (8 x 10 in.) Photographs: Cataloguing Difficulties Automatic • Image recognition methods are very primitive Manual • Photographic collections can be very large • Many photographs may show the same subject • Photographs have little or no internal metadata (no title page) • The subject of a photograph may not be known (Who are the people in a picture? Where is the location?) 29 30 Automatic record for George W. Bush home page DC-dot applied to http://www.georgewbush.com/ <link rel="schema.DC" href="http://purl.org/dc"> <meta name="DC.Subject" content="George W. Bush; Bush; George Bush; President; republican; 2000 election; election; presidential election; George; B2K; Bush for President; Junior; Texas; Governor; taxes; technology; education; agriculture; health care; environment; society; social security; medicare; income tax; foreign policy; defense; government"> <meta name="DC.Description" content="George W. Bush is running for President of the United States to keep the country prosperous."> 31 continued on next slide Automatic record for George W. Bush home page (continued) DC-dot applied to http://www.georgewbush.com/ <meta name="DC.Publisher" content="Concentric Network Corporation"> <meta name="DC.Date" scheme="W3CDTF" content="2001-01-12"> <meta name="DC.Type" scheme="DCMIType" content="Text"> <meta name="DC.Format" content="text/html"> <meta name="DC.Format" content="12223 bytes"> <meta name="DC.Identifier" content="http://www.georgewbush.com/"> 32 Informedia: the need for metadata A video sequence is awkward for information discovery: • Textual methods of information retrieval cannot be applied • Browsing requires the user to view the sequence. Fast skimming is difficult. • Computing requirements are demanding (MPEG-1 requires 1.2 Mbits/sec). Surrogates are required 33 Multi-Modal Information Discovery The multi-modal approach to information retrieval Computer programs to analyze video materials for clues e.g., changes of scene. • methods from artificial intelligence, e.g., speech recognition, natural language processing, image recognition. • analysis of video track, sound track, closed captioning if present, any other information. Each mode gives imperfect information. Therefore use many approaches and combine the evidence. 34 Informedia Library Creation Video Audio Text Speech recognition Image extraction Natural language interpretation 35 Segmentation Segments with derived metadata Harnessing the intelligence of the user • Relevance feedback • Support for browsing • Information visualization 36 The Human in the Loop Return objects Return hits Browse repository Search index 37 Informedia: Information Discovery User Querying via natural language Requested segments and metadata Segments with derived metadata 38 Browsing via multimedia surrogates MIRA Evaluation Frameworks for Interactive Multimedia Information Retrieval Applications • Information Retrieval techniques are beginning to be used in complex goal and task oriented systems whose main objectives are not just the retrieval of information. • New original research in IR is being blocked or hampered by the lack of a broader framework for evaluation. European study, 1996-99 39 MIRA Aims • • • • • • • • • • 40 Bring the user back into the evaluation process. Understand the changing nature of IR tasks and their evaluation. 'Evaluate' traditional evaluation methodologies. Consider how evaluation can be prescriptive of IR design Move towards balanced approach (system versus user) Understand how interaction affects evaluation. Support the move from static to dynamic evaluation. Understand how new media affects evaluation. Make evaluation methods more practical for smaller groups. Spawn new projects to develop new evaluation frameworks Feedback in the Vector Space Model Document vectors as points on a surface 41 • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface Relevance feedback (concept) x x o x o x hits from original search o x documents identified as non-relevant o documents identified as relevant original query reformulated query 42 Document clustering (concept) xx x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters. 43 Browsing in Information Space Starting point x x x x x x x x x x x x x x Effectiveness depends on (a) Starting point (b) Effective feedback (c) Convenience 44 User Interface Concepts Users need a variety of ways to search and browse, depending on the task being carried out and preferred style of working • Visual icons one-line headlines film strip views video skims transcript following of audio track • Collages • Semantic zooming • Results set • Named faces • Skimming 45 46 47 48 Alexandria User Interface 49 50 Information Visualization: Tilebars The figure represents a set of hits from a text search. Each large rectangle represents a document or section of text. Each row represents a search term or subquery. The density of each small square indicates the frequency with which a term appears in a section of a document. Hearst 1995 51 Information Visualization: Dendrogram 6 5 4 3 2 1 alpha 52 delta golf bravo echo charlie foxtrot Information Visualization: Self Organizing Maps (SOM) 53 54 Google has proved ... For a very wide range of users entirely automated: selection indexing ranking combined with searching by untrained users and online browsing is a very effective form of information discovery. 55 Searching Changing users, changing user interfaces 56 From To Trained user or librarian Untrained user Controlled vocabulary Natural language Fielded searching Unfielded text Manually created records Full text Boolean algorithms Ranking methods Stateful protocols Stateless protocols Information Discovery: 1991 and 2001 Content Computing Choice of content Index creation Frequency Vocabulary Query Users 57 1991 2001 print expensive selective human one time controlled Boolean trained online inexpensive comprehensive automatic monthly not controlled ranked retrieval untrained