Search Information retrieval Information and scientific progress Let’s go back into time with this video: The American Engineer Information retrieval—and information science—emerge in a post-WW2 atmosphere where scientific and technical advance is key to social progress. Information serves this goal. The basic IR task model The goal of information retrieval is often described something like this: Given a user information need (typically expressed as a search phrase or query), provide relevant documents from the collection. Note that key term relevance. We’ll come back to it. A lovely diagram of the IR process Diagram from Jaime Arguello, UNC-Chapel Background: indexing Indexing is the process of representing the subject of documents. An index can comprise terms extracted from the document itself or terms from another source. An index language specifies the concepts used for indexing. An index language includes terms to represent concepts and a syntax for assigning (and perhaps combining, etc) terms. A thesaurus is a type of index language in which terms are controlled and relationships between terms established. Background: indexing Precoordinate terms combine concepts at the point of indexing: foraging for wild mushrooms is a precoordinate term. Postcoordinate terms are combined only at the time of search. Foraging and wild mushrooms are postcoordinate terms. (The searcher would input “foraging for wild mushrooms.”) Background: indexing A classification (in this context) is a subject scheme that expresses complex concepts by means of a notation; a document is placed in the single class that best expresses its subject. Most classifications are enumerative; individual classes are defined in advance. Some classifications have synthetic elements (ways to customize classes for particular geographic locations, for example). A faceted classification is a synthetic scheme in which various elements of the subject are combined at the time of indexing (via a syntax) to create classes. Background: indexing Exhaustivity is an indexing principle that describes how much of a document’s content is indexed. (The “main” subject? The first three “important” subjects? Everything mentioned in the document? ) Specificity is an indexing principle that describes the level of abstraction at which a document’s content is indexed. (Is the subject information retrieval or information retrieval evaluation methods or precision and recall measures?) Background: indexing The Cranfield experiments test the relative effectiveness of three types of indexing languages: • Single words extracted from the text itself. • Concepts (phrases) extracted from the text itself. • Terms drawn from a controlled vocabulary. Multiple variations of these three were also created (single words plus synonyms, controlled terms plus broader and narrower concepts). Cranfield experiments Although the Cranfield tests did not involve computers (!), they illustrate the basis from which information retrieval evaluations are still conducted, both in the conceptualization of the retrieval task and in the methods employed to measure it. Also, the results were SHOCKING. Again: social context of IR beginnings Cleverdon: librarian at a college of aeronautics. Interested in new indexing technologies. Wants to systematically test them. The unspoken retrieval context: Facilitate the efficient progress of technical specialists working on specific problems. Cranfield protocol 1. Assemble a test collection of 1400 research papers, primarily in aerodynamics. 2. Index each document with test languages. 3. Develop test queries. 4. Determine relevance of each document to each query, using 1-4 scale. 5. Search collection using a specified query, index language, relevance level, and search rule (number of matching terms, for example). 6. Determine which of the retrieved documents are relevant to the query. Cranfield protocol Where do the test queries come from? Cleverdon and colleagues asked the authors of scientific papers to state in the form of a question “the problem to which their paper was addressed.”) (The example given is “small deflection theory of simple supported cylinders.”) In using this to approximate an information seeker, Cranfield focuses on specialists who have an extremely good idea of what they are looking for. Cranfield protocol Where do the relevance judgments come from? Students do the first pass, and the original document authors confirm. Again, this may be an excellent approximation of “the real thing,” but it is a certain sort of approximation. Cranfield protocol Search results were then described according to the precision and recall measures. Precision: Percentage of documents retrieved that are relevant. Recall: Percentage of relevant documents retrieved out of all possible relevant documents. Precision and recall Say I have a collection of 10 documents. For a query X, documents 2, 7, and 8 are relevant. My search for query X retrieves documents 7 and 8 but not 2; it also retrieves documents 4 and 5. Precision for this search is 2 relevant documents out of 4 retrieved, or 50 percent. Recall for this search is 2 relevant documents out of 3 possible relevant, or 67 percent. Cranfield results They do some math to normalize the results to a single number. And, the best performing index language is... Single terms extracted from the document itself with some word forms (e.g., alloy + alloys) added. OMG! Cranfield legacy The Cranfield methodology is used today in evaluating IR systems: the TREC (text retrieval conference) sponsored by NIST, in which researchers use a shared collection and set of queries to compare their systems’ effectiveness uses the same protocol...including human assessors. The Cranfield measures and means of conceptualizing retrieval also persist. 2009 TReC Web track query Query: appraisals Description: How are home values appraised? I want to know how home appraisals are done. Cranfield legacy TREC assessors Photo from NIST Wait, what were those people assessing? Relevance. Er, what is that, exactly? Exactly. Saracevic points out that IR—and, to some extent, information science as a whole—rests upon an uncertain concept. Is this problematic? Relevance in information science Saracevic noted a pattern in operational definitions of relevance as used by information scientists testing retrieval systems: Relevance is the A of a B existing between a C and a D as determined by an E. Relevance Saracevic describes seven different views of relevance, which emphasize different elements of the concept: 1. Subject knowledge view. What is the relationship between a question and a subject? 2. Subject literature view. What is the relationship between a question and the academic literature on a subject? Relevance Saracevic describes seven different views of relevance, which emphasize different elements of the concept: 3. Logical view. What is the nature of the inference between a question and conclusions from a subject literature? 4. System view. What is the relationship between a file’s contents and a question, a user, and a subject? Relevance Saracevic describes seven different views of relevance, which emphasize different elements of the concept: 5. Destination’s view. What is the human judgment of the relation between a document and a question? 6. Pertinence view. What is the relationship between a seeker’s knowledge and a subject? Relevance Saracevic describes seven different views of relevance, which emphasize different elements of the concept: 7. Pragmatic view. What is the relation between a user’s problem and the information provided? Back to representation and comparision Diagram from Jaime Arguello, UNC-Chapel Document representations In the Cranfield tests, documents were represented with a small set of index terms. If terms from the document itself are the best index terms anyway (maybe!), then if we can use ALL the text of the document, why not? This is full-text indexing. To speed processing, the text of documents is extracted and rearranged (tokenized) to form an inverted file. Inverted index file All the terms from all the documents are put in a table (T= term, and D= document). For a Boolean search (in a moment!) these tables just indicate presence and absence (1 or 0). Other models weight terms differently to rank retrieved documents. Table from Joseph Tennis, UW Other text processing Retrieval systems may remove common words (stopwords) from the inverted file, although this is not as common as it once was. Retrieval systems may also stem words to store just their roots. Retrieval models Most retrieval systems are variations of the following models: • Boolean. • Vector space. • Probabilistic. • Latent semantic indexing. Boolean model This is the oldest and simplest model; it puts most of the work on the searcher. Instead of searching for mere presence of index terms in a document (like the Cranfield tests), use Boolean operators (AND, OR, NOT) to describe the query more precisely. ( (“traumatic brain injury” OR TBI OR tbi OR “traumatic brain injuries”) AND (headache OR headaches) ) NOT concussion Ranked results Other retrieval models focus on using various statistical properties of texts (primarily) to produce ranked lists of results. A key element in calculating these rankings is the frequency of significant terms. This measure relies on a property of language known as Zipf’s law. Zipf’s law Zipf’s law describes a distribution in which the ith most frequent object appears 1/iθ times the frequency of the most frequent object. Er, what? Zipf’s law applies to language, for any corpus, including a single document. So the most frequent word (say) takes up N percent of the document; the next most frequent word takes up N/2 percent of the document; the next most frequent word takes up N/3 percent, etc. Zipf distributions appear for other phenomena as well. Zipf’s law A Zipf distribution at 300 data points. Graph from Jacob Nielsen, useit.com Implications of Zipf’s law Zipf’s law holds true across languages, across types of text (written and spoken, etc) across complexity of topics, document genres, etc. This statistical property implies that the most important words (for retrieval purposes) in a document are those that are most frequent in the document and least frequent in the collection. Tf/idf This relation is exploited by many retrieval models: term frequency/ inverse document frequency You can refine your calculation of tf-idf (e.g., by taking account of document length), but this is the basic idea. Vector space model The vector space model (originated by Salton), compares a query and a document as the correlation between two n-dimensional vectors. The cosine of the angle between the vectors is used to quantify the correlation. Index terms in queries and documents are weighted based on tf-idf (and other properties). Probabilistic model The probabilistic model (introduced by Robertson and Sparck Jones) recursively refines an answer set based on guessing at the “ideal” answer. Initial probabilistic models did not use weights, but later versions do. Latent semantic indexing Latent semantic indexing is a retrieval model that attempts to align documents with concepts, on the observation that terms that co-occur probably indicate something about a shared conceptual space. Documents and queries are mapped to a “concept space” and then compared. LSA aims to return documents based on a conceptual match to a query, as opposed to a term match. Lots of other stuff... Individual systems may incorporate many different sorts of information to adjust the rankings produced by one of these basic models: using text structure to refine weights (titles, etc), using your location or previous Web history to adjust rankings, promoting recent content, etc. What’s Google? The original Google innovation was a ranking enhancement outside of the primary retrieval model. They looked at the links to a page as a measure of information quality. This “PageRank” was used to adjust initial results of a retrieval system. Google is not magic.