Lecture: search

advertisement
Search
Information retrieval
Information and scientific progress
Let’s go back into time with this video:
The American Engineer
Information retrieval—and information
science—emerge in a post-WW2 atmosphere
where scientific and technical advance is key to
social progress. Information serves this goal.
The basic IR task model
The goal of information retrieval is often
described something like this:
Given a user information need (typically
expressed as a search phrase or query), provide
relevant documents from the collection.
Note that key term relevance. We’ll come back
to it.
A lovely diagram of the IR process
Diagram from Jaime Arguello, UNC-Chapel
Background: indexing
Indexing is the process of representing the subject of
documents. An index can comprise terms extracted from
the document itself or terms from another source.
An index language specifies the concepts used for
indexing. An index language includes terms to represent
concepts and a syntax for assigning (and perhaps
combining, etc) terms.
A thesaurus is a type of index language in which terms
are controlled and relationships between terms
established.
Background: indexing
Precoordinate terms combine concepts at the
point of indexing: foraging for wild mushrooms
is a precoordinate term.
Postcoordinate terms are combined only at the
time of search. Foraging and wild mushrooms
are postcoordinate terms. (The searcher would
input “foraging for wild mushrooms.”)
Background: indexing
A classification (in this context) is a subject scheme that
expresses complex concepts by means of a notation; a
document is placed in the single class that best expresses
its subject. Most classifications are enumerative;
individual classes are defined in advance. Some
classifications have synthetic elements (ways to customize
classes for particular geographic locations, for example).
A faceted classification is a synthetic scheme in which
various elements of the subject are combined at the time
of indexing (via a syntax) to create classes.
Background: indexing
Exhaustivity is an indexing principle that describes
how much of a document’s content is indexed. (The
“main” subject? The first three “important”
subjects? Everything mentioned in the document? )
Specificity is an indexing principle that describes the
level of abstraction at which a document’s content is
indexed. (Is the subject information retrieval or
information retrieval evaluation methods or
precision and recall measures?)
Background: indexing
The Cranfield experiments test the relative
effectiveness of three types of indexing languages:
• Single words extracted from the text itself.
• Concepts (phrases) extracted from the text itself.
• Terms drawn from a controlled vocabulary.
Multiple variations of these three were also created
(single words plus synonyms, controlled terms plus
broader and narrower concepts).
Cranfield experiments
Although the Cranfield tests did not involve
computers (!), they illustrate the basis from
which information retrieval evaluations are still
conducted, both in the conceptualization of the
retrieval task and in the methods employed to
measure it.
Also, the results were SHOCKING.
Again: social context of IR beginnings
Cleverdon: librarian at a college of aeronautics.
Interested in new indexing technologies. Wants
to systematically test them.
The unspoken retrieval context: Facilitate the
efficient progress of technical specialists
working on specific problems.
Cranfield protocol
1. Assemble a test collection of 1400 research
papers, primarily in aerodynamics.
2. Index each document with test languages.
3. Develop test queries.
4. Determine relevance of each document to each
query, using 1-4 scale.
5. Search collection using a specified query, index
language, relevance level, and search rule
(number of matching terms, for example).
6. Determine which of the retrieved documents are
relevant to the query.
Cranfield protocol
Where do the test queries come from?
Cleverdon and colleagues asked the authors of scientific
papers to state in the form of a question “the problem to
which their paper was addressed.”)
(The example given is “small deflection theory of simple
supported cylinders.”)
In using this to approximate an information seeker,
Cranfield focuses on specialists who have an extremely
good idea of what they are looking for.
Cranfield protocol
Where do the relevance judgments come from?
Students do the first pass, and the original
document authors confirm.
Again, this may be an excellent approximation of
“the real thing,” but it is a certain sort of
approximation.
Cranfield protocol
Search results were then described according to
the precision and recall measures.
Precision: Percentage of documents retrieved
that are relevant.
Recall: Percentage of relevant documents
retrieved out of all possible relevant documents.
Precision and recall
Say I have a collection of 10 documents. For a
query X, documents 2, 7, and 8 are relevant.
My search for query X retrieves documents 7 and 8
but not 2; it also retrieves documents 4 and 5.
Precision for this search is 2 relevant documents out
of 4 retrieved, or 50 percent.
Recall for this search is 2 relevant documents out of
3 possible relevant, or 67 percent.
Cranfield results
They do some math to normalize the results to a
single number. And, the best performing index
language is...
Single terms extracted from the document itself
with some word forms (e.g., alloy + alloys)
added.
OMG!
Cranfield legacy
The Cranfield methodology is used today in
evaluating IR systems: the TREC (text retrieval
conference) sponsored by NIST, in which
researchers use a shared collection and set of
queries to compare their systems’ effectiveness
uses the same protocol...including human
assessors.
The Cranfield measures and means of
conceptualizing retrieval also persist.
2009 TReC Web track query
Query: appraisals
Description: How are home values appraised? I
want to know how home appraisals are done.
Cranfield legacy
TREC assessors
Photo from NIST
Wait, what were those people assessing?
Relevance. Er, what is that, exactly?
Exactly.
Saracevic points out that IR—and, to some
extent, information science as a whole—rests
upon an uncertain concept. Is this problematic?
Relevance in information science
Saracevic noted a pattern in operational
definitions of relevance as used by information
scientists testing retrieval systems:
Relevance is the A of a B existing between a C
and a D as determined by an E.
Relevance
Saracevic describes seven different views of
relevance, which emphasize different elements
of the concept:
1. Subject knowledge view. What is the
relationship between a question and a
subject?
2. Subject literature view. What is the
relationship between a question and the
academic literature on a subject?
Relevance
Saracevic describes seven different views of
relevance, which emphasize different elements
of the concept:
3. Logical view. What is the nature of the
inference between a question and conclusions
from a subject literature?
4. System view. What is the relationship between
a file’s contents and a question, a user, and a
subject?
Relevance
Saracevic describes seven different views of
relevance, which emphasize different elements
of the concept:
5. Destination’s view. What is the human
judgment of the relation between a document
and a question?
6. Pertinence view. What is the relationship
between a seeker’s knowledge and a subject?
Relevance
Saracevic describes seven different views of
relevance, which emphasize different elements
of the concept:
7. Pragmatic view. What is the relation between
a user’s problem and the information
provided?
Back to representation and comparision
Diagram from Jaime Arguello, UNC-Chapel
Document representations
In the Cranfield tests, documents were represented
with a small set of index terms.
If terms from the document itself are the best index
terms anyway (maybe!), then if we can use ALL the
text of the document, why not? This is full-text
indexing.
To speed processing, the text of documents is
extracted and rearranged (tokenized) to form an
inverted file.
Inverted index file
All the terms from all the documents are put in a table (T=
term, and D= document). For a Boolean search (in a moment!)
these tables just indicate presence and absence (1 or 0). Other
models weight terms differently to rank retrieved documents.
Table from Joseph Tennis, UW
Other text processing
Retrieval systems may remove common words
(stopwords) from the inverted file, although this
is not as common as it once was.
Retrieval systems may also stem words to store
just their roots.
Retrieval models
Most retrieval systems are variations of the
following models:
• Boolean.
• Vector space.
• Probabilistic.
• Latent semantic indexing.
Boolean model
This is the oldest and simplest model; it puts
most of the work on the searcher. Instead of
searching for mere presence of index terms in a
document (like the Cranfield tests), use Boolean
operators (AND, OR, NOT) to describe the
query more precisely.
( (“traumatic brain injury” OR TBI OR tbi OR
“traumatic brain injuries”) AND (headache OR
headaches) ) NOT concussion
Ranked results
Other retrieval models focus on using various
statistical properties of texts (primarily) to
produce ranked lists of results.
A key element in calculating these rankings is
the frequency of significant terms. This measure
relies on a property of language known as Zipf’s
law.
Zipf’s law
Zipf’s law describes a distribution in which the ith most
frequent object appears 1/iθ times the frequency of the
most frequent object.
Er, what?
Zipf’s law applies to language, for any corpus, including a
single document. So the most frequent word (say) takes
up N percent of the document; the next most frequent
word takes up N/2 percent of the document; the next most
frequent word takes up N/3 percent, etc.
Zipf distributions appear for other phenomena as well.
Zipf’s law
A Zipf distribution at 300 data points.
Graph from Jacob Nielsen, useit.com
Implications of Zipf’s law
Zipf’s law holds true across languages, across
types of text (written and spoken, etc) across
complexity of topics, document genres, etc.
This statistical property implies that the most
important words (for retrieval purposes) in a
document are those that are most frequent in the
document and least frequent in the collection.
Tf/idf
This relation is exploited by many retrieval
models:
term frequency/ inverse document frequency
You can refine your calculation of tf-idf (e.g., by
taking account of document length), but this is
the basic idea.
Vector space model
The vector space model (originated by Salton),
compares a query and a document as the
correlation between two n-dimensional vectors.
The cosine of the angle between the vectors is
used to quantify the correlation.
Index terms in queries and documents are
weighted based on tf-idf (and other properties).
Probabilistic model
The probabilistic model (introduced by
Robertson and Sparck Jones) recursively refines
an answer set based on guessing at the “ideal”
answer.
Initial probabilistic models did not use weights,
but later versions do.
Latent semantic indexing
Latent semantic indexing is a retrieval model
that attempts to align documents with concepts,
on the observation that terms that co-occur
probably indicate something about a shared
conceptual space. Documents and queries are
mapped to a “concept space” and then compared.
LSA aims to return documents based on a
conceptual match to a query, as opposed to a
term match.
Lots of other stuff...
Individual systems may incorporate many
different sorts of information to adjust the
rankings produced by one of these basic models:
using text structure to refine weights (titles, etc),
using your location or previous Web history to
adjust rankings, promoting recent content, etc.
What’s Google?
The original Google innovation was a ranking
enhancement outside of the primary retrieval
model.
They looked at the links to a page as a measure
of information quality. This “PageRank” was
used to adjust initial results of a retrieval system.
Google is not magic.
Download