Information retrieval

advertisement
CSE 535
Information Retrieval
Chapter 1: Introduction to IR
Srihari-CSE535-Spring2008
Motivation
 IR: representation, storage, organization of, and
access to unstructured data
 Focus is on the user information need
 User information need:



When did the Buffalo Bills last win the Super Bowl?
Find all docs containing information on cricket players who are: (1)
tempermental, (ii) popular in their countries, and (iii) play in
international test series.
Emphasis is on the retrieval of information (not data)
Srihari-CSE535-Spring2008
Motivation
 Data retrieval



which docs contain a set of keywords?
Well defined semantics
a single erroneous object implies failure!
 Information retrieval




information about a subject or topic
deals with unstructured text
semantics is frequently loose
small errors are tolerated
 IR system:



interpret contents of information items
generate a ranking which reflects relevance
notion of relevance is most important
Srihari-CSE535-Spring2008
Basic Concepts
 The User Task
Retrieval
Database
Browsing



Retrieval
 information or data
 purposeful
 needle in a haystack problem
Browsing
 glancing around
 Formula 1 racing; cars, Le Mans, France, tourism
Filtering (push rather than pull)
Srihari-CSE535-Spring2008
Query
 Which plays of Shakespeare contain the
words Brutus AND Caesar but NOT Calpurnia?
 Could grep all of Shakespeare’s plays for
Brutus and Caesar, then strip out lines
containing Calpurnia?



Slow (for large corpora)
NOT Calpurnia is non-trivial
Other operations (e.g., find the phrase Romans and
countrymen) not feasible
Srihari-CSE535-Spring2008
Term-document incidence
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
1 if play contains
word, 0 otherwise
Srihari-CSE535-Spring2008
Incidence vectors
 So we have a 0/1 vector for each term.
 To answer query: take the vectors for Brutus,
Caesar and Calpurnia (complemented) 
bitwise AND.
 110100 AND 110111 AND 101111 = 100100.
Srihari-CSE535-Spring2008
Answers to query
 Antony and Cleopatra, Act III, Scene ii

Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus,

When Antony found Julius Caesar dead,

He cried almost to roaring; and he wept

When at Philippi he found Brutus slain.
 Hamlet, Act III, Scene ii


Lord Polonius: I did enact Julius Caesar I was killed i' the
Capitol; Brutus killed me.
Srihari-CSE535-Spring2008
Bigger document collections
 Consider N = 1million documents, each with
about 1K terms.
 Avg 6 bytes/term incl spaces/punctuation

6GB of data in the documents.
 Say there are M = 500K distinct terms among
these.
Srihari-CSE535-Spring2008
Can’t build the matrix
 500K x 1M matrix has half-a-trillion 0’s and
1’s.
Why?
 But it has no more than one billion 1’s.

matrix is extremely sparse: >99% zeros
 What’s a better representation?

We only record the 1 positions.
 Inverted Index
Srihari-CSE535-Spring2008
Ad-Hoc Retrieval
 Most standard IR task
 System to provide documents from the
collection that are relevant to an arbitrary
user information need



Information need: topic that user wants to know about
Query: user’s abstraction of the information need
Relevance: document is relevant if the user perceives it as
valuable wrt his information need
Srihari-CSE535-Spring2008
Issues to be Addressed by IR
 How to improve quality of retrieval



Precison: what fraction of the returned results are relevant to
information need?
Recall: what fraction of relevant documents in the collection are
returned by the system
Understanding user information need
 Faster indexes and smaller query response
times
 Better understanding of user behaviour


interactive retrieval
visualization techniques
Srihari-CSE535-Spring2008
Inverted index
 For each term T: store a list of all documents
that contain T.
 Do we use an array or a list for this?
Brutus
2
Calpurnia
1
Caesar
4
2
8
16 32 64 128
3
5
8
13 21 34
13 16
What happens if the word Caesar
is added to document 14?
Srihari-CSE535-Spring2008
Inverted index
 Linked lists generally preferred to arrays



Dynamic space allocation
Insertion of terms into documents easy
Space overhead of pointers
Brutus
2
4
8
16
Calpurnia
1
2
3
5
Caesar
13
Dictionary
32
8
64
13
21
128
34
16
Postings
Sorted by docID (more later
on why).
Srihari-CSE535-Spring2008
Inverted index construction
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
More on
these later.
Modified tokens.
Inverted index.
Friends Romans
Countrymen
Linguistic
modules
friend
roman
countryman
Indexer friend
2
4
roman
1
2
13 16
countrymanSrihari-CSE535-Spring2008
Basic Concepts
 Logical view of the documents

documents represented by a set of index terms or keywords
Accents
spacing
Docs
stopwords
Noun
groups
stemming
Automatic or
Manual
indexing
Structure
recognition
structure
Full text
Index terms
 Document representation viewed as a
continuum: logical view of docs might shift
Srihari-CSE535-Spring2008
The Retrieval Process
User
Interface
user need
Text
2,3
Text Operations
logical view
logical view
Query
Operations
user feedback
Indexing Criteria,
Preferences
Indexer
9
query
Searching
4,5,6
inverted file
Index
7
retrieved docs
Document
Collection
Ranking
ranked docs
7,21
Retrieval
Indexing
Srihari-CSE535-Spring2008
Applications of IR
 Specialized Domains

biomedical, legal, patents, intelligence
 Summarization
 Cross-lingual Retrieval, Information Access
 Question-Answering Systems

Ask Jeeves
 Web/Text Mining

data mining on unstructured text
 Multimedia IR

images, document images, speech, music
 Web applications


shopbots
personal assistant agents
Srihari-CSE535-Spring2008
IR Techniques
 Machine learning


clustering, SVM, latent semantic indexing, etc.
improving relevance feedback, query processing etc.
 Natural Language Processing, Computational
Linguistics




better indexing, query processing
incorporating domain knowledge: e.g., synonym dictionaries
use of NLP in IR: benefits yet to be shown for large-scale IR
Information Extraction
 Highly focused Natural language processing (NLP)
 named entity tagging, relationship/event detection
 Text indexing and compression
 User interfaces and visualization
 AI

advanced QA systems, inference, etc.
Srihari-CSE535-Spring2008
Download