Conventional Text-Retrieval Systems

advertisement
Modern Information Retrieval
Chapter 1: Introduction
Ricardo Baeza-Yates
Berthier Ribeiro-Neto
1
Motivation

Example of the user information need



Topic: NCAA college tennis team
Description: Find all the pages (documents) containing information on
college tennis teams which (1) are maintained by an university in the USA
and (2) participate in the NCAA tennis tournament.
Narrative: To be relevant, the page must include information on the
national ranking of the team in the last three years and the email or phone
number of the team coach.
2
IR Research

Information retrieval vs Data retrieval

Research





information search
information filtering (routing)
document classification and categorization
user interfaces and data visualization
cross-language retrieval
3
IR History

1970

1990, WWW
4
The User Task

Retrieval (Searching)


classic information search process where clear
objectives are defined
Browsing

a process where one’s main objectives are not clearly
defined and might change during the interaction with
the system
5
Logical View of the Documents

Text Operations



reduce the complexity of the document representation
a full text  a set of index terms
Steps
1.
2.
3.
4.
Stopwords removing
Stemming
Noun groups
...
6
Past, Present, and Future

Early Development


Library


Index
Author name, title, subject headings, keywords
The Web and Digital Libraries

Hyperlinks
7
Conventional Text-Retrieval
Systems
Automatic Text Processing
G. Salton, Addison-Wesley, 1989.
(Chapter 9)
8
Data Retrieval

A specified set of attributes is used to characterize each
record.
EMPLOYEE(NAME, SSN, BDATE, ADDR, SEX, SALARY, DNO)

Exact match between the attributes used in
query formulations and those attached to the document.
SELECT BDATE, ADDR
FROM EMPLOYEE
WHERE NAME = ‘John Smith’
9
Text-Retrieval Systems


Content identifiers (keywords, index terms, descriptors)
characterize the stored texts.
Degrees of coincidence between the sets of identifiers
attached to queries and documents
query formulation
content analysis
10
Possible Representation

Document representation (Text operation)




Query (Query operation)




unweighted index terms (term vectors)
weighted index terms
…
unweighted or weighted index terms
Boolean combinations (or, and, not)
…
Search operation must be effective

(Indexing)
11
File Structures

Main requirements



fast-access for various kinds of searches
large number of indices
Alternatives



Inverted Files
Signature Files
PAT trees
12
Inverted Files

File is represented as an array of indexed documents.
Term 1 Term 2 Term 3 Term 4
Doc 1
1
1
0
1
Doc 2
0
1
1
1
Doc 3
1
0
1
1
Doc 4
0
0
1
1
13
Inverted-file process

The document-term array is inverted (transposed).
Doc 1
Doc 2
Doc 3
Doc 4
Term 1
1
0
1
0
Term 2
1
1
0
0
Term 3
0
1
1
1
Term 4
1
1
1
1
14
Inverted-file process
(Continued)

Take two or more rows of an inverted term-document
array, and produce a single combined list of document
identifiers.

Ex: Query= (term2 and term3)
term2
1
1
0
0
term3
0
1
1
1
-----------------------------------------------------1 <-- D2
15
List-merging for two ordered lists


The inverted-index operations to obtain answers
are based on list-merging process.
Example
T1:
{D1, D3}
T2:
{D1, D2}
Merged(T1, T2): {D1, D1, D2, D3}
16
Extensions of Inverted Index Operations
(Distance Constraints)

Distance Constraints


(A within sentence B)
terms A and B must co-occur in a common
sentence
(A adjacent B)
terms A and B must occur adjacently in the
text
17
Extensions of Inverted Index Operations
(Distance Constraints)

Implementation


include term-location in the inverted indexes
information:
{P345, P348, P350, …}
retrieval:
{P123, P128, P345, …}
include sentence-location in the indexes
information:
{P345, 25; P345, 37; P348, 10; P350, 8; …}
retrieval:
{P123, 5; P128, 25; P345, 37; P345, 40; …}
18
Extensions of Inverted Index Operations
(Distance Constraints)



Include paragraph numbers in the indexes
sentence numbers within paragraphs
word numbers within sentences
information: {P345, 2, 3, 5; …}
retrieval: {P345, 2, 3, 6; …}
Query examples
(information adjacent retrieval)
(information within five words retrieval)
Cost: the size of indexes
19
Retrieval models
Set Theoretic
Fuzzy
Extended Boolean
Classic Models
Boolean
Vector
Probabilistic
Algebraic
Generalized Vector
Latent Semantic Index
Neural Networks
Probabilistic
Inference Network
Belief Network
20
Classic IR Model

Basic concepts : Each document is described
by a set of representative keywords called
index terms.

Assign a numerical weights to distinct
relevance between index terms.
21
Boolean model



Binary decision criterion
Data retrieval model
Advantage


clean formalism, simplicity
Disadvantage


It is not simple to translate an information need into a
Boolean expression.
exact matching may lead to retrieval of too few or too
many documents
22
Vector model

Assign non-binary weights to index terms in
queries and in documents. => TFxIDF

Compute the similarity between documents
and query. => Sim(Dj, Q)

More precise than Boolean model.
23
Term Weights


Term Weights
Di={Ti1, 0.2; Ti2, 0.5; Ti3, 0.6}
Issues


How to generate the term weights?
How to apply the term weights?
• Sum the weights of all document terms that match
the given query.
• Rank the output documents in the descending order
of term weight.
24
Boolean Query with Term Weights



Transform a Boolean expression into disjunctive
normal form.
T1 and (T2 or T3)
=
(T1 and T2) or (T1 and T3)
For each conjunct, compute the minimum term
weight of any document term in that conjunct.
The document weight is the maximum of all the
conjunct weights.
25
Boolean Query with Term Weights

Example: Q=(T1 and T2) or T3
Document
Vectors
Conjunct
Weights
(T1 and T2)
D1=(T1,0.2;T2,0.5;T3,0.6)
(T1 and T2) or T3
0.2
0.6
0.6
0.2
0.1
0.2
D2=(T1,0.7;T2,0.2;T3,0.1)
D1 is preferred.
(T3)
Query
Weight
26
Summary






Conventional IR systems
Evaluation
Text operations (Term selection)
Query operations (Pattern matching,
Relevance feedback)
Indexing (File structure)
Modeling
27
Resources

Journals






Journal of American Society of Information Sciences
ACM Transactions on Information Systems
Information Processing and Management
Information Systems (Elsevier)
Knowledge and Information Systems (Springer)
Conferences


ACM SIGIR, DL, CIKM, CHI, etc.
Text Retrieval Conference (TREC)
28
Download