CS4485: Information Retrieval  Who I am:

advertisement
CS4485: Information Retrieval
 Who I am:






2016/5/29
Dr. Lusheng WANG
Dept. of Computer Science
office: Y6429
phone: 2788 9820
e-mail: lwang@cs.cityu.edu.hk
web site: http://www.cs.cityu.edu.hk/~lwang/
CS4485 Information Retrieval /WANG
Lusheng
Page 1
Text Book:
• B-Y Ricardo and R-N Berthier, Modern Information
Retrieval, Addison Wesley, 1999.
• We will add more material in the handout.
References:
• W.B. Frakes and R. Baeza-Yates. Information Retrieval:Data Structures &
Algorithms. Prentice Hall,Englewood Cliffs,NJ,USA,1992
• I.H. Witten, A. Moffat, and T.C.Bell. Managing Gigabytes: Compressing and
Indexing Documents and Images. Van Nostrand Reinhold, NewYork, 1994.
• Michael Lesk. Practical Digital Libraries; Books,Bytes, and Bucks. Morgan
Kaufmann, 1997.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 2
Information Retrieval
 User task:


Translate the information needed into query in some
language
Provide some words
 Information Retrieval v.s. Browsing


Information retrieval: finding useful information.
Browsing: The objectives are not clearly defined and may
change during the browsing process.
 Most system combines the two types.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 3
Logic View of the documents
 Classic view– a set of index terms or keywords
 Full text logic view: keep the full text (with computers)

Still need some special treatment (chapter 7)
• Elimination of stopwords (useless words appear in all documents)
• Use of stemming (reduces distinct words to their common
grammatical root)
• Identification of noun groups (eliminates adjectives, adverbs, and
verbs)
• Compression techniques
 Structures are used—structured text retrieval models
(chapters, section, subsections)
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 4
What we will cover
(Syllabus:http://www.cs.cityu.edu.hk//content/courses/index.html)






Retrieval models for text (documents)
Retrieval models for hypertext (searching the web)
Retrieval Evaluation
Query Languages
Query operations
Text operations

Chinese language text operations
 Indexing and searching (algorithmic issues)
 Brief introduction to multimedia IR.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 5
Evaluation
 50% coursework
 50% examination
Coursework:
 1 assignment 20%
 A midterm examination 20%
 A project (do it in pairs) 60%
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 6
Definitions
 A database is a collection of documents.
 A document is a sequence of terms,
expressing ideas about some topic in a natural
language.
 A term is a semantic unit, a word, phrase, or
potentially root of a word.
 A query is a request for documents pertaining to
some topic.
2016/5/29
CS4485 Information Retrieval /WANG
Page 7
Lusheng
Definitions (Cont.)
 An Information Retrieval (IR) System
attempts to find relevant documents to
respond to a user’s request.
 The real problem boils down to matching
the language of the query to the language of
the document.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 8
Hard Parts of IR
 Simply matching on words is a very brittle
approach.
 One word can have a zillion different semantic meanings






2016/5/29
Consider: Take
“take a place at the table”
“take money to the bank”
“take a picture”
“take a lot of time”
“take drugs”
CS4485 Information Retrieval /WANG
Lusheng
Page 9
More problems with IR
 You can’t even tell what part of speech a
word has:



2016/5/29
“I saw her duck.”
A query that searches for “pictures of a duck”
will find documents that contain
“I saw her duck away from the ball galling from
the sky”
CS4485 Information Retrieval /WANG
Lusheng
Page 10
More Problems with IR
 Proper Nouns often use regular old nouns
 Consider a document with “a man named
Abraham owned a Lincoln”
 A word matching query for “Abraham Lincoln”
may well find the above document.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 11
What is Different about IR from
the rest of Computer Science
 Most algorithms in computer science have a
“right” answer:
 Consider the two problems:


Sort the following ten integers
Find the highest integer
 Now consider:

2016/5/29
Find the document most relevant to “hippos in
the zoo”
CS4485 Information Retrieval /WANG
Lusheng
Page 12
Measuring Effectiveness
 An algorithm is deemed incorrect if it does not
have a “right” answer.
 A heuristic tries to guess something close to
the right answer. Heuristics are measured on
“how close” they come to a right answer.
 IR techniques are essentially heuristics because
we do not know the right answer.
 So we have to measure how close to the right
answer we can come.
2016/5/29
CS4485 Information Retrieval /WANG
Page 13
Lusheng
Precision / Recall
Example
 Consider a query that retrieves 10 documents.
 Lets say the result set is.
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
 If all ten were relevant, we would have 100 percent precision. If there were
only ten relevant in the whole collection, we would have 100 percent recall.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 14
Example
(continued)
 Now lets say that only documents two and five are
relevant.
 Consider these results:
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
 Since we have retrieved ten documents and gotten two of them right, precision is
20 percent. Recall is 2/totall relevant in entire collection.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 15
Levels of Recall
 If we keep retrieving documents, we will
ultimately retrieve all documents and achieve
100 percent recall.
 That means that we can keep retrieving
documents until we reach x% of recall.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 16
Levels of Recall (example)
 Retrieve top 2000 documents. Lets say there are
five total documents relevant.
 Document DocID Recall Precision
-100
A
.20
.01
-200
B
.40
.01
-500
C
.60
.006
-1000
D
.80
.004
-1500
E
1.0
.003
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 17
How to evaluation the quality of the
retrieval system
Let R be the set of all relevant documents
A: set of all documents reported as relevant by the system
Ra: AR, the set of relevant documents reported.
Recall = |Ra|/|R|.
Recall = 10%: 10% of the relevant documents in R are found.
Precision = |Ra|/|A|.
 Precision = 90%: 90% of the reported documents are relevant
(suppose 100% are relevant).
1. Recall=100% does not mean the system finds ALL relevant
documents
2. Precision=100% does not mean all reported documents are
relevant.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 18
Evaluating IR
 Recall is the fraction of relevant documents
retrieved from the set of total relevant
documents collection-wide.
 Precision is the fraction of relevant documents
retrieved from the total number retrieved.
 An IR system ranks documents by SC, allowing
the user to trade off between precision and
recall.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 19
Precision/Recall Tradeoff
100%
Top 10
Top 100
Top 1000
Precision
Recall
2016/5/29
100%
CS4485 Information Retrieval /WANG
Lusheng
Page 20
Strategy vs Utility
 An IR strategy is a technique by which a relevance
assessment is obtained between a query and a
document.
 An IR utility is a technique that may be used to
improve the assessment given by a strategy. A
utility may plug into any strategy.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 21
Strategies
 Manual

Boolean
 Automatic




Probabilistic
Inference Networks
Vector Space Model
Latent Semantic Indexing (LSI)
 Adaptive Models


2016/5/29
Genetic Algorithms
Neural Networks
CS4485 Information Retrieval /WANG
Lusheng
Page 22
Retrieval: Ad hoc and Filtering
 Ad hoc retrieval: the documents in the
collection remain relatively static while new
queries are submitted to the system. (library)
 Filtering: queries remain relatively the same
while new documents come and leave the
system. (stock market)
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 23
A formal Characterization of IR models
 Definition
An information retrieval model is a quadruple
[D,Q,F,R(qi,dj)] where
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 24
Continue:
 (1) D is a set composed of logical views (or representations) for
the documents in the collection.
 (2) Q is a set composed of logical views (or representations) for
the user information needs. Such representations are called
queries.
 (3) F is a framework for modeling document representations,
queries, and their relationships.
 (4) R(qi,dj) is a ranking function which associates a real number
with a query qiQ and a document representation djD. Such
ranking defines an ordering among the documents with regard to
the query qi.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 25
Index terms
 A document is represented by a set of keywords, called
index terms.
 How to select keywords is an important issue and will
be discussed in Chapter 7.
 Some terms are more important than other terms, e.g.,
a terms appears in five documents is more important
than a term appears in most of the document.
 The word “The” is not useful while the word “cityU”
is important for retrieval information related to our
university.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 26
Boolean Model:
1. Each document dj is represented by a vector
dj=(w1,j,w2,j, …, wn,j), where wi,j =0 if term ki
does not appear in dj and wi,j=1 if term ki is in
dj.
2. A query is a Boolean function that is
represented as a disjunctive normal form.

2016/5/29
(1,1,1)(1,1,0)(1,0,0)
CS4485 Information Retrieval /WANG
Lusheng
Page 27
An example of Boolean retrieval model



2016/5/29
Documents:
• d1=(1, 0, 1, 1, 1, 1, 1, 1), d2=(0, 1, 0, 0, 1,
1, 1, 1)
• d3=(0, 0, 0, 1, 1, 1, 1,1), d4=(1, 1, 0, 0, 1,
1, 0, 0 )
Query: (1, 1, 1, 1, 1, 1,1, 1) (1,1,
0,0,1,1,0,0)
Result: Only d4 is selected..
CS4485 Information Retrieval /WANG
Lusheng
Page 28
Representation of documents: Boolean
model:
 d1: Computer science department, computer study,
computer algorithms
 d2:computer study, programming skills,
 d3: department stores, notebook,
 Keywords: 1. computer, 2. science, 3. study, 4. store, 5.
dept. 6. algorithms, 7. programming, 8. skills, 9. notebook,
 d1=(1,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0);
d3=(0,0,0,1,1,0,0,0,1).
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 29
Advantages
 simple, easy to understand by users
 precise semantics
 Neat formulation
 Get great attention in the past
Disadvantages


2016/5/29
Binary decision criterion (relevant or non-relevant)
Hard to get the Boolean formula for required
information.
CS4485 Information Retrieval /WANG
Lusheng
Page 30
Vector Space Model
1. Each document dj is represented by a vector
dj=(w1,j,w2,j, …, wn,j), where wi,j ≥0
2. Each query q is also represented by a vector
q=(w1,q, w2,q, …, wn,q).
3. The similarity between the document and the
query is defined as
sim(dj, q) = i=1, , …, n (wi,j wi,q )/
(i=1…n wi,j 2)0.5 (i=1… n wi,q 2)0.5
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 31
Example 1: dj=(2, 3, 1, 0) and q=(2, 3, 1, 0).
sim(dj,q)=(4+9+1+0)/(4+9+1+0)0.5(4+9+1+0)0.5
=1.
Example 2: dj=(0, 0,0,5) and q=(2, 3, 1, 0).
sim(dj,q)=0/(25)0.5(4+9+1+0)0.5=0.
Example 3: dj=(1, 3, 1,1) and q=(2, 3, 1, 0).
sim(dj,q)=(2+9+1+0)/(12)0.5(14)0.5=0.8570.5
=0.925.
Example 3: dj=(1, 3, 1,0) and q=(2, 3, 1, 0).
sim(dj,q)=(2+9+1+0)/(11)0.5(14)0.5>0.925.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 32
Note that wi,j≥0 and wi,q ≥0, since sim(q, dj) is in
[0,1].


2016/5/29
The documents are ranked according to the
similarity.
Even if the match is only partial, the
document might be retrieved
CS4485 Information Retrieval /WANG
Lusheng
Page 33
How to determine the weights wi,j on terms?
Definition :
Let N be the total number of documents in the
system and ni be the number of documents in which the index
ki appears.Let freqi,j be the raw frequency of term ki in the
document dj (i.e. the number of times the term ki is mentioned
in the text of the document dj). Then, the normalized frequency
fi,j of term ki in the document dj is given by
freqi , j
fi , j 
max lfreql , j
where the maximum is computed over all terms which are
mentioned in the text of the document dj. If the term ki does not
appear in the document dj then fi,j=0.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 34
Continue:
 Further, let idfi, inverse document frequency for ki, be given by
N
idf i  log
ni

The best known term-weighting schemes use weights which are
given by
N
wi , j  fi , j * log
ni
Such term-weighting strategies are called tf-idf schemes.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 35
idfi=ln(1000/ni)
ni
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 36
idfi=log(1000/ni)
ni
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 37
Continue:
 Several variations of the above expression for the weight wi,j are
described in an interesting paper by Salton and Buckley which
appeared in 1988.
 For the query term weights, Salton and Buchley suggest
0.5 freqi , q
N
wi , q  (0.5 
) * log
max lfreql , q
ni
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 38
Example 1:
 d1:Its term-weighting scheme improves retrieval performance;
 d2:Its partial matching strategy allows retrieval of documents
that approximate the query conditions;
 d3:Its cosine ranking formula sorts the documents according to
their degree of similarity to the query.
In this example, N=3, for the term ki=“retrieval” , ni=2,
idfi=log(3/2)=0.176, freqi,1=1,fi,1=1,wi,1=0.176.
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 39
Example 2:
 d1: Computer science department, computer study, computer
algorithms
 d2:computer study, programming skills,
 d3: department stores, notebook,
 Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms,
7. programming, 8. skills, 9. notebook,
 d1=(2,1,1,0,1,1,0,0,0); d2=(1,0,1,0,0,0,1,1,0); d3=(0,0,0,1,1,0,0,0,1).
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 40
freq k i
k1
k2
k3
k4
k5
k6
k7
k8
k9
d1
3
1
1
0
1
1
0
0
0
d2
1
0
1
0
0
0
1
1
0
d3
0
0
0
1
1
0
0
0
1
i,j
dj
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 41
fi,j k i
k1
k2
k3
k4
k5
k6
k7
k8
k9
0
0
0
dj
d1
1
d2
1
0
1
0
0
0
1
1
0
d3
0
0
0
1
1
0
0
0
1
2016/5/29
0.33 0.33
0
0.33 0.33
CS4485 Information Retrieval /WANG
Lusheng
Page 42
i
1
2
3
4
5
6
7
8
9
ni
2
1
2
1
2
1
1
1
1
idfi 0.18 0.48 0.18 0.48 0.18 0.48 0.48 0.48 0.48
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 43
wi,j k i
k1
k2
k3
k4
d1 0.18 0.16 0.06
0
d2 0.18
0
0.18
0
d3
0
0
k5
k6
k7
k8
k9
0.06 0.16
0
0
0
dj
2016/5/29
0
0
0.48 0.18
0
0
CS4485 Information Retrieval /WANG
Lusheng
0.48 0.48
0
0
0
0.48
Page 44
Example 2:
 d1: Computer science dept. Algorithms improve retrieval
performance;
 d2:computer study, algorithm, programming skills, query
conditions;
 d3: computer stores, notebook, printers
 d4: computer store sales CD’s and software
 Keywords: 1. computer, 2. science, 3. study, 4. store, 5. dept. 6. algorithms,
7. improve, 8. retrieval, 9. performance, 10. programming, 11. skills, 12.
query, 13. conditions, 14. notebook, 15. printers, 16. sales, 17. CD’s, 18.
software, 19. algorithm
 Question: 19 keywords or 18 keywords? – language process


Every document contains may “the” do we need it?
Table, desk, are they the same? Related?
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 45
Summary

Information Retrieval models
 Boolean model
 Vector space model
2016/5/29
CS4485 Information Retrieval /WANG
Lusheng
Page 46
Course Arrangement:


2016/5/29
No lecture and tutorial in week 2.
I make up class will be scheduled in week 3.
CS4485 Information Retrieval /WANG
Lusheng
Page 47
Download