S534 - Information Retrieval: Theory and Practice

advertisement
Information Retrieval to Knowledge
Retrieval, one more step
Xiaozhong Liu
Assistant Professor
School of Library and Information Science
Indiana University Bloomington
What is Information?
What is Retrieval?
What is Information Retrieval?
I am Retriever
How to find this book in Library?
Search something based on
User Information Need!!
How to express your information need?
Query
User Information Need!!
What is Good query?
What is Bad query?
Good query: query ≈ information need
Bad query: query ≠ information need
Query
Wait!!! User NEVER make mistake!!!
It’s OUR job!!!
Task 1: Given user information need, how to help (or
automatically) help user propose a better query?
If there is a query…
Perfect query:
π‘„π‘’π‘’π‘Ÿπ‘¦π‘œπ‘π‘‘π‘–π‘šπ‘–π‘§π‘’
User input query:
π‘„π‘’π‘’π‘Ÿπ‘¦π‘’π‘ π‘’π‘Ÿ
User Information Need!!
What is Good results?
What is Bad results?
Query
Given a query,
How to retrieve results?
Results
Task 2: Given a query (not perfect), how to retrieve
Documents from collection?
F(query, doc)
Very Large, Unstructured
Text Data!!!
Can you give me an example?
If query term exist in doc
Yes, this is result
F(query, doc)
If query term NOT exist in doc
No, this is not result
Is there any problem in this function?
Brainstorm…
Query: Obama’s wife
Doc 1. My wife supports Obama’s new policy on…
Doc 2. Michelle, as the first lady of the United States…
Yes, this is a very challenging task!
Another problem
Collection size: 5 billion
Match doc: 5
My algorithm successfully finds all the 5 docs! In… 3 billion results…
User Information Need!!
How to help user find
the results from all the
retrieved results?
Query
Results
Task 3: Given retrieved results, how to help you find their
results?
If retrieval algorithm retrieved 1 billion results from
collection, what will you do???
Search with Google, click “next”???
Yes, we can help user find what they need!
Query: Indiana University Bloomington
Can you read it
One by one?
You use it??
User
User Information Need!!
1
Query
3
2
Results
System
They are not independent!
User
User Information Need!!
1
Query
3
2
Results
System
Text
Map
Information Retrieval
Image
……
Music
web
scholar
document
blog
news
Text
Map
Information Retrieval
Image
……
Music
Index
Documents vs. Database Records
• Relational database records are typically made up of welldefined fields
Select * from students where GPA > 2.5
Text, similar way? Find all the docs including “Xiaozhong”
Select * from documents where text like ‘%xiaozhong%’
We need a more effective way to index the text!
Collection C: doc1, doc2, doc3 ……… docN
Vocabulary V: w1, w2, w3 ……… wn
Document doci : di1, di2, di3 ……… dim
Query q: q1, q2, q3 ……… qt
All dij οƒŽ V
where qx is the query term
Collection C: doc1, doc2, doc3 ……… docN
V: w1,
w2,
w3
………
wn
Doc1
1
0
0
1
Doc2
0
0
0
1
Doc3
1
1
1
1
1
1
………
DocN
1
0
Query q: 0, 1, 0 ………
Collection C: doc1, doc2, doc3 ……… docN
V: w1,
w2,
w3
………
Normalization is very important!
wn
Doc1
3
0
0
9
Doc2
0
0
0
7
Doc3
2
11
21
1
………
DocN
7
0
1
2
Query q: 0, 3, 0 ………
Collection C: doc1, doc2, doc3 ……… docN
Doc1
V: w1,
w2,
0.41
0
Doc2
0
Doc3
0.42
0
0.11
w3
0
………
Normalization is very important!
wn
0.62
0
0.12
0.34
0.13
Weight
………
DocN
0.01
0
0.19
0.24
Query q: 0, 0.37, 0 ………
Term weighting
TF * IDF
Term frequency: freq (w, doc) / | doc|
Or…
Inverse document frequency
1+ log(N/k)
N total num of docs in collection
k total num of docs with word w
An effective way to weight each word in a document
Retrieval Model?
Ranking?
Index
Semantic?
Speed?
Space?
Document representation meets the requirement of retrieval system
Stemming
Education
Educate
Educational
Educat
Educating
Very effective to improve system performance.
Some risk! E.g. LA Lakers = LA Lake?
Educations
Inverted index
Doc 1: I love my cat.
Doc 2: This cat is lovely!
Doc 3: Yellow cat and white cat.
I love my cat this is lovely yellow and write
i love cat thi yellow write
We lose something?
i-1
love - 1, 2
thi - 2
cat - 1, 2, 3
yellow - 3
write - 3
Inverted index
Doc 1: I love my cat.
Doc 2: This cat is lovely!
Doc 3: Yellow cat and white cat.
i-1
love - 1, 2
thi - 2
cat - 1, 2, 3
yellow - 3
write - 3
i – 1:1
love – 1:1, 2:1
thi – 2:1
cat – 1:1, 2:1, 3:2
yellow – 3:1
write – 3:2
We still lose something?
Inverted index
Doc 1: I love my cat.
Doc 2: This cat is lovely!
Doc 3: Yellow cat and white cat.
i – 1:1
love – 1:1, 2:1
thi – 2:1
cat – 1:1, 2:1, 3:2
yellow – 3:1
write – 3:2
i – 1:1
love – 1:2, 2:4
thi – 2:1
cat – 1:4, 2:2, 3:2, 3:5
yellow – 3:2
write – 3:4
Why do you need position info?
Proximity of query terms
query: information retrieval
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
Index – bag of words
query: information retrieval
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
What’s the limitation of bag-of-words? Can we make it better?
n-gram:
Doc 1: information retrieval, retrieval is, is important, important for ……
bi-gram
Better semantic representation!
What’s the limitation?
Index – bag of “phrase”?
Doc 1: …… big apple ……
Doc 2: …… apple……
More precision, less ambiguous
How to identify phrases from documents?
Identify syntactic phrases using POS tagging
n-grams
from existing resources
Noise detection
What is the noise of web page? Non-informative
content…
Web Crawler - freshness
Web is changing, but we cannot constantly check all the pages…
Need to find the most important page that change freq
www.nba.com
www.iub.edu
www.restaurant????.com
Sitemap: a list of urls for each host; modification time and freq
Retrieval
Model
Mathematical modeling is frequently used with the objective to
understand, explain, reason and predict behavior or
phenomenon in the real world (Hiemstra, 2001).
i.e. some model help you to predict tomorrow stock price…
Vector Space Model
Hypothesis:
Retrieval and ranking problem = Similarity Problem!
Is that a good hypothesis? Why?
Retrieval Function: Similarity (query, Document)
Return a score!!! We can Rank the documents!!!
Vector Space Model
So, query is a short document
Collection C: doc1, doc2, doc3 ……… docN
V: w1,
w2,
Doc1
0.41
0
Doc2
0
Doc3
0.42
0
0.11
w3
0
………
wn
0.62
0
0.12
0.34
0.13
………
DocN
Query q: 0, 0.37, 0 ………
0.01
0
0.19
0.24
Collection C: doc1, doc2, doc3 ……… docN
Similarity
V: w1,
w2,
Doc1
0.41
0
Doc2
0
Doc3
0.42
Doc Vector
DocN
Query q: 0, 0.37, 0 ………
Query Vector
0
0.11
w3
0
………
wn
0.62
0
0.12
0.34
0.13
………
0.01
0
0.19
0.24
Doc1: ……Cat……dog……cat……
Doc2: ……Cat……dog
Doc3: ……snake……
Query: dog cat
cat
doc 1
2
doc 2
doc 3
1
dog
Doc1: ……Cat……dog……cat……
Doc2: ……Cat……dog
Doc3: ……snake……
Query: dog cat
cat
doc 1
2
doc 2 = query
θ
1
doc 3
dog
F (q, doc) = cosine similarity (q, doc)
Why Cosine?
Vector Space Model
Vocabulary V: w1, w2, w3 ……… wn
Dimension = n = vocabulary size
Document doci : di1, di2, di3 ……… din
Query q: q1, q2, q3 ……… qn
All dij οƒŽ V
Same dimensional space!!!
Doc1: ……Cat……dog……cat……
Doc2: ……Cat……dog
Doc3: ……snake……
Query: dog cat
Try!
Term weighting
Doc
[ 0.42
0.11
0.34
0.13 ]
weight, how?
TF * IDF
Term frequency: freq (w, doc) / | doc|
Or…
Inverse document frequency
1+ log(N/k)
N total num of docs in collection
k total num of docs with word w
More TF
Weighting is very important for retrieval model!
We can improve TF by…
i.e.
freq (term, doc)
log [freq (term, doc)]
BM25:
Vector Space Model
But…
Bag of word assumption = Word independent!
Query = Document, maybe not true!
Vector and SEO (Search Engine Optimization)…
synonym? Semantic related words?
How about these…
Pivoted Normalization Method
Dirichlet Prior Method
TF IDF
Normalization
+parameter
Language model
Probability distribution over words
P (I love you) = 0.01
P (you love I) = 0.00001
P (love you I) = 0.0000001
If we have this information… we could build a generative model!
P(text | )
Language model - unigram
Generate text with bag-of-word assumption (word independent):
P (w1, w2,…wn) = P(w1) P(w2)…P(wn)
topic X = ???
food orange desk USB computer Apple Unix …. …. ….
milk sport superbowl
Doc: I’m using Mac computer… remote access
another computer… share some USB device…
P(Doc | topic1) vs.
topic 1
food orange desk USB computer Apple Unix …. …. milk yogurt
P(Doc | topic2)
topic 2
iPad NBA sport superbowl NHL score information unix USB
king
ghost
hamlet
play
…. ….
romeo
juliet
iPad
iplhone
4s
tv
apple ……
play
store
topicX
food orange desk USB computer Apple Unix …. …. ….
10/10000
1000/10000
P(“computer” | topic X)
If we have enough data, i.e. docs about topic X
How to estimate???
milk sport superbowl
30/10000
query: sport game watch
P(query | doc 1) vs.
doc 1
food orange desk USB computer Apple Unix …. …. milk yogurt
P(query | doc 2)
doc 2
iPad NBA sport superbowl NHL score information unix USB
a document  doc:
query likelihood
query term likelihood
Retrieval Problem ο‚» Query likelihood ο‚» Term likelihood P(qi | doc)
But document is a small sample of topic… Data like:
Smoothing!
Smoothing
P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?
We want give this non-zero score!!!
i.e.
We can make it better!
Smoothing
First, it addresses the data sparseness problem. As a document is
only a very small sample, the probability P (qi | Doc) could be
zero for those unseen words (Zhai & Lafferty, 2004).
Second, smoothing helps to model the background (nondiscriminative) words in the query.
Improve language model estimation by using Smoothing
Smoothing
Another smoothing method:
P (w | doc)
P (w |  )
P (w | collection)
Collection Language Model
P (w |  ) = (1-λ) βˆ™P( query | θdoc)+λβˆ™P(doc| θcollection)
Smoothing
We could use collection language model:
Term Freq
IDF
Doc length norm
TFIDF is closely related to Language Model and other retrieval models
Language model
Solid statistical foundation
Flexible parameter setting
Different smoothing method
Language model in library?
If we have a paper… and a query…
Similarity (paper, query)
Vector Space Model
If query word not in the paper…
Score = 0
If we use language model…
Language model in library?
Likelihood of query given a paper can be estimated by:
P(query |  ) = αP (query | paper) + βP (query | author) +γP (query | journal) +……
Likelihood of query given a paper & author & journal & ……
e.g. what’s the difference between web and doc
retrieval???
F (doc, query)
vs
F (web page, query)
web page = doc + hyperlink + domain info + anchor text + metadata + …
Can you use those to improve system performance???
Knowledge
Score each topic, level of interest
Topic 1
Topic 2
Historical Interest
Hot topic
Diminishing topic
{if P ( Day today | Z n ) ο€Ύ mean[ P ( Day i | Z n )]  STD[ P ( Day i | Z n )]}
Score(Topic n ) ο€½ {else if P( Day today | Z n ) ο€Ό mean[ P( Day i | Z n )] ο€­ STD[ P ( Day i | Z n )]}
{else
P ( Day today | Z n )
Current
Interest
Regular topic
a οƒ— P( Day today | Z n ) / mean[ P( Day i | Z n )]
b οƒ— P ( Day today | Z n ) / mean[ P( Day i | Z n )]
“Obama”, Nov 5th 2008 After Election
1. Win
2. Create history
3. First black president
6
Nov 5th CIV:
5
4
3
Wiki:Barack_Obama; Wiki:Election; win; success;
Wiki:President_of_the_United_States
Wiki:African_American; President
World; America; victory; record; first;
president ; 44th; History; Wiki:Victory_Records
Entity:first_black_president;
2
1
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Wiki:Sarah_Palin; sarah; palin; hillary
Secret; Wiki:Hillary_Rodham_Clinton
Clinton; newsweek; club; cloth
16
17
18
19
Entity:first_black_president; Celebrate;
20 21 22 23 24 25 26 27 28 29 30
black; african;
Wiki:Colin_Powell; Wiki:Secretary_of_State
Wiki:United_States
Google web
NDCG3
NDCG5
NDCG10
t-test
CIV
0.35909366
0.399970894
0.479302401
CILM
0.356652652
0.387120299
0.483420045
Google
0.230423817
0.318737414
0.388792379 **
TFIDF
0.27596245
0.333012091
0.437831859 *
BM25
0.284599431
0.336961764
0.436466778 *
LM (liner)
0.32558799
0.382113457
0.473992963
LM (dirichlet)
0.34665084
0.358128576
0.45150825
LM (twostage)
0.349735965
0.358725227
0.450046444
BEST1:
CIV
CIV
CILM
BEST2:
CILM
CILM
CIV
NDCG5
NDCG10
Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
Yahoo_web
NDCG3
0.351765133
0.391807685
0.288059321
0.24320988
0.245263974
0.276208943
0.223253393
0.219225991
CIV
CILM
Yahoo
TFIDF
BM25
LM (liner)
LM (dirichlet)
LM (twostage)
0.38207777
0.40623334
0.326373542
0.282799657
0.277579262
0.316889107
0.270017519
0.266537146
0.475506721
0.482464858
0.410969176
0.404092457
0.395953269
0.432428784
0.385936078
0.384349848
BEST1:
CILM
CILM
CILM
BEST2:
CIV
CIV
CIV
Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
t-test
***
***
*
***
***
Knowledge Retrieval System
How to help
user propose
knowledge-base
queries ?
Query
How to
represent
knowledge?
Matching
Knowledge
Representation
Knowledge within
Scientific Literature
Knowledge-based
Information Need
How to
match
between
the two?
Academic Knowledge
Query Recommendation & Feedback
Query
Recommendation
Query
Feedback
74
Structural Keyword Generation
- Features
Category
Feature
Text
Keyword
Content
Content_Of_Keyword
CAP
Contain_Digit
Character_Length_Of_Keyword
Token_Length_Of_Keyword
Category_Length_Of_Keyword
Exist_In_Title
Title
Context
Abstract
Context
Location_In_Title
Title_Text_POS
Title_Unigram
Title_Bigram
Location_In_Abstract
Keyword_Position_In_Sentence_O
f_Abstract
Abstract_Freq
Abstract_Text_POS
Abstract_Unigram
Abstract_Bigram
Description or Example
content of the keyword, stemmed, case insensitive, stop words
removed
a vector of all the tokens in the keyword
whether the keyword is capitalized
whether the keyword contains digits, i.e., TREC2002, value = true
number of characters in the target keyword
number of tokens in the keyword
number of tokens in the keyword; if the length is more than four, we
use four to represent its category length
whether keyword exists in title (stemmed, case insensitive, stop
words removed)
the position where the keyword appears in the title
unigram and its part of speech in title (in a text window)
unigram of keyword in title (in a text window)
bigram of keyword in title (in a text window)
which sentence the keyword appears in the abstract
the keyword’s position in the sentence (beginning, middle or end)
how many times a keyword appears in the abstract
unigram and its part of speech in abstract (in a text window)
unigram of keyword in abstract (in a text window)
76
bigram of keyword in abstract (in a text window)
Evaluation
– Domain Knowledge Generation
F1 Compare
Keywordbased
features
Keyword +
Title-based
features
Keyword +
Title +
Abstractbased
features
Concept
Research
Question
Supervised
Semisupervised
0.637
0.662
Methodology
0.479
0.516
Dataset
0.824
0.816
Evaluation
0.571
0.571
Research
Question
0.633
0.667
Methodology
0.498
0.534
Dataset
0.824
0.816
Evaluation
0.571
0.571
Research
Question
0.642
0.663
Methodology
0.420
0.542
Dataset
0.831
0.823
Evaluation
0.621
0.662
F measure comparison for Supervised Learning and SemiSupervised Learning
GOOD! but not PERFECT…
Knowledge comes from…
• System? Machine Learning, but… low modest
performance…
• User? No way! Very high cost! Author won’t
contribute…
• System + User? Possible!
WikiBackyard
Trigger: 1. Wiki page
Edit
improve; 2. Machine
learning model improve;
3. All other wiki pages
improve; 4. KR index
improve!
ScholarWiki
YES! It+helps!!!
User
Machine learning is powerful…
• Knowledge retrieval for scholar publications…
• Knowledge from paper
• Knowledge from user
– Knowledge feedback
– Knowledge recommendation
• Knowledge from User vs. from Machine learning
• ScholarWiki (user) + WikiBackyard (machine)
Knowledge via Social
Network and Text Mining
CITATION? CO-OCCUR?
CO-AUTHOR?
Full text citation analysis
With further study of citation analysis, increasing numbers of
researchers have come to doubt the reasonableness of assuming
that the raw number of citations reflects an article’s influence
(MacRoberts & MacRoberts 1996). On the other hand, full-text
analysis has to some extent compensated for the weaknesses of
citation counts and has offered new opportunities for citation
analysis.
Content of each node?
Motivation of each citation?
With further study of citation analysis,
increasing numbers of researchers have come
to doubt the reasonableness of assuming that
the raw number of citations reflects an article’s
influence (MacRoberts & MacRoberts 1996).
On the other hand, full-text analysis has to
some extent compensated for the weaknesses
of citation counts and has offered new
opportunities for citation analysis.
Every word @ Citation Context will VOTE!!
Motivation? Topic? Reason??? Left and Right N words??
N = ??????????
With further study of citation analysis,
increasing numbers of researchers have come
to doubt the reasonableness of assuming that
the raw number of citations reflects an article’s
influence (MacRoberts & MacRoberts 1996).
On the other hand, full-text analysis has to
some extent compensated for the weaknesses
of citation counts and has offered new
opportunities for citation analysis.
Word effectiveness is decaying based on the distance!!!
Closer words make more significant contribution!!
How about language model? Each node and edge
represented by a language model?
High dimensional space! Word difference?
Topic modeling – each node is represented by a topic
distribution (Prior Distribution); each edge is represented by
a topic distribution (Transitioning Probability Distribution)
Supervised topic modeling
1. Each topic has a label (YES! We can interpret each topic)
2. We DO KNOW the total number of topics
Each paper is a mix probability distribution of
Author Given Keywords
Keywords
Each paper: pzkeyi(paper) = p(zkeyi | abstract, title)
With further study of citation analysis, increasing numbers of researchers have come
to doubt the reasonableness of assuming that the raw number of citations reflects an
article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text
analysis has to some extent compensated for the weaknesses of citation counts and
has offered new opportunities for citation analysis.
Paper importance
Domain credit: 100
if we have 3 topics (keywords): key1, key2, key3
Key1-Pub1 credit:
25 * 0.6
25
Key1-Citation1 credit:
25 * 0.6*[0.8/(0.8+0.2)]
pub 1
0.8
25
pub 2
P(key1 | citation) = 0.8
P(key2 | citation) = 0.1
P(key3 | citation) = 0.1
P(key1 | text) = 0.6
P(key2 | text) = 0.15
P(key3 | text) = 0.25
0.2
25
Evenly share the credits?
Citation is important if 1. citation focusing
on important topic 2. other citations focusing
on other topics
25
pub 4
pub 3
Paper importance
Domain credit: 100
if we have 3 keywords: key1, key2, key3
Key1-Pub1 credit:
25 * 0.6
25
[25,25,25]
Key1-Citation1 credit:
25 * 0.6*[0.8/(0.8+0.2)]
pub 1
0.8
25 [29,26,28]
0.2
Domain publication ranking
Domain keyword topical ranking
Topical citation tree
pub 2
[27,27,26]
25
pub 3
25
Citation number between paper
pair is IMPORTANT!
pub 4
[25,25,25]
Different citations make different contribution to
different topics (keywords) to the citing
publication.
Citation transitioning topic prior
Publication/venue/author topic prior
NDCG for Review citation recommendation
0.4
0.35
0.3
NDCG
0.25
0.2
0.15
0.1
0.05
0
nDCG@10
nDCG@30
nDCG@50
nDCG@100
nDCG@300
nDCG@500
nDCG@1000 nDCG@3000 nDCG@5000
nDCG@ALL
PageRank
0.0093
0.0076
0.0084
0.0107
0.0198
0.0261
0.0392
0.0719
0.0917
0.1904
TFIDF
0.0674
0.0741
0.0833
0.0975
0.1266
0.1391
0.1541
0.1827
0.1932
0.2141
BM25
0.0689
0.0738
0.0832
0.0957
0.126
0.138
0.1525
0.1808
0.1895
0.213
LM
0.0713
0.0757
0.0861
0.1006
0.1329
0.1446
0.1616
0.1872
0.1987
0.2174
PageRank_LM
0.1023
0.113
0.132
0.1563
0.1943
0.21
0.2258
0.24
0.2458
0.2593
Prior
0.1229
0.1451
0.1621
0.1847
0.2319
0.2514
0.2738
0.3074
0.3152
0.3445
Literature Review Citation recommendation
Input: Paper Abstract
Output: A list of ranked citations
MAP and NDCG evaluation
Given a paper abstract:
1. Word level match (language model)
2. Topic level match (KL-Divergence)
3. Topic importance
Use Inference Network to integrate each hypothesis
Inference Network
Content Match
Citation
Recommendation
Publication
Topical Prior
1. PageRank
2. Full-text PageRank (greedy match)
3. Full-text PageRank (topic modeling)
Topic match
Output:
Input
1. [3]
2. [2]
3. [6]
4. [8]
5. [10]
6. [1]
……
YES
YES
NO
NO
YES
NO
3
2
0
0
1
0
MAP
(Cite or not?)
NDCG
(Important citation?)
NDCG for citation recommendation based on Abstract
0.4
0.35
Based on topic
inference, 30 seconds
0.3
Based on greedy
match, 1 second
0.25
0.2
0.15
0.1
nDCG@
10
nDCG@
30
nDCG@
50
nDCG@
100
nDCG@
300
nDCG@
500
nDCG@
1000
nDCG@
3000
nDCG@
5000
nDCG@
all
abstract_LM
0.1183
0.1424
0.1586
0.1774
0.2022
0.21
0.2202
0.2326
0.2375
0.2647
LM + PageRank
0.174
0.1951
0.2091
0.2297
0.2539
0.2615
0.2711
0.2847
0.289
0.3157
LM + Fulltext
0.1712
0.1929
0.2088
0.229
0.252
0.2607
0.2704
0.2832
0.2886
0.3148
LM + Fulltext + smoothing
0.2015
0.2281
0.2447
0.267
0.2897
0.2989
0.3078
0.3199
0.3236
0.3445
LM + Fulltext + LLDA
0.2032
0.2317
0.246
0.2648
0.2907
0.3007
0.3108
0.3241
0.329
0.347
LM + Fulltext + LLDA + smoothing
0.2137
0.2411
0.2563
0.2756
0.3035
0.3109
0.3208
0.3335
0.3374
0.355
• Information Retrieval
•
•
•
•
•
Index
Retrieval Model
Ranking
User feedback
Evaluation
• Knowledge Retrieval
•
•
•
•
Machine Learning
User Knowledge
Integration
Social Network Analysis
Thank you!
Download