textWeb

advertisement
Text and Web Search
Text Databases and IR

Text databases (document databases)


Large collections of documents from various sources:
news articles, research papers, books, digital libraries,
e-mail messages, and Web pages, library database, etc.
Information retrieval



A field developed in parallel with database systems
Information is organized into (a large number of)
documents
Information retrieval problem: locating relevant
documents based on user input, such as keywords or
example documents
Information Retrieval


Typical IR systems

Online library catalogs

Online document management systems
Information retrieval vs. database systems

Some DB problems are not present in IR, e.g., update,
transaction management, complex objects

Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
Basic Measures for Text Retrieval
Relevant
Relevant &
Retrieved
Retrieved
All Documents


Precision: the percentage of retrieved documents that are in
fact relevant to the query (i.e., “correct” responses)
| {Relevant}  {Retrieved } |
precision 
| {Retrieved } |
Recall: the percentage of documents that are relevant to the
query and were, in fact, retrieved
recall 
| {Relevant}  {Retrieved } |
| {Relevant} |
Information Retrieval Techniques

Index Terms (Attribute) Selection:





Stop list
Word stem
Index terms weighting methods
Terms  Documents Frequency Matrices
Information Retrieval Models:



Boolean Model
Vector Model
Probabilistic Model
Problem - Motivation


Given a database of documents, find documents
containing “data”, “retrieval”
Applications:
 Web
 law + patent offices
 digital libraries
 information filtering
Problem - Motivation


Types of queries:
 boolean (‘data’ AND ‘retrieval’ AND NOT ...)
 additional features (‘data’ ADJACENT ‘retrieval’)
 keyword queries (‘data’, ‘retrieval’)
How to search a large collection of documents?
Full-text scanning

for single term:
 (naive: O(N*M))
ABRACADABRA
text
CAB
pattern
Full-text scanning

for single term:
 (naive: O(N*M))
 Knuth, Morris and Pratt (‘77)
 build a small FSA; visit every text letter once
only, by carefully shifting more than one step
ABRACADABRA
text
CAB
pattern
Full-text scanning
ABRACADABRA
text
CAB
pattern
CAB
...
CAB
CAB
Full-text scanning

for single term:
 (naive: O(N*M))
 Knuth Morris and Pratt (‘77)
 Boyer and Moore (‘77)
 preprocess pattern; start from right to left &
skip!
ABRACADABRA
text
CAB
pattern
Text - Detailed outline

text






problem
full text scanning
inversion
signature files
clustering
information filtering and LSI
Text – Inverted Files
Text – Inverted Files
Q: space overhead?
A: mainly, the postings lists
Text – Inverted Files



how to organize dictionary?
stemming – Y/N?
 Keep only the root of each word ex.
inverted, inversion  invert
insertions?
Text – Inverted Files



how to organize dictionary?
 B-tree, hashing, TRIEs, PATRICIA trees, ...
stemming – Y/N?
insertions?
Text – Inverted Files

postings list – more Zipf distr.: eg., rank-frequency
plot of ‘Bible’
log(freq)
freq ~ 1/rank /
ln(1.78V)
log(rank)
Text – Inverted Files

postings lists
 Cutting+Pedersen
 (keep first 4 in B-tree leaves)
 how to allocate space: [Faloutsos+92]
 geometric progression
 compression (Elias codes) [Zobel+] – down to 2%
overhead!

Conclusions: needs space overhead (2%-300%), but it is the
fastest
Vector Space Model and Clustering




Keyword (free-text) queries (vs Boolean)
each document: -> vector (HOW?)
each query: -> vector
search for ‘similar’ vectors
Vector Space Model and Clustering

main idea: each document is a vector of size d: d is
the number of different terms in the database
document
...data...
‘indexing’
aaron
data
zoo
d (= vocabulary size)
Document Vectors


Documents are represented as “bags of words”
Represented as vectors when used
computationally




A vector is like an array of floating points
Has direction and magnitude
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse
Document Vectors
One location for each word.
nova
A
10
B
5
C
D
E
F
G
5
H
I
galaxy heat
5
3
10
h’wood
film
role
diet
fur
10
8
7
10 in text
5 A
“Nova” occurs910 times
“Galaxy” occurs 5 times in text A 10
“Heat” occurs 3 times in text A 9
7 means 0 occurrences.)
9
(Blank
6
10
7
2
5
10
10
8
1
3
Document Vectors
One location for each word.
nova
A
10
B
5
C
D
E
F
G
5
H
I
galaxy heat
5
3
10
h’wood
film
role
diet
“Hollywood” occurs
78times7in text I
10
“Film” occurs
9 5 times
10 in 5text I
“Diet” occurs 1 time in text I 10
“Fur” occurs 3 times in text I 9
6
7
10
7
2
5
fur
10
10
9
8
1
3
Document Vectors
Document ids
nova
A
10
B
5
C
D
E
F
G
5
H
I
galaxy heat
5
3
10
h’wood
10
9
film
8
10
role
diet
7
5
10
9
6
7
10
7
2
5
fur
10
10
9
8
1
3
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Vector Space Model and Clustering
Then, group nearby vectors together
 Q1: cluster search?
 Q2: cluster generation?
Two significant contributions
 ranked output
 relevance feedback
Vector Space Model and Clustering

cluster search: visit the (k) closest superclusters;
continue recursively
MD TRs
CS TRs
Vector Space Model and Clustering

ranked output: easy!
MD TRs
CS TRs
Vector Space Model and Clustering

relevance feedback (brilliant idea) [Roccio’73]
MD TRs
CS TRs
Vector Space Model and Clustering


relevance feedback (brilliant idea) [Roccio’73]
How?
MD TRs
CS TRs
Vector Space Model and Clustering

How? A: by adding the ‘good’ vectors and
subtracting the ‘bad’ ones
MD TRs
CS TRs
Cluster generation

Problem:

given N points in V dimensions,
 group them
Cluster generation

Problem:

given N points in V dimensions,
 group them (typically a k-means or AGNES is used)
Assigning Weights to Terms



Binary Weights
Raw term frequency
tf x idf
 Recall the Zipf distribution
 Want to weight terms highly if they are
 frequent in relevant documents … BUT
 infrequent in the collection as a whole
Binary Weights

Only the presence (1) or absence (0) of a term is
included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
1
1
0
1
1
1
0
0
0
0
1
t2
0
0
1
0
1
1
1
1
0
1
0
t3
1
0
1
0
1
0
0
0
1
1
1
Raw Term Weights

The frequency of occurrence for the term in each
document is included in the vector
docs
D1
D2
D3
D4
D5
D6
D7
D8
D9
D10
D11
t1
2
1
0
3
1
3
0
0
0
0
4
t2
0
0
4
0
6
5
8
10
0
3
0
t3
3
0
7
0
3
0
0
0
1
5
1
Assigning Weights


tf x idf measure:
 term frequency (tf)
 inverse document frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Goal: assign a tf * idf weight to each term in each
document
tf x idf
wik  tfik * log( N / nk )
Tk  term k
tfik  frequency of term Tk in document Di
idf k  inverse document frequency of term Tk in C
N  total number of documents in the collection C
nk  the number of documents in C that contain Tk
idf k  log  N 
 nk 
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words
For a
collection
of 10000
documents
 10000 
log 
0
 10000 
 10000 
log 
  0.301
 5000 
 10000 
log 
  2.698
 20 
 10000 
log 
4
 1 
Similarity Measures for
document vectors
|QD|
|QD|
2
|Q|| D|
|QD|
|QD|
|QD|
1
Simple matching (coordination level match)
Dice’s Coefficient
Jaccard’s Coefficient
1
|Q | | D | 2
|QD|
min(| Q |, | D |)
2
Cosine Coefficient
Overlap Coefficient
tf x idf normalization

Normalize the term weights (so longer
documents are not unfairly given more
weight)

normalize usually means force all values to fall within a
certain range, usually between 0 and 1, inclusive.
wik 
tfik log( N / nk )
2
2
(
tf
)
[log(
N
/
n
)]
k 1 ik
k
t
Vector space similarity
(use the weights to compare the documents)
Now, the similarity of two documents is :
t
vi * v j
sim ( Di , D j )   wik  w jk 
|| v i |||| v j ||
k 1
This is also called the cosine, or normalized inner product.
Computing Similarity Scores
D1  (0.8, 0.3)
D2  (0.2, 0.7)
1.0
Q  (0.4, 0.8)
cos 1  0.74
Q
D2
0.8
0.6
0.4
0.2
cos  2  0.98
2
1
0.2
D1
0.4
0.6
0.8
1.0
Vector Space with Term Weights and
Cosine Matching
Term B
1.0
0.8
0.6
D2
Q
Q = (0.4,0.8)
D1=(0.8,0.3)
D2=(0.2,0.7)
2
0
sim (Q, Di ) 
D1
1
0.2

0.4 0.6
Term A
0.8
1.0

t
sim (Q, D 2) 
0.4
0.2
Di=(di1,wdi1;di2, wdi2;…;dit, wdit)
Q =(qi1,wqi1;qi2, wqi2;…;qit, wqit)
j 1
wq j wdij
 j 1 (wq j )
t
2
2
(
w
)
 j 1 dij
t
(0.4  0.2)  (0.8  0.7)
[(0.4) 2  (0.8) 2 ]  [(0.2) 2  (0.7) 2 ]
0.64
 0.98
0.42
.56
sim (Q, D1 ) 
 0.74
0.58
Text - Detailed outline

Text databases






problem
full text scanning
inversion
signature files (a.k.a. Bloom Filters)
Vector model and clustering
information filtering and LSI
Information Filtering + LSI

[Foltz+,’92] Goal:



users specify interests (= keywords)
system alerts them, on suitable newsdocuments
Major contribution: LSI = Latent
Semantic Indexing

latent (‘hidden’) concepts
Information Filtering + LSI
Main idea
 map each document into some ‘concepts’
 map each term into some ‘concepts’
‘Concept’:~ a set of terms, with weights, e.g.

“data” (0.8), “system” (0.5), “retrieval” (0.6) ->
DBMS_concept
Information Filtering + LSI
Pictorially: term-document matrix
(BEFORE)
'data' 'system' 'retrieval' 'lung' 'ear'
TR1 1
1
1
TR2 1
1
1
TR3
1
1
TR4
1
1
Information Filtering + LSI
Pictorially: concept-document matrix
and...
'DBMS- 'medicalconcept' concept'
TR1 1
TR2 1
TR3
1
TR4
1
Information Filtering + LSI
... and concept-term matrix
data
'DBMS- 'medicalconcept' concept'
1
system
1
retrieval 1
lung
1
ear
1
Information Filtering + LSI
Q: How to search, eg., for ‘system’?
Information Filtering + LSI
A: find the corresponding concept(s); and
the corresponding documents
data
'DBMS- 'medicalconcept' concept'
1
'DBMS- 'medicalconcept' concept'
TR1 1
system
1
TR2 1
retrieval 1
lung
1
ear
1
TR3
1
TR4
1
Information Filtering + LSI
A: find the corresponding concept(s); and
the corresponding documents
data
'DBMS- 'medicalconcept' concept'
1
'DBMS- 'medicalconcept' concept'
TR1 1
system
1
TR2 1
retrieval 1
lung
1
ear
1
TR3
1
TR4
1
Information Filtering + LSI
Thus it works like an (automatically
constructed) thesaurus:
we may retrieve documents that DON’T
have the term ‘system’, but they contain
almost everything else (‘data’,
‘retrieval’)
SVD

LSI: find ‘concepts’
SVD - Definition
A[n x m] = U[n x r] L [ r x r] (V[m x r])T




A: n x m matrix (eg., n documents, m terms)
U: n x r matrix (n documents, r concepts)
L: r x r diagonal matrix (strength of each
‘concept’) (r : rank of the matrix)
V: m x r matrix (m terms, r concepts)
SVD - Example

A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD - Example

A = U L VT - example:
retrieval CS-concept
inf.
lung
MD-concept
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD - Example

A = U L VT - example: doc-to-concept
similarity matrix
retrieval CS-concept
inf.
lung
MD-concept
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD - Example

A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
‘strength’ of CS-concept
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD - Example

A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
term-to-concept
similarity matrix
CS-concept
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD - Example

A = U L VT - example:
retrieval
inf.
lung
brain
data
CS
MD
1
2
1
5
0
0
0
1
2
1
5
0
0
0
1
2
1
5
0
0
0
0
0
0
0
2
3
1
0
0
0
0
2
3
1
=
0.18
0.36
0.18
0.90
0
0
0
0
0
0
0
0.53
0.80
0.27
term-to-concept
similarity matrix
CS-concept
x
9.64 0
0
5.29
x
0.58 0.58 0.58 0
0
0
0
0
0.71 0.71
SVD for LSI
‘documents’, ‘terms’ and ‘concepts’:
 U: document-to-concept similarity
matrix
 V: term-to-concept sim. matrix
 L: its diagonal elements: ‘strength’ of
each concept
SVD for LSI

Need to keep all the eigenvectors?

NO, just keep the first k (concepts)
Web Search

What about web search?





First you need to get all the documents of the
web…. Crawlers.
Then you have to index them (inverted files, etc)
Find the web pages that are relevant to the query
Report the pages with their links in a sorted order
Main difference with IR: web pages have links

may be possible to exploit the link structure for
sorting the relevant documents…
Kleinberg’s Algorithm (HITS)

Main idea: In many cases, when you search
the web using some terms, the most relevant
pages may not contain this term (or contain
the term only a few times)



Harvard : www.harvard.edu
Search Engines: yahoo, google, altavista
Authorities and hubs
Kleinberg’s algorithm


Problem dfn: given the web and a query
find the most ‘authoritative’ web pages for
this query
Step 0: find all pages containing the query
terms (root set)
Step 1: expand by one move forward and
backward (base set)
Kleinberg’s algorithm

Step 1: expand by one move forward
and backward
Kleinberg’s algorithm


on the resulting graph, give high score
(= ‘authorities’) to nodes that many
important nodes point to
give high importance score (‘hubs’) to
nodes that point to good ‘authorities’)
hubs
authorities
Kleinberg’s algorithm
observations
 recursive definition!
 each node (say, ‘i’-th node) has both an
authoritativeness score ai and a
hubness score hi
Kleinberg’s algorithm
Let E be the set of edges and A be the
adjacency matrix:
the (i,j) is 1 if the edge from i to j exists
Let h and a be [n x 1] vectors with the
‘hubness’ and ‘authoritativiness’ scores.
Then:
Kleinberg’s algorithm
Then:
k
i
l
m
ai = hk + hl + hm
that is
ai = Sum (hj) over all j that
(j,i) edge exists
or
a = AT h
Kleinberg’s algorithm
i
n
p
q
symmetrically, for the
‘hubness’:
hi = an + ap + aq
that is
hi = Sum (qj) over all j that
(i,j) edge exists
or
h=Aa
Kleinberg’s algorithm
In conclusion, we want vectors h and a
such that:
h=Aa
a = AT h
Start from a and h to all 1. Then apply the following trick:
h=Aa=A(ATh)=(AAT)h = ..=(AAT)2 h ..= (AAT)k h
a = (ATA)ka
Kleinberg’s algorithm
In short, the solutions to
h=Aa
a = AT h
are the left- and right- eigenvectors of the
adjacency matrix A.
Starting from random a’ and iterating, we’ll
eventually converge
(Q: to which of all the eigenvectors? why?)
Kleinberg’s algorithm
(Q: to which of all the eigenvectors?
why?)
A: to the ones of the strongest
eigenvalue, because of property :
(AT A ) k v’ ~ (constant) v1
So, we can find the a and h vectors and the page with the
highest a values are reported!
Kleinberg’s algorithm - results
Eg., for the query ‘java’:
0.328 www.gamelan.com
0.251 java.sun.com
0.190 www.digitalfocus.com (“the java
developer”)
Kleinberg’s algorithm - discussion


‘authority’ score can be used to find
‘similar pages’ to page p
closely related to ‘citation analysis’,
social networs / ‘small world’
phenomena
google/page-rank algorithm



closely related: The Web is a directed graph
of connected nodes
imagine a particle randomly moving along the
edges (*)
compute its steady-state probabilities. That
gives the PageRank of each pages (the
importance of this page)
(*) with occasional random jumps
PageRank Definition

Assume a page A and pages T1, T2, …, Tm
that point to A. Let d is a damping factor.
PR(A) the Pagerank of A. C(A) the outdegree of A. Then:
PR(T1) PR(T 2)
PR(Tm)
PR( A)  (1  d )  d (

 ... 
)
C (T1) C (T 2)
C (Tm)
google/page-rank algorithm

Compute the PR of each page~identical
problem: given a Markov Chain,
compute the steady state probabilities
p1 ... p5
2
1
4
5
3
Computing PageRank


Iterative procedure
Also, … navigate the web by randomly
follow links or with prob p jump to a
random page. Let A the adjacency
matrix (n x n), ci out-degree of page i
Prob(Ai->Aj) = dn-1+(1-d)ci–1Aij
A’[i,j] = Prob(Ai->Aj)
google/page-rank algorithm

Let A’ be the transition matrix
(=
adjacency matrix, row-normalized : sum of each row
= 1)
2
1
1
3
1/2
1
5
p1
p2
p2
p3
1
4
1/2
p1
1/2 1/2
=
p3
p4
p4
p5
p5
google/page-rank algorithm

Ap=p
A
2
1
1
3
1/2
1
5
1/2
1/2 1/2
=
p
p1
p1
p2
p2
p3
1
4
p
=
p3
p4
p4
p5
p5
google/page-rank algorithm


Ap=p
thus, p is the eigenvector that
corresponds to the highest eigenvalue
(=1, since the matrix is row-normalized)
Kleinberg/google - conclusions
SVD helps in graph analysis:
hub/authority scores: strongest left- and
right- eigenvectors of the adjacency
matrix
random walk on a graph: steady state
probabilities are given by the strongest
eigenvector of the transition matrix
References

Brin, S. and L. Page (1998). Anatomy of a
Large-Scale Hypertextual Web Search Engine.
7th Intl World Wide Web Conf.
Download