IR_lab5_solution_2015

advertisement
IR Lab 5 tf*idf solution for manual calculation
Q: Gold Silver Truck
D1: Shipment of gold damage in a fire
D2: Delivery of silver arrived in a silver truck
D3: Shipment of gold arrived in a truck
Objective: Manually calculate similarity of each of these documents (D1, D2, D3)
with the query (Q).
Solution: Using cosine distance to calculate the similarity
 Step 1: Presenting Q, D1, D2 and D3 as vectors in the same vector space
 Step 2: Using cosine distance to calculate the similarity
Step 1: Presenting Q, D1, D2 and D3 as vectors in the same vector space.
So each document and query is a vector in the vector space.
 Dimension of the vector space ???
o Number of different terms in D1, D2, D3 is the dimension of the
vector space
 For each document and query, coordination in each dimension ???
o The coordination in each dimension is the weight of term (wi)
o Weight of term i in document j: wij = tfij *idfi
There are 11 different terms in the collection of D1, D2, D3: gold, silver, truck,
shipment, of, damage, in, a, fire, delivery, arrived.
So the dimension of the vector space is 11. Next step is presenting all documents
D1, D2 and D3 in the vector space of 11 dimensions.
 IDF
We have:
idfi = lg(N/ni), in which:
N = number of documents in the collection
ni = number of documents containing term i.
So:
Idf(gold) = lg(3/2) = 0.176
Idf(silver) = lg(3/1)= 0.477
Idf(truck) = lg(3/2) = 0.176
Idf(shipment) = lg(3/2) = 0.176
Idf(of) = lg(3/3) = 0
Idf(damage) = lg(3/1) = 0.477
Idf(in) = lg(3/3) = 0
Idf(fire) = lg(3/1) = 0.477
Idf(delivery) = lg(3/1) = 0.477
Idf(arrived) = lg(3/2) = 0.176
Idf(a) = lg(3/3) = 0
 TF
We have tfi,j = freqi/Pj , in which:
Freqi : frequency of term i in document j
Pj : total terms in document j.
So :
tf gol Silve truc shipme of damage in
d
r
k
nt
d
shipmen
t
of
damage
d
in
fire
a
gold
Q
0.059
0.159
0.059
0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.000
D1
0.025
0.000
0.000
0.025
0.000
0.068
0.000
0.068
0.000
0.000
0.000
D2
0.000
0.119
0.022
0.000
0.000
0.000
0.000
0.060
0.060
0.022
0.000
D3
0.025
0.000
0.025
0.025
0.000
0.068
0.000
0.000
0.000
0.025
0.000
D
1
D
2
D
3
truck
deliver arrive
y
d
wij
Q
Silver
fir
e
deliver
y
1/
3
1/
7
0
1/3
1/3
0
0
0
0
0
0
0
0
0
0
1/7
1/7
0
1/8
1/8
1/
7
0
1/7
1/7
1/
7
1/
8
0
0
1/8
1/
7
1/
8
0
0
2/8
1/
7
1/
8
1/
7
0
1/7
1/
7
1/
8
1/
7
0
1/7
 Weights wij & presenting Q, D1, D2, D3 in the vector space
We have: wij = tfij * idfi
So:
 Each document and query is presented as a vector with 11 dimensions,
for example, from the table above: D1=(0.025, 0, 0, 0.025, 0, 0.068, 0, 0.068, 0,
0, 0)
Step 2: Using cosine distance to calculate the similarity
For example, we have vectors a1 =(x1, y1), a2=(x2, y2) then
cos ineDis tan ce(a1, a2 ) = d (a1, a2 ) = cosq =
So:
a1 · a2
( x1 * x2 ) + ( y1 * y2 )
=
| a1 | ´ | a2 |
x12 + x22 * y12 + y22
arrive
d
a
Similarity(Q, D1) = d(Q, D1) =
0.025 * 0.059 + 0 * 0.159 + 0 * 0.059 + 0.025 * 0 + 0 * 0 + 0.068 * 0 + 0 * 0 + 0.068 * 0 + 0 * 0 + 0 * 0 + 0 * 0
0.0252 + 0 + 0 + 0.0252 + 0 + 0.0682 + 0 + 0.0682 + 0 + 0 + 0 * 0.0592 + 0.1592 + 0.0592 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0
= 0.0800693
Similarity(Q, D2) = d(Q, D2) = 0.7561805
Similarity(Q, D3) = d(Q, D3) = 0.1942335
Download