IR Lab 5 tf*idf solution for manual calculation Q: Gold Silver Truck D1: Shipment of gold damage in a fire D2: Delivery of silver arrived in a silver truck D3: Shipment of gold arrived in a truck Objective: Manually calculate similarity of each of these documents (D1, D2, D3) with the query (Q). Solution: Using cosine distance to calculate the similarity Step 1: Presenting Q, D1, D2 and D3 as vectors in the same vector space Step 2: Using cosine distance to calculate the similarity Step 1: Presenting Q, D1, D2 and D3 as vectors in the same vector space. So each document and query is a vector in the vector space. Dimension of the vector space ??? o Number of different terms in D1, D2, D3 is the dimension of the vector space For each document and query, coordination in each dimension ??? o The coordination in each dimension is the weight of term (wi) o Weight of term i in document j: wij = tfij *idfi There are 11 different terms in the collection of D1, D2, D3: gold, silver, truck, shipment, of, damage, in, a, fire, delivery, arrived. So the dimension of the vector space is 11. Next step is presenting all documents D1, D2 and D3 in the vector space of 11 dimensions. IDF We have: idfi = lg(N/ni), in which: N = number of documents in the collection ni = number of documents containing term i. So: Idf(gold) = lg(3/2) = 0.176 Idf(silver) = lg(3/1)= 0.477 Idf(truck) = lg(3/2) = 0.176 Idf(shipment) = lg(3/2) = 0.176 Idf(of) = lg(3/3) = 0 Idf(damage) = lg(3/1) = 0.477 Idf(in) = lg(3/3) = 0 Idf(fire) = lg(3/1) = 0.477 Idf(delivery) = lg(3/1) = 0.477 Idf(arrived) = lg(3/2) = 0.176 Idf(a) = lg(3/3) = 0 TF We have tfi,j = freqi/Pj , in which: Freqi : frequency of term i in document j Pj : total terms in document j. So : tf gol Silve truc shipme of damage in d r k nt d shipmen t of damage d in fire a gold Q 0.059 0.159 0.059 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 D1 0.025 0.000 0.000 0.025 0.000 0.068 0.000 0.068 0.000 0.000 0.000 D2 0.000 0.119 0.022 0.000 0.000 0.000 0.000 0.060 0.060 0.022 0.000 D3 0.025 0.000 0.025 0.025 0.000 0.068 0.000 0.000 0.000 0.025 0.000 D 1 D 2 D 3 truck deliver arrive y d wij Q Silver fir e deliver y 1/ 3 1/ 7 0 1/3 1/3 0 0 0 0 0 0 0 0 0 0 1/7 1/7 0 1/8 1/8 1/ 7 0 1/7 1/7 1/ 7 1/ 8 0 0 1/8 1/ 7 1/ 8 0 0 2/8 1/ 7 1/ 8 1/ 7 0 1/7 1/ 7 1/ 8 1/ 7 0 1/7 Weights wij & presenting Q, D1, D2, D3 in the vector space We have: wij = tfij * idfi So: Each document and query is presented as a vector with 11 dimensions, for example, from the table above: D1=(0.025, 0, 0, 0.025, 0, 0.068, 0, 0.068, 0, 0, 0) Step 2: Using cosine distance to calculate the similarity For example, we have vectors a1 =(x1, y1), a2=(x2, y2) then cos ineDis tan ce(a1, a2 ) = d (a1, a2 ) = cosq = So: a1 · a2 ( x1 * x2 ) + ( y1 * y2 ) = | a1 | ´ | a2 | x12 + x22 * y12 + y22 arrive d a Similarity(Q, D1) = d(Q, D1) = 0.025 * 0.059 + 0 * 0.159 + 0 * 0.059 + 0.025 * 0 + 0 * 0 + 0.068 * 0 + 0 * 0 + 0.068 * 0 + 0 * 0 + 0 * 0 + 0 * 0 0.0252 + 0 + 0 + 0.0252 + 0 + 0.0682 + 0 + 0.0682 + 0 + 0 + 0 * 0.0592 + 0.1592 + 0.0592 + 0 + 0 + 0 + 0 + 0 + 0 + 0 + 0 = 0.0800693 Similarity(Q, D2) = d(Q, D2) = 0.7561805 Similarity(Q, D3) = d(Q, D3) = 0.1942335