CS 4485 Test Question 1. (10 points) In the vector space model, document j is represented as (w1,j, w2,j, w3,j, …, wn,j), where there are n terms in total. Each wi,j is computed as N N freqi , j wi , j fi , j * log idf i log fi , j ni ni max lfreql , j Explain the meaning of N, ni, and freq i,j. Answer: N is the total number of documents in the system, ni is the number of documents that contain term i, and freq i,j. is the number of times that term i appears in document j. Question 2. (10 points) Give the definitions of recall and precision. Answer: Let R be the set of all relevant documents. A: set of all documents reported as relevant by the system Ra: AR, the set of relevant documents reported. Recall = |Ra|/|R|. Precision = |Ra|/|A|. Question 3: (15 points) Describe a O(n) time algorithm that takes a document (a string) as input and finds the number of occurrence of each term in the document. Answer: Construct an automaton while reading the text. There is a starting state in the automaton, and for each word in the text, there is a final state. Each final state has a counter indicating the number of times the corresponding word is accepted. When reading the first letter of a word, (words are separated by spaces), we go to the starting state of the automaton. If the word is new then create new states of the word (or part of the word). When a final state is reach and the next letter is space, then increase the counter on the state by one. (You can use your own words to describe the algorithm. It tests if you understand the algorithm for assignment one. Also, it tests your ability of describing algorithms.) Question 4: (15 points) Let X1=(1,0), X2 =(1,1), X3=(2,2), and X4=(2,3). D1={X1, X2} and D2={X3, X4}. Use the Perceptron Algorithm to find a vector w such that WiXij>0 for each XjD1 and i=1 to m WiXij<0 for each XjD2. i=1 to m Answer: (0) Change the vector from m-d to (m+1)-d by adding an element with vaoue 1. For example, (2, 3)(1, 2, 3). (5, 8) (1, 5, 8).(It is missing in the handout.) (1) For each XD1, if X·W<0 then increase the weight vector at the next iteration: W=Wold+CX. (2) For each XD2 if X·W>0 then decrease the weight vector at the next iteration: W=Wold -CX. C is a constant. Repeat until X·W>0 for each XD1 and X·W<0 for each XD2 . Y=1.5 is a solution. W=(1.0,1.0,1.0) (2.0,2.0,1.0) w=(-1.0,-1.0,0.0) (1.0,0.0,1.0) w=(0.0,-1.0,1.0) (1.0,1.0,1.0) w=(1.0,0.0,2.0) (2.0,2.0,1.0) w=(-1.0,-2.0,1.0) (1.0,0.0,1.0) w=(0.0,-2.0,2.0) (1.0,1.0,1.0) w=(1.0,-1.0,3.0) (2.0,2.0,1.0) w=(-1.0,-3.0,2.0) (1.0,1.0,1.0) w=(0.0,-2.0,3.0) (Everybody get 15 points for this question.) Question 5. (10 points) Explain the following terminologies: stopword, stemming. Answer: (1) Stem: the portion of a word which is left after the removal of its affixes (i.e., prefixes or suffixes). Example: connect is the stem for {connected, connecting connection, connections} (2) Stopword: words appear more than 80% of the documents in the collection are stopwords and are filtered out as potential index words. Question 6. (10 points) PageRank of a page A is given as follows: PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)). Explain the meaning of PR(Ti) and C(Ti). Answer: PR(A) is the pagerank, C(A) is defined as the number of links going out of page A. Question 7. (15 points Bouns) Use the dynamic programming algorithm to compute LCS between X=abbcdbba and Y=abcbdba. Answer: a b b c d b b a 0 0 0 0 0 0 0 0 0 a 0 1 1 1 1 1 1 1 1 b 0 1 2 2 2 2 2 2 2 c 0 1 2 2 3 3 3 3 3 b 0 1 2 2 3 3 4 4 4 d 0 1 2 2 3 4 4 4 4 b 0 1 2 3 3 4 5 5 5 a 0 1 2 3 3 4 5 5 6 Question 8. (15 points) Consider the Fuzzy Information Retrieval model. Suppose that our system has d1=(1, 0, 0), d2=(1, 0, 0), and d3=(0, 1,1), d4=(0, 0, 1) and d5=(1, 0, 0). Query is k1 and k2. Compute q, j for j=1 and 2. Answer: (1) n1=3, n2=1, n3=2, n1,2=0, n1,3=0, n2,3=1. (2) c1,2=0, c1,3=0, and c2,3=0.5. (3) q=k1k2k3 k1k2k3. (4) 1,1=1-(1-c1,1)=1. 2,1=1, 3,1=0, 1,2=1, 2,2=0, 3,2=0. (5) q,1=c1+c2, 1=1-(1-c1,1)(1-c2,1)= 1-(1-1,12,13,1)(1-1,22,1(1-3,1))=0.