CS 4485 Test

advertisement
CS 4485 Test
Question 1. (10 points) In the vector space model, document j is represented as
(w1,j, w2,j, w3,j, …, wn,j), where there are n terms in total. Each wi,j is computed as
N
N
freqi , j
wi , j  fi , j * log
idf i  log
fi , j 
ni
ni
max lfreql , j
Explain the meaning of N, ni, and freq i,j.
Answer: N is the total number of documents in the system, ni is the number of
documents that contain term i, and freq i,j. is the number of times that term i appears in
document j.
Question 2. (10 points) Give the definitions of recall and precision.
Answer: Let R be the set of all relevant documents.
A: set of all documents reported as relevant by the system
Ra: AR, the set of relevant documents reported.
Recall = |Ra|/|R|.
Precision = |Ra|/|A|.
Question 3: (15 points) Describe a O(n) time algorithm that takes a document (a string)
as input and finds the number of occurrence of each term in the document.
Answer: Construct an automaton while reading the text. There is a starting state in the
automaton, and for each word in the text, there is a final state. Each final state has a
counter indicating the number of times the corresponding word is accepted.
When reading the first letter of a word, (words are separated by spaces), we go to the
starting state of the automaton. If the word is new then create new states of the word (or
part of the word). When a final state is reach and the next letter is space, then increase the
counter on the state by one.
(You can use your own words to describe the algorithm. It tests if you understand
the algorithm for assignment one. Also, it tests your ability of describing
algorithms.)
Question 4: (15 points) Let X1=(1,0), X2 =(1,1), X3=(2,2), and X4=(2,3). D1={X1, X2}
and D2={X3, X4}. Use the Perceptron Algorithm to find a vector w such that

WiXij>0 for each XjD1 and
i=1 to m

WiXij<0 for each XjD2.
i=1 to m
Answer: (0) Change the vector from m-d to (m+1)-d by adding an element with
vaoue 1. For example, (2, 3)(1, 2, 3). (5, 8) (1, 5, 8).(It is missing in the handout.)
(1) For each XD1, if X·W<0 then increase the weight vector at the next iteration:
W=Wold+CX.
(2) For each XD2 if X·W>0 then decrease the weight vector at the next iteration:
W=Wold -CX.
C is a constant.
Repeat until X·W>0 for each XD1 and X·W<0 for each XD2 .
Y=1.5 is a solution.
W=(1.0,1.0,1.0)
(2.0,2.0,1.0) w=(-1.0,-1.0,0.0)
(1.0,0.0,1.0) w=(0.0,-1.0,1.0)
(1.0,1.0,1.0) w=(1.0,0.0,2.0)
(2.0,2.0,1.0) w=(-1.0,-2.0,1.0)
(1.0,0.0,1.0) w=(0.0,-2.0,2.0)
(1.0,1.0,1.0) w=(1.0,-1.0,3.0)
(2.0,2.0,1.0) w=(-1.0,-3.0,2.0)
(1.0,1.0,1.0) w=(0.0,-2.0,3.0)
(Everybody get 15 points for this question.)
Question 5. (10 points) Explain the following terminologies:
stopword, stemming.
Answer:
(1) Stem: the portion of a word which is left after the removal of its affixes (i.e.,
prefixes or suffixes). Example: connect is the stem for {connected, connecting
connection, connections}
(2) Stopword: words appear more than 80% of the documents in the collection are
stopwords and are filtered out as potential index words.
Question 6. (10 points) PageRank of a page A is given as follows:
PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn)). Explain the meaning of
PR(Ti) and C(Ti).
Answer: PR(A) is the pagerank, C(A) is defined as the number of links going out of
page A.
Question 7. (15 points Bouns) Use the dynamic programming algorithm to compute
LCS between X=abbcdbba and Y=abcbdba.
Answer:
a b b c d b b a
0 0 0 0 0 0 0 0 0
a 0 1 1 1 1 1 1 1 1
b 0 1 2 2 2 2 2 2 2
c 0 1 2 2 3 3 3 3 3
b 0 1 2 2 3 3 4 4 4
d 0 1 2 2 3 4 4 4 4
b 0 1 2 3 3 4 5 5 5
a 0 1 2 3 3 4 5 5 6
Question 8. (15 points) Consider the Fuzzy Information Retrieval model.
Suppose that our system has d1=(1, 0, 0), d2=(1, 0, 0), and d3=(0, 1,1), d4=(0, 0, 1) and
d5=(1, 0, 0). Query is k1 and k2. Compute q, j for j=1 and 2.
Answer: (1) n1=3, n2=1, n3=2, n1,2=0, n1,3=0, n2,3=1.
(2) c1,2=0, c1,3=0, and c2,3=0.5.
(3) q=k1k2k3  k1k2k3.
(4) 1,1=1-(1-c1,1)=1. 2,1=1, 3,1=0, 1,2=1, 2,2=0, 3,2=0.
(5) q,1=c1+c2, 1=1-(1-c1,1)(1-c2,1)= 1-(1-1,12,13,1)(1-1,22,1(1-3,1))=0.
Download