Learning Techniques for Information Retrieval • Perceptron algorithm • Least mean Adaptive linear model • Let X1, X2, …, Xn be n vectors (of n documents). • D1D2={X1, X2, …, Xn}, where D1 be the set of relevant documents and D2 be the set of irrelevant documents. • D1 and D2 are obtained from users feedback. • Question: find a vector w such that WiXij<0 for each XjD1 and i=1 to m i=1 to m WiXij>>0 for each XjD2 D1 D2 X0 X1 X2 W1 W0 Threshold W2 X3 +1 W3 -1 Wn Xn Output =sign(y) Remarks: • W is the new vector for query. • W is computed based on the feedback, i.e., D1 and D2. • The following is a hyper-plane: WiXi>=0 i=1 to m • The hyper-palne cuts the whole space into to parts and hopefully one part contains relevant docs and the other contains ir-relevant docs. Perceptron Algorithm (1) For each XD1, if X·W<0 then increase the weight vector at the next iteration: W=Wold+CX. (2) For each XD2 if X·W>0 then decrease the weight vector at the next iteration: W=Wold -CX. C is a constant. Repeat until X·W>0 for each XD1 and X·W<0 for each XD2 . Preceptron Convergence Theorem • The perceptron algorithm finds a W in finite iterations if the t raining set {X1, X2, …, Xn} is linearly separable. Query Expansion and Term Reweighting for the Vector Model • Dr : set of relevant documents, as identified by the user, among the retrieved documents; • Dn : set of non-relevant documents among the retrieved documents; • Cr : set of relevant documents among all documents in the , collection; • | Dr |,| Dn |,| Cr | : number of documents in the sets respectively; , • Dr , Dn , Cr , , : tuning constants. qopt 1 1 dj dj | Cr | dCr N | Cr | dCr Query Expansion and Term Reweighting for the Vector Model Standard_Rochio : Ide_Regular : Ide_Dec_Hi : qm q | Dr | d j Dr qm q d j Dr | Dn | d j Dn d j d j Dr qm q dj dj dj d j Dn d j max nonrelevant (d j ) Where max nonrelevant (d j ) is a reference to the highest ranked non-relevant document. Evaluation of Relevance Feedback Strategies • Simple way: use the new query to search the database and recalculate the results • Problem: used feedback information, it is not fare. • Better way: just consider the unused document. Query Expansion Through Local Clustering • Definition Let V ( s ) be a non-empty subset of words which are grammatical variants of each other. A canonical form from s of V ( s ) is called a stem. For instance, if V ( s) { polish, polishing , polished } then s polish • Definition For a given query q , the set Dl of documents retrieved is called the local document set. Further, the set Vl of all distinct words in the local document set is called the local vocabulary. The set of all distinct stems derived from the set Vl is referred to as Sl . Association Clusters • Definition The frequency of a stem si in a document d j , d j Dl , is referred to as f si , j . Let m (mi , j ) be an association matrix with | Sl | rows and | Dl | columns, t where mi , j f si , j. Let m be the transpose of m. The matrix s mmt is a local stem-stem association matrix. Each element su ,v in s expresses a correlation between the stems su cu ,v and d j Dl su ,v cu ,v sv namely, f su , j f sv , j (5.5) (5.6) Association Clusters • Normalize su ,v cu ,v cu ,u cv ,v cu ,v (5.7) • Definition Consider the u -th row in the association matrix s (i.e., the row with all the associations for the stem su ). Let Su (n) be a function which takes the u-th row and returns the set of n largest values su ,v , where v varies over the set of local stems and v u . Then Su (n) defines a local association cluster around the stem su . If su ,v is given by equation (5.6), the association cluster is said to be unnormalized. If su ,v is given by equation 5.7, the association cluster is said to be normalized.