NewsWeeder: Learning to Filter Netnews By: Ken Lang Presented by Salah Omer Introduction Theoretical Framework Approach Analysis & Evaluation Introduction: Newsweeder is a Netnews filtering system that addresses the classification problem by letting the user rates his or her interest level for each article being read. The goal of an information filtering system is to sort through large volumes of information and present to the user those documents that are likely to satisfy his or her information requirement. Introduction: Several learning techniques are used in information filtering Rule-based user-driven mode. Explicit-learning mode. Implicit-learning mode. Introduction: Rule-based user-driven mode: This mode requires direct and explicit user input in establishing and maintaining the user model. SIFT (Yan, & Garcia-Molina, 1995), Lens (Malone et al, 1987), and Infoscope (Fischer, & Steven, 1991) are some examples of systems that use this approach. Some research systems have extended by making complex rule based key word matching profile. However, average users will not be interested to follow these complex systems. Introduction: Explicit-learning mode: This mode creates and updates the user model by eliciting explicit user feedback. Ratings on presented items, provided by users based on one or more ordinal or qualitative scales, are often employed to directly indicate users’ preference. Introduction: Implicit-learning mode: This mode aims to acquire users’ preferences with minimal or no additional effort from them. Kass, & Finin (1988) defined implicit mode as “observing the behavior of the user and inferring facts about the user from the observed behavior”. Theoretical Framework TD-IDF MDL Theoretical Framework TD-IDF Use the Information Retrieval techniques of term frequency inverse document frequency tf-idf. The technique as we learn in the class is based on two empirical observations. Theoretical Framework TF-IDF First, the more times a token appears in a document i.e. term frequency (TF), the more likely the term is relevant to the document. Second, the more time the term is observe the throughout the documents in the documents’ set, the less it discriminates between documents (IDF). Theoretical Framework Minimum Description Length MDL is based on the following insight: Any regularity in the data can be used to compress the data, i.e. to describe it using fewer symbols than the number of symbols needed to describe the data literally. The more we are able to compress the data, the more we have learned about the data. Theoretical Framework Minimum Description Length MDL provide the frame work for balancing the tradeoff between model complexity and training error. In the NewsWeeder domain, the tradeoff involves the importance of each token and how to decide which one to drop and which one to keep. Theoretical Framework Minimum Description Length MDL is based on the Bayes’s Rules P(H\D) = P(D\H)P(H) P(D) In the Bayes rule in order to maximize H in P(H\D), we need to maximize P(D\H)P(H) Or Theoretical Framework Minimum Description Length P(H\D) = P(D\H)P(H) P(D) to minimize - Log(P(D\H)P(H)) which is equal to - log(P(D\H)) - log(P(H)) Theoretical Framework Minimum Description Length MDL interpretation of -log(P(D\H)) - log(P(H)) as per Shannon information theory is to find the hypothesis which minimizes the total encoding length. Approach: Representation: First: Raw text is the parsed into tokens. For NewsWeeder tokens are kept to the word level. Second: create vectors of the token counts for the document. Using the bag of word approach and keeping the word in their unstemmed form. Approach: Learning:TF-IDF NewsWeeder uses the weight derived from the multiplication of tf and idf as express in the formula w(t,d)=tft,dlog(|N|dft) The logarithm is used to normalize large values. Approach: Learning:TF-IDF In the NewsWeeder the documents in each category are converted into tf-idf vector. Approach Learning:TF-IDF To classify a new document, first, the new document is compared to the prototype vector then given a predicted rating based on the cosine to each category. Second, the result is converted from the categorization procedure to a continuous value using a linear regression. Approach Learning:MDL First, perform the categorization step then convert the categorization similarities into continuous rating prediction using. argmax{p(ci\Td, ld, Dtrain} = argmin{-log(p(Td\ci,ld,Dtrain)) – log(pci\ld, Dtrain))} Given d = document, Td = token vector, Dtrain = training data The most probable category ci for d is that which minimizes the bits needed to encode Td plus ci Approach Learning:MDL Second, apply the probabilistic model to compute the similarity for each training document. The probability of the data in a document given it length and category is the product of the individual token probabilities. p(ci\Td, ld, Dtrain) = ∏p(td\ci,ld,Dtrain) Approach Learning:MDL Drive the probability estimate for ti,d Computed the number of document contains token ti = ∑ti,j jЄN The correlation ri,l between ti,d and Id Combine the token probability distribution that is independent of the document length with the one that is dependent on the document length weighted by ri,l Approach Learning:MDL Based on the hypothesis that each token either has specialized distribution for a category or that it is unrelated. The MDL chooses the category specific hypothesis is the total bits saved in using the hypothesis ∑ -log(p(ti,d\ld)) – [- log(p(ti,d\ld, ck))] dЄNck is greater than the complexity of cost of including the extra parameters. Learning Algorithm Summary for TF IDF and MDL For the two approaches: 1. Divide the articles into training and test (unseen) sets. 2. Parse the training articles, throwing out tokens occurring less than three times total. 3. Compute ti and ri,l for each token. Learning Algorithm Summary for TF IDF and MDL For TF-IDF 1. throw out the M most frequent tokens over the entire training set. 2. compute the term weights, normalize the weight vector for each article, and find the average of the vectors for each rating category. 3. compute the similarity of each training document to each rating category prototype using the cosine similarity metric. Learning Algorithm Summary for TF IDF and MDL For MDL 1. Decide for each token t and category c whether to use p(t|l,c)=p(t|l), or to use a category dependent model for when t occurs in c. Then pre-compute the encoding lengths for no tokens occurring for documents in each category. 2. Compute the similarity of each training document to each rating category by taking the inverse of the number of bits needed to encode Td under the category’s probabilistic model. Learning Algorithm Summary for TF IDF and MDL For the two approaches: 1. Using the similarity measurements computed in steps 3-TD-IDF or 2-MDL on the training data, compute a linear regression from rating category similarities to continuous rating predictions. 2. Apply the model obtained in steps 6 and step 9 similarly measurement to test articles. Result Evaluation: Two methods to evaluate the performance on the result: 1. The precision metrics (the ratio of relevant documents retrieved to all documents retrieved. 2. Confusion matrix (column of the matrix represents the instances in a predicted class, while each row represents the instances in an actual class) of error generated by text classifier. Data Table 1: Rating Labels The “interesting” label indicates the articles is rated 2 or better. Label Intended Use 1 Essential For articles not to be missed if at all possible 2 Interesting For articles of definite interest 3 Borderline For articles the user is uncertain about his interest in and would rather not make a commitment 4 Boring For articles not interesting 5 Gong For articles the user wants to heavily weight against seeing again, perhaps because they are so clearly irritating to have in the list Skip Skip For articles the user does not even want to read (note that this category may cover several of the above ratings, as they can only be used if the user actually requests to see the article) Rating Data Table 2 summarized the data collection over one year. NewsWeeder experiment used 40 individual over 1 year. Since our model assumes a stable distribution pool, and because it has no temporal dependence, users’ interests that lasted less than the one-year period will add some amount of error to the performance. User B has rated 16% of the articles as interesting (a rating of 1 or 2), it is possible for a considerably smaller percentage of interesting articles to be in the newsgroups read by User A. Table 2: Article/Rating Data for Two Users Rating User A User B 1 27 (1%) 29 (3%) 2 475 (11%) 43 (14%) 3 854 (20%) 67 (6%) 4 935 (22%) 56 (5%) 5 57 (1%) 17 (2%) Skip 1995 (46%) 732 (70%) Total 4343 1044 Total Interesting (1 or 2) 502 (12%) 172 (16%) TF-IDF Performance Analysis: Graph 1 show the effect of removing the top N most frequent words in the top 10%,; the top predicted rating articles. Five trials with training/test data split of 80/20 for each trial removed 100 to 400 words the best precision 43% reached at 300 words removed. MDL Performance Analysis: Graph 2 shows that MDL outperform TFIDF for both user A and B. Performance is measured in the percentage of the interesting articles found in the top 10% with highest predicted rating. MDL reach a precision of 44% for User A and 59% for user B compared to TF-IDF (37% for A and 49 % for B) Table 3 shows the confusion Matrix for MDL categorization of articles for user A in a single trial