NewsWeeder: Learning to Filter Netnews

advertisement
NewsWeeder: Learning to Filter
Netnews
By: Ken Lang
Presented by Salah Omer
Introduction
Theoretical Framework
Approach
Analysis & Evaluation
Introduction:
Newsweeder is a Netnews filtering system
that addresses the classification problem
by letting the user rates his or her interest
level for each article being read.
The goal of an information filtering system
is to sort through large volumes of
information and present to the user those
documents that are likely to satisfy his or
her information requirement.
Introduction:
Several learning techniques are used in
information filtering
 Rule-based user-driven mode.
 Explicit-learning mode.
 Implicit-learning mode.
Introduction:
 Rule-based user-driven mode:
This mode requires direct and explicit user input
in establishing and maintaining the user model.
SIFT (Yan, & Garcia-Molina, 1995), Lens
(Malone et al, 1987), and Infoscope (Fischer, &
Steven, 1991) are some examples of systems
that use this approach. Some research
systems have extended by making complex rule
based key word matching profile. However,
average users will not be interested to follow
these complex systems.
Introduction:
Explicit-learning mode: This mode
creates and updates the user model by
eliciting explicit user feedback. Ratings on
presented items, provided by users based
on one or more ordinal or qualitative
scales, are often employed to directly
indicate users’ preference.
Introduction:
Implicit-learning mode: This mode aims
to acquire users’ preferences with minimal
or no additional effort from them. Kass, &
Finin (1988) defined implicit mode as
“observing the behavior of the user and
inferring facts about the user from the
observed behavior”.
Theoretical Framework
TD-IDF
MDL
Theoretical Framework
TD-IDF
Use the Information Retrieval techniques
of term frequency inverse document
frequency tf-idf.
The technique as we learn in the class is
based on two empirical observations.
Theoretical Framework
TF-IDF
First, the more times a token appears in a
document i.e. term frequency (TF), the
more likely the term is relevant to the
document.
Second, the more time the term is
observe the throughout the documents in
the documents’ set, the less it
discriminates between documents (IDF).
Theoretical Framework
Minimum Description Length
MDL is based on the following insight:
Any regularity in the data can be used to
compress the data, i.e. to describe it using
fewer symbols than the number of symbols
needed to describe the data literally.
The more we are able to compress the data,
the more we have learned about the data.
Theoretical Framework
Minimum Description Length
MDL provide the frame work for balancing
the tradeoff between model complexity
and training error.
In the NewsWeeder domain, the tradeoff
involves the importance of each token and
how to decide which one to drop and
which one to keep.
Theoretical Framework
Minimum Description Length
MDL is based on the Bayes’s Rules
P(H\D) = P(D\H)P(H)
P(D)
In the Bayes rule in order to maximize H in
P(H\D), we need to maximize P(D\H)P(H)
Or
Theoretical Framework
Minimum Description Length
P(H\D) = P(D\H)P(H)
P(D)
to minimize
- Log(P(D\H)P(H))
which is equal to
- log(P(D\H)) - log(P(H))
Theoretical Framework
Minimum Description Length
MDL interpretation of
-log(P(D\H)) - log(P(H))
as per Shannon information theory is to find
the hypothesis which minimizes the total
encoding length.
Approach:
Representation:
First: Raw text is the parsed into tokens.
For NewsWeeder tokens are kept to the
word level.
Second: create vectors of the token counts
for the document. Using the bag of word
approach and keeping the word in their
unstemmed form.
Approach:
Learning:TF-IDF
NewsWeeder uses the weight derived
from the multiplication of tf and idf as
express in the formula
w(t,d)=tft,dlog(|N|dft)
The logarithm is used to normalize large
values.
Approach:
Learning:TF-IDF
In the NewsWeeder the documents in
each category are converted into tf-idf
vector.
Approach
Learning:TF-IDF
To classify a new document, first, the new
document is compared to the prototype
vector then given a predicted rating based
on the cosine to each category.
Second, the result is converted from the
categorization procedure to a continuous
value using a linear regression.
Approach
Learning:MDL
 First, perform the categorization step then
convert the categorization similarities into
continuous rating prediction using.
argmax{p(ci\Td, ld, Dtrain} =
argmin{-log(p(Td\ci,ld,Dtrain)) – log(pci\ld, Dtrain))}
Given
d = document,
Td = token vector,
Dtrain = training data
The most probable category ci for d is that which
minimizes the bits needed to encode Td plus ci
Approach
Learning:MDL
Second, apply the probabilistic model to
compute the similarity for each training
document.
The probability of the data in a document
given it length and category is the product
of the individual token probabilities.
p(ci\Td, ld, Dtrain) = ∏p(td\ci,ld,Dtrain)
Approach
Learning:MDL
 Drive the probability estimate for ti,d
Computed the number of document contains token
ti = ∑ti,j
jЄN
The correlation ri,l between ti,d and Id
 Combine the token probability distribution that
is independent of the document length with
the one that is dependent on the document
length weighted by ri,l
Approach
Learning:MDL
 Based on the hypothesis that each token
either has specialized distribution for a
category or that it is unrelated. The MDL
chooses the category specific hypothesis is the
total bits saved in using the hypothesis
∑ -log(p(ti,d\ld)) – [- log(p(ti,d\ld, ck))]
dЄNck
is greater than the complexity of cost of
including the extra parameters.
Learning Algorithm Summary for
TF IDF and MDL
For the two approaches:
1. Divide the articles into training and test
(unseen) sets.
2. Parse the training articles, throwing out
tokens occurring less than three times
total.
3. Compute ti and ri,l for each token.
Learning Algorithm Summary for
TF IDF and MDL
 For TF-IDF
1. throw out the M most frequent tokens over the
entire training set.
2. compute the term weights, normalize the
weight vector for each article, and find the
average of the vectors for each rating
category.
3. compute the similarity of each training
document to each rating category prototype
using the cosine similarity metric.
Learning Algorithm Summary for
TF IDF and MDL
 For MDL
1. Decide for each token t and category c
whether to use p(t|l,c)=p(t|l), or to use a
category dependent model for when t occurs in
c. Then pre-compute the encoding lengths for
no tokens occurring for documents in each
category.
2. Compute the similarity of each training
document to each rating category by taking the
inverse of the number of bits needed to
encode Td under the category’s probabilistic
model.
Learning Algorithm Summary for
TF IDF and MDL
 For the two approaches:
1. Using the similarity measurements
computed in steps 3-TD-IDF or 2-MDL
on the training data, compute a linear
regression from rating category
similarities to continuous rating
predictions.
2. Apply the model obtained in steps 6 and
step 9 similarly measurement to test
articles.
Result Evaluation:
 Two methods to evaluate the performance on
the result:
1. The precision metrics (the ratio of relevant
documents retrieved to all documents retrieved.
2. Confusion matrix (column of the matrix
represents the instances in a predicted class,
while each row represents the instances in an
actual class) of error generated by text
classifier.
Data
Table 1: Rating Labels
The “interesting” label indicates the articles is rated 2 or better.
Label
Intended Use
1
Essential
For articles not to be missed if at all possible
2
Interesting
For articles of definite interest
3
Borderline
For articles the user is uncertain about his interest in and would rather not
make a commitment
4
Boring
For articles not interesting
5
Gong
For articles the user wants to heavily weight against seeing again, perhaps
because they are so clearly irritating to have in the list
Skip
Skip
For articles the user does not even want to read (note that this category may
cover several of the above ratings, as they can only be used if the user
actually requests to see the article)
Rating
Data
 Table 2 summarized the data collection over
one year. NewsWeeder experiment used 40
individual over 1 year. Since our model assumes
a stable distribution pool, and because it has no
temporal dependence, users’ interests that
lasted less than the one-year period will add
some amount of error to the performance.
 User B has rated 16% of the articles as
interesting (a rating of 1 or 2), it is possible for a
considerably smaller percentage of interesting
articles to be in the newsgroups read by User A.
Table 2: Article/Rating Data for Two Users
Rating
User A
User B
1
27 (1%)
29 (3%)
2
475 (11%)
43 (14%)
3
854 (20%)
67 (6%)
4
935 (22%)
56 (5%)
5
57 (1%)
17 (2%)
Skip
1995 (46%)
732 (70%)
Total
4343
1044
Total Interesting (1 or 2)
502 (12%)
172 (16%)
TF-IDF Performance Analysis:
Graph 1 show the effect of removing the
top N most frequent words in the top 10%,;
the top predicted rating articles.
Five trials with training/test data split of
80/20 for each trial removed 100 to 400
words the best precision 43% reached at
300 words removed.
MDL Performance Analysis:
Graph 2 shows that MDL outperform TFIDF for both user A and B. Performance
is measured in the percentage of the
interesting articles found in the top 10%
with highest predicted rating.
MDL reach a precision of 44% for User A
and 59% for user B compared to TF-IDF
(37% for A and 49 % for B)
Table 3 shows the confusion Matrix for MDL categorization of articles for
user A in a single trial
Download