Hypertext Categorization - School of Computer Science

advertisement
Hypertext Categorization
Rayid Ghani
IR Seminar - 10/3/00
“Standard” Approach
 Apply traditional text learning algorithms
 In many cases, goal is not to classify
hypertext but to test the algorithms
 Is it actually the right approach?
Results?
 Mixed results


Positive results in most cases BUT the goal was
to test the algorithms
Negative in few e.g. Chakrabarti BUT the goal
was to motivate their own algorithm
How is hypertext different?
 Link Information
 Diverse Authorship
 Short text - topic not obvious from the text
 Structure / position within the web graph
 Author-supplied features(meta-tags)
 Bold , italics, heading etc.
How to use those extra
features?
Specific approaches to classify
hypertext
 Chakrabarti et al SIGMOD 98
 Oh et al. SIGIR 00
 Slattery & Mitchell ICML 00
 Goal is not classification but retrieval


Bharat & Henzinger SIGIR 98
Croft & Turtle 93
Chakrabarti et al. SIGMOD 98
 Use the page and linkage information
 Add words from the “neighbors” and treat
them as belonging to the page itself


Decrease in performance (not surprising)
Link information is very noisy
 Use topic information from neighbors instead
Data Sets
 IBM Patent Database

12 classes (630 train, 300 test for each class)
 Yahoo

13 classes , 20000 docs (for expts involving
hypertext, only 900 documents were used) (?)
Experiments
 Using text from neighbors


Local+Neighbor_Text: 
Local+Neighbor_Text_Tagged: 
 Assume Neighbors are Pre-classified




Text – 36%
Link – 34%
Prefix – 22.1% (words in class heirarchy used)
Text+Prefix – 21%
Oh et al. SIGIR 2000
 Relationship b/w class and neighbors of a
web page in the training set is not
consistent/useful (?)
 Instead, Use the class and neighbor info of
the page being classified (use regularities in
the test set)
Classification
 Classify test instance d by:
arg max [ P (C | G , T )]
c
 arg max [ P (C | T ) P (C | G )]
c
|T |
 arg max [( P (c) P(ti | c) N ( ti |d ) ) Neighbord (c)]
c
i 1
ld (c )
Neighbord (c) 
wL
Ld
Algorithm
 For each test document d, generate a set A of
“trustable” neighbors
 For all terms ti in d, adjust the term weight using the
term weights from A
 For each doc a in A, assign a max confidence value
if its class is known otherwise assign a class
probabilistically and give it partial confidence
weight
 Classify d using the equation given earlier.
Experiments
 Reuters used to assess the algorithm on
datasets without hyperlinks – only varying
the size of the training set & # of features (?)

Results not directly comparable but numbers
similar to reported results
 Articles from an encyclopedia – 76 classes,
20836 documents
Results
 Terms+Classes > Only Classes > Only Terms
> No use of inlinks
Other issues
 Link discrimination
 Knowledge of neighbor classes
 Use of links in training set
 Inclusion of new terms from neighbors
Comparison
Chakrabarti Oh et al. Improvement
Links in training Y
set
Link
N
discrimination
Knowledge of
Y
neighbor class
N
5%
Y
6.7%
Y
Iteration
Y
N
6.6%
1.9%
1.5%
Using new terms Y
from neighbors
N
31.4%
Slattery & Mitchell ICML 00
 Given a problem setting in which the test set
contains structural regularities, How can we
find and use them?
Hubs and Authorities
Kleinberg (1998)
“.. a good hub is a page
that points to many
good authorities;
a good authority is a
page pointed to by
many good hubs.”
Hubs
Authorities
Hubs and Authorities
Kleinberg (1998)
“Hubs and authorities
exhibit what could be
called a mutually
reinforcing relationship”
Iterative relaxation:
Hub ( p )   Authority (q )
q: p  q
Authority ( p ) 
Hubs
Authorities
 Hub (q)
q:q  p
The Plan
 Take an existing learning algorithm
 Extend it to exploit structural regularities in
the test set

Using Hubs and Authorities as inspiration
FOIL
Quinlan & Cameron-Jones (1993)
Learns relational rules like:
target_page(A) :- has_research(A), link(A,B),
has_publications(B).
For each test example
 Pick matching rule with best training set
performance p.
 Predict positive with confidence p
FOIL-Hubs Representation
Add two rules to a learned rule set
 target_page(A):-link(B,A),target_hub(B).
 target_hub(A):link(A,B),target_page(B).
Talk about confidence rather than truth
 target_page(page15) = 0.75
Evaluate by summing instantiations
target_pag e(page15) 
 target_hub (B)
B : link(B, page15)
FOIL-Hubs Algorithm
1. Apply learned FOIL rules: learned(A)
2. Iterate
1.
2.
3.
Evaluate target_hub(A)
Evaluate target_page(A)
Set target_page(A) =
target_pag e(A)  s  learned(A)
3. Report target_page(A)
FOIL-Hubs Algorithm
Learned FOIL rules
foil(A)
target_page(A)
target_hub(A)
1.
Apply learned FOIL rules to test set
2.
Initialise target_page(A) confidence from foil(A)
3.
Evaluate target_hub(A)
4.
Evaluate target_page(A)
5. target_page(A)=target_page(A)s+foil(A)
Data Set
 4127 pages from Computer Science
departments of four universities:
Cornell University
University of Texas at Austin
University of Washington University of Wisconsin
• Hand labeled into:
Student
558 Web pages
Course
243 Web pages
Faculty
153 Web pages
Experiment
Three binary classification tasks
1. Student Home Page
2. Course Home Page
3. Faculty Home Page
Leave two university out cross-validation
Student Home Page
100
Precision
80
60
40
FOIL-Hubs
FOIL
20
0
0
20
40
60
Recall
80
100
Course Home Page
100
FOIL-Hubs
FOIL
Precision
80
60
40
20
0
0
20
40
60
Recall
80
100
More Detailed Results
Partition the test data into
 Examples covered by some learned FOIL
rule
 Examples covered by no learned FOIL rule
Student – FOIL covered
100
Precision
80
60
40
FOIL-Hubs
FOIL
20
0
0
20
40
60
Recall
80
100
Student – FOIL uncovered
100
Precision
80
60
FOIL-Hubs
FOIL
40
20
0
0
20
40
60
Recall
80
100
Course – FOIL covered
100
Precision
80
60
40
FOIL-Hubs
FOIL
20
0
0
20
40
60
Recall
80
100
Course – FOIL uncovered
100
FOIL-Hubs
FOIL
Precision
80
60
40
20
0
0
20
40
60
Recall
80
100
Recap
 We’ve searched for regularities of the form
student_page(A):link(Web->KB members page,A)
in the test set.
 We consider this an instance of a regularity schema
student_page(A):link(<page constant>,A)
Conclusions
 Test set regularities can be used to improve
classification performance
 FOIL-Hubs used such regularities to
outperform FOIL on three Web page
classification problems
 We can potentially search for other regularity
schemas using FOIL
Other work
 Using the structure of HTML to improve
retrieval. Michal Cutler, Yungming Shih,
Weiyi Meng. USENIX 1997

Use tfidf - different different weights to text in
different html tags
Download