Hypertext Categorization - School of Computer Science

Hypertext Categorization Rayid Ghani IR Seminar - 10/3/00 “Standard” Approach  Apply traditional text learning algorithms  In many cases, goal is not to classify hypertext but to test the algorithms  Is it actually the right approach? Results?  Mixed results   Positive results in most cases BUT the goal was to test the algorithms Negative in few e.g. Chakrabarti BUT the goal was to motivate their own algorithm How is hypertext different?  Link Information  Diverse Authorship  Short text - topic not obvious from the text  Structure / position within the web graph  Author-supplied features(meta-tags)  Bold , italics, heading etc. How to use those extra features? Specific approaches to classify hypertext  Chakrabarti et al SIGMOD 98  Oh et al. SIGIR 00  Slattery & Mitchell ICML 00  Goal is not classification but retrieval   Bharat & Henzinger SIGIR 98 Croft & Turtle 93 Chakrabarti et al. SIGMOD 98  Use the page and linkage information  Add words from the “neighbors” and treat them as belonging to the page itself   Decrease in performance (not surprising) Link information is very noisy  Use topic information from neighbors instead Data Sets  IBM Patent Database  12 classes (630 train, 300 test for each class)  Yahoo  13 classes , 20000 docs (for expts involving hypertext, only 900 documents were used) (?) Experiments  Using text from neighbors   Local+Neighbor_Text:  Local+Neighbor_Text_Tagged:   Assume Neighbors are Pre-classified     Text – 36% Link – 34% Prefix – 22.1% (words in class heirarchy used) Text+Prefix – 21% Oh et al. SIGIR 2000  Relationship b/w class and neighbors of a web page in the training set is not consistent/useful (?)  Instead, Use the class and neighbor info of the page being classified (use regularities in the test set) Classification  Classify test instance d by: arg max [ P (C | G , T )] c  arg max [ P (C | T ) P (C | G )] c |T |  arg max [( P (c) P(ti | c) N ( ti |d ) ) Neighbord (c)] c i 1 ld (c ) Neighbord (c)  wL Ld Algorithm  For each test document d, generate a set A of “trustable” neighbors  For all terms ti in d, adjust the term weight using the term weights from A  For each doc a in A, assign a max confidence value if its class is known otherwise assign a class probabilistically and give it partial confidence weight  Classify d using the equation given earlier. Experiments  Reuters used to assess the algorithm on datasets without hyperlinks – only varying the size of the training set & # of features (?)  Results not directly comparable but numbers similar to reported results  Articles from an encyclopedia – 76 classes, 20836 documents Results  Terms+Classes > Only Classes > Only Terms > No use of inlinks Other issues  Link discrimination  Knowledge of neighbor classes  Use of links in training set  Inclusion of new terms from neighbors Comparison Chakrabarti Oh et al. Improvement Links in training Y set Link N discrimination Knowledge of Y neighbor class N 5% Y 6.7% Y Iteration Y N 6.6% 1.9% 1.5% Using new terms Y from neighbors N 31.4% Slattery & Mitchell ICML 00  Given a problem setting in which the test set contains structural regularities, How can we find and use them? Hubs and Authorities Kleinberg (1998) “.. a good hub is a page that points to many good authorities; a good authority is a page pointed to by many good hubs.” Hubs Authorities Hubs and Authorities Kleinberg (1998) “Hubs and authorities exhibit what could be called a mutually reinforcing relationship” Iterative relaxation: Hub ( p )   Authority (q ) q: p  q Authority ( p )  Hubs Authorities  Hub (q) q:q  p The Plan  Take an existing learning algorithm  Extend it to exploit structural regularities in the test set  Using Hubs and Authorities as inspiration FOIL Quinlan & Cameron-Jones (1993) Learns relational rules like: target_page(A) :- has_research(A), link(A,B), has_publications(B). For each test example  Pick matching rule with best training set performance p.  Predict positive with confidence p FOIL-Hubs Representation Add two rules to a learned rule set  target_page(A):-link(B,A),target_hub(B).  target_hub(A):link(A,B),target_page(B). Talk about confidence rather than truth  target_page(page15) = 0.75 Evaluate by summing instantiations target_pag e(page15)   target_hub (B) B : link(B, page15) FOIL-Hubs Algorithm 1. Apply learned FOIL rules: learned(A) 2. Iterate 1. 2. 3. Evaluate target_hub(A) Evaluate target_page(A) Set target_page(A) = target_pag e(A)  s  learned(A) 3. Report target_page(A) FOIL-Hubs Algorithm Learned FOIL rules foil(A) target_page(A) target_hub(A) 1. Apply learned FOIL rules to test set 2. Initialise target_page(A) confidence from foil(A) 3. Evaluate target_hub(A) 4. Evaluate target_page(A) 5. target_page(A)=target_page(A)s+foil(A) Data Set  4127 pages from Computer Science departments of four universities: Cornell University University of Texas at Austin University of Washington University of Wisconsin • Hand labeled into: Student 558 Web pages Course 243 Web pages Faculty 153 Web pages Experiment Three binary classification tasks 1. Student Home Page 2. Course Home Page 3. Faculty Home Page Leave two university out cross-validation Student Home Page 100 Precision 80 60 40 FOIL-Hubs FOIL 20 0 0 20 40 60 Recall 80 100 Course Home Page 100 FOIL-Hubs FOIL Precision 80 60 40 20 0 0 20 40 60 Recall 80 100 More Detailed Results Partition the test data into  Examples covered by some learned FOIL rule  Examples covered by no learned FOIL rule Student – FOIL covered 100 Precision 80 60 40 FOIL-Hubs FOIL 20 0 0 20 40 60 Recall 80 100 Student – FOIL uncovered 100 Precision 80 60 FOIL-Hubs FOIL 40 20 0 0 20 40 60 Recall 80 100 Course – FOIL covered 100 Precision 80 60 40 FOIL-Hubs FOIL 20 0 0 20 40 60 Recall 80 100 Course – FOIL uncovered 100 FOIL-Hubs FOIL Precision 80 60 40 20 0 0 20 40 60 Recall 80 100 Recap  We’ve searched for regularities of the form student_page(A):link(Web->KB members page,A) in the test set.  We consider this an instance of a regularity schema student_page(A):link(<page constant>,A) Conclusions  Test set regularities can be used to improve classification performance  FOIL-Hubs used such regularities to outperform FOIL on three Web page classification problems  We can potentially search for other regularity schemas using FOIL Other work  Using the structure of HTML to improve retrieval. Michal Cutler, Yungming Shih, Weiyi Meng. USENIX 1997  Use tfidf - different different weights to text in different html tags

Hypertext Categorization - School of Computer Science

Related documents

Products

Support

Hypertext Categorization - School of Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib