Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis

advertisement
11-709 Read the Web: Project Proposal
Classifying University Web Pages
According to Academic Field
Richard Wang
Tim Isganitis
01/26/2006
Goal
• Learn how to classify web pages according to
the academic field they relate to.
– We (loosely) define academic fields to correspond to
academic departments. For example:
• Computer Science
• Biological Science
• Public Policy
– We predefine the department names, but an
alternative (harder) method is to recognize the names
of departments and cluster them according to a
broader notion of “field.”
Redundant Features
• Domain Name
– www.cs.cmu.edu (Computer Science)
– www.bio.indiana.edu (Biology)
– We assume that most pages under these domains
have to do with the given field.
• Text of Hyperlink
– <a href=“www.csd.cs.cmu.edu”>Computer Science
Department</a>
• Words on a web page
– Incorporate word features
Domain Name Classifier
• Use a dictionary to associate strings that appear
in a domain name with types of field.
– Probably position dependent:
• Look for strings <dept> to fill www.<dept>.<school>.edu
– For example:
• 51% of web pages under www.cs.abc.edu are classified as
“Computer Science”
• Assume all web pages under “www.cs.<any school>.edu”
would be related to the field of Computer Science
Academic Page Classifier
• Train a classifier on academic web pages
– Labels of web pages are derived from the domain
name using Domain Name Classifier
– Initially try using simple features (i.e. bag-of-words) to
train the classifier
– We will try to use Minorthird
– For example:
• Domain Name Classifier indicates that www.ri.abc.edu is very
likely to be related to Robotics
• Then incorporate all web pages under www.ri.abc.edu as
training examples for the academic field Robotics
Learning Loop
• Given a URL token like “cs” or “bio” we can
search for other domains of the form:
www.cs.<school>.edu
– The Domain name classifier labels all pages in these
domains as Computer Science pages
• Given a URL such as www.cs.cmu.edu we can
search for other domains of the form:
www.<dept>.cmu.edu
– The text-based classifier labels the abbreviation
<dept> based on the content of the pages in this
domain.
Download