11-709 Read the Web: Project Proposal Classifying University Web Pages According to Academic Field Richard Wang Tim Isganitis 01/26/2006 Goal • Learn how to classify web pages according to the academic field they relate to. – We (loosely) define academic fields to correspond to academic departments. For example: • Computer Science • Biological Science • Public Policy – We predefine the department names, but an alternative (harder) method is to recognize the names of departments and cluster them according to a broader notion of “field.” Redundant Features • Domain Name – www.cs.cmu.edu (Computer Science) – www.bio.indiana.edu (Biology) – We assume that most pages under these domains have to do with the given field. • Text of Hyperlink – <a href=“www.csd.cs.cmu.edu”>Computer Science Department</a> • Words on a web page – Incorporate word features Domain Name Classifier • Use a dictionary to associate strings that appear in a domain name with types of field. – Probably position dependent: • Look for strings <dept> to fill www.<dept>.<school>.edu – For example: • 51% of web pages under www.cs.abc.edu are classified as “Computer Science” • Assume all web pages under “www.cs.<any school>.edu” would be related to the field of Computer Science Academic Page Classifier • Train a classifier on academic web pages – Labels of web pages are derived from the domain name using Domain Name Classifier – Initially try using simple features (i.e. bag-of-words) to train the classifier – We will try to use Minorthird – For example: • Domain Name Classifier indicates that www.ri.abc.edu is very likely to be related to Robotics • Then incorporate all web pages under www.ri.abc.edu as training examples for the academic field Robotics Learning Loop • Given a URL token like “cs” or “bio” we can search for other domains of the form: www.cs.<school>.edu – The Domain name classifier labels all pages in these domains as Computer Science pages • Given a URL such as www.cs.cmu.edu we can search for other domains of the form: www.<dept>.cmu.edu – The text-based classifier labels the abbreviation <dept> based on the content of the pages in this domain.