Authors: Min-Yen Kan Hoang and Oanh Nguyen Thi
Conference: ICIKM 2005
Reporter: Yi-Ren Yeh
• Introduction
• URL Feature Extraction
– Recursive segmentation
– Using URL feature classes
• Experimental results
• Conclusion
• A web page's uniform resource locator (URL) is the least expensive to obtain
• One of the more informative sources with respect to classification
• The authors approach webpage classification only by using the URLs
– Feature extraction from URL
– Apply machine learning algorithms
• Segment URL at non-alphanumeric characters and at URI-escaped entities (e.g., '%20') to create smaller tokens
• Baseline segmentation is straightforward to implement and typically results in 4-7 tokens
• Concatenated words (e.g., activatealert) are especially prevalent in website domain names
• Segmenting these tokens into its component words is likely to increase performance
• This paper performs the segmentation by information content (entropy) reduction additionally
• A token T can be split into n partitions if where ti denotes the ith partition of T
• A partitioning that has lower entropy than others would be a more probable parse of the token
• Such entropies can be estimated by collecting the frequencies of tokens in a large corpus
• Applying a tree partition strategy (O(n log n)) to replace all the 2^(T-1) partitions
• First spilt the URL via URI protocol scheme :// host / path / document . extension ? query # fragment
• A token that occurs in different parts of URLs may contribute differently to classification
• The authors feature set by qualifying them with their components
• The absence of certain components can influence classification as well
• The absence of certain components also can influence classification as well
• Using the surface form of a token also presents challenges for generalization
– e.g. 2002 vs. 2003
• Add features for tokens with capitalized letters and/or numbers that differentiate these tokens by their length
• These features are added both in a general,
URL-wide feature as well as ones that are URI component-specific
• N-grams token might also help in classification
– The authors use 2, 3, and 4-grams
• Sequential order among tokens also matters
– “web spider” and “spider web”
– consider model left-to-right precedence between tokens
• Employ a subset of the WebKB, containing
4,167 pages
• Four classes ( student, faculty, course and project )
• Use SVM and Maximum entropy classification method
• Marco F measure is used
Evaluate On Hierarchical Categorization
• Evaluate on the Open Directory Project
• The snapshot dated 3 August 2004, which encompasses over 4.4 M URLs categorized into
17 first-level and 508 second-level categories
• The authors use 100,000 randomly chosen ODP
URLs to assemble a testing (and training) corpus for the two-level, hierarchical experiments
• Only 360 second-level categories are used.
• The authors have extended previous work and added features to model URL component length, content, orthography, token sequence and precedence
• Also evaluate the use of these features over a large set of tasks including relevance, categorization and
Pagerank prediction.
• These features do not perform as well with typical web site entry points ( i.e., just the domain name), as they attempt to leverage the internal path structure of the URL.
scheme :// host / path / document . extension ? query # fragment