Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo Contents • Introduction • Applications • Features • Algorithms • Experiments Introduction • Large amount of web pages on the World Wide Web • Web information retrieval tasks: crawling, searching, extracting KBs,… Introduction • Subject classification: consider the subject or topic of web page. Example: “business”, “sports”,… • Functional classification: role of web pages. Example: course page, researcher homepage,… Applications • Improving quality of search result • Building focused crawler • Extracting KBs Improving Search Results • Solve the query ambiguity • User is asked to specify before searching (Chekuri et al. [1997]) • Present the categorized view of results to users (Kaki [2005]) Building Focused Crawler • When only domain-specific queries are expected, performing a full crawl is usually inefficient. • Only documents relevant to a predefined set of topics are of interest. (Chakrabarti et al. [1999]) Extracting KBs • Store complex structured and unstructured information from the World Wide Web to make a computer understandable environment. • First step : recognize class instances by classifying web’s content. (Craven et al. [1998]) Feature Selection • Textual contents, HTML tags, hyperlinks, anchor texts • On-page features • Neighbors features On-page Features • Textual Content ▫ Bag-of-words ▫ N-gram representation: n consecutive words (Mladenic [1998]). Example: New York, new, york • HTML tags: Ardo [2005] • URL: Kan and Thi [2005], Sujatha [2013]. Positive point: reduce processing time Neighbors Features (1) • Weak assumption: neighbor pages of the pages belong to the same category share common characteristics • Strong assumption: a page is much more likely to be surrounded by pages of the same category. Neighbors Features (2) Neighbors Features (3) • Sibling pages are more useful than parents and children. (Chakrabarti et al. [1998], Qi and Davison [2006]) • The content of neighbors need to be sufficiently similar to the target page. (Oh et al. [2000]) • Using a portion of content on parent and child pages: title, anchor text, and the surrounding text of anchor text on the parent pages Algorithms • k-NN • Co-training • Naïve Bayes K-NN • Kwon and Lee [2000] • Bag-of-words Co-traning • Blum and Mitchell [1998] • Labeled and unlabeled data • Two classifiers that are trained on different sets of features are used to classify the unlabeled instances. • The prediction of each classifier is used to train the other. Web Page Classification using Naive Bayes • Bernoulli model: a document is represented by a feature vector with binary elements taking value 1 if the corresponding word is present in the document and 0 if the word is not present ▫ E.g: consider the vocabulary: and the short document “the blue dog ate a blue biscuit”. The Bernoulli feature vector is: b = (1, 0, 1, 0, 1, 0)T • Consider a web page D, whose class is given by C, we classify D as the class which has the highest posterior probability P(C |D): 17 Web Page Classification using Naive Bayes • The document likelihood P(Di|C): Where: bi : Bernoulli feature vector. P( wt |C ): the probability of word wt occurring in a document of class C. nk(wt) be the number of documents of class C = k in which wt is observed. Nk is the total document of class C = k. • The prior term: 18 Experimental Results • Dataset: WebKB ▫ Contains 8145 webs pages. ▫ Seven categories: student, faculty, staff, course, project, department and othe r. ▫ Data is collected in 4 departments and some pages from other universities. Cornell, Texas, Washington, Wisconsin, and others. • Experimental setup: ▫ Select four most populous categories: student, faculty, course, and project. ▫ Training data: Cornell, Washington, Texas and miscellaneous pages co llected from other universities. ▫ Testing data: Wisconsin. 19 Experimental Results • Experimental result: Classes # of training pages # of testing pages accuracy Faculty Course Student Project 1082 845 1485 479 42 85 156 25 0.8182 0.8851 0.7595 0.8148 20 THANK YOU 21