Eliminating Noisy Information in Web Pages for Data Mining Lan Yi Bing Liu Xiaoli Li School of Computing National University of Singapore 3 Science Drive 2 Singapore 117543 Department of Computer Science University of Illinois at Chicago 851 S. Morgan Street Chicago, IL 60607-7053 Singapore-MIT alliance National University of Singapore 3 Science Drive 2 Singapore 117543 yilan@comp.nus.edu.sg liub@cs.uic.edu lixl@comp.nus.edu.sg ABSTRACT A commercial Web page typically contains many information blocks. Apart from the main content blocks, it usually has such blocks as navigation panels, copyright and privacy notices, and advertisements (for business purposes and for easy user access). We call these blocks that are not the main content blocks of the page the noisy blocks. We show that the information contained in these noisy blocks can seriously harm Web data mining. Eliminating these noises is thus of great importance. In this paper, we propose a noise elimination technique based on the following observation: In a given Web site, noisy blocks usually share some common contents and presentation styles, while the main content blocks of the pages are often diverse in their actual contents and/or presentation styles. Based on this observation, we propose a tree structure, called Style Tree, to capture the common presentation styles and the actual contents of the pages in a given Web site. By sampling the pages of the site, a Style Tree can be built for the site, which we call the Site Style Tree (SST). We then introduce an information based measure to determine which parts of the SST represent noises and which parts represent the main contents of the site. The SST is employed to detect and eliminate noises in any Web page of the site by mapping this page to the SST. The proposed technique is evaluated with two data mining tasks, Web page clustering and classification. Experimental results show that our noise elimination technique is able to improve the mining results significantly. Categories and Subject Descriptors H.3.3 [INFORMATION STORAGE AND RETRIEVAL]: Information Search and Retrieval – clustering, information filtering, selection process. General Terms Algorithm, Design, Experimentation, Theory. Keywords Noise detection, noise elimination, Web mining. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SIGKDD ’03, August 24-27, 2003, Washington, DC, USA. Copyright 2003 ACM 1-58113-737-0/03/0008…$5.00. 1. INTRODUCTION The rapid expansion of the Internet has made the WWW a popular place for disseminating and collecting information. Data mining on the Web thus becomes an important task for discovering useful knowledge or information from the Web [6][9]. However, useful information on the Web is often accompanied by a large amount of noise such as banner advertisements, navigation bars, copyright notices, etc. Although such information items are functionally useful for human viewers and necessary for the Web site owners, they often hamper automated information gathering and Web data mining, e.g., Web page clustering, classification, information retrieval and information extraction. Web noises can be grouped into two categories according to their granularities: Global noises: These are noises on the Web with large granularity, which are usually no smaller than individual pages. Global noises include mirror sites, legal/illegal duplicated Web pages, old versioned Web pages to be deleted, etc. Local (intra-page) noises: These are noisy regions/items within a Web page. Local noises are usually incoherent with the main contents of the Web page. Such noises include banner advertisements, navigational guides, decoration pictures, etc. In this work, we focus on detecting and eliminating local noises in Web pages to improve the performance of Web mining, e.g., Web page clustering and classification. This work is motivated by a practical application. A commercial company asked us to build a classifier for a number of products. They want to download product description and review pages from the Web and then use the classifier to classify the pages into different categories. In this paper, we will show that local noises in Web pages can seriously harm the accuracy of data mining. Thus cleaning the Web pages before mining becomes critical for improving the data mining results. We call this preprocessing step Web page cleaning. Figure 1 gives a sample page from PCMag1. This page contains an evaluation report of Samsung ML-1430 printer. The main content (segment 3 in Figure 1) only occupies 1/3 of the original Web page, and the rest of the page contains many advertisements, navigation links (e.g., segment 1 in Figure 1), magazine subscription forms, privacy statements, etc. If we perform clustering on a set of product pages like this page, such items are irrelevant and should be removed. 1 http://www.pcmag.com/ ② ③ ① Figure 1: A part of an example Web page with noises (dotted lines are drawn manually) Despite its importance, relatively little work has been done on Web page cleaning in the past (see our related work section). In this paper, we propose a highly effective technique to clean Web pages with the purpose of improving Web data mining. Note that although XML 2 Web pages are more powerful than HMTL pages for describing the contents of a page and one can use XML tags to find the main contents for various purposes, most current Web pages on the Web are still in HTML rather than in XML. The huge number of HTML pages on the Web are not likely to be transformed to XML pages in the near future. Hence, we focus our work on cleaning HTML pages. Our cleaning technique is based on the following observation. In a typical commercial Web site, Web pages tend to follow some fixed layouts or presentation styles as most pages are generated automatically. Those parts of a page whose layouts and actual contents (i.e., texts, images, links, etc) also appear in many other pages in the site are more likely to be noises, and those parts of a page whose layouts or actual contents are quite different from other pages are usually the main contents of the page. In this paper, we first introduce a new tree structure, called style tree, to capture the common layouts (or presentation styles) and the actual contents of the pages in a Web site. We then propose an information based measure to determine which parts of the style tree indicate noises and which parts of the style tree contain the 2 http://www.w3.org/XML/ main contents of the pages in the Web site. To clean a new page from the same site, we simply map the page to the style tree of the site. According to the mapping, we can decide the noisy parts and delete them. Our experiment results based on two popular Web mining tasks, i.e., Web page clustering and Web page classification, show that our cleaning technique is able to boost the mining results dramatically. For example, in classification, the average classification accuracy over all our datasets increases from 0.625 before cleaning to 0.954 after cleaning. This represents a remarkable improvement. We also compare our proposed method with the existing template based cleaning method [2]. Our results show that the proposed method outperforms this existing state-ofthe-art method substantially. Our contributions • A new tree structure, called Style Tree, is proposed to capture the actual contents and the common layouts (or presentation styles) of the Web pages in a Web site. An information (or entropy) based measure is also introduced to evaluate the importance of each element node in the style tree, which in turn helps us to eliminate noises in a Web page. • Experimental results show that the proposed page cleaning technique is able to improve the results of Web data mining dramatically. It also outperforms the existing template based cleaning technique given in [2] by a large margin. 2. RELATED WORK 3. THE PROPOSED TECHNIQUE Although Web page cleaning is an important task, relatively little work has been done in this field. In [17], a method is proposed to detect informative blocks in news Web pages. The concept of informative blocks is similar to our concept of main contents of a page. However, the work in [17] is limited by the following two assumptions: (1) the system knows a prori how a Web page can be partitioned into coherent content blocks, and (2) the system knows a priori which blocks are the same blocks in different Web pages. The proposed cleaning technique is based on the analysis of both the layouts and the actual contents (i.e., texts, images, etc.) of the Web pages in a given Web site. Thus, our first task is to find a suitable data structure to represent both the presentation styles (or layouts) and the actual contents of the Web pages in the site. We propose a Style Tree (ST) for this purpose. Below, we start by giving an overview of the DOM (Document Object Model) 3 tree, which is commonly used for representing the structure of a single Web page, and showing that it is insufficient for our purpose. We then present the style tree, which is followed by our entropy measure for evaluating the nodes in the style tree for noise detection. As we will see, partitioning a Web page and identifying corresponding blocks in different pages are actually two critical issues in Web page cleaning. Our system is able to perform these tasks automatically (with no user help). Besides, their work views a Web page as a flat collection of blocks which correspond to “TABLE” elements in Web pages, and each block is viewed as a collection of words. These assumptions are often true in news Web pages, which is the domain of their applications. In general, these assumptions are too strong. In [2], Web page cleaning is defined as a frequent template detection problem. They propose a frequency based data mining algorithm to detect templates and views those templates as noises. The cleaning method in [2] is not concerned with the context of a Web site, which can give useful clues for page cleaning. Moreover, in [2], the partitioning of a Web page is pre-fixed by considering the number of hyperlinks that an HTML element has. This partitioning method is simple and useful for a set of Web pages from different Web sites, while it is not suitable for Web pages that are all from the same Web site because a Web site typically has its own common layouts or presentation styles, which can be exploited to partition Web pages and to detect noises. We will compare the results of our method with those of the method in [2] and give a discussion in the experiment section. Other related work includes data cleaning for data mining and data Warehousing [13], duplicate records detection in textual databases [16] and data preprocessing for Web Usage Mining [7]. Our task is different as we deal with semi-structured Web pages and also we focus on removing noisy parts of a page rather than duplicate pages. Hence, different cleaning techniques are needed. Web page cleaning is also related to feature selection in traditional machine learning (see [18]). In feature selection, features are individual words or attributes. However, items in Web pages have some structures, which are reflected by their nested HTML tags. Hence, different methods are needed in the context of the Web. [8][10] propose some learning mechanisms to recognize banner ads, redundant and irrelevant links of Web pages. However, these techniques are not automatic. They require a large set of manually labeled training data and also domain knowledge to generate classification rules. [11] enhances the HITS algorithm [12] by using the entropy of anchor text to evaluate the importance of links. It focuses on improving HITS algorithm to find more informative structures in Web sites. Although it segments Web pages into content blocks to avoid unnecessary authority and hub propagations, it does not detect or eliminate noisy contents in Web pages. 3.1 DOM tree Each HTML page corresponds to a DOM tree where tags are internal nodes and the detailed texts, images or hyperlinks are the leaf nodes. Figure 2 shows a segment of HTML codes and its corresponding DOM tree. In the DOM tree, each solid rectangle is a tag node. The shaded box is the actual content of the node, e.g., for the tag IMG, the actual contents is “src=image.gif”. Notice that our study of HTML Web pages begins from the BODY tag since all the viewable parts are within the scope of BODY. Each node is also attached with its display properties. For convenience of analysis, we add a virtual root node without any attribute as the parent tag node of BODY in the DOM tree. <BODY bgcolor=WHITE> <TABLE width=800 height=200 > … </TABLE> <IMG src="image.gif" width=800> <TABLE bgcolor=RED> … </TABLE> </BODY> root BODY width=800 height=200 width=800 TABLE IMG bgcolor=white bgcolor=red TABLE Figure 2: A DOM tree example (lower level tags are omitted) Although a DOM tree is sufficient for representing the layout or presentation style of a single HTML page, it is hard to study the overall presentation style and content of a set of HTML pages and to clean them based on individual DOM trees. Thus, DOM trees are not enough in our cleaning work which considers both the presentation style and real content of the Web pages. We need a more powerful structure for this purpose. This structure is critical because our algorithm needs it to find common styles of the pages from a site in order to eliminate noises. We introduce a new tree structure, called style tree (ST), which is able to compress the common presentation styles of a set of related Web pages. A style tree example is given in Figure 3 as a combination of DOM trees d1 and d2. We observe that, except for the four tags (P, IMG, P and A) at the bottom level, all the tags in d1 have their corresponding tags in d2. Thus, d1 and d2 can be compressed. We use a count to indicate how many pages have a particular style at a particular level of the style tree. In Figure 3, we can see that both pages start with BODY, and thus BODY has a count of 2. Below BODY, both pages also have the same presentation style of TABLE-IMG-TABLE. We call this whole sequence of tags (TABLE-IMG-TABLE) a style node, which is enclosed in a dash3 http://www.w3.org/DOM/ lined rectangle in Figure 3. It represents a particular presentation style at this point. A style node is thus a sequence of tag nodes in a DOM tree. In the style tree, we call these tag nodes the element nodes so as to distinguish them from tag nodes in the DOM tree. For example, the TABLE-IMG-TABLE style node has three element nodes, TABLE, IMG and TABLE. An element node also contains slightly different information from a tag node in a DOM tree as will be defined later. d1 d2 root root bgcolor=white bgcolor=white BODY width=800 height=200 BODY width=800 height=200 bgcolor=red TABLE P IMG TABLE TABLE IMG P bgcolor=red IMG A P root Style tree BR P bgcolor=white 2 BODY width=800 height=200 bgcolor=red 2 TABLE IMG TABLE 1 P TABLE IMG P 1 A P BR P Figure 3: DOM trees and the style tree In Figure 3, we can see that below the right most TABLE tag, d1 and d2 diverge, which is reflected by two different style nodes in the style tree. The two style nodes are P-IMG-P-A and P-BR-P respectively. This means below the right TABLE node, we have two different presentation styles. The page count of these two style nodes are both 1. Clearly, the style tree is a compressed representation of the two DOM trees. It enables us to see which parts of the DOM trees are common and which parts are different. Building a style tree (called site style tree or SST) for the pages of a Web site is fairly straightforward. We first build a DOM tree for each page and then merge it into the style tree in a top-down fashion. At a particular element node E in the style tree, which has the corresponding tag node T in the DOM tree, we check whether the sequence of child tag nodes of T in the DOM tree is the same as the sequence of element nodes in a style node S below E (in the style tree). If the answer is yes, we simply increment the page count of the style node S, and then go down the style tree and the DOM tree to merge the rest of the nodes. If the answer is no, a new style node is created below the element node E in the style tree. The sub-tree of the tag node T in the DOM tree is copied to the style tree after converted to style nodes and element nodes of the style tree. 3.3 Determining the Noisy Elements in ST In our work, the definition of noise is based on the following assumptions: (1) The more presentation styles that an element node has, the more important it is, and vice versa. (2) The more diverse that the actual contents of an element node are, the more important the element node is, and vice versa. Both these importance values are used in evaluating the importance of an element node. The presentation importance aims at detecting noises with regular presentation styles while the content importance aims at identifying those main contents of the pages that may be presented in similar presentation styles. Hence, in the proposed method the importance of an element node is given by combining its presentation importance and content importance. The greater the combined importance of an element node is, the more likely it is the main content of the pages. root 100 Body 100 3.2 Style Tree (ST) Table We now define a style tree, which consists of two types of nodes, namely, style nodes and element nodes. 100 Definition: A style node (S) represents a layout or presentation style, which has two components, denoted by (Es, n), where Es is a sequence of element nodes (see below), and n is the number of pages that has this particular style at this node level. In Figure 3, the style node (in a dash-lined rectangle) P-IMG-P-A has 4 element nodes, P, IMG, P and A, and n = 1. Definition: An element node E has three components, denoted by (TAG, Attr, Ss), where • TAG is the tag name, e.g., “TABLE” and “IMG”; • Attr is the set of display attributes of TAG, e.g., bgcolor = RED, width = 100, etc. • Ss is a set of style nodes below E. Note that an element node corresponds to a tag node in the DOM tree, but points to a set of child style nodes Ss (see Figure 3). For convenience, we usually denote an element node by its tag name, and a style node by its sequence of tag names corresponding to its element node sequence. Tr Img Table Table 100 35 Tr Text 25 15 25 P Text A 100 P 100 A P P A P Img A Img A A A A Figure 4: An example site style tree (SST) In the example of Figure 4, the shaded parts of the SST are more likely to be noises since their presentation styles (together with their actual contents which cannot be shown in the figure) are highly regular and fixed and hence less important. The doublelined Table element node has many child style nodes, which indicate that the element node is likely to be important. That is, the double-lined Table is more likely to contain the main contents of the pages. Specially, the double-lined Text element node is also meaningful since its content is diverse although its presentation style is fixed. Let the SST be the style tree built using all the pages of a Web site. We need a metric to measure the importance of a presentation style. Information theory (or entropy) is a natural choice. Definition (node importance): For an element node E in the SST, let m be the number of pages containing E and l be the number of child style nodes of E (i.e., l = |E.Ss|), the node importance of E, denoted by NodeImp(E), is defined by l − pi log m pi Node Im p ( E ) = ∑ i =1 1 if if m >1 m =1 -0.35log1000.35 - 2*0.25log1000.25-0.15log1000.15 = 0.292 > 0 However, we cannot say that Body is a noisy item by considering only its node importance because it does not consider the importance of its descendents. We use composite importance to measure the importance of an element node and its descendents. Definition (composite importance): For an internal element node E in the SST, let l = |E.Ss|. The composite importance of E, denoted by CompImp(E), is defined by l CompImp ( E ) = (1 − γ l ) NodeImp ( E ) + γ l ∑ ( pi CompImp ( Si )) (2) i =1 where pi is the probability that E has the ith child style node in E.Ss. In the above equation, CompImp(Si) is the composite importance of a style node Si (∈ E.Ss), which is defined by CompImp( S i ) = j =1 k j m ∑p ij log m pij j =1 Intuitively, if l is small, the possibility that E is presented in different styles is small. Hence the value of NodeImp(E) is small. If E contains many presentation styles, then the value of NodeImp(E) is large. For example, in the SST of Figure 4, the importance of the element node Body is 0 (llog100 l = 0) since l = 1. That is, below Body, there is only one presentation style TableImg-Table-Table. The importance of the double-lined Table is k H (ai ) = − (1) where pi is the probability that a Web page uses the ith style node in E.Ss. ∑ CompImp( E 1 if m = 1 l (4) CompImp ( E ) = ∑ H ( a i ) 1 − i =1 if m > 1 l where ai is an actual feature of the content in E. H(ai) is the information entropy of ai within the context of E, where pij is the probability that ai appears in E of page j. Note that if m = 1, it means that only one page contains E, then E is a very important node, and its CompImp is 1 (all the values of CompImp are normalized to between 0 and 1). Calculating composite importance (using the CalcCompImp(E) procedure) for all element nodes and style nodes can be easily done by traversing the SST. We will not discuss it further here. 3.4 Noise Detection As mentioned earlier, our definition of noise is based on the assumptions that the more presentation styles that are used to compose an element node the more important the element node is and that the more diverse that the actual contents of an element node are, the more important the element node is. We now define what we mean by noises and give an algorithm to detect and to eliminate them. Definition (noisy): For an element node E in the SST, if all of its descendents and itself have composite importance less than a specified threshold t, then we say element node E is noisy. Figure 5 gives the algorithm MarkNoise(E) to identify noises in the SST. It first checks whether all E’s descendents are noisy or not. If any one of them is not noisy, then E is not noisy. If all its descendents are noisy and E’s composite importance is also small, then E is noisy. ) Input: E: root element node of a SST Return: TRUE if E and all of its descendents are noisy, else FALSE (3) where Ej is an element node in Si.E, and k = |Si.Es|, which is the number of element nodes in Si. In (2), γ is the attenuating factor, which is set to 0.9. It increases the weight of NodeImp(E) when l is large. It decreases the weight of NodeImp(E) when l is small. This means that the more child style nodes an element node has, the more its composite importance is focused on itself, and the fewer child style nodes it has, the more its composite importance is focused on its descendents. Leaf nodes are different from internal nodes since they only have actual content with no tags. We define the composite importance of a leaf element node based on the information in its actual contents (i.e., texts, images, links, etc.) Definition: For a leaf element node E in the SST, let l be the number of features (i.e., words, image files, link references, etc) appeared in E and let m be the number of pages containing E, the composite importance of E is defined by (5) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: MarkNoise(E) for each S ∈E.Ss do for each e ∈ S.Es do if (MarkNoise(e) == FALSE) then return FALSE end if end for end for if (E.CompImp ≤ t) then mark E as “noisy” return TRUE else return FALSE end if Figure 5: Mark noisy element nodes in SST Definition (maximal noisy element node): If a noisy element node E in the SST is not a descendent of any other noisy element node, we call E a maximal noisy element node. In other words, if an element node E is noisy and none of its ancestor nodes is noisy, then E is a maximal noisy element node, which is also marked by the algorithm in Figure 5. Input: E: Root element node of the simplified SST Input: EPST: root element node of the page style tree Return: The main content of the page after cleaning Definition (meaningful): If an element node E in the SST does not contain any noisy descendent, we say that E is meaningful. MapSST (E, EP) 1: if E is noisy then 2: delete EP (and its descendents) as noises 3: return NULL 4: end if 5: if E is meaningful then 6: Ep is meaningful 7: return the content under EP 8: else returnContent = NULL 9: S2 is the (only) style node in EP.Ss 10: if ∃S1∈E.Ss ∧ S2 matches S1 then 11: e1,i is the ith element node in sequence S1.Es; 12: e2,i is the ith element node in sequence S2.Es; 13: for each pair (e1,i , e2,i) do 14: returnContent += MapSST(e1,i , e2,i) 15: end for 16: return returnContent 17: else EP is possibly meaningful; 18: return the content under EP 19: end if 20: end if Definition (maximal meaningful element node): If a meaningful element node E is not a descendent of any other meaningful element node, we say E is a maximal meaningful element node. Notice that some element nodes in the SST may be neither noisy nor meaningful, e.g., an element node containing both noisy and meaningful descendents. Similar to MarkNoise(EN), the algorithm MarkMeaningful(EN) marks all the maximal meaningful element nodes. Note that in the actual implementation, the function CalcCompImp(EN), MarkNoise(EN) and MarkMeaningful(EN) are all combined into one in order to reduce the number of scans of the SST. Here we discuss them separately for clarity. Since we are able to identify maximal meaningful element nodes and maximal noisy element nodes in the STT, we need not traverse the whole SST to detect and eliminate noises. Going down from the root of the SST, when we find a maximal noisy node, we can instantly confirm that the node and its descendents are noisy. So we can simplify the SST into a simpler tree by removing descendents of maximal noisy nodes and maximal meaningful nodes in the SST. Let us go back to the SST in Figure 4. Assume that we have identified the element nodes in the shaded areas to be noisy and the double-lined element nodes to be meaningful, the SST can be simplified to the one in Figure 6. root Body Table Tr Tr Img Table Table Text Figure 6: A simplified SST Figure 7: Map EP to E and return meaningful contents 3.5 The overall algorithm Figure 8 summarizes all the steps of our Web cleaning algorithm. Given a Web site, the system first randomly crawls a number of Web pages from the Web site (line 1) and builds the SST based on these pages (line 2-6). In many sites, we could not crawl all its pages because they are too large. By calculating the composite importance of each element node in the SST, we find the maximal noisy nodes and maximal meaningful nodes. To clean a new page P, we map its PST to the SST to eliminate noises (lines 10-13). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: Randomly crawl k pages from the given Web site S Set null SST with virtual root E (representing the root); for each page W in the k pages do BuildPST(W); BuildSST(E, Ew) end for CalcCompImp(E); MarkNoise(E); MarkMeaningful(E); for each target Web page P do Ep = BuildPST(P) /* representing the root */ MapSST(E, Ep) end for We now give the algorithm for detecting and eliminating noises (Figure 7) given a SST and a new page from the same site. The algorithm basically maps the DOM tree of the page to the SST, and depending on where each part of the DOM tree is mapped to the SST, we can find whether the part is meaningful or noisy by checking if the corresponding element node in the SST is meaningful or noisy. If the corresponding element node is neither noisy nor meaningful, we simply go down to the lower level nodes. 3.6 Further Enhancements For easy presentation of the algorithm, we assume that the DOM tree of the page is converted to a style tree with only one page (called a page style tree or PST). The algorithm MapSST takes two inputs, an element node E in the SST and an element node Ep of the page style tree. At the beginning, they are the respective root nodes in the SST and the page style tree. 1. For any two style nodes S1 and S2 belonging to the same parent element node E in a SST, if e1∈S1.Es and e2∈ S2.Es, it is possible that e1 and e2 are the same element node appearing in different groups of Web pages (presented in different Figure 8: The overall algorithm The algorithm introduced above is the basic algorithm. Some minor tunings are needed to make it more effective. presentation styles). In this case, it is logical to view the element nodes e1 and e2 as one element node by merging them. The merging is accomplished in the following manner: If e1.TAG = e2.TAG and e1.Att r= e2.Attr, we compare their actual contents to see whether they are similar and can be merged. Let the characteristic feature set of ej be Ij = {featurek | freq(featurek) ≥ γ, featurek occurs in the actual contents of ej }, where j = 1, 2. freq(featurek) is the document frequency of featurek within ej and γ is a predefined constant between 0 and 1. If |Ij| > 0 (j = 1, 2) and |I1∩I2|/|I1∪I2| ≥ λ, then e1 and e2 are merged to form a new element node (e1 and e2 are deleted). Thus, in the process of building a SST, for any newly created element node E, all the element nodes immediately below E will be merged if possible and their corresponding tag nodes in DOM trees are grouped together to build the sub-trees under E. In our experiments, we set γ = 0.85 and λ = 0.85, which perform very well. By doing so, the original element nodes e1 and e2 become two pointers pointing to the newly created element node in the SST. The rest of the algorithm remains the same as in the basic algorithm. 2. The leaf tag nodes used for the algorithm should not be the actual leaf tag nodes as they tend to overly fragment the page. Instead, we use the parent nodes of the actual leaf tag nodes in the DOM tree as the (virtual) leaf tag nodes in building the SST and in computing the importance values of element nodes. 3. It is possible that although an element node in the SST is meaningful as a unit, it may still contain some noisy items. So, for each meaningful element node in the SST, we do not output those locally noisy features whose information entropy (see equation 5) is smaller than ε (ε = 0.01 is set as the default value of our system, which performs quite well). Thus, in the mapping algorithm of Figure 7, the contents in each meaningful element node should be output by first deleting those locally noisy features. 4. EMPIRICAL EVALUATION This section evaluates the proposed noise elimination algorithm. Since the purpose of our noise elimination is to improve Web data mining, we performed two data mining tasks, i.e., clustering and classification, to test our system. By comparing the mining results before and after cleaning, we show that our cleaning technique is able to improve mining results substantially. We also compare our results with the mining results obtained after cleaning using the template based technique proposed in [2]. To distinguish the method proposed in [2] with our method in discussion, we denote the method in [2] as the template based method and denote our method as the SST based method. Note that we could not compare our system with the technique in [17] as it is not suitable for our task. It is designed specifically for identifying main news articles in news Web pages, and it makes some assumptions that are not suitable for general page cleaning (see Section 2). Below, we first describe datasets used in our experiments and evaluation measures. We then present our experiment results of clustering and classification, and also give some discussions. 4.1 Datasets and Evaluation Measures Our empirical evaluation is done using Web pages from 5 commercial Web sites, Amazon 4 , CNet 5 , PCMag, J&R 6 and ZDnet7. These sites contain many introduction or overview pages of different kinds of products. To help the users navigate the site and to show advertisements, the pages from these sites all contain a large amount of noise. We will show that the noise misleads data mining algorithms to produce poor results (both in clustering and in classification). However, our technique of detecting and eliminating noise is able to improve the mining results substantially. The five Web sites contain Web pages of many categories or classes of products. We choose the Web pages that focus on the following categories of products: Notebook, Digital Camera, Mobile Phone, Printer and TV. Table 1 lists the number of documents downloaded from each Web site, and their corresponding classes. Table 1. Number of Web pages and their classes Web sites Notebook Camera Mobile Printer TV Amazon 434 402 45 767 719 CNet 480 219 109 500 449 J&R 51 80 9 104 199 PCMag 144 137 43 107 0 ZDnet 143 151 97 80 0 Since we test our system using clustering and classification, we use the popular F score measure to evaluate the results before and after cleaning. F score is defined as follows: F = 2p*r/(p+r), where p is the precision and r is the recall. F score measures the performance of a system on a particular class, and it reflects the average effect of both precision and recall. We will also include the accuracy results for classification. 4.2 Experimental Results We now present the experimental results of Web page clustering and classification before and after cleaning and compare our method with the template based method. For the experiments, we implemented the template based method given in [2]. This method first partitions all the parse trees of HTML pages into flattened pagelets according to the number of hyperlinks each HTML element contains (see Section 4.2.2 for the definition of pagelet). Then it uses the shingle technique [5] to determine the almost-similarities of pagelets. A shingle is a text fingerprint that is invariant under small perturbations [2]. For our application, we use the local template detection algorithm in [2] to detect templates. According to the algorithm, a group of (no less than 2) pagelets whose shingles are the same is treated as a template and is deleted. Additionally, we use the template based method to clean the Web pages in each individual site separately rather than cleaning all the pages from all the 5 sites altogether, which proves to be more effective. For cleaning in our method, the site style tree of each Web site is 4 5 6 7 http://www.amazon.com/ http://www.cnet.com/ http://www.jandr.com/ http://www.zdnet.com/ Cumulative Distribution of F scores in clustering number of results 450 400 350 300 F(N) F(T) F(S) 250 200 150 100 50 0 F(N) F(T) F(S) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0 0 0 0 0 0 2 0 0 85 5 0 294 80 26 314 213 55 101 316 94 3 124 427 1 62 104 0 0 94 Sorted F score zones Figure 9: The distribution of F scores of clustering (F(N) represents the F score of clustering using the original Web pages without cleaning; F(T) represents the F score of clustering after cleaning is done using the template based technique in [2]; F(S) represents the F score of clustering after cleaning using the SST based method) built using 500 randomly sampled pages from the site. We also tried larger numbers. However, we found that 500 pages are sufficient and more sampled pages do not improve the cleaning results. Our cleaning algorithm needs a threshold to decide noisy and meaningful elements. We set the threshold for each Web site as follows: For each Web site, we choose a small number of pages (20), and then clean them using a number of threshold values. We then look at the cleaned pages, and according to these cleaned pages, we set the final threshold. 4.2.1 Clustering We use the popular k-means clustering algorithm [1]. We put all the 5 categories of Web pages into a big set, and use the clustering algorithm to cluster them into 5 clusters. Since the kmeans algorithm selects the initial cluster seeds randomly, we performed a large number of experiments (800) to show the behaviors of k-means clustering before and after page cleaning. The cumulative distributions of F scores before and after cleaning are plotted in Figure 9, where X-axis shows 10 bins of F score from 0 to 1 with each bin size of 0.1 and Y-axis gives the number of experiments whose F scores fall into each bin. The F score for each experiment is the average value of the 5 classes. It is computed as follows: By comparing the pages’ original classes and the k-means clustering results, we find the optimal assignment of classes to clusters that gives the best average F score for the 5 classes. From Figure 9, we can clearly observe that clustering results after our SST based cleaning are dramatically better than the results using the original noisy Web pages. Our method also helps to produce much better clustering results than the template based method. Table 2 gives the statistics of F scores over the 800 clustering runs using the original Web pages, the pages cleaned with the template based method and the pages cleaned with the SST based method respectively. We observe that over the 800 runs, the average F score for the noise case (without cleaning) is 0.506, and the average F score for the template based cleaning case is 0.631, while the average F score for the SST based cleaning case is 0.751, which is a remarkable improvement. Table 2. Statistics of k-means clustering results Method F(N) F(T) F(S) Ave(F) 0.506 0.631 0.751 F < 0.5 F >= 0.7 F >= 0.8 F >= 0.9 47.63% 0.50% 0.13% 0.00% 10.63% 23.25% 7.75% 0.00% 3.25% 78.13% 24.75% 11.75% More specifically, before cleaning, only 0.5% of the 800 results (4 out of 800) have the F scores no less than 0.7, and 47.63% lower than 0.5. After template based cleaning, 23.25% of the 800 clustering results have the F scores no less than 0.7, and 10.63% lower than 0.5. While after the SST based cleaning, 78.13% of the 800 results have F scores no less than 0.7, and only 3.25% lower than 0.5. Thus, we can conclude that our proposed noise elimination method is much more effective than the template based method for Web page clustering. 4.2.2 Classification For classification, we use the Naive Bayesian classifier (NB), which has been shown to perform very well in practice by many researchers [14][15]. The basic idea of NB is to use the joint probabilities of words and classes to estimate the probabilities of classes given a document. In order to study how Web page noise affects classification accuracy and to better understand the situations where noise elimination is most effective, we performed a comprehensive evaluation with different training (TR) and testing (TE) configurations. Table 3. Averaged F scores of classification C1 C2 notebook camera notebook mobile notebook printer notebook TV camera mobile camera printer camera TV mobile printer mobile TV printer TV Average F1(N) 0.992 0.946 0.991 0.967 0.979 0.968 0.794 0.984 0.677 0.956 0.918 F1(T) 0.932 0.977 0.996 0.984 0.985 0.988 0.943 0.983 0.872 0.99 0.965 F1(S) 0.994 0.954 0.991 0.998 0.996 0.996 0.986 0.998 0.977 0.997 0.989 F2(N) 0.923 0.779 0.832 0.724 0.767 0.763 0.694 0.783 0.649 0.719 0.763 F2(T) 0.965 0.903 0.973 0.935 0.872 0.943 0.916 0.923 0.819 0.935 0.918 F2(S) 0.976 0.911 0.979 0.976 0.963 0.975 0.974 0.987 0.959 0.979 0.968 F3(N) 0.699 0.672 0.634 0.559 0.629 0.589 0.542 0.581 0.537 0.516 0.596 F3(T) 0.871 0.836 0.834 0.847 0.800 0.817 0.796 0.866 0.801 0.840 0.831 F3(S) 0.952 0.886 0.954 0.961 0.938 0.944 0.946 0.941 0.944 0.969 0.944 A3(N) 0.705 0.734 0.639 0.583 0.674 0.603 0.567 0.639 0.565 0.541 0.625 A3(T) 0.874 0.894 0.838 0.857 0.852 0.826 0.806 0.924 0.823 0.853 0.855 A3(S) 0.953 0.925 0.956 0.962 0.954 0.948 0.947 0.967 0.958 0.970 0.954 Table 4. Averaged accuracies of classification C2 C1 notebook camera notebook mobile notebook printer notebook TV camera mobile camera printer camera TV mobile printer mobile TV printer TV Average A1(N) 0.992 0.961 0.991 0.967 0.985 0.969 0.821 0.991 0.690 0.957 0.932 A1(T) 0.934 0.985 0.996 0.984 0.990 0.989 0.943 0.991 0.884 0.990 0.969 A1(S) 0.994 0.971 0.991 0.998 0.997 0.996 0.986 0.999 0.985 0.997 0.991 In each experiment, we build a classifier based on training pages from two different classes, and then use the classifier to classify the test pages. We denote the two classes by C1 and C2, e.g., C1 may be camera and C2 may be notebook. Let the five Web sites be Site1, …, Site5. We experimented with three configurations of training and test sets from different Web sites: 1. TR = {C1(Sitei) and C2(Sitej)}, and TE = {all C1 and C2 pages except C1(Sitei) and C2(Sitej)}. This means that both classes of training pages are from the same Web site. The test pages are from the other sites. 2. TR = {C1(Sitei) and C2(Sitej)} (i ≠ j), and TE={all C1 and C2 pages except C1(Sitei), C2(Sitei), C1(Sitej) and C2(Sitej)}. This means that we use C1 pages from Sitei and C2 pages from Sitej (i ≠ j) for training and test on the C1 and C2 pages in the other three sites. 3. TR = {C1(Sitei) and C2(Sitej)} (i ≠ j), and TE = {all C1 and C2 pages except those pages in TR}. This means that we use C1 pages from Sitei and C2 pages from Sitej (i ≠ j) for training and test on the C1 and C2 pages in all five sites without the training pages. We tried all possible two class combinations of the 5 sites for the three configurations. Table 3 and Table 4 respectively show the average F scores and the average accuracies of the three configurations before and after cleaning. In Table 3 and Table 4, A2(N) 0.932 0.805 0.861 0.745 0.823 0.797 0.722 0.819 0.659 0.755 0.792 A2(T) 0.966 0.940 0.973 0.937 0.913 0.947 0.922 0.957 0.843 0.939 0.934 A2(S) 0.976 0.941 0.979 0.977 0.973 0.976 0.974 0.992 0.971 0.980 0.974 Fi (i = 1, 2, 3) and Ai (i = 1, 2, 3) respectively denote the average F score and accuracy of classification under the i-th configuration. The average F scores (or accuracies) are computed by averaging the F scores (or accuracies) of all possible two class combinations within 5 sites according to different configurations. Note that since there are no TV pages in PCMag and ZDnet sites, so we only averaged the results from those possible experiments. Again, in Table 3 and Table 4, N stands for no cleaning, T stands for cleaning using the template method, and S stands for the proposed method. From these two tables we can see that cleaning in general improves F score and accuracy in all cases. We also observe that in almost all cases the improvements made by our method are more significant than those made by the template based method. Below, we discuss the results of each configuration. • In the first configuration, since site specific noisy items occur in both C1 and C2 training data, the NB technique is able to discount them to a large extent. Thus, even without cleaning the classification results are reasonable. However, cleaning still improves the classification results significantly. • In the second configuration, cleaning makes a major difference because the noisy items in C1 and C2 training data (they are from different Web sites) are quite different, which confuses NB. The proposed SST based method also outperforms the template based method in this configuration. • In the third configuration, our SST technique still performs very well. However, the results produced by the template based method become significantly worse. The reason is that the test set includes pages from the same sites as the training sets. Since the template based method often under-cleans the pages (see the detailed discussion below), the pages from the same site are still more like each other although they may belong to different classes. We now explain why the template based method is not as effective as the SST based method. The template based method defines a pagelet as follows: An HTML element in the parse tree of a page is a pagelet if (1) none of its children contains at least k hyperlinks; and (2) none of its ancestor elements is a pagelet [2]. The HTML elements in the parse tree are actually the tag nodes in our DOM tree. The template based method gives the best results on average when we set k = 3 (which is the same as that given in [2]). However, since the granularity of partitioning the Web page completely depends on the number of linkages in HTML elements, the partitioning result may not coincide with the natural partitions of the Web page in question. This can result in under cleaning due to pagelets that are too large. For example, for k = 3, segment 2 in Figure 1 is a pagelet P after partitioning. It is obvious that most product pages from PCMag site have similar pagelets like P. The words “Home” and “Product Guides” in this pagelet are actually not useful for mining in our case. However, the pagelet P will not be removed because its content (together with the words “Printer” and “Samsung ML-1430”) are different from the pagelet contents in other Web pages. In our SST based method, segment 2 is a tag node T in the DOM tree of the page in Figure 1. In the SST of the PCMag site, similar tag nodes in the rest of the Web pages will be grouped together with T to form a leaf element node E in the SST. Within the element node E, the words “Home” and “Product Guides” are very likely to be identified as noises because they appear too frequently in E although the element node E is meaningful as a whole. The template based method may also result in excessive cleaning due to pagelets that are very small. Small pagelets tend to catch the idiosyncrasy of the pages and thus may result in removal of too much information from the pages because the template based method considers any repeating pagelet as noise. In contrast, our SST based method does not have these problems because it captures the natural layout of a Web site, and it also considers the importance of actual content features within the context of their host element nodes in SST. Execution time: In our experiments, we randomly sample 500 Web pages from each given Web site to build its SST. The time taken to build a SST is always below 20 seconds. The process of computing composite importance can always be finished in 2 seconds. The final step of cleaning each page takes less than 0.1 second. All our experiments were conducted on a Pentium 4 1.6GHz PC with 256 MB memory. 5. CONCLUSION In this paper, we proposed a technique to clean Web pages for Web data mining. Observing that the Web pages in a given Web site usually share some common layout or presentation styles, we propose a new tree structure, called Style Tree (ST) to capture those frequent presentation styles and actual contents of the Web site. The site style tree (SST) provides us with rich information for analyzing both the structures and the contents of the Web pages. We also proposed an information based measure to evaluate the importance of element nodes in SST so as to detect noises. To clean a page from a site, we simply map the page to its SST. Our cleaning technique is evaluated using two data mining tasks. Our results show that the proposed technique is highly effective. 6. ACKNOWLEDGEMENT Bing Liu was supported in part by the National Science Foundation (NSF) under the NSF grant IIS-0307239. 7. REFERENCES [1] Anderberg, M.R. Cluster Analysis for Applications, Academic Press, Inc. New York, 1973. [2] Bar-Yossef, Z. and Rajagopalan, S. Template Detection via Data Mining and its Applications, WWW 2002, 2002. [3] Beeferman, D., Berger, A. and Lafferty, J. A model of lexical attraction and repulsion. ACL-97, 1997. [4] Beeferman, D., Berger, A. and Lafferty, J. Statistical models for text segmentation. Machine learning, 34(1-3), 1999. [5] Broder, A., Glassman, S., Manasse, M. and Zweig, G. Syntactic clustering of the Web, Proceeding of WWW6, 1997. [6] Chakrabarti, S. Mining the Web: Discovering Knowledge from Hypertext Data. Morgan Kaufmann, 2002. [7] Cooley, R., Mobasher, B. and Srivastava, J. Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information Systems, (1) 1, 1999. [8] Davision, B.D. Recognizing Nepotistic links on the Web. Proceeding of AAAI 2000. [9] Han, J. and Chang, K. C.-C. Data Mining for Web Intelligence, IEEE Computer, Nov. 2002. [10] Jushmerick, N. Learning to remove Internet advertisements, AGENT-99, 1999. [11] Kao, J.Y., Lin, S.H. Ho, J.M. and Chen, M.S. Entropy-based link analysis for mining web informative structures, CIKM 2002. [12] Kleinberg, J. Authoritative Sources in a Hyperlinked Environment. ACM-SIAM Symposium on Discrete Algorithms, 1998. [13] Lee, M.L., Ling, W. and Low, W.L. Intelliclean: A knowledge-based intelligent data cleaner. KDD-2000, 2000. [14] Lewis, D. and Gale, W. A sequential algorithm for training text classifiers. Proceedings of SIGIR, 1994. [15] McCallum, A. and Nigam, K. A comparison of event models for naïve Bayes text classification. AAAI-98 Workshop on Learning for Text Categorization. AAAI Press, 1998. [16] Nahm, U.Y., Bilenko, M. and Mooney R.J. Two Approaches to Handling Noisy Variation in Text Mining. ICML-2002 Workshop on Text Learning, 2002 [17] Shian-Hua Lin and Jan-Ming Ho. Discovering Informative Content Blocks from Web Documents, KDD-02, 2002. [18] Yang, Y. and Pedersen, J.O. A comparative study on feature selection in text categorization. ICML-97, 1997.