A CRF-based Feature Selection Algorithm for Web Information Extraction Shandian ZHE1, Yan GUO1, Xia Tian1, Xueqi CHENG1,† 1 Institute of Computing Technology, Chinese Science Academy, Beijing 100190, China Abstract: Web information extraction plays a key role in many applications, such as information retrieval and knowledge base construction. However, current researches in this field mainly focus on directly improving traditional merits, like precision or recall. The contributions of different features have not been treated separately and so not carefully explored. In addition, when building extraction tools, due to the lack of solutions for feature evaluation, developers can only judge through subjective cognition or personal experiences, which are irrational and inefficient. The paper proposes an innovative feature selection algorithm for web information extraction. The algorithm leverages the CRF model to evaluate feature importance and recursively eliminates unimportant features starting from the whole set. The major benefit of utilizing CRFs is that features about dependencies between extraction objects can be directly evaluated together with other ones and feature importance can be naturally measured by model parameters. The algorithm is examined in two typical web extraction applications: title and article extraction in news and blog pages. In contrast with baseline methods, the proposed algorithm eliminates feature redundancy automatically and yields better and more compact feature set. Especially, the features about position dependencies between extraction objects are demonstrated to be important so that the advance of the proposed algorithm is further embodied. Keywords: Feature Selection; CRF; Web Information Extraction 1. Introduction Nowadays, the web has become a huge and valuable resource along with its rapid growth and so web information extraction has been paid much attention. There have been broad studies on various web extraction problems [1][2][3]. However, while these studies mainly focus on directly improving traditional merits such as precision or recall, the contributions of different features for extraction have never been treated separately and thus often ignored or not fully explored. We believe that evaluating and selecting important features can benefit web information extraction. On one hand, the effectiveness of an extraction algorithm may come from two aspects: effective features or smart extraction process—if we can make sure the contributions of leveraged features, we can further separately assess the effectiveness of the extraction process thereby the algorithm can be developed or improved more easily and conveniently. For example, to develop an extraction algorithm, we can first incorporate enough fruitful features, select an effective feature set and then focus on building a smart extraction process, rather than mix these steps together. On the other hand, due to the lack of an effective solution for feature evaluation and selection, when building web extraction tools, developers can only choose features through subjective cognition or personal experiences, which are irrational and easily make the developing process inefficient. A typical example is that developers usually have to preserve the old features when incorporating new ones to avoid the extraction effect becoming worse. However, this will often leads to a final extraction tool taking too much memory or running too slowly since so many features are included. Then the developers have to spend extra time for program optimization. In contrast, if the developers can effectively evaluate and select features each † Corresponding author. Email addresses: zheshandian@software.ict.ac.cn (S. Zhe), guoyan@ict.ac.cn (Y. Guo), xiatian@ict.ac.cn (T. Xia), cxq@ict.ac.cn (X. Cheng). time when incorporating new ones, the performance can be guaranteed since important features are reserved and redundant or useless ones are in time removed. Related work. Previous work has proposed a series of feature selection algorithms for different domains, which can be classified into two categories: filter methods and wrapper methods. In filter methods [4][5][6], a group of statistical information principles are leveraged for measuring a feature’s importance, such as information gain, KL-divergence, inter-class distances and so on. In wrapper methods [7][8][9], features are evaluated by some classification model, like SVM or Neural Networks. Since training classifiers purely aims to yield better classification performance so features are much more deeply explored for facilitating classification, their contributions can then be evaluated more precisely than by simple statistical principles. In addition, in filter methods, features are usually treated independently, however, in practical applications, there usually exist interactive influences among different features. By contrast, wrapper methods usually evaluate a set of features simultaneously so that such influences can be considered. Therefore, wrapper methods are generally expected to yield better result than filter methods. However, training classifiers repeatedly may take much more time than one-pass calculating information statistics in filter methods. This paper proposes an innovative feature selection algorithm based on CRFs [10] for web information extraction. As discriminative graphical models, CRFs have been successfully applied in many domains [10][15][16][17][18][19][20] and proves to fully utilize different features to derive high performance. Especially, CRFs can directly incorporate extraction objects’ dependencies as features so that they can be consistently evaluated together with other features, while classifiers like SVM assuming variables independent cannot. These dependences are quite useful for web information extraction since practical applications often require simultaneously extracting multiple objects in one page and their dependences can naturally be used in extraction algorithms. For example, when extracting titles and articles in news pages, titles appear before articles so when a piece of article text is extracted, any of the following texts cannot be the news title. Besides, since CRFs configure the best variable values by maximizing features’ linear combination [10], a feature’s weight naturally reflect its influence to the value configuration so as to classification result. Thereby, features’ importance can be naturally reflected and conveniently measured by their weights—model parameters. Due to these advantages, although there is rarely a selection algorithm leveraging CRFs, we believe that they can benefit feature selection for web information extraction. The proposed algorithm starts from the whole feature set and each time trains a CRF model to pick and remove the most unimportant features. The experiment in two typical problems: title and article extraction in news and blog pages shows that the algorithm yields a more compact feature set as well as better extraction effect, in contrast with a series of baseline methods. Especially, the dependency features are demonstrated to be important (they are preserved in the final best feature set and a better extraction effect is also obtained with them) so that the advance of the algorithm is further embodied. Our main contribution is that we point out the value of doing feature selection in web information extraction and propose an innovative CRF-based feature selection algorithm. The algorithm is able to effectively evaluate all kinds of features (especially ones about extraction objects’ dependencies) and yields a better and more compact result compared to baseline methods. The rest of the paper is organized as follows: Section 2 gives a brief introduction to CRFs; Section 3 presents the CRF-based feature selection algorithm; Section 4 lists the features for experimentation and Section 5 conducts the experiment; Section 6 brings this paper into a conclusion. 2. Conditional random fields (CRFs) 2.1. Preliminaries Conditional random fields (CRFs) [10] are graphical models developed to induct a conditional probability distribution of hidden variables given the observed variable values. Let G=(V,E) be an undirected graph over a set of random variables X and Y, within which X are observed and Y are hidden to be labeled. There can be any structure among Y to indicate their probability dependencies. The conditional distribution of Y given X generally takes the following form: p( y | x) 1 exp{ T f c ( xc , yc )} , Z( x) cC (2.1) where C is the set of cliques of G; xc and yc are components of X and Y associating clique c; fc is a feature function defined on clique c; λ is the feature importance weight vector, i.e. model parameters; Z(x) is the partition function for normalization. Note that the clique c may contain multiple variables and so their dependences can be directly defined in features. Thus, unlike other classifiers assuming classification variables independent, CRFs break the assumption and allow incorporation of more potentially useful information. This is the main advantage of our constructing CRF-based feature selection algorithm, as we will describe in Section 3. 2.2. Learning parameters Given the training data, learning parameters is to find a weight vector λ that optimizes the log-likelihood function. However, to avoid over-fitting, a spherical Gaussian prior with zero mean is usually introduced as penalty thereby the penalized log-likelihood is written as: L( ) Tc f c ( xc , yc ) log( Z ( x)) cC T 2 2 (2.2) The convex function can be minimized by numerical gradient algorithms. We choose L-BFGS [11] for its outstanding performance over other algorithms [12]. Each component of the gradient vector can be calculated from (2.2): L( ) f k ( x, y) E p ( y| x , ) ( f k ( x, y)) k2 k , (2.3) where Ep(y|x,λ)(fk(x,y)) is the expectation of feature function fk with respect to the current model distribution. The main cost in training is to compute the partition function Z(x). The junction tree algorithm [13] can do the computation precisely and efficiently. 2.3. Finding the Most Likely Assignment We aim to find the most probable configuration of y, given the observation set x with a trained CRF. This can be done by the junction tree algorithm with a little modification: we only need to replace operations of summation by max in the belief propagation phase. Then the most likely assignment of each variable can be found in all cliques containing it. The detail is referred in [13]. 3. CRF-based feature selection algorithm 3.1 Extraction and classification Although there might be various web extraction problems, all of them are essentially to pick elements from web pages, i.e. DOM tree elements. Thus, the extraction process can be treated as classifying the DOM tree elements into our interested types or noise. For example, if we want to extract titles and articles from news pages, we can classify each text node in the corresponding DOM trees into Title, Article or Noise. After that, nodes labeled as Title or Article are returned as extraction results. Thereby, we can incorporate various features through classifiers and this is our first step of feature selection for web information extraction. 3.2 The advantages of Utilizing CRFs We employ CRFs for feature evaluation and selection in web information extraction, and the advantages mainly include: (1) A capability of incorporating broad features, especially ones about classification variables’ dependencies. Since people often need to extract multiple objects in one page and their dependence are often useful for extraction (e.g. in news pages, articles come after titles and then comments), such dependences can be directly incorporated as features in CRFs (Section 2.1) and so evaluated together with other ones, while classifiers (e.g. SVM) assuming classification variables independent cannot. (2) A direct and natural way to measure feature importance. To get the best value assignment (Section 2.3), we need to find a configuration to maximize features’ linear combination, i.e. in the numerator of Formula 2.1. Obviously, feature weight with bigger absolute value has a stronger influence thus the corresponding feature takes more importance. Linear classifiers, like SVM, have the similar property, and in [7] the feature weights from SVM are also utilized for evaluation. However, when applying non-linear kernel functions, feature vectors are transformed into higher dimensional space [14] thus the weights in the original space may not be appropriate to indicate feature importance again. (3) Excellent performance. CRFs have been applied in many areas, such as NLP [10][15], computer vision [16][17], web page classification [18] and information extraction [19][20] and proved to have excellent performance so that the contribution of various features can be deeply explored and expected to be fully measured. 3.3 The feature selection algorithm Owning to the advantages in Section 3.2, we propose our CRF-based feature selection algorithm, as shown in Table 3.1. The algorithm adopts the RFE (Recursively Feature Elimination) strategy [7], with each time k most unimportant features removed and the remaining features together evaluated by CRFs again. Such a backward process can preserve features’ potential dependencies which may affect classification so as to evaluate feature set objectively. The parameter k is for users to adjust the selection granularity—big k can reduce iterations and so running time but important feature may also be incorrectly removed. Parameters n and t further help users balance the computing cost and the selection performance. Since there may be diverse web extraction needs, in our algorithm, the standard of evaluating the best feature set is left for users (evaluate function F) and a group of candidate feature sets will be generated, which are partly bounded by threshold α: when removing some of features leads to a significant decrease in classification precision (Step 4), since other features are even more important (bigger absolute weight), we believe that there is no more unimportant features and so the bound is identified with no training iterations going on. Table 3.1 CRF-based Feature Selection Algorithm Input: precision decrease threshold α, minimum feature number n, maximum iteration number t, feature number decrement k, evaluation function F Output: effective feature subset S, feature importance ranking R 1: Prepare the training dataset and do some initialization (e.g. normalize the features into [0,1]), take S1 containing all features, i = 1 2: Train the dataset with feature set Si by CRF and derive feature importance λi 3: Evaluate Si by F and calculate the classification precision pi (n-fold cross validation) 4: if pi-1 – pi > α then goto Step 10 5: i ++ 6: if i == t then goto Step 10 7: Rank Si by absolute values of λi and remove k features with minimum values in Si to derive Si+1 8: if |Si+1| < n then goto Step 10 9: Goto Step 2 10: Select feature set S with the maximum value of F in all above iterations 11: Rank S by absolute values of λ and recursively append the rank of removed features in previous iterations according to the corresponding λi 12: Output S, R 4. Features for title and article extraction We apply the CRF-based feature selection algorithm for title and article extraction: we classify DOM trees’ text nodes into three types: Title, Article and Noise and propose 1200 features. The features mainly come from HTML source code, DOM tree, vision picture and some keywords. Besides, to verify the effectiveness of extraction objects’ dependencies, we in addition introduce the position dependency features: Specially, they are defined as the binary indicator function, i.e. f(yi, yi+1) = Iu,v(yi, yi+1), where yi and yi+1 are text node types whose texts appear in succession; Iu,v(x,y) equals 1 if u=x, v=y and 0 otherwise; u, v iterates over y’s range. The feature distribution is shown in Table 4.1. Table 4.1 Distribution of feature types Type Number Description of Example Features whether <p> appears before the text and the type is Title (Article, Noise) Html source 258 whether <li> appears before the text and the type is Title (Article, Noise) whether <div> is the text’s parent and the type is Title (Article, Noise) number of brother nodes of the text and the type is Title (Article, Noise) DOM tree 300 whether <p> is the brother node of the text’s parent and the type is Title (Article, Noise) the depth from <body> to the text and the type is Title (Article, Noise) whether the text’s font size is in level 1(total 7 levels) and the type is Title (Article, Noise) whether the text is weighted and the type is Title (Article, Noise) Vision-based 363 whether the text’s color is blue and the type is Title (Article, Noise) relative x(y)-coordinate of the text in browser screen and the type is Title (Article, Noise) whether the text contains ‘because’ and the type is Title (Article, Noise) Keywords Dependency 270 9 whether the text contains ‘so that’ and the type is Title (Article, Noise) whether the type is Title and the following type is Article 5. Experiment 5.1 Dataset We totally crawled 3,365 pages from the news channel of 31 websites and 3,032 pages from the blog channel of 10 websites. These sites such as Sina, Sohu, and Netease are popular and famous in China. Then we separately sampled 660 news pages and 618 blog pages for labeling and training. The dataset contains various contents, such as politics, economics, and entertainment, however, without any particularly dominant area. 5.2 Baseline Methods and Selection Results In the experiment, we define the evaluation function as the average F1 of classifying Title and Article: (5.1) Eval( s ) F1(Title ) (1 ) F1( Article ) , where α takes 0.5 so titles and articles are considered as equally important. Actually, users can choose different α according to their special needs. Since other classifiers like SVM are not able to evaluate dependency features (Section 4), we employ CRFs to conduct 3-fold cross-validation to derive F1 on feature set s. We choose four baselines for comparisons: IG-RFE. This is a filter method baseline: The features are ranked according to their information gains with respect to classification variables-the more, the higher, then each time, the last k features in the ranking are removed and the remained ones are evaluated by Formula 5.1. This process repeats until no features are left. SVM-RFE. The method is proposed in [7], which takes similar steps to our algorithm and the only difference is that each time features are evaluated and ranked by SVM. Perceptron-RFE. Perceptron [21] is another broadly used linear classifier and we in addition leverage its weight vector to evaluate and rank features as a baseline, similar to SVM-RFE. CRF-single. Since none of the above methods is able to directly incorporate dependency features in Section 4, to compare fairly and further examine their importance, we in addition run our algorithm without incorporating them as a baseline. We denote our algorithm starting from all the features in Section 4 as CRF-all. For all methods, we set α = 0.1 and choose k = 10 for its fine granularity. Note that to keep number consistence, for each baseline we set the dependency features as 0 so that all the methods start from 1200 features. Fig 5.1 and 5.2 show all feature sets’ average F1 in the selection process on news and blog pages, where CRF-all and CRF-single nearly always select the best feature sets thereby the advance of our algorithm is proved. However, the even better results of CRF-all further demonstrate the importance of the position dependency features in Section 4. For other methods, SVM-RFE is slightly inferior to CRF-single and that may because features favored by SVMs are not most suited to CRFs. However, since SVM cannot directly evaluate the dependency features, which have just been proved to be important, it is not as advantageous as CRFs for feature selection in web extraction. Percetron-RFE’s performance approaches SVM-RFE in news pages (Fig 5.1), but in blog pages (Fig 5.2) their disparity become more distinct. The reason may be that Percetron is not a stable classifier [21] and on average it is much inferior to stable classifiers like SVM. IG-RFE is the worst in contrast with all other ones, which may because simple information gains are quite far from features’ real contributions for extraction and so important features are easily missed. 1.0 .9 Average F1 .8 .7 .6 CRF-all CRF-single SVM-RFE Perceptron-RFE IG-RFE .5 .4 .3 1200 1000 800 600 400 200 0 Number of Features Fig.5.1 Feature selection for news pages .9 .8 Average F1 .7 .6 .5 .4 CRF-all CRF-single SVM-RFE Perceptron-RFE IG-RFE .3 .2 1200 1000 800 600 N u mb e r 400 o f 200 0 F e a t u r e s Fig.5.2 Feature selection for blog pages All the methods’ results are shown in Table 5.1, where the last two rows list the original sets (all features—All; all except dependency features—No dependency). It can be seen that CRF-all yields the most compact feature set simultaneously with the highest average F1 and then CRF-single. Besides, in the last two rows, our statistical test shows their average F1 differences are both significant for news and blogs, so are the average F1 differences between CRF-all and CRF-single. Therefore, the effect of dependency features is proved again so that our algorithm’s advantage of evaluating broad features is embodied a second time. Table 5.1 Feature selection result (news/blogs) Methods Feature Number F1(Title) F1(Article) Average F1 CRF-all 90/120 0.9529/0.8825 0.9257/0.8659 0.9393/0.8742 CRF-single 90/150 0.9467/0.8551 0.8509/0.8321 0.8988/0.8436 SVM-RFE 120/210 0.9123/0.8526 0.8593/0.8268 0.8858/0.8397 Perceptron-RFE 400/750 0.9289/0.8266 0.8301/0.7622 0.8795/0.7944 IG-RFE 570/1100 0.8450/0.8485 0.8512/0.7387 0.8481/0.7936 All 1200/1200 0.9252/0.8421 0.8850/0.8461 0.9051/0.8441 No dependency 1191/1191 0.8844/0.8519 0.8472/0.7341 0.8658/0.7930 Furthermore, it can be seen that most of the methods greatly reduce the feature set as well as improve average F1. Thereby, the benefit of feature selection proves the value of our work. 6. Conclusions and future work The work is motivated in the process of building a blog content extraction tool. We summarize a lot of features but lack of an effective way to choose the important ones. We then realize the value of feature selection for web extraction and seek for a solution. In the future work, we would like to utilize approximate algorithm for training CRFs, since they can greatly reduce the training time so as to accelerate the selection process. We want to examine the feasibility in several typical problems. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] Y. Hu, G. Xin, R. Song et al. Title Extraction from Bodies of HTML Documents and Its Application to Web Page Retrieval. International Conference on Research and Development in Information Retrieval (SIGIR’ 05). Salvador, Brazil, ACM, 2005, pp. 250~257. B. Liu, R. Grossman, Y. Zhai. Mining Data Records in Web Pages. International Conference on Knowledge Discovery and Data Mining (KDD'03), ACM, 2003, pp. 601~606. J. Zhu, Z. Nie, J. Wen et al. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction. International Conference on Knowledge Discovery and Data Mining (KDD'06), ACM, 2006, pp 494~503. I. Inza, P. Larranga, R. Blanco, et al. Fitler Versus Wrapper Gene Selection Approaches in DNA Microarray Domains, Artifical Intelligence in Medicine, 2004, Vol 31(2), pp. 91~103. Zhou Xiaobo, Wang Xiaodong and E.R. Dougherty. Nonlinear Probit Gene Classfication Using Mutual-Information and Wavelet-based Feature Selection, Biological Systems, 2004, Vol 12(3), pp. 371~386. Zhou Xiaobo, Wang Xiaodong and E.R. Dougherty. Gene Selection Using Logistic Regression Based on AIC, BIC and MDL Criteria. Journal of New Mathematics and Natural Computation, 2005, Vol 1(1), pp. 129~145. Guoyon I, Weston J, Barnhill S and et al. Gene Selection for Cancer Classfication Using Support Vector Machines. Machine Learning, 2002, Vol 46(1/2/3), pp. 389~422. Verikas A, Bacauskiene M. Feature Selection with Neural Networks. Pattern Recoginition Letters, 2002, Vol 23(11), pp. 1323~1335. Weston J, Mukherjee S, Chapelle O, and et al. Feature Selection for SVMs. Advances in Neural Information Processing Systems, Cambridge, USA, 2001 Vol 13, pp. 668~674. John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on Machine Learning, 2001, pp. 282~289. D.C. Liu, and J. Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization. Mathematical Programming 45, December, 1989, Springer Press, pp. 503-528. Malouf R. A Comparison of Algorithms for Maximum Entropy Parameter Estimation. In Proceedings of COLING '02 ,Taipei, August, 2002, pp. 49~55. G. Cowell, P. Dawid, L. Lauritzen, and J. Spiegelhalter. Probabilistic Networks and Expert Systems. Springer Press, 1999, chapter 3. B. E. Boser, I. Guyon, and V. Vapnik. A Training algorithm for optimal margin classifiers. In Proceedings of Fifth Annual Workshop on Computing Learning Theory, ACM Press, 1992, pp. 144~152. F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of Human Language Technology-NAACL, 2003. R. Hariharan and K. Toyama. Project Lachesis: parsing and modeling location histories. In Geographic Information Science, 2004. S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of International Conference on Computer Vision, 2003. B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceeding of the Conference on Uncertainty in Artificial Intelligence (UAI), 2002. Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Interactive information extraction with constrained conditional random fields. In Proceedings of National Conference on Aritfical Intelligence (AAAI), 2004. F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL, 2004. S. I. Gallant. Perceptron-based learning algorithms. IEEE Transaction on Neural Networks, 1990, 1:179~191.