A CRF-based Feature Selection Algorithm for Web Information

advertisement
A CRF-based Feature Selection Algorithm for Web Information
Extraction
Shandian ZHE1, Yan GUO1, Xia Tian1, Xueqi CHENG1,†
1
Institute of Computing Technology, Chinese Science Academy, Beijing 100190, China
Abstract: Web information extraction plays a key role in many applications, such as information retrieval and
knowledge base construction. However, current researches in this field mainly focus on directly improving
traditional merits, like precision or recall. The contributions of different features have not been treated separately
and so not carefully explored. In addition, when building extraction tools, due to the lack of solutions for feature
evaluation, developers can only judge through subjective cognition or personal experiences, which are irrational
and inefficient. The paper proposes an innovative feature selection algorithm for web information extraction. The
algorithm leverages the CRF model to evaluate feature importance and recursively eliminates unimportant
features starting from the whole set. The major benefit of utilizing CRFs is that features about dependencies
between extraction objects can be directly evaluated together with other ones and feature importance can be
naturally measured by model parameters. The algorithm is examined in two typical web extraction applications:
title and article extraction in news and blog pages. In contrast with baseline methods, the proposed algorithm
eliminates feature redundancy automatically and yields better and more compact feature set. Especially, the
features about position dependencies between extraction objects are demonstrated to be important so that the
advance of the proposed algorithm is further embodied.
Keywords: Feature Selection; CRF; Web Information Extraction
1. Introduction
Nowadays, the web has become a huge and valuable resource along with its rapid growth and so web
information extraction has been paid much attention. There have been broad studies on various web
extraction problems [1][2][3]. However, while these studies mainly focus on directly improving
traditional merits such as precision or recall, the contributions of different features for extraction have
never been treated separately and thus often ignored or not fully explored.
We believe that evaluating and selecting important features can benefit web information extraction.
On one hand, the effectiveness of an extraction algorithm may come from two aspects: effective
features or smart extraction process—if we can make sure the contributions of leveraged features, we
can further separately assess the effectiveness of the extraction process thereby the algorithm can be
developed or improved more easily and conveniently. For example, to develop an extraction algorithm,
we can first incorporate enough fruitful features, select an effective feature set and then focus on
building a smart extraction process, rather than mix these steps together. On the other hand, due to the
lack of an effective solution for feature evaluation and selection, when building web extraction tools,
developers can only choose features through subjective cognition or personal experiences, which are
irrational and easily make the developing process inefficient. A typical example is that developers
usually have to preserve the old features when incorporating new ones to avoid the extraction effect
becoming worse. However, this will often leads to a final extraction tool taking too much memory or
running too slowly since so many features are included. Then the developers have to spend extra time
for program optimization. In contrast, if the developers can effectively evaluate and select features each
†
Corresponding author.
Email addresses: zheshandian@software.ict.ac.cn (S. Zhe), guoyan@ict.ac.cn (Y. Guo), xiatian@ict.ac.cn (T. Xia),
cxq@ict.ac.cn (X. Cheng).
time when incorporating new ones, the performance can be guaranteed since important features are
reserved and redundant or useless ones are in time removed.
Related work. Previous work has proposed a series of feature selection algorithms for different
domains, which can be classified into two categories: filter methods and wrapper methods. In filter
methods [4][5][6], a group of statistical information principles are leveraged for measuring a feature’s
importance, such as information gain, KL-divergence, inter-class distances and so on. In wrapper
methods [7][8][9], features are evaluated by some classification model, like SVM or Neural Networks.
Since training classifiers purely aims to yield better classification performance so features are much
more deeply explored for facilitating classification, their contributions can then be evaluated more
precisely than by simple statistical principles. In addition, in filter methods, features are usually treated
independently, however, in practical applications, there usually exist interactive influences among
different features. By contrast, wrapper methods usually evaluate a set of features simultaneously so
that such influences can be considered. Therefore, wrapper methods are generally expected to yield
better result than filter methods. However, training classifiers repeatedly may take much more time
than one-pass calculating information statistics in filter methods.
This paper proposes an innovative feature selection algorithm based on CRFs [10] for web
information extraction. As discriminative graphical models, CRFs have been successfully applied in
many domains [10][15][16][17][18][19][20] and proves to fully utilize different features to derive high
performance. Especially, CRFs can directly incorporate extraction objects’ dependencies as features so
that they can be consistently evaluated together with other features, while classifiers like SVM
assuming variables independent cannot. These dependences are quite useful for web information
extraction since practical applications often require simultaneously extracting multiple objects in one
page and their dependences can naturally be used in extraction algorithms. For example, when
extracting titles and articles in news pages, titles appear before articles so when a piece of article text is
extracted, any of the following texts cannot be the news title. Besides, since CRFs configure the best
variable values by maximizing features’ linear combination [10], a feature’s weight naturally reflect its
influence to the value configuration so as to classification result. Thereby, features’ importance can be
naturally reflected and conveniently measured by their weights—model parameters.
Due to these advantages, although there is rarely a selection algorithm leveraging CRFs, we believe
that they can benefit feature selection for web information extraction. The proposed algorithm starts
from the whole feature set and each time trains a CRF model to pick and remove the most unimportant
features. The experiment in two typical problems: title and article extraction in news and blog pages
shows that the algorithm yields a more compact feature set as well as better extraction effect, in
contrast with a series of baseline methods. Especially, the dependency features are demonstrated to be
important (they are preserved in the final best feature set and a better extraction effect is also obtained
with them) so that the advance of the algorithm is further embodied.
Our main contribution is that we point out the value of doing feature selection in web information
extraction and propose an innovative CRF-based feature selection algorithm. The algorithm is able to
effectively evaluate all kinds of features (especially ones about extraction objects’ dependencies) and
yields a better and more compact result compared to baseline methods.
The rest of the paper is organized as follows: Section 2 gives a brief introduction to CRFs; Section 3
presents the CRF-based feature selection algorithm; Section 4 lists the features for experimentation and
Section 5 conducts the experiment; Section 6 brings this paper into a conclusion.
2. Conditional random fields (CRFs)
2.1. Preliminaries
Conditional random fields (CRFs) [10] are graphical models developed to induct a conditional
probability distribution of hidden variables given the observed variable values. Let G=(V,E) be an
undirected graph over a set of random variables X and Y, within which X are observed and Y are hidden
to be labeled. There can be any structure among Y to indicate their probability dependencies. The
conditional distribution of Y given X generally takes the following form:
p( y | x) 
1
exp{  T  f c ( xc , yc )} ,
Z( x)
cC
(2.1)
where C is the set of cliques of G; xc and yc are components of X and Y associating clique c; fc is a
feature function defined on clique c; λ is the feature importance weight vector, i.e. model parameters;
Z(x) is the partition function for normalization. Note that the clique c may contain multiple variables
and so their dependences can be directly defined in features. Thus, unlike other classifiers assuming
classification variables independent, CRFs break the assumption and allow incorporation of more
potentially useful information. This is the main advantage of our constructing CRF-based feature
selection algorithm, as we will describe in Section 3.
2.2. Learning parameters
Given the training data, learning parameters is to find a weight vector λ that optimizes the
log-likelihood function. However, to avoid over-fitting, a spherical Gaussian prior with zero mean is
usually introduced as penalty thereby the penalized log-likelihood is written as:
L( )   Tc  f c ( xc , yc )  log( Z ( x)) 
cC
T  
2 2
(2.2)
The convex function can be minimized by numerical gradient algorithms. We choose L-BFGS [11]
for its outstanding performance over other algorithms [12]. Each component of the gradient vector can
be calculated from (2.2):

L( )
  f k ( x, y)  E p ( y| x , ) ( f k ( x, y))  k2
k

,
(2.3)
where Ep(y|x,λ)(fk(x,y)) is the expectation of feature function fk with respect to the current model
distribution. The main cost in training is to compute the partition function Z(x). The junction tree
algorithm [13] can do the computation precisely and efficiently.
2.3. Finding the Most Likely Assignment
We aim to find the most probable configuration of y, given the observation set x with a trained CRF.
This can be done by the junction tree algorithm with a little modification: we only need to replace
operations of summation by max in the belief propagation phase. Then the most likely assignment of
each variable can be found in all cliques containing it. The detail is referred in [13].
3. CRF-based feature selection algorithm
3.1 Extraction and classification
Although there might be various web extraction problems, all of them are essentially to pick elements
from web pages, i.e. DOM tree elements. Thus, the extraction process can be treated as classifying the
DOM tree elements into our interested types or noise. For example, if we want to extract titles and
articles from news pages, we can classify each text node in the corresponding DOM trees into Title,
Article or Noise. After that, nodes labeled as Title or Article are returned as extraction results. Thereby,
we can incorporate various features through classifiers and this is our first step of feature selection for
web information extraction.
3.2 The advantages of Utilizing CRFs
We employ CRFs for feature evaluation and selection in web information extraction, and the
advantages mainly include: (1) A capability of incorporating broad features, especially ones about
classification variables’ dependencies. Since people often need to extract multiple objects in one page
and their dependence are often useful for extraction (e.g. in news pages, articles come after titles and
then comments), such dependences can be directly incorporated as features in CRFs (Section 2.1) and
so evaluated together with other ones, while classifiers (e.g. SVM) assuming classification variables
independent cannot. (2) A direct and natural way to measure feature importance. To get the best value
assignment (Section 2.3), we need to find a configuration to maximize features’ linear combination, i.e.
in the numerator of Formula 2.1. Obviously, feature weight with bigger absolute value has a stronger
influence thus the corresponding feature takes more importance. Linear classifiers, like SVM, have the
similar property, and in [7] the feature weights from SVM are also utilized for evaluation. However,
when applying non-linear kernel functions, feature vectors are transformed into higher dimensional
space [14] thus the weights in the original space may not be appropriate to indicate feature importance
again. (3) Excellent performance. CRFs have been applied in many areas, such as NLP [10][15],
computer vision [16][17], web page classification [18] and information extraction [19][20] and proved
to have excellent performance so that the contribution of various features can be deeply explored and
expected to be fully measured.
3.3 The feature selection algorithm
Owning to the advantages in Section 3.2, we propose our CRF-based feature selection algorithm, as
shown in Table 3.1. The algorithm adopts the RFE (Recursively Feature Elimination) strategy [7], with
each time k most unimportant features removed and the remaining features together evaluated by CRFs
again. Such a backward process can preserve features’ potential dependencies which may affect
classification so as to evaluate feature set objectively.
The parameter k is for users to adjust the selection granularity—big k can reduce iterations and so
running time but important feature may also be incorrectly removed. Parameters n and t further help
users balance the computing cost and the selection performance.
Since there may be diverse web extraction needs, in our algorithm, the standard of evaluating the
best feature set is left for users (evaluate function F) and a group of candidate feature sets will be
generated, which are partly bounded by threshold α: when removing some of features leads to a
significant decrease in classification precision (Step 4), since other features are even more important
(bigger absolute weight), we believe that there is no more unimportant features and so the bound is
identified with no training iterations going on.
Table 3.1 CRF-based Feature Selection Algorithm
Input: precision decrease threshold α, minimum feature number n, maximum iteration number t,
feature number decrement k, evaluation function F
Output: effective feature subset S, feature importance ranking R
1:
Prepare the training dataset and do some initialization (e.g. normalize the features into [0,1]),
take S1 containing all features, i = 1
2:
Train the dataset with feature set Si by CRF and derive feature importance λi
3: Evaluate Si by F and calculate the classification precision pi (n-fold cross validation)
4: if pi-1 – pi > α then goto Step 10
5: i ++
6: if i == t then goto Step 10
7: Rank Si by absolute values of λi and remove k features with minimum values in Si to derive Si+1
8: if |Si+1| < n then goto Step 10
9:
Goto Step 2
10: Select feature set S with the maximum value of F in all above iterations
11: Rank S by absolute values of λ and recursively append the rank of removed features in previous
iterations according to the corresponding λi
12: Output S, R
4. Features for title and article extraction
We apply the CRF-based feature selection algorithm for title and article extraction: we classify DOM
trees’ text nodes into three types: Title, Article and Noise and propose 1200 features. The features
mainly come from HTML source code, DOM tree, vision picture and some keywords. Besides, to
verify the effectiveness of extraction objects’ dependencies, we in addition introduce the position
dependency features: Specially, they are defined as the binary indicator function, i.e. f(yi, yi+1) = Iu,v(yi,
yi+1), where yi and yi+1 are text node types whose texts appear in succession; Iu,v(x,y) equals 1 if u=x,
v=y and 0 otherwise; u, v iterates over y’s range. The feature distribution is shown in Table 4.1.
Table 4.1 Distribution of feature types
Type
Number
Description of Example Features
whether <p> appears before the text and the type is Title (Article, Noise)
Html source
258
whether <li> appears before the text and the type is Title (Article, Noise)
whether <div> is the text’s parent and the type is Title (Article, Noise)
number of brother nodes of the text and the type is Title (Article, Noise)
DOM tree
300
whether <p> is the brother node of the text’s parent and the type is Title (Article, Noise)
the depth from <body> to the text and the type is Title (Article, Noise)
whether the text’s font size is in level 1(total 7 levels) and the type is Title (Article, Noise)
whether the text is weighted and the type is Title (Article, Noise)
Vision-based
363
whether the text’s color is blue and the type is Title (Article, Noise)
relative x(y)-coordinate of the text in browser screen and the type is Title (Article, Noise)
whether the text contains ‘because’ and the type is Title (Article, Noise)
Keywords
Dependency
270
9
whether the text contains ‘so that’ and the type is Title (Article, Noise)
whether the type is Title and the following type is Article
5. Experiment
5.1 Dataset
We totally crawled 3,365 pages from the news channel of 31 websites and 3,032 pages from the blog
channel of 10 websites. These sites such as Sina, Sohu, and Netease are popular and famous in China.
Then we separately sampled 660 news pages and 618 blog pages for labeling and training. The dataset
contains various contents, such as politics, economics, and entertainment, however, without any
particularly dominant area.
5.2 Baseline Methods and Selection Results
In the experiment, we define the evaluation function as the average F1 of classifying Title and Article:
(5.1)
Eval( s )    F1(Title )  (1   )  F1( Article ) ,
where α takes 0.5 so titles and articles are considered as equally important. Actually, users can choose
different α according to their special needs. Since other classifiers like SVM are not able to evaluate
dependency features (Section 4), we employ CRFs to conduct 3-fold cross-validation to derive F1 on
feature set s.
We choose four baselines for comparisons:

IG-RFE. This is a filter method baseline: The features are ranked according to their information
gains with respect to classification variables-the more, the higher, then each time, the last k
features in the ranking are removed and the remained ones are evaluated by Formula 5.1. This
process repeats until no features are left.

SVM-RFE. The method is proposed in [7], which takes similar steps to our algorithm and the only
difference is that each time features are evaluated and ranked by SVM.

Perceptron-RFE. Perceptron [21] is another broadly used linear classifier and we in addition
leverage its weight vector to evaluate and rank features as a baseline, similar to SVM-RFE.

CRF-single. Since none of the above methods is able to directly incorporate dependency features
in Section 4, to compare fairly and further examine their importance, we in addition run our
algorithm without incorporating them as a baseline.
We denote our algorithm starting from all the features in Section 4 as CRF-all. For all methods, we
set α = 0.1 and choose k = 10 for its fine granularity. Note that to keep number consistence, for each
baseline we set the dependency features as 0 so that all the methods start from 1200 features.
Fig 5.1 and 5.2 show all feature sets’ average F1 in the selection process on news and blog pages,
where CRF-all and CRF-single nearly always select the best feature sets thereby the advance of our
algorithm is proved. However, the even better results of CRF-all further demonstrate the importance of
the position dependency features in Section 4. For other methods, SVM-RFE is slightly inferior to
CRF-single and that may because features favored by SVMs are not most suited to CRFs. However,
since SVM cannot directly evaluate the dependency features, which have just been proved to be
important, it is not as advantageous as CRFs for feature selection in web extraction. Percetron-RFE’s
performance approaches SVM-RFE in news pages (Fig 5.1), but in blog pages (Fig 5.2) their disparity
become more distinct. The reason may be that Percetron is not a stable classifier [21] and on average it
is much inferior to stable classifiers like SVM. IG-RFE is the worst in contrast with all other ones,
which may because simple information gains are quite far from features’ real contributions for
extraction and so important features are easily missed.
1.0
.9
Average F1
.8
.7
.6
CRF-all
CRF-single
SVM-RFE
Perceptron-RFE
IG-RFE
.5
.4
.3
1200
1000
800
600
400
200
0
Number of Features
Fig.5.1 Feature selection for news pages
.9
.8
Average F1
.7
.6
.5
.4
CRF-all
CRF-single
SVM-RFE
Perceptron-RFE
IG-RFE
.3
.2
1200
1000
800
600
N u mb e r
400
o f
200
0
F e a t u r e s
Fig.5.2 Feature selection for blog pages
All the methods’ results are shown in Table 5.1, where the last two rows list the original sets (all
features—All; all except dependency features—No dependency). It can be seen that CRF-all yields the
most compact feature set simultaneously with the highest average F1 and then CRF-single. Besides, in
the last two rows, our statistical test shows their average F1 differences are both significant for news
and blogs, so are the average F1 differences between CRF-all and CRF-single. Therefore, the effect of
dependency features is proved again so that our algorithm’s advantage of evaluating broad features is
embodied a second time.
Table 5.1 Feature selection result (news/blogs)
Methods
Feature Number
F1(Title)
F1(Article)
Average F1
CRF-all
90/120
0.9529/0.8825
0.9257/0.8659
0.9393/0.8742
CRF-single
90/150
0.9467/0.8551
0.8509/0.8321
0.8988/0.8436
SVM-RFE
120/210
0.9123/0.8526
0.8593/0.8268
0.8858/0.8397
Perceptron-RFE
400/750
0.9289/0.8266
0.8301/0.7622
0.8795/0.7944
IG-RFE
570/1100
0.8450/0.8485
0.8512/0.7387
0.8481/0.7936
All
1200/1200
0.9252/0.8421
0.8850/0.8461
0.9051/0.8441
No dependency
1191/1191
0.8844/0.8519
0.8472/0.7341
0.8658/0.7930
Furthermore, it can be seen that most of the methods greatly reduce the feature set as well as
improve average F1. Thereby, the benefit of feature selection proves the value of our work.
6. Conclusions and future work
The work is motivated in the process of building a blog content extraction tool. We summarize a lot of
features but lack of an effective way to choose the important ones. We then realize the value of feature
selection for web extraction and seek for a solution. In the future work, we would like to utilize
approximate algorithm for training CRFs, since they can greatly reduce the training time so as to
accelerate the selection process. We want to examine the feasibility in several typical problems.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
Y. Hu, G. Xin, R. Song et al. Title Extraction from Bodies of HTML Documents and Its Application to Web
Page Retrieval. International Conference on Research and Development in Information Retrieval (SIGIR’
05). Salvador, Brazil, ACM, 2005, pp. 250~257.
B. Liu, R. Grossman, Y. Zhai. Mining Data Records in Web Pages. International Conference on Knowledge
Discovery and Data Mining (KDD'03), ACM, 2003, pp. 601~606.
J. Zhu, Z. Nie, J. Wen et al. Simultaneous Record Detection and Attribute Labeling in Web Data Extraction.
International Conference on Knowledge Discovery and Data Mining (KDD'06), ACM, 2006, pp 494~503.
I. Inza, P. Larranga, R. Blanco, et al. Fitler Versus Wrapper Gene Selection Approaches in DNA Microarray
Domains, Artifical Intelligence in Medicine, 2004, Vol 31(2), pp. 91~103.
Zhou Xiaobo, Wang Xiaodong and E.R. Dougherty. Nonlinear Probit Gene Classfication Using
Mutual-Information and Wavelet-based Feature Selection, Biological Systems, 2004, Vol 12(3), pp.
371~386.
Zhou Xiaobo, Wang Xiaodong and E.R. Dougherty. Gene Selection Using Logistic Regression Based on
AIC, BIC and MDL Criteria. Journal of New Mathematics and Natural Computation, 2005, Vol 1(1), pp.
129~145.
Guoyon I, Weston J, Barnhill S and et al. Gene Selection for Cancer Classfication Using Support Vector
Machines. Machine Learning, 2002, Vol 46(1/2/3), pp. 389~422.
Verikas A, Bacauskiene M. Feature Selection with Neural Networks. Pattern Recoginition Letters, 2002, Vol
23(11), pp. 1323~1335.
Weston J, Mukherjee S, Chapelle O, and et al. Feature Selection for SVMs. Advances in Neural Information
Processing Systems, Cambridge, USA, 2001 Vol 13, pp. 668~674.
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data. Proceedings of the Eighteenth International Conference on
Machine Learning, 2001, pp. 282~289.
D.C. Liu, and J. Nocedal. On the Limited Memory BFGS Method for Large Scale Optimization.
Mathematical Programming 45, December, 1989, Springer Press, pp. 503-528.
Malouf R. A Comparison of Algorithms for Maximum Entropy Parameter Estimation. In Proceedings of
COLING '02 ,Taipei, August, 2002, pp. 49~55.
G. Cowell, P. Dawid, L. Lauritzen, and J. Spiegelhalter. Probabilistic Networks and Expert Systems.
Springer Press, 1999, chapter 3.
B. E. Boser, I. Guyon, and V. Vapnik. A Training algorithm for optimal margin classifiers. In Proceedings of
Fifth Annual Workshop on Computing Learning Theory, ACM Press, 1992, pp. 144~152.
F. Sha and F. Pereira. Shallow parsing with conditional random fields. In Proceedings of Human Language
Technology-NAACL, 2003.
R. Hariharan and K. Toyama. Project Lachesis: parsing and modeling location histories. In Geographic
Information Science, 2004.
S. Kumar and M. Hebert. Discriminative random fields: A discriminative framework for contextual
interaction in classification. In Proceedings of International Conference on Computer Vision, 2003.
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proceeding of
the Conference on Uncertainty in Artificial Intelligence (UAI), 2002.
Trausti Kristjannson, Aron Culotta, Paul Viola and Andrew McCallum. Interactive information extraction
with constrained conditional random fields. In Proceedings of National Conference on Aritfical Intelligence
(AAAI), 2004.
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random
fields. In HLT-NAACL, 2004.
S. I. Gallant. Perceptron-based learning algorithms. IEEE Transaction on Neural Networks, 1990,
1:179~191.
Download