Chapter 5

advertisement
Chapter 5
FEATURE SELECTION
Chapter 5
FEATURE SELECTION
5.1 Need for Feature Reduction
Many factors affect the success of machine learning on a given task. The
representation and quality of the example data is first and foremost. Nowadays, the need
to process large databases is becoming increasingly common. Full text databases learners
typically deal with tens of thousands of features; vision systems, spoken word and
character recognition problems all require hundreds of classes and may have thousands of
input features. The majority of real-world classification problems require supervised
learning where the underlying class probabilities and class-conditionals probabilities are
unknown, and each instance is associated with a class label. In real-world situations,
relevant features are often unknown a priori. Therefore, many candidate features are
introduced to better represent the domain. Theoretically, having more features should
result in more discriminating power. However, practical experience with machine
learning algorithms has shown that this is not always the case, current machine learning
toolkits are insufficiently equipped to deal with contemporary datasets and many
algorithms are susceptible to exhibit poor complexity with respect to the number of
features. Furthermore, when faced with many noisy features, some algorithms take an
inordinately long time to converge, or never converge at all. And even if they do
converge conventional algorithms will tend to construct poor classifiers [Kon94].
Many of the introduced features during the training of a classifier are either partially
or completely irrelevant/redundant to the target concept; an irrelevant feature does not
affect the target concept in any way, and a redundant feature does not add anything new
63
Chapter 5
FEATURE SELECTION
to the target concept. In many applications, the size of a dataset is so large that learning
might not work as well before removing these unwanted features. Recent research has
shown that common machine learning algorithms are adversely affected by irrelevant and
redundant training information. The simple nearest neighbor algorithm is sensitive to
irrelevant attributes, its sample complexity (number of training examples needed to reach
a given accuracy level) grows exponentially with the number of irrelevant attributes (s.
[Lan94a, Lan94b, Aha91]). Sample complexity for decision tree algorithms can grow
exponentially on some concepts (such as parity) as well. The naive Bayes classifier can
be adversely affected by redundant attributes due to its assumption that attributes are
independent given the class [Lan94c]. Decision tree algorithms such as C4.5 [Qui86,
Qui93] can sometimes over fit training data, resulting in large trees. In many cases,
removing irrelevant and redundant information can result in C4.5 producing smaller trees
[Koh96]. Neural Networks are supposed to cope with irrelevant and redundant features
when the amount of training data is enough to compensate this drawback, otherwise they
are also affected by the amount of irrelevant information.
Reducing the number of irrelevant/redundant features drastically reduces the running
time of a learning algorithm and yields a more general concept. This helps in getting a
better insight into the underlying concept of a real-world classification problem. Feature
selection methods try to pick up a subset of features that are relevant to the target concept.
5.2 Feature Selection process
The problem introduced in previous section can be alleviated by preprocessing the
dataset to remove noisy and low-information bearing attributes.
“Feature selection is the problem of choosing a small subset of features that ideally
is necessary and sufficient to describe the target concept”
Kira & Rendell
From the terms “necessary” and “sufficient” included in the given definition, it can
be stated that feature selection attempts to select the minimally sized subset of features
according to the following criteria:
64
Chapter 5
FEATURE SELECTION
1. the classification accuracy do not significantly decrease; and
2. the resulting class distribution, given only the values for the selected
features, is as close as possible to the original class distribution, given all
features.
Figure 5.1 General criteria for a feature selection method.
Ideally, feature selection methods search through the subsets of features, and try to
find the best one among 2N candidate subsets according to some evaluation function.
However this procedure is exhaustive as it tries to find only the best one. It may be too
costly and practically prohibitive even for a medium sized feature set. Other methods
based on heuristic or random search methods attempt to reduce computational complexity
by compromising performance. These methods need a stopping criterion to prevent an
exhaustive search of subsets. There are four basic steps in a typical feature selection
method:
1. Starting point: Selecting a point in the feature subset space from which to begin
the search can affect the direction of the search. One option is to begin with no
features and successively add attributes. In this case, the search is said to proceed
forward through the search space. Conversely, the search can begin with all
features and successively remove them. In this case, the search proceeds
backward through the search space. Another alternative is to begin somewhere in
the middle and move outwards from this point.
2. Search organization: An exhaustive search of the feature subspace is prohibitive
for all but a small initial number of features. With N initial features there exist
2 N possible subsets. Heuristic search strategies are more feasible than exhaustive
ones and can give good results, although they do not guarantee finding the
optimal subset.
3. Evaluation strategy: How feature subsets are evaluated is the single biggest
differentiating factor among feature selection algorithms for machine learning.
One paradigm, dubbed the filter [Koh95, Koh96], operates independent of any
65
Chapter 5
FEATURE SELECTION
learning algorithm—undesirable features are filtered out of the data before
learning begins. These algorithms use heuristics based on general characteristics
of the data to evaluate the merit of feature subsets. Another school of thought
argues that the bias of a particular induction algorithm should be taken into
account when selecting features. This method, called the wrapper [Koh95,
Koh96], uses an induction algorithm along with a statistical re-sampling
technique such as cross-validation to estimate the final accuracy of feature
subsets.
4. Stopping criterion: A feature selector must decide when to stop searching
through the space of feature subsets. Depending on the evaluation strategy, a
feature selector might stop adding or removing features when none of the
alternatives improves upon the merit of a current feature subset. Alternatively, the
algorithm might continue to revise the feature subset as long as the merit does not
degrade. A further option could be to continue generating feature subsets until
reaching the opposite end of the search space and then select the best.
Many learning algorithms can be viewed as making a (biased) estimate of the
probability of the class label given a set of features. This is a complex, high dimensional
distribution. Unfortunately, induction is often performed on limited data. This makes
estimating the many probabilistic parameters difficult. In order to avoid over fitting the
training data, many algorithms employ the Occam’s Razor [Gam97] bias to build a simple
model that still achieves some acceptable level of performance on the training data. This
bias often leads an algorithm to prefer a small number of predictive attributes over a large
number of features that, if used in the proper combination, are fully predictive of the class
label. If there is too much irrelevant and redundant information present or the data is
noisy and unreliable, then learning during the training phase is more difficult.
Feature subset selection is the process of identifying and removing as much irrelevant
and redundant information as possible. This reduces the dimensionality of the data and
may allow learning algorithms to operate faster and more effectively. In some cases,
accuracy on future classification can be improved; in others, the result is a more compact,
easily interpreted representation of the target concept.
66
Chapter 5
FEATURE SELECTION
5.3 Feature selection methods overview
Feature subset selection has long been a research area within statistics and pattern
recognition [Dev82, Mil90]. It is not surprising that feature selection is as much of an
issue for machine learning as it is for pattern recognition, as both fields share the common
task of classification. In pattern recognition, feature selection can have an impact on the
economics of data acquisition and on the accuracy and complexity of the classifier
[Dev82]. This is also true of machine learning, which has the added concern of distilling
useful knowledge from data. Fortunately, feature selection has been shown to improve the
comprehensibility of extracted knowledge [Koh96].
GENERATION
EVALUATION
Distance
Information
Dependency
Heuristic
Relief [Kir92], Relief-F
[Kon94], Segen [Seg84]
DTM [Car93] , Koller &
Sahami [Kol96]
Complete
Random
Branch & Bound
[Nar77], BFF [XuL88],
Bobrowski [Bob88]
MDLM [She90]
POE1ACC [Muc71],
PRESET [Mod93]
Focus [Alm92],
Schlimmer [Sch93],
Consistency
LVF [Liu96]
MIFES-1 [Oli92]
SBS, SFS [Dev82], SBS-
LVW [Liu96b],
SLASH [Car94], PQSS,
Classifier
BDS [Doa92], Schemata
Ichino & Sklansky
Error Rate
search [Moo94], RC
[Ichi84] [Ichi84b]
GA [Vaf94], SA,
RGSS [Doa92],
[Dom96], Queiros &
Gelsema [Que84]
RMHC-PF1
[Ska94]
Table 5.1 Different feature selection methods as stated by M. Dash and H. Liu [Das97].
67
Chapter 5
FEATURE SELECTION
There are a huge number of different feature selection methods. A study carried out
by M. Dash and H. Liu [Das97] presents 32 different methods grouped based on the types
of generation and evaluation function using in them. If the original feature set contains N
number of features, then the total number of competing candidate subsets to be generated
is 2N. This is a huge number even for medium-sized N. Generation procedures are
different approaches for solving this problem, namely: complete, in which all the subsets
are evaluated; heuristic, whose generation of subsets is made by adding/removing
attributes (incremental/decremental); and random, which evaluates a certain number of
random generated subsets. On the other hand, the aim of an evaluation function is to
measure the discriminating ability of a feature or a subset to distinguish the different class
labels. There are two common approaches: a wrapper uses the intended learning
algorithm itself to evaluate the usefulness of features, while a filter evaluates features
according to heuristics based on general characteristics of the data. The wrapper approach
is generally considered to produce better feature subsets but runs much more slowly than
a filter [Hal99]. The study of M. Dash and H. Liu [Das97] divides evaluation functions
into five categories: distance, which evaluates differences between class conditional
probabilities; information, based on the information gain of a feature; dependence, based
in correlation measurements; consistency, in which an acceptable inconsistency rate is set
by the user; and classifier error rate, which uses the classifier as evaluation function.
According to the general approach, only the last evaluation function, classifier error rate,
could be counted as a wrapper.
Table 5.1 resumes the classification of methods in [Das97]. The blank boxes in the
table signify that no method exists yet for these combinations. Since deeper analysis of
each one of the included feature selection techniques is far from this thesis purpose,
references for further information about them are given in the table.
In [Hal99], M. A. Hall and L. A. Smith present a particular approach to feature
selection, Correlation-based Feature Selection (CFS), that uses a correlation-based
heuristic to evaluate the worth of features. Despite this method has not been plainly
employed during this work, various ideas about feature selection using correlation
measurements between features and between features and output classes have been
extracted from it and its direct application is seriously taken into consideration for future
68
Chapter 5
FEATURE SELECTION
work. Consequently a brief overview of this particular feature selection criteria is given in
following section 5.3.1.
5.3.1 Correlation-based Feature Selection
CFS algorithm relies on a heuristic for evaluating the worth or merit of a subset of
features. This heuristic takes into account the usefulness of individual features for
predicting the class label along with the level of intercorrelation among them. The
hypotheses on which the heuristic is based can be stated:
“Good feature subsets contain features highly correlated with (predictive of) the
class, yet uncorrelated with (not predictive of) each other” [Hal99]
Following the same directions, Genari [Gen89] states “Features are relevant if their
values vary systematically with category membership.” In other words, a feature is useful
if it is correlated with or predictive of the class; otherwise it is irrelevant. Empirical
evidence from the feature selection literature shows that, along with irrelevant features,
redundant information should be eliminated as well (s. [Lan94c, Koh96, Koh95]. A
feature is said to be redundant if one or more of the other features are highly correlated
with it. The above definitions for relevance and redundancy lead to the idea that best
features for a given classification are those that are highly correlated with one of the
classes and have an insignificant correlation with the rest of the features in the set.
If the correlation between each of the components in a test and the outside variable is
known, and the inter-correlation between each pair of components is given, then the
correlation between a composite1 consisting of the summed components and the outside
variable can be predicted from [Ghi64, Hog77, Zaj62]:
rzc 
1
krzi
k  k  (k  1)rii
(5.1)
Subset of features selected for evaluation.
69
Chapter 5
FEATURE SELECTION
Where
rzc = correlation between the summed components and the outside variable.
k = number of components (features).
rzi = average of the correlations between the components and the outside variable.
rii = average inter-correlation between components.
Equation 5.1 represents the Pearson’s correlation coefficient, where all the variables
have been standarised. The numerator can be thought of as giving an indication of how
predictive of the class a group of features are; the denominator of how much redundancy
there is among them. Thus, equation 5.1 shows that the correlation between a composite
and an outside variable is a function of the number of component variables in the
composite and the magnitude of the inter-correlations among them, together with the
magnitude of the correlations between the components and the outside variable. Some
conclusions can be extracted from (5.1):
 The higher the correlations between the components and the outside variable, the
higher the correlation between the composite and the outside variable.
 As the number of components in the composite increases, the correlation between
the composite and the outside variable increases.
 The lower the inter-correlation among the components, the higher the correlation
between the composite and the outside variable.
Theoretically, when the number of components in the composite increases, the
correlation between the composite and the outside variable also increases. However, it is
unlikely that a group of components that are highly correlated with the outside variable
will at the same time bear low correlations with each other [Ghi64]. Furthermore, Hogart
[Hog77] notes that, when inclusion of an additional component is considered, low intercorrelation with the already selected components may well predominate over high
correlation with the outside variable.
70
Chapter 5
FEATURE SELECTION
5.4 Feature Selection Procedures employed in this work
Neural Networks, principal classifier used during this thesis, are able to handle
redundant and irrelevant features, assuming that enough training patterns are available to
estimate suitable weights during the learning process. Since our database hardly satisfies
this requirement, pre-processing of the input features becomes indispensable to improve
prediction accuracy. Regression models have been mainly applied as selection procedure
during this work. However, other procedures, such as Fischer’s discriminant (F-Ratio),
NN pruning, correlation analysis and graphical analysis of features’ statistics (boxplot)
have been tested. Subsequent subsections give a description of each one of these methods
bases.
5.4.1 Regression models
Mainly linear regression models are employed during this thesis to select a subset of
features from a higher amount of them in order to reduce the input dimensionality of the
classifier, as stated in former sections. For some specific cases (speaker independent),
also quadratic regression models were used.
Linear regression models emotions by linear combination of features and selects only
those, which significantly modifies the model. Quadratic models performance works in a
similar way but allowing quadratic combinations in the modelling of the emotions. Both
models are implemented using R2, a language and environment for statistical computing
and graphics. The resulting selected features are scored by R according to their influence
on the model. There are four different scores ordered by grade of importance:
- three points (most influential),
- two points,
- one point and,
- just one remark (less influential).
Which features are taken into account, for classification tasks, varies among different
experiments and it is specified in the conditions descriptions in chapters 8 and 9.
2
http://www.r-project.org/
71
Chapter 5
FEATURE SELECTION
Feature sets selected through regression models are tested in following experiments:
PROSODIC EXP.
QUALITY EXP.
SPKR DEPENDENT
8.2.1.1, 8.2.1.4, 8.1.2.5, 8.2.2.2
9.2.2.2
SPKR INDEPENDENT
8.3.1.2
9.3.2.1, 9.3.2.2
Table 5.2 Experiments where the regression-based feature selection is tested.
5.4.2 Fischer’s discriminant: F-Ratio
Fischer’s discriminant is a measure of separability of the recognition classes. It is
based on the idea that the ability of a feature to separate two classes depends on the
distance between classes and the scatter within classes. In figure 5.2 it becomes clear that,
although the means of X are more widely separated, Y is better at separating the two
classes, because there is no overlap between the distributions: Overlap depends on two
factors:

The distance between distributions for the two classes and,

The width of (i.e. scatter within) the distributions.
Class 1
Class 1
Class 2
Class 2
Feature X
Feature Y
Figure 5.2 Performance of a recognition feature depends on the class-to-class difference relative to the
scatter within classes.
72
Chapter 5
FEATURE SELECTION
A reasonable way to characterise this numerically would be to take the ratio of the
difference of the means to the standard deviation of the measurements or, since there are
two set of measurements, to find the average of the two standard deviations. Fischer’s
discriminant is based on this principle.
f 

1
 2 
2
(5.2)
 12   2 2
Where
 n  mean of the feature for the class n.
 n 2  variance of the feature for the class n.
Generally there will be many more than two classes. In that case, the class-to-class
separation of the feature over all the classes has to be considered. This estimation can be
done by representing each class by its mean and taking the variance of the means. This
variance is then compared to the average width of the distribution for each class, i.e. the
mean of the individual variances. This measure is commonly called the F-Ratio:
1
F  Ratio 
1
m
( j  ) 2
(m  1) 
j 1
m
(5.3)
n
 ( xij   j )
m( n  1) 
j 1 i 1
2
Where
n = number of measurements for each class.
m = number of different classes.
x ij = ith measurement for class j.
μj = mean of all measurements for class j.
 = mean of all measurements over all classes.
The F-Ratio feature selection method is tested in experiment 9.3.2.3. There, it’s
observed that the performance of the classifier doesn’t improve when the set resulting
from the F-Ratio analysis of the features is tested. However, an evident reason can be the
cause of this lack of success: The F-Ratio is used for evaluating a single feature and when
many potential features are to be evaluated, it’s not safe just to rank features by the F ratio
73
Chapter 5
FEATURE SELECTION
and pick the best one for use unless all the features are uncorrelated, which usually are
not. Two possible responses to this problem are proposed for future work:

Use various techniques for evaluation combination of features.

Transform features into dependent ones and then pick the best ones in the
transformed space.
5.4.3 NN pruning
This procedure is implicitly performed during the neural network training phase if the
pruning algorithm is selected. A pruned neural network eliminates all those nodes
(features) and/or links that are not relevant for the classification tasks. Since this selection
method is closer of being a neural network learning method, detailed information about
its functioning can be found in section 6.3.3.3.
5.4.4 Correlation analysis
Despite there are many proposed algorithms to implement correlation-based feature
selection, which uses heuristic generation of subsets and selects the best composite
among many different possibilities, the implementation, test and search of an optimal
algorithm would widely spread the boundaries of the present thesis and was proposed as a
separate topic for a future Diploma Thesis. Therefore, the correlation analysis employed
in the first experiments of this thesis was simply to apply the main ideas presented in
section 5.3.1.
First, the correlation matrix was calculated taking into account both the features and
the outputs of the specific problem. Then, following the conclusions extracted from
section 5.3.1, the features that were the most correlated with one of the outputs and, at the
same time, had weak correlation with the rest of the features, were selected among all
candidates.
This procedure of feature selection was employed in the preliminary experiments
carried out with prosodic features. The results showed that, despite the selected features
actually seem to possess relevant information, the optimisation of a correlation-based
method would be a valuable proposal for future work, as it has been already introduced.
74
Chapter 5
FEATURE SELECTION
5.4.5 Graphical analysis of feature statistics (“boxplot”)
A “boxplot” provides an excellent visual summary of many important aspects of a
distribution. The box stretches from the lower hinge (defined as the 25th percentile3) to the
upper hinge (the 75th percentile) and therefore the length of the box contains the middle
1
2
3
0 e+00
0.0 e+00
1 e+05
5.0 e+07
2 e+05
3 e+05
1.0 e+08
4 e+05
5 e+05
1.5 e+08
half of the scores in the distribution.
1
2
3
(b)
100
100
150
150
200
200
250
250
300
300
350
(a)
1
2
(c)
3
1
2
3
(d)
Figure 5. 3 Boxplot graphical representations of features (a) P1.5, (b) P1.7, (c) P1.14 and (d) P1.23 used
for the selection in experiment 8.2.2.2. Each box represents one of the three activation levels (3 classes).
The median4 is shown as a line across the box. Therefore 1/4 of the distribution is
between this line and the top of the box and 1/4 of the distribution is between this line and
the bottom of the box.
3
A percentile rank is the proportion of scores in a distribution that a specific score is greater than or equal
to. For instance, if you received a score of 95 on a math test and this score was greater than or equal to the
scores of 88% of the students taking the test, then your percentile rank would be 88. You would be in the
88th percentile.
4
The median is the middle of a distribution: half the scores are above the median and half are below the
median.
75
Chapter 5
FEATURE SELECTION
It is often useful to compare data from two or more groups by viewing “boxplots”
from the groups side by side. Boxplots are useful to compare two or more samples by
comparing the center value (median) and variation (length of the box) and to compare
how well a given feature is capable of separating predetermined classes. For instance,
figure 5.3 shows the boxplot representations of four different features when three outputs
are considered (experiment 8.2.2.2). The first feature (a) would a priori represents a good
feature to discriminate among the three given classes because their median values are
well distanced and also the box lengths are not significantly overlapped. Also features (b)
and (c) were considered for the selected set after the observation of their statistics.
However, feature (d) was omitted because its statistics are not able to make distinctions
among the classes, as depicted in figure 5.3 (d).
This graphical method has been employed during this thesis in combination with other
procedure, mainly linear regression, in order to achieve a enhanced analysis of the
features.
76
Download