Slides

advertisement
Machine Learning
in Practice
Lecture 22
Carolyn Penstein Rosé
Language Technologies Institute/
Human-Computer Interaction
Institute
Plan for the Day

Announcements
 Questions?
Multi-Level Cross-Validation
 Feature Selection

Setting Up the Experimenter for Regression
Problems
Cascading
Classifiers
Advanced Cross Validation for
Hierarchical Models
New Features
Animals
Plants
Living Things
Target Class
Raw features ……


Let’s say you want to train a model to predict whether something is a
living thing
Let’s say you know that Animals have a lot in common with each
other, and Plants have a lot in common with each other


You think it would be easier to predict Living Thing if you first have models
to predict Animal and Plant
Describe the steps you’ll go through to build the Living Thing model
Remember the cluster feature
example
Class 1
Class 2
Added structure makes it easier to
detect Class1
Class 1
Class 2
Advanced Cross Validation for
Hierarchical Models
New Features
Classifier A
Classifier B
Classifier C
Target Class
Raw features ……



You will use the result of Classifiers A and B to train
Classifier C
You need labeled data for Class A and Class B to train
classifiers A and B, but you don’t want to train Classifier C
with those perfect labels
You can use cross validation to get “noisy” versions of A
and B in the training data for C
Advanced Cross Validation
D
d
f
F
ABC
1

2
3

4
5
6
7

Let’s say each instance in
your data has features A, B,
and C
You are trying to predict F
You want to train feature D,
which you think will help you
detect F better
Advanced Cross Validation
D
d
f
ABC
1
2
3
F
Fold 1:
 Train a d classifier over
segments 2-7
 Use this model to apply D
labels to segment 1
 Train an f classifier over 2-7
 Since
D labels in 1 will be noisy,
you need to train your f classifier
with noisy D labels in 2-7
 You can get those noisy labels
using cross validation within 2-7
4
5
6
7

Use the f model to apply F
labels to segment 1
Advanced Cross Validation
D
d
f
F

ABC
1
2
3
4




5
6
7

Think about how to get those noisy
labels using cross validation within
2-7
Train a d classifier on 3-7 to apply
noisy D labels to 2
Train a d classifier on 2+4-7 to
apply noisy D labels to 3
Etc.
Now you have noisy D labels for 27 in addition to the perfect D labels
you started with
You will use these noisy labels to
train f (not the perfect ones!)
Advanced Cross Validation
D
d
f
F

ABC
1
2
3
4




5
6
7

Think about how to get those noisy
labels using cross validation within
2-7
Train a d classifier on 3-7 to apply
noisy D labels to 2
Train a d classifier on 2+4-7 to
apply noisy D labels to 3
Etc.
Now you have noisy D labels for 27 in addition to the perfect D labels
you started with
You will use these noisy labels to
train f (not the perfect ones!)
Remember: Dumping Labels from
Weka
 Save output buffer
Pull results section
out
 Use the predicted
column
 NOTE: If you do this
using weka’s crossvalidation, you won’t
be able to match up
the instance
numbers!!!

Feature Selection
Why do irrelevant features hurt
performance?

Divide-and-conquer approaches have the
problem that the further down in the tree you get,
the less data you are paying attention to
 it’s


easy for the classifier to get confused
Naïve Bayes does not have this problem, but it
has other problems, as we have discussed
SVM is relatively good at ignoring irrelevant
attributes, but it can still suffer
 Also,
it’s very computationally expensive with large
attribute spaces
Two Paradigms for Attribute
Selection

Wrapper Method
 Evaluate
the subset using the algorithm that
will be used for the classification in terms of
how the classifier does with that subset


Use a search method like Best First
Filter Method
 Use
an independent metric of feature
goodness
 Rank and then select

Don’t be confused – not the “standard”
usage of filter versus wrapper
How do you evaluate feature goodness
apart from the learning algorithm?

Notice a combination of scoring heuristics and
search methods
How do you evaluate feature goodness
apart from the learning algorithm?
(evaluating subsets of features)

Look for the smallest set of attributes that
distinguishes every training instance from every
other training instance
 Problem
occurs if there are two instances with the same
attribute values but different classes

You could use decision tree learning to pick out a
subset of attributes to use with a different algorithm
 It
will have no effect if you use it with decision trees
 It might work well with instance based learning – to avoid
having it be confused by irrelevant attributes
How do you evaluate feature goodness
apart from the learning algorithm?
(evaluating individual features)
You can rank attributes for decision trees
using 1R to compensate for the bias towards
selecting features that branch heavily
 Look at the correlation between each
features and the class attribute

Efficiently Navigating the Attribute Space

Evaluating individual attributes and then ranking
them is the most efficient approach for attribute
selection
 That’s
what we have been doing up until now with
ChiSquaredAttributeEval

Searching for the optimal subset of features
based on evaluating subsets together is more
complex
 Exhaustive
search for the optimal subset of attributes
is not tractable
 Use a greedy search BestFirst
Efficiently Navigating the Attribute Space


Remember that greedy methods are efficient, but
they sometimes get stuck in locally optimal
solutions
Forward selection: Start with nothing and add
attributes
 On
each round, pick the attribute that will have the
biggest estimated positive effect on performance

Backward elimination: Start with the whole set
and prune
 On
each round, select the attribute that seems to be
dragging down performance the most

Bidirectional search methods combine these
methods
Forward Selection
Forward Selection
* Pick the most predictive feature.
Forward Selection
* Pick the next most predictive feature, or the one that gives the pair the
most predictive power altogether – in the case of the wrapper method, using
the classification algorithm you will eventually use.
Forward Selection
* Pick the next most predictive feature, or the one that gives the set the
most predictive power altogether – in the case of the wrapper method, using
the classification algorithm you will eventually use.
Backwards Elimination
* Pick the least predictive feature, i.e., the one you can drop without hurting the
predictiveness of the remaining features, or even helping it.
Backwards Elimination
* Pick the least predictive feature, i.e., the one you can drop without hurting the
predictiveness of the remaining features, or even helping it.
Backwards Elimination
* Pick the least predictive feature, i.e., the one you can drop without hurting the
predictiveness of the remaining features, or even helping it.
Backwards Elimination
* Pick the least predictive feature, i.e., the one you can drop without hurting the
predictiveness of the remaining features, or even helping it.
Efficiently Searching the Attribute
Space


You can use a beam search method rather than
selecting a single attribute on each round
Race search: stop when you don’t get any
statistically significant increase from one round
to the next
 Option:
Schemata search is like race search except that
you rank attributes first and then race the top ranking
attributes (more efficient!)
 Sometimes you include a random selection of other
attributes in with the selected attributes
Efficiently Searching the Attribute
Space



Different approaches will make different mistakes
Backward elimination produces larger attribute sets
and often better performance than forward
selection
Forward selection is good for eliminating redundant
attributes or attributes with dependencies between
them, which is good for Naïve Bayes
 Better
model
if you want to be able to understand the trained
Selecting an Attribute Selection Technique
Attribute Selection Options:
Evaluating Subsets



CfsSubsetEval: looks for a subset that are
highly correlated with the predicted class but
have low inter-correlation
ClassifierSubsetEval: Evaluates subset using
a selected classifier to compute performance
ConsistencySubsetEval: evaluates the
goodness of a subset of attributes based on
consistency of instances that are close to each
other in the reduced attribute space
Attribute Selection Options:
Evaluating Subsets




SVMAttributeEval: backwards elimination,
beam search technique, you specify the number
or percent to get rid of on each iteration, stops
when it doesn’t help anymore
WrapperSubsetEval: Just like
ClassifierSubsetEval
ChiSquaredAttributeEval: evaluates the worth
of a feature by computing the chi-squared
statistic of the attribute in relation to the
predicted class
GainRatioAttributeEval: Like Chi-squared but
using GainRatio
Attribute Selection Options:
Evaluating Subsets




InfoGainAttributeEval: Like Chi-squared but
using Information Gain
OneRAttributeEval: Like Chi-squared but looks
at accuracy of using single attributes for
classification
ReliefAttributeEval: evaluates the worth of an
attribute based on surrounding instances based
on that attribute
SymetricalUncertAttributeEval: Like Chisquared by uses symetrical uncertainty
Subsets in a new vector space

PrincipalComponents: use principal
components analysis to select a subset of
eigenvectors from the diagonalized
covariance matrix that accounts for a
certain percentage of the variance in the
predicted class
 A cheaper
method is to use a random
projection onto a smaller vector space
Not as good as principal components analysis
 But not that much worse either

Remember Matrix Multiplication
* Notice you ended up with fewer attributes!
What can we do with that?
Using linear algebra
 Project one vector space onto a more
compact one
 Then select the top N dimensions that as a
set explain the most variance in your data

Subsets in a new vector space

PrincipalComponents: use principal
components analysis to select a subset of
eigenvectors from the diagonalized
covariance matrix that accounts for a
certain percentage of the variance in the
predicted class
 A cheaper
method is to use a random
projection onto a smaller vector space
Not as good as principal components analysis
 But not that much worse either

Selecting an Attribute Selection Technique
E.g., Forward Selection
* Forward selection and backward selection are much slower than just ranking
attributes based on a metric that can be applied to one attribute at a time. So
If you are using such a metric, just use a ranking selection technique rather than
a search selection technique. Using search won’t change the result, it will just
waste a lot of time!
Does it matter which one you pick?



Consider the Spam data set
Decision trees worked well because we needed
to consider interactions between attributes
Cfsubseteval and ChiSquaredAttibuteEval
consider attributes out of context
 Both significantly reduce performance
 In this case we’re harming performance because we’re
ignoring interactions between attributes at the selection stage
 Cfsubseteval is significantly better than
ChiSquaredAttributeEval for the same number of
features
 But with ChiSquaredAttributeEval you can choose to
have more features, and then you do not get a
degradation in performance eventhough you reduce
the feature space
Take Home Message

Multi-level cross-validation helps prevent
over-estimating performance for tuned
models
 Needed
for cascaded classification in addition
to more typical tuning

Wide range of feature selection techniques
 Make
sure your options are consistent with one
another
 Search based approaches are much slower
than simple ranking approaches
Download