ScribeAll

advertisement
1/24/05 Monday Lecture Notes
Classification
We want to find a classifier ¢ that is good with respect to the generalization error i.e. it
has to be good with respect to the underlying probability distribution that describes the
data.
Problem
Find ¢ from some class of classifiers (like neural network, decision tree etc.) such that ¢
minimizes the generalization error Rp[¢] = P [¢ (X1, X2, ..., Xn) ≠ c] = Ep [£¢]
We want a measure that would give us some kind of a [0,1] value for a particular instance
of the input and output. Then we simply take the average over the whole probability
space of that measure and say that it is the global measure of how bad the classifier is.
Anything that minimizes this measure gives us the best classifier.
This probability, one would claim, is simply the expected value of the probability space
under the lock [0,1] loss function.

We express it more fancily as =
P [ x1, x2, ..., xn ] * £¢ (x1, x2, xn, c)
x1, x 2 ,..., xn ,c
Why we are using [0,1] random variable? Because it has the interesting property that if X
is [0,1] random variable, then E[X]=0*P[X=0]+1*P[X=1]=P[X=1]. Thus, E[X]=P[X=1].
Empirical Risk / Generalization Error
It is called "Generalization Error" because it characterizes the entire source of data.
However, we must remember that we do not know what "characteristics" of entire data
are, as otherwise there would not have been any reason to actually compute the classifier.
What would be the best prediction to make if we knew the underlying probability
distribution? We have to classify each data item to a one class label or another (like, we
cannot say this class label with this probability and that class label with that probability).
If we had the probability distribution, then it would have been easy since we could have
easily made the classification based upon which class has the highest probability.
Unfortunately, we do not have that!
Obviously, if we are able to get the samples, i.e. the training data of the underlying
probability space, we could try to approximate it. This is what is called the Empirical
Risk or Generalization Error.
We have samples w1, w2, ...., wn out of  according to distribution P. What does that
mean? It means that if the distribution is extended in a particular region of space, we
would see more points there. If somehow we have the ability to generate more data for
the underlying probability distribution, then we can actually estimate this quantity.
Simply, compute the value of the risk for each of these samples and take their average:
N
P
[¢ ] approx=

i 1
£¢ ( wi . X1, wi.X2, ... , wi.Xn, wi . c) / N
f(x)
This was a general theory that was developed in the 1950s related to the design of nuclear
weapons. Suppose we want to find an integral of a complicated function f(x). One way to
achieve that would be to draw an enclosing box around that function (in a multidimensional space, of course), and throw in some uniformly distributed points randomly.
The fraction of the number of points that fall under the function divided by the number of
trials and then multiplied by the area of the box would approximately give the area
(integral) of the function f(x). This is also called Monte Carlo method which is used a lot
at the casino. However, it would require a HUGE number of points for it to converge.
So, if one tries to find out how good a given classifier is, one way to do is to simply go
and get a huge number of data points from the original data source. You compute this
average from the samples and that would be a very good estimate of what the true
generalization error is. But the trouble is that you do not have a huge amount of data.
In order for ANY classification method to build this quantity, you either have to use
something like those we just discussed, which are called Empirical risk minimization
method. The other method is to develop fail-sophisticated math, trying to estimate for all
the particular empirical risk value or some other quantity what would be the
generalization error.
However, the best way to test the goodness of a classifier would be to choose a sample
from the source that has not been used in the training data at all (so that it will not be
correlated with the actual classifier), and then compare the classifier. The problem is
complicated by the fact that a lot of these datasets that we are using to build the classifier
are very small. So, though you train your classifier and test it, you are nowhere close to
the right number. There are a number of tricks employed and a number of very handy
arguments why they apparently work.
Overfitting
Potential danger of trusting one's training data too much.
For example, let's take a case which is not really classification but regression, the
continuous version of classification.
In the graph above, we increase the degree(n) of the polynomial P(x) (y = anXn + an-1Xn1
+ .....+ a2X2 + a1X + a0) that fits the points. The regression is about finding P(x)
polynomial that has the smallest generalized error with respect to the source of the data.
When the polynomial is of degree:
0 ==> P(x) = a - a straight line parallel to the X axis that does not seem to be very fitting.
1 ==> P(x) = a X + b - another straight line, but it need not be parallel to the X-axis. It
seems to be fitting more.
2 ==> P(x) = aX2 + bX + c - a parabola, which captures most of the points pretty well,
although not perfectly.
However, if we keep on increasing the degree (say as high as 7), then we would find a
polynomial that fits ALL the points in the training data PERFECTLY, but we can tell
even from naked eye observation (which is in fact very good at generalizing) that it does
not truly represent the data.
Generalization
Error
Error for
whole data
Error for
Training Data
Degree
Binary classification
In binary classification we are trying to distinguish between two classes. Below is the
example of binary classifier in bi-dimensional space. The two classes are labeled + and -.
X2
Binary
classifier
+
+ + + +
+ + + + + - - +-- - - - - - - -
-
X1
Binary classifier in our example works the following way: if the point is above the line, it
belongs to +; if the point is below the line, it belongs to -.
When choosing the classifier we have to be aware about the overfitting problem. The
classifier has to be a good predictor for the distribution and not for the sample made
available from distribution. This means that we cannot just simply build a classifier that
is as good as possible with respect to available sample since the noise of the data will be
learned as well. The noise learning phenomena is called overfitting.
Approaches to deal with overfitting
A. Limit the class of classifiers
- Classifier ¢ is polynomial decision surface
- ¢ is a classification tree (piece wise constant decision surface)
- ¢ is a neural network
This approach does not give mechanism to limit/control overfitting. . Each class of
classifiers is as powerful as any other. All types of classes can copy continuous (except in
a finite number of points) functions. Therefore, every class is a universal learner.
B. Intuition: Limit the capacity of learner/class
- limit # of parameters
Example: Neural network
Neural network consists of inputs x1,x2, …,xn. All inputs are connected to all the neurons.
The connections have weights w1,w2,…,wn. The output is characterized by function
σ(x1w1+x2w2+…+xnwn+c), where c is constant, and by threshold value.
Problem:
Find “right” capacity of the classifier. To do that we need other data, besides training data,
to estimate capacity. Therefore, we have several options:
Options I
1. Pick various capacities
2. Learn best classifier using training data
3. Estimate error on different data set
Option II
Allow a lot of parameter and use regularization methods
Option III
Build a large tree of classifiers and use extra data to pick part of tree (chop the
tree).
Download