The Normal Distribution with Many Feature Values

advertisement
The Normal Distribution with Many Feature
Values
Suppose we have more than one feature value. For example, we are now
given two pieces of information about the unknown person X – the weight
and the height. How can we incorporate this into Bayesian classification?
We can assume a series of models of increasing complexity.
The simplest model we could assume is that the feature values are
independent and that the standard deviation is the same for each.
When we say the feature values are independent we mean that they are not
correlated in any way. In other words one feature value gives you no
information about any of the others.
This causes the probability distributions to have the shape shown below.
y
Fig 1
x
This diagram shows an example with two feature values x and y and two
classes. The horizontal axis represents x and the vertical axis represents y.
The contours represent lines of equal probability just as contours on a
geographical map represents lines of equal height. You could imagine the
probability distributions as hills rising up from the plane of the paper.
NB the contours do NOT represent limits of the distributions. Both
distributions go on to infinity in all directions. The distributions pass through
each other.
In this case the contours are circles centred on the means of each class.
Let us further assume that the probability distributions for each class have
the same shape. In other words the feature values for each class are
independent and the standard deviations are the same for each value.
Notice how the probability of belonging to a certain class depends on the
distance from the mean of that class. The distance d is given by
d 2  x2  y2
which is just Pythagoras theorem.
The decision boundary between two classes will be the set of points which
are equally distant from the means of both classes. This will be a straight
line which passes through the midpoint of the line joining the two means.
This line is known as the “perpendicular bisector” of the two means.
Now the next most complex model assumes that the feature values are still
independent but that the standard deviations are now different. This leads to
the situation shown below.
Fig 2
The probability contours are now elliptical and the axes of the ellipse are
parallel to the x and y axes.
Lets still assume that the shape of the probability distribution is the same for
each class. In other words only the position of the mean is different.
Now we can obtain Fig 1 from Fig 2 by a simple transformation. Imagine
that Fig 2 is drawn on a rubber sheet. If we squash the rubber sheet in the x
direction then we can convert the ellipses into circles. We could then draw a
decision boundary between the two classes. It will be the perpendicular
bisector between the two means.
We could then stretch the rubber sheet back to its original position. The
decision boundary will still be a straight line but its slope will have changed.
Alternatively we could also describe the probability distributions in terms of
distance from the means. But now we have to invent a new type of distance
metric
d2 
x2
 x2

y2
 y2
This is called Mahalanobis distance. The decision boundary is the set of
points with equal Mahalanobis distance from the two means.
The next most complex model assumes that the feature values are no longer
independent. Or, in other words, that they are correlated with one another.
This implies that if we know one feature value then this tells us something
about the other feature values.
The probability contours now look like this.
Fig 3
They are still elliptical but the axes of the ellipses are not parallel to the x an
y axis.
In the case shown here the principal axes slope upwards. This implies that as
x increases y also tends to increase. So if you know that x is high you would
expect y also to be high.
Let us still assume that the probability distributions have the same shape for
each class.
But we can turn Fig 3 into Fig 2 by a simple transformation. We simply
rotate the co-ordinate system so that the axes are now parallel to the axes of
the ellipse.
We can then squash the co-ordinate system to turn the ellipses into circles.
We can then draw the decision boundary as before. We can then stretch and
rotate the co-ordinate system back to its original state. The decision
boundary will still be a straight line.
The most complex model we can assume is that the feature values are not
independent and the probability distributions for the two classes have
different shapes.
One example of this is shown below.
Fig 4
The probability contours are still ellipses but the different classes are no
longer represented by the same ellipse.
Now the decision boundaries are no longer represented by straight lines. An
extreme example is shown below.
One class is represented by a broad shallow distribution but the other is
represented by a narrow highly peaked distribution. The second class is
concentrated in small region and is “surrounded” by the first class.
The decision boundary will be closed loop with a tail pointing away from the
mean of class 1.
In general the decision boundaries will be quadratic curves. These include
circles, ellipses, parabolas and hyperbolas.
The problem is that a quadratic curve requires many more parameters to
define it than a straight line.
If we have N feature values a line requires N parameters but a quadratic
curve requires N 2 .
In image analysis it is quite common to have images with more than 1000
pixels. Since each pixel counts as a feature value a quadratic curve would
require more than 1 million parameters.
To fit the values of 1 million parameters we would need over 1 million
objects in our training set. It is very rare that we have this amount of data.
For this reason, it is common practice to assume that all classes have the
same shape even when this is unlikely to be the case. If the classes have the
same shape then the decision boundaries are straight lines or planes and so
require far fewer parameters.
However, there is an alternative way of dealing with this problem. It is
called Principal Component Analysis.
Principal Component Analysis
The main purpose of Principal Component Analysis is to find the vectors
which represent the axes of the ellipsoids.
It turns out that these vectors are the eigenvectors of the covariance matrix.
The covariance matrix is defined as follows.
The covariance matrix is an NxN matrix where N is the number of feature
values. Lets suppose we have a set of objects each with N feature values.
The element  ij in the row i and column j represents the covariance of the ith
and jth feature values. Let the ith and jth feature values of the kth object be
called x k and y k then  ij is given by
 ij2   ( xk  m x )( y k  m y )
k
where m x is the mean of the ith feature value and m y is the mean of the jth
feature value.
Along the leading diagonal (where i=j)  ij is just the standard deviation of
the ith feature value. The off-diagonal elements are measures of how much
feature value i is correlated with feature value j. If  ij = 0 then i and j are
independent.
You can see from the above equation that  ij must equal  ji .This implies
that the covariance matrix is “symmetric”. That is if you were to flip the
matrix about the leading diagonal the matrix would remain unchanged. In
other words if you interchanged the rows and columns the matrix would
remain unchanged.
In the situations discussed above the covariance matrices are as follows.
When the probability contours are circles (Fig 1) the covariance matrix is
 2 0 


 0 2


The off-diagonal elements are zero because the feature values are
independent. The elements on the leading diagonal are all equal because the
standard deviation is the same for each feature value.
In the second situation where the probability contours were ellipses with
their axes parallel to the x and y axes (Fig 2) the covariance matrix is
  x2

 0

0 

 y2 
The off-diagonal elements are still zero because the feature values are still
independent. But the elements on the leading diagonal are no longer equal
because the standard deviations are different for each feature value.
In the third case where the probability contours were ellipses with their axes
at an angle (Fig 3) the covariance matrix is
  x2
 2

 xy
 xy2 

 y2 
The off-diagonal elements are no longer zero because the feature values are
no longer independent.
In the final most complex case which we considered above the classes had
probability distributions with different shapes (Fig 4). The covariance matrix
is a description of the shape of the probability distribution. In this case each
class would have a different covariance matrix.
Eigenvalues and Eigenvectors
The eigenvalues and eigenvectors of a square matrix A are defined by
Au k  k u k
where u k is the kth eigenvector and  k is the kth eigenvalue. If A is an NxN
matrix then there will be N eigenvectors and N eigenvalues. Each vector
u k will have N components. The eigenvalues  k are scalars.
(Here we use the convention that vectors are symbolised by lower-case bold
characters e.g. u, matrices by upper-case bold e.g. A and scalars by italic
e.g. s.)
The above equation says that if you multiply any eigenvector u k by the
matrix A you get the same eigenvector back again but multiplied by the
scalar  k .
We saw above that covariance matrices are symmetric. If A is symmetric
then the eigenvectors have the following two properties
 The eigenvectors form a “complete” set. This means that they can be
used as the axes of a co-ordinate system. Or, in other words, any Ndimensional vector can be expressed as a linear combination of the
eigenvectors.
 The eigenvectors are orthogonal
0 i  j
ui  u j  
1 i  j
Here u i  u j is the “dot” product or scalar product of the two vectors.
In addition the eigenvalues of a covariance matrix have another property.
They are all greater than or equal to zero
k  0
The eigenvectors of the covariance matrix represent the directions of the
principal axes of the ellipsoids. The eigenvalues represent the squares of the
standard deviations along those axes.
u2
u1
The eigenvectors are ranked in order of the size of their eigenvalues. So u1
is the eigenvector with the largest eigenvalue. u1 represents the axis along
which the ellipsoid is widest. 1 is proportional to the square of the width of
the ellipsoid along that axis.
u 2 is at right angles to u 1 and represents the axis along which the ellipsoid is
next widest.  2 is proportional to the square of the width of the ellipsoid
along that axis.
In N-dimensional space an ellipsoid will have N axes all at right angles to
one another. u k represents the kth axis and  k is proportional to the square
of the width of the ellipsoid along that axis.
We can now transform the co-ordinate system of feature space so that the
eigenvectors now define the axes. This means that the data is now described
by a new set of feature values. The new feature values are given by taking
the dot product of the old feature vector with each eigenvector in turn.
When the data is expressed in terms of the new feature values the axes of the
ellipsoids are now parallel to the axes of the co-ordinate system. So the
feature values are now independent of each other. The off-diagonal elements
of the new covariance matrix are zero. The new covariance matrix is
 12

0

0
 

0
0

2
2
0
0
32


0 

0 

0 
 
where the diagonal elements are the eigenvalues. Remember the eigenvalues
are equal to the square of the standard deviations along the axes.
It is usually the case that if N is large (say > 50) the eigenvalues fall very
steeply
k
k
This means that the data is spread out in only a small number of directions.
There are only a small number of axes of the ellipsoid where the ellipsoid is
significantly wide. Along all the other axes the ellipsoid is very thin.
You can visualise this by considering a 3-dimensional space, for instance a
room in a building, and imagining that the data is spread out on the surface
of a table within the room. The data is spread out along only two
dimensions. It has very little variation perpendicular to the surface of the
table. So the third dimension gives us very little information about the data.
We can now ignore all those feature values whose eigenvalues are below a
certain threshold. The data has very little variation along these directions.
The data is described almost completely by the small number of feature
values with large eigenvalues.
This means that we have reduced the number of dimensions N necessary to
describe the data. Typically we can reduce N from > 50 down to maybe 5 or
6. But remember that the number of parameters needed to describe a
quadratic curve is proportional to N 2 . So we reduce the number of
parameters from > 2500 down to 25 or 36. This is a huge reduction in the
complexity of the model.
This is the basic reason why we use Principal Component Analysis. It
reduces the number of dimensions and allows us to use simpler models.
What do the Principal Components mean?
The eigenvectors usually represent independent modes of variation in the
data. Often these represent underlying “causes” in the real world which lead
to the different manifestations of the data.
For example, consider the distribution of height and weight among the male
population
weight
B
C
A
D
height
Here the main axis of the ellipse slopes upward showing that height and
weight are correlated. The second axis is perpendicular to the main axis.
These two axes form a “natural” co-ordinate system to describe the data. But
what do these two axes represent in the real world?
Consider the people around points A and B at opposite ends of the main
axis. People at point A have low weight and low height so they are small
people but have a normal body shape. People at point B have high weight
and high height so they are large people but also have a normal body shape.
So the main axis represents a change in size of the person while the body
shape remains constant.
A
B
Consider people at points C and D at opposite ends of the secondary axis.
People at point C have low weight but high height so they are very thin
people. People at point D conversely have high weight and low height so
they are very fat people. So the secondary axis represents a change in body
shape from thin to fat. You could call this the “obesity” axis.
C
D
Lets take a more complicated example. Suppose we gathered many
thousands of hand-written examples of the digit ‘2’. We could find the
Principal Components of this data. We would probably find that some of the
eigenvectors represented variations like these:
A variation in the tilt of the main axis
A variation in the height of the upper loop
A variation in the size of the lower loop
A variation in the angle of the downstroke
Some Vector Mathematics
The easiest way to calculate decision boundaries is to use vector
mathematics.
b
c1
m
c2
In the simplest case where the distributions are circles it is very easy to
calculate the equation of the boundary. The boundary passes through the
midpoint m between the means of the two classes, where
m  (c1  c 2 ) / 2
Let the vector b represent the direction of the boundary. Let the vector n
represent the vector between the two means.
n  c 2  c1
The boundary b is perpendicular to n. Therefore
b.n  0
We have a choice as to the exact values of the components of b. There are a
number of possibilities that will satisfy the above equation but the simplest
choice is to make
b1  n2
b2  n1
Now we can represent the equation of the boundary by
p = m + sb
where p represents the position vector of any point on the boundary and s is
a scalar.
This also gives us an easy way of deciding which class a new object is in.
Let the feature values of the new object be vector d. Then if
(d-m).n >0
then it’s in class 1 and if it’s < 0 then it’s in class 2.
Now let’s see how to find the decision boundary in the case where the
probability distributions are elliptical. At the start of this chapter we said that
you could convert Fig 2 to Fig 1 by “squashing up” the axes so that the
ellipses became circles.
We can squash up the axes by dividing the x-values by  x and the y-values
by  y . This means that the standard deviations along both axes are now
equal to 1. The components of the vector n between the two means now
become
n1 /  x
n2 /  y
The vector b which represents the boundary now becomes
b1  n2 /  y
b2  n1 /  x
We now have to “unsquash” the axes in order to return to the original state.
So we now have to multiply the x-values by  x and the y-values by  y . The
components of the boundary vector now become
b1 
x
n2
y
b2  
y
n1
x
The decision boundary still passes through the midpoint between the two
means but the slope has now changed. So we can still write the equation of
the boundary as
p = m +sb
Its just that the values of b are different.
Now let us consider the third situation where the ellipses are at an angle. We
now have to rotate the x and y axes so that they are parallel to the axes of the
ellipse.
Let u1 and u 2 be the eigenvectors which represent the first and second axes
of the ellipses. Then if we rotate the x and y axes the new values of the
components of n become
n1  n.u1
n2  n.u 2
This gets us back to situation 2 so we now have to squash up the axes again
so that the ellipses become circles. So just as before we divide the
components of n by  1 and  2 , the standard deviations along the first and
second axes.
n1 /  1
n2 /  2
Of course,  1 is just  1 and  2 is  2 where 1 and  2 are the
eigenvalues. We can now set the values of b as before
b1  n2 /  2
b2  n1 /  1
Then we have to unsquash the axes
1
n2
2

b2   2 n1
1
b1 
And finally rotate them back to the original angles.
b1  b.v
b2  b.w
where v and w are the vectors in the rotated co-ordinate system which
represent the original x and y axes
v1  (1,0).u1
v2  (1,0).u 2
w1  (0,1).u1
w2  (0,1).u 2
Vectors and Principal Components
We saw above that Principal Components Analysis tells us which are the
most important eigenvectors. These eigenvectors are the directions in
feature space which best describe the data.
Take for example the graph below which shows the eigenvalues
k
All the eigenvalues except the first three are nearly zero. In other words only
the first three eigenvectors give any useful information about the data. The
variation in the data along the remaining eigenvectors is negligible. This
implies that the data lies in a three-dimensional sub-space within the total
feature space.
During the classification phase, when we wish to classify a new object, we
could simply test to see if the new object lies in the above three-dimensional
space. If it does not then we know that the object cannot belong to the class
represented by the above data. But how can we decide whether the new
object lies in this space?
Let us choose some point within the space to be the origin (the most
convenient point to choose would be the mean). Let the new object be
represented by point P. Then let us draw the vector from the origin to P.
Let’s call this vector p.
If P lies in the space then the scalar product of p with any of eigenvector u k
where k > 3 will be very small. We could calculate each of these values and
test to see if they are small but this is not very computationally efficient
since there may be many such eigenvectors.
It’s easier to use the following formula
N
3
 p g
i 1
2
i
k 1
2
k
T
which is equivalent to the above operation but is faster since it only involves
calculating three scalar products.
Here N is the total number of dimensions and
g k  p.u k
is the scalar product of p with eigenvector u k .
T is a threshold value which the user is free to choose. But usually T will
depend on the eigenvalues which are left out of the sub-space. So a
reasonable choice would be
N
T  3  j
j 3
Download