The design of a pattern recognition system usually

advertisement
PUT-November14
PR-ECS074
Solution
Q1. (a) Pattern Recognition: Pattern Recognition consists of recognizing a pattern using a machine
(computer). It can be defined in several ways.
•DEFINITION.1:- It is a study of ideas and algorithms that provide computers with a perceptual
capability to put abstract objects, or patterns into categories in a simple and reliable way.
•DEFINITION.2:- It is an ambitious endeavor of mechanization of the most fundamental function of
cognition.
The design of a pattern recognition system usually entails the repetition of a number of different
activities: data collection, feature choice, model choice, training, and evaluation.
a) Data collection can account for surprisingly large part of the cost of developing a pattern
b)
c)
d)
e)
f)
recognition system. It may be possible to perform a preliminary feasibility study with a small set
of "typical" examples, but much more data will usually be needed to assure good performance
in the fielded system.
The choice of the distinguishing features is a critical design step and depends on the
characteristics of the problem domain.
How do we know when a hypothesized model differs significantly from the true model
underlying our patterns, and thus a new model is needed? In short, how are we to know to
reject a class of models and try another one?
In general, the process of using data to determine the classifier is referred to as training
the classifier.
Evaluation is important both to measure the performance of the system and to identify the
need for improvements in its components.
In more general terms, we may ask how an algorithm scales as a function of the number of
feature dimensions, or the number of patterns or the number of categories. What is the tradeoff
between computational ease and performance? In some problems we know we can design an
excellent recognizer, but not within the engineering constraints. How can we optimize within
such constraints?
1(b)-(i) A normal distribution in a variate
with mean
and variance
is a statistic distribution
with probability density function
(1)
on the domain
.
The normal distribution is implemented in Mathematics as Normal Distribution [mu, sigma].
The so-called "standard normal distribution" is given by taking
and
in a general
normal distribution. An arbitrary normal distribution can be converted to a standard normal
distribution by changing variables to
, so
, yielding
(2)







Seven features of normal distributions are listed below:
Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are denser in the center and less dense in the tails.
Normal distributions are defined by two parameters, the mean (μ) and the standard deviation
(σ).
68% of the area of a normal distribution is within one standard deviation of the mean.
Approximately 95% of the area of a normal distribution is within two standard deviations of the
mean.
(ii)
Critic
Sensors
Learning Element
Performance Element
Effectors
Problem Generators

Learning Element makes changes to the system based on how it’s doing.

Performance Element is the agent itself that acts in the world.

Critic tells the learning Element how it is doing (e.g., success or failure) by comparing with a
fixed standard of performance.

Problem Generator suggests “problems” or actions that will generate new examples or
experiences that will aid in training the system further.
1-(c)
A random variable is a function that associates a unique numerical value with every outcome of
an experiment. The value of the random variable will vary from trial to trial as the experiment is
repeated. There are two types of random variable - discrete and continuous.
A discrete random variable is one which may take on only
a countable number of distinct values such as 0, 1, 2, 3, 4, ... Discrete random variables are
usually (but not necessarily) counts. If a random variable can take only a finite number of
distinct values, then it must be discrete. Examples of discrete random variables include the
number of children in a family, the Friday night attendance at a cinema, the number of patients
in a doctor's surgery, the number of defective light bulbs in a box of ten.
The probability distribution of a discrete random variable is a list of
probabilities associated with each of its possible values. It is also sometimes called the
probability function or the probability mass function.
Suppose a random variable X may take k different values, with the probability that X = xi defined
to be P(X = xi) = pi. The probabilities pi must satisfy the following:
1: 0 < pi < 1 for each i
2: p1 + p2 + ... + pk = 1.
A continuous random variable is one which takes an infinite number of possible values.
Continuous random variables are usually measurements. Examples include height, weight, the
amount of sugar in an orange, the time required to run a mile.
A continuous random variable is not defined at specific values. Instead, it
is defined over an interval of values, and is represented by the area under a curve (in advanced
mathematics, this is known as an integral). The probability of observing any single value is equal
to 0, since the number of values which may be assumed by the random variable is infinite.
Suppose a random variable X may take all values over an interval of real numbers. Then the
probability that X is in the set of outcomes A, P(A), is defined to be the area above A and under a
curve. The curve, which represents a function p(x), must satisfy the following:
1: The curve has no negative values (p(x) > 0 for all x)
2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.
All random variables (discrete and continuous) have a cumulative distribution function. It is a
function giving the probability that the random variable X is less than or equal to x, for every
value x. For a discrete random variable, the cumulative distribution function is found by
summing up the probabilities.
The chi-square test is used to determine whether there is a significant difference
between the expected frequencies and the observed frequencies in one or more
categories. Do the number of individuals or objects that fall in each category differ
significantly from the number you would expect? Is this difference between the
expected and observed due to sampling error, or is it a real difference?
Chi-Square Test Requirements:
1. Quantitative data.
2. One or more categories.
3. Independent observations.
4. Adequate sample size (at least 10).
5. Simple random sample.
6. Data in frequency form.
7. All observations must be used.
The chi-square formula used is
X2 = (O - E)2/E
where O is the Observed Frequency in each category
E is the Expected Frequency in the corresponding category
df is the "degree of freedom" (n-1)
X2 is Chi Square
Green
Yellow
Observed (o)
639
241
Expected (e)
660
220
Deviation (o - e)
-21
21
Deviation (d2)
441
441
d2/e
0.668
2
2
2
2
X = d /e = 2.668
Q2.(a) Bayes' theorem (or Bayes' Law and sometimes Bayes' Rule) is a direct application
of conditional probabilities. The probability P(A|B) of "A assuming B" is given by the
formula
P(A|B) = P(A∩B) / P(B)
which for our purpose is better written as
P(A∩B) = P(A|B)·P(B).
The left hand side P(A∩B) depends on A and B in a symmetric manner and would be the
same if we started with P(B|A) instead:
P(B|A)·P(A) = P(A∩B) = P(A|B)·P(B).
Bayesian decision theory is a fundamental statistical approach to the problem of pattern
classification. This approach is based on quantifying the tradeoffs between various classification
decisions using probability and the costs that accompany such decisions. It makes the assumption
that the decision problem is posed in probabilistic terms, and that all of the relevant probability
values are known.
If the catch produced as much sea bass as salmon, we would say that the next fish is equally
likely to be sea bass or salmon. More generally, we assume that there is some a priori
probability (or simply prior) P(ω1) that the next fish is sea bass, and prior some prior probability
P(ω2) that it is salmon. If we assume there are no other types of fish relevant here, then P(ω1)
and P(ω2) sum to one.
These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass or
salmon before the fish actually appears. It might, for instance, depend upon the time of year or
the choice of fishing area. Suppose for a moment that we were forced to make a decision about
the type of fish that will appear next without being allowed to see it. For the moment, we shall
assume that any incorrect classification entails the same cost or consequence, and that the only
information we are allowed to use is the value of the prior probabilities. If a decision must be
made with so little information, it seems logical to use the following decision rule: decision
Decide ω1 if P(ω1) > P(ω2); otherwise decide ω2.
Decision Rule: This rule makes sense if we are to judge just one fish, but if we are to judge
many fish, using this rule repeatedly may seem a bit strange. After all, we would always make
the same decision even though we know that both types of fish will appear. How well it works
depends upon the values of the prior probabilities. If P(ω1) is very much greater than P(ω2), our
decision in favor of ω1 will be right most of the time. If P(ω1) = P(ω2), we have only a fiftyfifty chance of being right. In general, the probability of error is the smaller of P(ω1) and P(ω2),
and we shall see later that under these conditions no other decision rule can yield a larger
probability of being right
where in this case of two categories
Bayes’ formula can be expressed informally in English by saying that
Note: We generally use an upper-case P(·) to denote a probability mass function and a lowercase p(·) to denote a probability density function.
Bayes’ formula shows that by observing the value of x we can convert the prior probability P(ωj)
to the a posteriori probability (or posterior) probability P(ωj |x)— the probability of the state of
nature being ωj given that feature value x has been measured. We call p(x|ωj) the likelihood of
ωj with respect to x (a term chosen to indicate that, other things being equal, the category ωj for
which p(x|ωj) is large is more “likely” to be the true category). Notice that it is the product of the
likelihood and the prior probability that is most important in determining the posterior
probability; the evidence factor, p(x), can be viewed as merely a scale factor that guarantees that
the posterior probabilities sum to one, as all good probabilities must.
The variation of P(ωj |x) with x is illustrated in Fig. 2.2 for the case P(ω1) = 2/3 and P(ω2) =
1/3.
Figure 2.1: Hypothetical class-conditional probability density functions show the probability
density of measuring a particular feature value x given the pattern is in category ωi. If x
represents the length of a fish, the two curves might describe the difference in length of
populations of two types of fish. Density functions are normalized, and thus the area under each
curve is 1.0.
If we have an observation x for which P(ω1|x) is greater than P(ω2|x), we would naturally be
inclined to decide that the true state of nature is ω1. Similarly, if P(ω2|x) is greater than P(ω1|x),
we would be inclined to choose ω2. To justify this decision procedure, let us calculate the
probability of error whenever we make a decision. Whenever we observe a particular x,
Clearly, for a given x we can minimize the probability of error by deciding ω1 if P(ω1|x) >
P(ω2|x) and ω2 otherwise. Of course, we may never observe exactly the same value of x twice.
Will this rule minimize the average probability of error? Yes, because the average probability of
error is given by
Figure 2.2: Posterior probabilities for the particular priors P(ω1) = 2/3 and P(ω2) = 1/3 for the
class-conditional probability densities shown in Fig. 2.1. Thus in this case, given that a pattern is
measured to have feature value x = 14, the probability it is in category ω2 is roughly 0.08, and
that it is in ω1 is 0.92. At every x, the posteriors sum to 1.0.
and if for every x we insure that P(error|x) is as small as possible, then the integral must be as
small as possible. Thus we have justified the following Bayes’ decision Bayes’ rule for
minimizing the probability of error:
This form of the decision rule emphasizes the role of the posterior probabilities. By using Eq. 1,
we can instead express the rule in terms of the conditional and prior evidence probabilities. First
note that the evidence, p(x), in Eq. 1 is unimportant as far as making a decision is concerned. It is
basically just a scale factor that states how frequently we will actually measure a pattern with
feature value x; its presence in Eq. 1 assures us that P(ω1|x) + P(ω2|x) = 1. By eliminating this
scale factor, we obtain the following completely equivalent decision rule:
Decide ω1 if p(x|ω1)P(ω1) > p(x|ω2)P(ω2); otherwise decide ω2. (8) Some additional insight
can be obtained by considering a few special cases. If for some x we have p(x|ω1) = p(x|ω2),
then that particular observation gives us no information about the state of nature; in this case, the
decision hinges entirely on the prior probabilities. On the other hand, if P(ω1) = P(ω2), then the
states of nature are equally probable; in this case the decision is based entirely on the likelihoods
p(x|ωj ). In general, both of these factors are important in making a decision, and the Bayes
decision rule combines them to achieve the minimum probability error.
2.(b) Case 1:
The simplest case occurs when the features that are measured is independent of each other, that
is, statistically independent, and when each feature has the same variance, 2. For example, if we
were trying to recognize an apple from an orange, and we measured the colour and the weight as
our feature vector, then chances are that there is no relationship between these two properties.
The non-diagonal elements of the covariance matrix are the covariances of the two features
x1=colour and x2=weight. But because these features are independent, their covariances would be
0. Therefore, the covariance matrix for both classes would be diagonal, being merely 2 times the
identity matrix I.
As a second simplification, assume that the variance of colours is the same is the variance of
weights. This means that there is the same degree of spreading out from the mean of colours as
there is from the mean of weights. If this is true for some class i then the covariance matrix for
that class will have identical diagonal elements. Finally, suppose that the variance for the colour
and weight features is the same in both classes. This means that the degree of spreading for these
two features is independent of the class from which you draw your samples. If this is true, then
the covariance matrices will be identical. When normal distributions are plotted that have a
diagonal covariance matrix that is just a constant multiplied by the identity matrix, their cluster
points about the mean are spherical in shape.
Geometrically, this corresponds to the situation in which the samples fall in equal-size
hyperspherical clusters, the cluster for the ith class being centered about the mean vector i (see
Figure 4.12). The computation of the determinant and the inverse of i is particularly easy:
and
(4.42)
Because both |i| and the (d/2) ln 2 term are independent of i, they are unimportant additive
constants that can be ignored. Thus, we obtain the simple discriminant functions
Figure 4.12: Since the bivariate normal densities have diagonal covariance matrices, their
contours are spherical in shape. Each class has the exact same covariance matrix, the circular
lines forming the contours are the same size for both classes. This is because identical covariance
matrices imply that the two classes have identically shaped clusters about their mean vectors.
(4.43)
where
(4.44)
If the prior probabilities are not equal, then Eq.4.43 shows that the squared distance
must be normalized by the variance
and offset by adding ln P(wi); thus, if x is equally near
two different mean vectors, the optimal decision will favor the a priori more likely category.
Regardless of whether the prior probabilities are equal or not, it is not actually necessary to
compute
distances.
Expansion
of
the
quadratic
(4.45)
form
yields
which appears to be a quadratic function of x. However, the quadratic term xTx is the same for all
i, making it an ignorable additive constant. Thus, we obtain the equivalent linear discriminant
functions
(4.46)
where
(4.47)
and
(4.48)
We call wi0 the threshold or bias for the ith category.
The decision boundaries for these discriminant functions are found by intersecting the functions
gi(x) and gj(x) where i and j represent the 2 classes with the highest a posteriori probabilities. As
in the univariate case, this is equivalent to determining the region for which gi(x) is the
maximum of all the discriminant functions. By setting gi(x) = gj(x) we have that:
(4.49)
Consider the term wi0 - wj0:
(4.50)
Now, by adding and subtracting the same term, we get:
(4.51)
By letting:
(4.52)
the result is:
(4.53)
But because of the way we define wi and wj, this is just:
(4.54)
So from the original equation we have:
(4.55)
and after multiplying through by variance the final decision boundary is given by:
(4.56)
Now let w =
. Then this boundary can be written as:
(4.57)
where
(4.58)
and
w=
(4.59)
This is called the normal form of the boundary equation. Geometrically, equations 4.57, 4.58,
and 4.59 define a hyperplane through the point x0 that is orthogonal to the vector w. But since
w=
then the hyperplane which separates Ri and Rj is orthogonal to the line that links
their means. If P(wi)=P(wj), the second term on the right of Eq.4.58 vanishes, and thus the point
x0 is halfway between the means (equally divide the distance between the 2 means, with a
decision region on either side), and the hyperplane is the perpendicular bisector of the line
between the means (see Figure 4.13).
Figure 4.13: Two bivariate normal distributions, whose priors are exactly the
same. Therefore, the decision bondary is exactly at the midpoint between the
two means. The decision boundary is a line orthogonal to the line joining the
two means.
If P(wi)P(wj) the point x0 shifts away from the more likely mean. Note, however, that if the
variance is small relative to the squared distance
, then the position of the decision
boundary is relatively insensitive to the exact values of the prior probabilities. In other words,
there are 80% apples entering the store. If you observe some feature vector of color and weight
that is just a little closer to the mean for oranges than the mean for apples, should the observer
classify the fruit as an orange? The answer depends on how far from the apple mean the feature
vector lies. In fact, if P(wi)>P(wj) then the second term in the equation for x0 will subtract a
positive amount from the first term. This will move point x0 away from the mean for Ri. If
P(wi)<P(wj) then x0 would tend to move away from the mean for Rj. So for the above example
and using the above decision rule, the observer will classify the fruit as an apple, simply because
it's not very close to the mean for oranges, and because we know there are 80% apples in total
(see Figure 4.14 and Figure 4.15).
Figure 4.14: As the priors change, the decision boundary throught point x0
shifts away from the more common class mean (two dimensional Gaussian
distributions).
Figure 4.15: As the priors change, the decision boundary throught point x 0
shifts away from the more common class mean (one dimensional Gaussian
distributions).
If the prior probabilities P(wi) are the same for all c classes, then the ln P(wi) term becomes
another unimportant additive constant that can be ignored. When this happens, the optimum
decision rule can be stated very simply: the decision rule is based entirely on the distance from
the feature vector x to the different mean vectors. The object will be classified to Ri if it is
closest to the mean vector for that class. To classify a feature vector x, measure the Euclidean
distance
from each x to each of the c mean vectors, and assign x to the category of the
nearest mean. Such a classifier is called a minimum-distance classifier. If each mean vector is
thought of as being an ideal prototype or template for patterns in its class, then this is essentially
a template-matching procedure.
Case 2:
Another simple case arises when the covariance matrices for all of the classes are identical but
otherwise arbitrary. Since it is quite likely that we may not be able to measure features that are
independent, this section allows for any arbitrary covariance matrix for the density of each class.
In order to keep things simple, assume also that this arbitrary covariance matrix is the same for
each class wi. This means that we allow for the situation where the color of fruit may covary with
the weight, but the way in which it does is exactly the same for apples as it is for oranges. Instead
of having spherically shaped clusters about our means, the shapes may be any type of
hyperellipsoid, depending on how the features we measure relate to each other. However, the
clusters of each class are of equal size and shape and are still centered about the mean for that
class.
Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal
clusters of equal size and shape, the cluster for the ith class being centered about the mean vector
i. Because both i and the (d/2) ln 2 terms in eq. 4.41 are independent of i, they can be ignored
as superfluous additive constants. Using the general discriminant function for the normal density,
the constant terms are removed. This simplification leaves the discriminant functions of the
form:
(4.60)
Note that, the covariance matrix no longer has a subscript i, since it is the same matrix for all
classes.
If the prior probabilities P(wi) are the same for all c classes, then the ln P(wi) term can be
ignored. In this case, the optimal decision rule can once again be stated very simply: To classify
a feature vector x, measure the squared Mahalanobis distance (x -µi)T-1(x -µi) from x to each of
the c mean vectors, and assign x to the category of the nearest mean. As before, unequal prior
probabilities bias the decision in favor of the a priori more likely category.
Expansion of the quadratic form (x -µi)T-1(x -µi) results in a sum involving a quadratic term xT1
x which here is independent of i. After this term is dropped from eq.4.41, the resulting
discriminant functions are again linear.
After expanding out the first term in eq.4.60,
(4.61)
(4.62)
where
(4.63)
and
(4.64)
The boundary between two decision regions is given by
(4.65)
Now examine the second term
yields:
from eq.4.64. Substituting the values for wi0 and wj0
(4.66)
Then, by adding and subtracting the same term:
(4.67)
Now if we let
(4.68)
Then the above line reduces to:
which is actually just:
=0 (4.69)
Now, starting with the original equation and substituting this line back in, the result is:
=0
(4.70)
So let
w = wi - wj
which means that the equation for the decision boundary is given by:
wTx - wTx0=0 (4.71)
Because the discriminants are linear, the resulting decision boundaries are again hyperplanes. If
Ri and Rj are contiguous, the boundary between them has the equation eq.4.71 where
w=
(
) (4.72)
and
(4.73)
Again, this formula is called the normal form of the decision boundary.
As in case 1, a line through the point x0 defines this decision boundary between Ri and Rj. If the
prior probabilities are equal then x0 is halfway between the means. If the prior probabilities are
not equal, the optimal boundary hyperplane is shifted away from the more likely mean The
decision boundary is in the direction orthogonal to the vector w =
(
); The difference
lies in the fact that the term w is no longer exactly in the direction of
. Instead, the vector
between i and j is now also multiplied by the inverse of the covariance matrix. This means that
the decision boundary is no longer orthogonal to the line joining the two mean vectors. Instead,
the boundary line will be tilted depending on how the 2 features covary and their respective
variances (see Figure 4.19). As before, with sufficient bias the decision plane need not lie
between the two mean vectors.
To understand how this tilting works, suppose that the distributions for class i and class j are
bivariate normal and that the variance of feature 1 is
and that of feature 2 is
. Suppose
also that the covariance of the 2 features is 0. Finally, let the mean of class i be at (a,b) and the
mean of class j be at (c,d) where a>c and b>d for simplicity. Then the vector w will have the
form:
This equation can provide some insight as to how the decision boundary will be tilted in relation
to the covariance matrix. Note though, that the direction of the decision boundary is orthogonal
to this vector, and so the direction of the decision boundary is given by:
Now consider what happens to the tilt of the decision boundary when the values of
or
are
changed (Figure 4.16). Although the vector form of w provided shows exactly which way the
decision boundary will tilt, it does not illustrate how the contour lines for the 2 classes are
changing as the variances altered.
Figure 4.16: As the variance of feature 2 is increased, the x term in the vector will become less
negative. This means that the decision boundary will tilt vertically. Similarly, as the variance of
feature 1 is increased, the y term in the vector will decrease, causing the decision boundary to
become more horizontal.
Does the tilting of the decision boundary from the orthogonal direction make intuitive sense?
With a little thought, it is easy to see that it does. For example, suppose that you are again
classifying fruits by measuring their color and weight. Suppose that the color varies much more
than the weight does. Then consider making a measurement at point P in Figure 4.17:
Figure 4.17: The discriminant function evaluated at P is smaller for class apple than it is for
class orange.
In Figure 4.17, the point P is at actually closer euclideanly to the mean for the orange class. But
as can be seen by the ellipsoidal contours extending from each mean, the discriminant function
evaluated at P is smaller for class 'apple' than it is for class 'orange'. This is because it is much
worse to be farther away in the weight direction, then it is to be far away in the color direction.
Thus, the total 'distance' from P to the means must consider this. For this reason, the decision
bondary is tilted.
The fact that the decision boundary is not orthogonal to the line joining the 2 means, is the only
thing that separates this situation from case 1. In both cases, the decision boundaries are straight
lines that pass through the point x0. The position of x0 is effected in the exact same way by the a
priori probabilities.
Figure 4.18: The contour lines are elliptical in shape because the covariance matrix is not
diagonal. However, both densities show the same elliptical shape. The prior probabilities are the
same, and so the point x0 lies halfway between the 2 means. The decision boundary is not
orthogonal to the red line. Instead, it is tilted so that its points are of equal distance to the contour
lines in w1 and those in w2.
Figure 4.19: The contour lines are elliptical, but the prior probabilities are different. Although
the decision boundary is a parallel line, it has been shifted away from the more likely class. With
sufficient bias, the decision boundary can be shifted so that it no longer lies between the 2
means:
Case 3:
In the general multivariate normal case, the covariance matrices are different for each category.
This case assumes that the covariance matrix for each class is arbitrary. The discriminant
functions cannot be simplified and the only term that can be dropped from eq.4.41 is the (d/2) ln
2 term, and the resulting discriminant functions are inherently quadratic.
(4.74)
which can equivalently be written as:
gi(x) = xTWix + wiTx + wi0
where
and
(4.75)
and
2.(c)
Q3.(a)
3(b)(i) GMM: the weighted sum of a number of Gaussians where the weights are
determined by a distribution, π
We can represent a GMM involving a latent variable
Identify a likelihood function,
And set partials to zero to obtain optimization of means:
Optimization of covariance:
Optimization of mixing term:
MLE of a GMM:
EM for GMMs:
•
Initialize the parameters
– Evaluate the log likelihood
•
Expectation-step: Evaluate the responsibilities
•
Maximization-step: Re-estimate Parameters
– Evaluate the log likelihood
– Check for convergence
•
E-step: Evaluate the Responsibilities
•
M-Step: Re-estimate Parameters
Singularities-Problem with GMM:
•
•
•
When a mixture component collapses on a given point, the mean becomes the point,
and the variance goes to zero.
Consider the likelihood function as the covariance goes to zero.
The likelihood approaches infinity.
(b)(ii) Hidden Markov Models: While belief nets are a powerful method for representing the
dependencies and independencies among variables, we turn now to the problem of representing
particular but extremely important dependencies. In problems that have an inherent temporality -- that
is, consist of a process that unfolds in time -- we may have states at time t that are influenced directly by
a state at t- 1. Hidden Markov models (HMMs) have found greatest use in such problems, for instance
speech recognition or gesture recognition. While the notation and description is unavoidably more
complicated than the simpler models considered up to this point, we stress that the same underlying
ideas are exploited. Hidden Markov models have a number of parameters, whose values are set so as to
best explain training patterns for the known category. Later, a test pattern is classified by the model that
has the highest posterior probability, i.e., that best "explains" the test pattern.
First-order Markov models
We consider a sequence of states at successive times; the state at any time t is denoted w(t). A
particular sequence of length T is denoted by WT = {w(1),we(2), ...,w(T)} as for instance we might have w
t
= {wl,w4,w2,w2,wl,w4}. Note that the system can revisit a state at different steps, and not every state
need be visited.
Our model for the production of any sequence is described by transition probabilities
P(wj(t+1)wi(t))=aij -- the time-independent probability of having state wj at step • t+1 given that the
state at time t was wi. There is no requirement that the transition probabilities be symmetric (aij ≠ aji, in
general) and a particular state may be visited in succession (aii ≠ 0, in general), as illustrated (Fig 2.1).
Figure2.1: The discrete states, wi, in a basic Markov model are represented by nodes, and the transition
probabilities, aij, by links. In a first-order discrete time Markov model, at any step • the full system is in a
particular state w(t). The state at step t + 1 is a random function that depends solely on the state at step
t and the transition probabilities.
Suppose we are given a particular model θ -- that is, the full set of aij -- as well as a particular
sequence wT. In order to calculate the probability that the model generated the particular sequence we
simply multiply the successive probabilities. For instance, to find the probability that a particular model
generated the sequence described above, we would have P(wT /θ)=a14a42a22a21a14. If there is a prior
probability on the first state P(w(1) = wi), we could include such a factor as well; for simplicity, we will
ignore that detail for now.
Up to here we have been discussing a Markov model, or technically speaking, a first-order
discrete time Markov model, since the probability at t +1 depends only on the states at t. For instance, in
a Markov model for the production of spoken words, we might have states representing phonemes, and
a Markov model for the production of a spoken work might have states representing phonemes. Such a
Markov model for the word "cat" would have states for/k/, /a/and/t/, with transitions from/k/to /a/;
transitions from/a/to/t/; and transitions from/t/to a final silent state.
Note however that in speech recognition the perceiver does not have access to the states w(t).
Instead, we measure some properties of the emitted sound. Thus we will have to augment our Markov
model to allow for visible states -- which are directly accessible to external measurement -- as separate
from the w states, which are not.
First-order hidden Markov models:
We continue to assume that at every time step t the system is in a state w(t) but now we also
assume that it emits some (visible) symbol v(t). While sophisticated Markov models allow for the
emission of continuous functions (e.g., spectra), we will restrict ourselves to the case where a discrete
symbol is emitted. As with the states, we define a particular sequence of such visible states as VT = {v(1),
v(2), ..., v(T)} and thus we might have V 6 = {v5, v1, v1, v5, v2, v3}.
Our model is then that in any state w(t) we have a probability of emitting a particular visible
state vk(t). We denote this probability
Because we have access only to the visible
states, while the col are unobservable, such a full model is called a hidden Markov model (Fig. 2.2)
Figure2.2 :Three hidden units in an HMM and the transitions between them are
shown in black while the visible states and the emission probabilities of visible
states are shown in red. This model shows all transitions as being possible; in
other HMMs, some such candidate transitions are not allowed.
Hidden Markov Model Computation:
Now we define some new terms and clarify our notation. In general networks such as those in
Fig. 2.2 are finite-state machines, and when they have associated transition probabilities, they are called
Markov networks. They are strictly causal- the probabilities depend only upon previous states. A Markov
model is called ergodic if every one of the states has a non-zero probability of occurring given some
starting state. A final or absorbing state coo is one which, if entered, is never left (i.e., a00 = 1).
As mentioned, we denote the transition probabilities aij among hidden states and for the
probability bjk of the emission of a visible state:
We demand that some transition occur from step tt + 1 (even if it is to the same state), and
that some visible symbol be emitted after every step. Thus we have the normalization conditions:
where the limits on the summations are over all hidden states and all visible symbols,
respectively.
With these preliminaries behind us, we can now focus on the three central issues in hidden
Markov models:
The Evaluation problem. Suppose we have an HMM, complete with transition probabilities aij and bjk.
Determine the probability that a particular sequence of visible states VT was generated by that model.
The Decoding problem. Suppose we have an HMM as well as a set of observations VT . Determine the
most likely sequence of hidden states wT that led to those observations.
The Learning problem. Suppose we are given the coarse structure of a model (the number of states and
the number of visible states) but not the probabilities aij and bjk. Given a set of training observations of
visible symbols, determine these parameters.
Evaluation
The probability that the model produces a sequence VT of visible states is:
where each r indexes a particular sequence
of T hidden states. In
the general case of c hidden states, there will be rmax = cT possible terms in the sum of Eq.,
corresponding to all possible sequences of length T. Thus, according to Eq., in order to compute
the probability that the model generated the particular sequence of T visible states VT, we
should take each conceivable sequence of hidden states, calculate the probability they produce
VT, and then add up these probabilities. The probability of a particular visible sequence is
merely the product of the corresponding (hidden) transition probabilities aij and the (visible)
output probabilities bjk of each step.
Because we are dealing here with a first-order Markov process, the second factor in Eq.,
which describes the transition probability for the hidden states, can be rewritten as:
that is, a product of the aij's according to the hidden sequence in question. In Eq., w(T)
=w0 is some final absorbing state, which uniquely emits the visible state v 0. In speech
recognition applications, w0 typically represents a null state or lack of utterance, and v0 is some
symbol representing silence. Because of our assumption that the output probabilities depend
only upon the hidden state, we can write the first factor in Eq. as
that is, a product of bjk's according to the hidden state and the corresponding visible
state. We can now use Eqs. to express
(1)
Despite its formal complexity, Eq.(1) has a straightforward interpretation. The
probability that we observe the particular sequence of T visible states VT is equal to the sum
over all rmax possible sequences of hidden states of the conditional probability that the system
has made a particular transition multiplied by the probability that it then emitted the visible
symbol in our target sequence. All these are captured in our parameters aij and bkj, and thus Eq.
can be evaluated directly. This is an O(cTT) calculation, which is quite prohibitive in practice. For
instance, if c = 10 and T = 20, we must perform on the order of 1021 calculations.
A computationally simpler algorithm for the same goal is as follows. We can calculate
P(VT) recursively, since each term P(v(t)/w(t))P(w(t)/w(t- 1)) involves only v(t), w(t) and w(t - 1).
We do this by defining
where the notation bjkv(t) means the transition probability bjk selected by the visible
state emitted at time t. Thus the only non-zero contribution to the sum is for the index k which
matches the visible state v(t). Thus αj(t) represents the probability that our HMM is in hidden
state wj at step t having generated the first t elements of VT . This calculation is implemented in
the Forward algorithm in the following way:
Algorithm (HMM Forward):
where in line 5, α0 denotes the probability of the associated sequence ending to the
known final state. The Forward algorithm has, thus, a computational complexity of O(c 2 T) -- far
more efficient than the complexity associated with exhaustive enumeration of paths of Eq. (1).
For the illustration of c = 10, T = 20 above, we would need only on the order of 2000
calculations -- more than 17 orders of magnitude faster than that to examine each path
individually.
We shall have cause to use the backward algorithm, which is the time-reversed version
of the Forward algorithm.
Algorithm (HMM Backward):
Figure2.3 :The computation of probabilities by the Forward algorithm can be visualized
by means of a trellis -- a sort of "unfolding" of the HMM through time. Suppose we seek
the probability that the HMM was in state w2 at t = 3 and generated the observed visible
symbol up through that step (including the observed visible symbol v k). The probability
the HMM was in state wj(t = 2) and generated the observed sequence through t = 2 is
αj(2) for j = 1,2,...,c. To find α2(3) we must sum these and multiply the probability that
state w2 emitted the observed symbol vk. Formally, for this particular illustration we
have
Decoding
Given a sequence of visible states VT , the decoding problem is to find the most probable
sequence of hidden states. While we might consider enumerating every possible path and
calculating the probability of the visible sequence observed, this is an O(cTT) calculation and
prohibitive. Instead, we use perhaps the simplest decoding algorithm:
Algorithm (HMM decoding):
A closely related algorithm uses logarithms of the probabilities and calculates total
probabilities by addition of such logarithms; this method has complexity O(c 2T).
3.(c) The class statistics(scatter covariance matrices) are
µ1 = [3.0 3.6]T
µ2 = [8.4
0.8 −0.4
S1 = [
]
−0.4 2.64
S2 = [
7.6]T
1.84 −0.04
]
−0.04 2.64
The within- and between-class scatter is
29.16 21.6
SB = (µ1- µ2)(µ1- µ2)T = [
]
21.6 16.0
SW = S1 + S2 = [
2.64 −0.44
]
−0.44 5.28
The LDA projection is then obtained as the solution of the generalized eigenvalue problem
=[
0.91
]
0.39
Q4(a). In general, classification methods allow you to reduce the dimensionality of a
complex data set by grouping the data into a set number of classes.
With traditional (crisp) classification methods, each sample/location is placed
into one class or another. In crisp classification, class membership is binary, a
sample is a member of a class or not. Crisp class membership values can be either
"1" when that class is the best fit, or "0" (for all other classes).
In fuzzy classification, a sample can have membership in many different classes to
different degrees. Typically, the membership values are constrained so that all of the
membership values for a particular sample sum to 1.
Why use fuzzy classes?
Fuzzy classes are appropriate for continuous data that does not fall neatly into
discrete classes, such as climatic data , vegetation type soil classification, and many
other engineering, geological, and medical applications . Fuzzy classes can better
represent transitional areas than hard classification, as class membership is not
binary (yes/no) but instead one location can belong to a few classes.
Brown (1998) identifies fuzzy classification as appropriate for data with 1) "attribute
ambiguity" and 2) "spatial vagueness." Attribute ambiguity occurs when class
membership is partial or unclear. Ambiguity is particularly a problem for some
remotely-sensed data, such as aerial photography, which is not interpreted
consistently. Spatial vagueness emerges when the sampling resolution is not fine
enough to catch boundary locations, when gradual transitions occur between classes,
or when there is some location uncertainty in the data.
Fuzzy classes depict the spatial and attribute uncertainty present in most data sets
more accurately than hard classification.
Fuzzy classification can reduce the dimensionality of multivariate data sets, by
assigning the objects in the data set to k fuzzy classes. You, the user, choose the
number of classes, k.
BoundarySeer uses a k-means technique to create fuzzy classes. First, it assigns the
locations randomly to classes. It then refines the class membership, reducing the
variation within a class and maximizing the between-class variation. This process
results in a new data set where the original spatial locations are described only by
membership in the k classes.
Steps:
1. Initialization.
a. An initial partition of k clusters is established. Cluster membership is initially
random.
b. Select a value for the fuzziness exponent
and
, 2 is a good initial value).
, phi (values can be between 1
c. Select a value for the stopping criterion , epsilon. It determines the level of
convergence necessary before quitting (recommend  = 0.001).
2. Refinement. BoundarySeer compares dissimilarity between classes using Euclidean
distance. BoundarySeer rearranges class memberships iteratively to minimize
the within-class least squared-error function, J.
3. Finalization.
a. The procedure terminates when the largest proportional difference between
the matrices is ‹ , the stopping criteria.
b. Once the final partition has been selected, it is saved as a new data set with
the same X-Y values as the original data set, and variable(s) denoting class
membership. Unless renamed by the user, the data set has a "Classes" suffix.
Choosing fuzzy classification parameters
To perform a fuzzy classification, you must choose values for the number of classes (k), the
fuzziness of the classification (phi), and the stopping criterion (epsilon). BoundarySeer
provides some preset defaults for these settings, so you may classify your data without
entering any values. You may wish to test the influence of these parameters on the
classification by repeating the analysis and varying the parameters.
How many classes? Choosing a value for k:
Choosing an appropriate number of classes is the eternal classification problem.
Classification techniques will produce the number of clusters specified, regardless of
whether they are meaningful distinctions. The k-means technique for fuzzy classification
maximizes between-cluster variation for a set number of clusters (k). You may wish to
check on how the chosen value of k influences the clustering by comparing the outcomes for
a range of k values.
If you have a sense of the number of clusters that is appropriate for your data, use that. For
a first pass, you might try a "rule-of-thumb" from hard clustering: k = n½ where n = the
number of objects in the data set.
How fuzzy? Choosing a value for

, phi, determines the fuzziness of the classification. When phi is set to one (not possible in
BoundarySeer), the clustering is hard clustering, with binary class membership (yes/no). Phi
values for fuzzy clustering can range from just above 1 to infinity. Yet, at very high phi
values, the classification may be so fuzzy as to not distinguish any classes at all. The choice
of phi will balance the need for structure (distinguishable classes) from continuity
(fuzziness). A common starting place is phi = 2. As phi approaches one, clustering becomes
more difficult, so values lower than 1.1 may not produce good results.
How optimal? Choosing a value for

BoundarySeer will continually reallocate class membership values between the classes until
it arrives at an optimal arrangement. The cutoff for the optimization is , epsilon.
BoundarySeer minimizes the within-class least-squared error term. Once BoundarySeer is
changing the matrix of membership values by very small amounts, it is time to stop
optimization. BoundarySeer compares matrices of membership values by the largest
proportional difference between membership values (i.e. if a membership value is 0.75 and
it changes by 0.03, then the proportional difference is 0.03/0.75 = 0.04). All proportionate
differences for each class membership value for each location are calculated, and the largest
must be less than epsilon.
Detecting boundaries on fuzzy classes
Fuzzy classification produces a new multivariate data set with the same spatial support as
the original data set. In this new data set, the locations are associated with new variables,
fuzzy membership values for each of the classes. BoundarySeer can find boundaries for this
new data set in many ways.
Boundary Membership Values (BMVs) can be derived from (1) wombling on the fuzzy
classes, (2) wombling with location uncertainty on the classes, (3) spatially
constrained clustering, (4) the confusion index, or (5) the classification
entropy index.
You may find boundaries using wombling, confusion index, and classification entropy
directly from the fuzzy classification dialog. For location uncertainty and spatially
constrained clustering, first create fuzzy classes, then perform the boundary detection
procedure.
Confusion Index
The confusion index is simply the ratio of the second highest class membership value to
the highest. If the two values are similar, the confusion index returns a value close to one,
indicating high confusion about class membership. If the two values are very different, then
the confusion index is closer to zero, indicating less confusion about class membership.
BoundarySeer uses the confusion index as a Boundary Likelihood Value (BLV).
BoundarySeer calculates the confusion index for each spatial location, then all the confusion
indices for the data set are used to create BMVs. The confusion index values are scaled to
between 0-1, with the lowest confusion index set to 0 and the highest to 1.0. Locations with
high confusion index are most transitional between classes and therefore, most boundarylike.
Classification entropy
Classification entropy at location i, h(i), is:
where k is the number of classes, and mic is the fuzzy membership value for location i in
class c. Entropy results parallel those of the confusion index, with entropy values close to
one when membership is spread among the classes, and closer to zero when membership is
primarily in one class.
BoundarySeer uses entropy as a BLV. BoundarySeer calculates the entropy for each spatial
location, then it scales all entropy values for the entire data set to make BMVs. Entropy
values are scaled to between 0-1, with the lowest value set to 0 and the highest to 1.0.
Locations with high classification entropy are most transitional between classes and
therefore, most boundary-like.
(b)(i) Nearest Neighbor Classification Algorithm:
(ii) K-Nearest Neighbor Classification Algorithm:
It is generalization of the NN rule. The knearest neighbors rule is as follows:
Given a set of training samples {x1;…….; xn} and a test point x, find k
training points closest to x, x1*;……..; xk*. Collect the labels associated
θ1*;……;θk* and classify x to the class which has the greatest number of
representatives in θ1*;……;θk*
.
In other words, the classification is performed by taking the majority vote among
k nearest neighbors of x.
The computational load of the NN classification of a single test point is O(nd),here
d is the dimension of feature vectors and n is the number of training samples.
O(nd) is a lot of time, particularly if n is large. The difference to most other
techniques for classification is that with KNN training points are needed
during the classification when usually they are needed only during the training.
Since training is performed only once and the classification many times, we are
more concerned about the time consumption of classification.
(iii) Modified K- Nearest Neighbour Classification Algorithm:
It is similar to KNN algorithm. The only difference is that these k
nearest neighbors are weighted according to their distance from the test point.
Each of the neighbor is associated with the weight w which is defined as follows:
𝑑k−𝑑j
𝑖𝑓 𝑑k ≠ 𝑑1
wj = {𝑑k−𝑑1
1
𝑖𝑓 𝑑k = 𝑑1
where j= 1,……,k and wj Є[0,1].
0for the features having maximum distance out of k
1 for the features having minimum distance out of k
in between 0 and 1 for remaining features
(c) Parzen-window approach to estimate densities assume that the region Rn is a ddimensional hypercube
Vn  hnd (h n : length of the edge of  n )
Let  (u) be the following window function :
1

j  1,... , d
1 u j 
 (u)  
2

0 otherwise
– Φ(u) defines a unit hypercube centered at the origin.
– ((x-xi)/hn) is equal to unity if xi falls within the hypercube of volume Vn centered
at x and equal to zero otherwise.
The number of samples in this hypercube is:
 x  xi 

kn   
h
i 1 
n

i n
By substituting kn, we obtain the following estimate:
1 i n 1
pn ( x)  
n i 1 Vn
 x  xi 

h
 n 
 
Pn(x) estimates p(x) as an average of functions of x and the samples (xi) (i = 1,… ,n). These
functions  can be general!
• The estimate pn (x) be a legitimate density function, that is, that it be nonnegative and
integrate to one. This can be assured by requiring the window function itself be a
density function.
• To be more precise, if we require that
and
and if we maintain the relation
then it follows at once that pn(x) also satisfies these conditions.
•
In discussing convergence, we must recognize that we are talking about the
convergence of a sequence of random variables, because for any fixed x the value of
pn(x) depends on the random samples xi , . . . , xn.
•
Thus, pn(x) has some mean
estimate pn(x) converges to p(x) if
•
To prove convergence we must place conditions on the unknown density p(x), on the
window function φ(u), and on the window width hn.
The following additional conditions assure convergence
•
•
and variance
. We shall say that the
The first two eq. keep φ(.) well-behaved, and they are satisfied by most density
functions. The last two eq. state that the volume Vn must approach zero, but at a rate
slower than 1/n
Convergence of the mean:
(1)
Convergence of the Variance:
By dropping the second term, bounding φ(.) and using Eq. 1, we obtain
Q5(a). Cluster: a collection of data objects
•
•
•
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
Cluster analysis
– Grouping a set of data objects into clusters
Clustering is unsupervised classification: no predefined classes
Note: the classes identified are abstract, in the sense that we obtain ‘cluster 0’ ...
‘cluster n’ as our classes.
Steps for Clustering:
•
•
•
A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
The quality of a clustering result depends on both the similarity measure used by the
method and its implementation.
The quality of a clustering method is also measured by its ability to discover some or
all of the hidden patterns.
K-MEANS CLUSTERING:
•
•
•
The k-means algorithm is an algorithm to cluster n objects based on attributes into k
partitions, where k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in
that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
An algorithm for partitioning (or clustering) N data points into K disjoint subsets S j
containing data points so as to minimize the sum-of-squares criterion
where xn is a vector representing the the nth data point and uj is the geometric
centroid of the data points in Sj.
Step 1: Begin with a decision on the value of k = number of clusters .
Step 2: Put any initial partition that classifies the data into k clusters. You may
assign the training samples randomly, or systematically as the following:
1.Take the first k training sample as single-element clusters
2. Assign each of the remaining (N-k) training sample to the cluster with the nearest
centroid. After each assignment, recompute the centroid of the gaining cluster.
• Step 3: Take each sample in sequence and compute its distance from the centroid of
each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
•
•
Step 4 . Repeat step 3 until convergence is achieved, that is until a pass through the
training sample causes no new assignments.
A Simple example showing the implementation of k-means algorithm (using
K=2)
•
Step 1:
Initialization: Randomly we choose following two centroids (k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and m2=(5.0,7.0).
Step 2:
• Thus, we obtain two clusters containing:
{1,2,3} and {4,5,6,7}.
•
Their new centroids are:
Step 3:
• Now using these centroids we compute the Euclidean distance of each object, as
shown in table.
Therefore, the new clusters are:
{1,2} and {3,4,5,6,7}
• Next centroids are: m1=(1.25,1.5) and m2 = (3.9,5.1)
Step 4:
•
The clusters obtained are:
{1,2} and {3,4,5,6,7}
• Therefore, there is no change in the cluster.
• Thus, the algorithm comes to a halt here and final result consist of 2 clusters {1,2}
and {3,4,5,6,7}.
b(i) Cluster Validity:
• How do we evaluate the “goodness” of the resulting clusters?
Different Aspects of Cluster Validation:
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether
non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to
externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference to
external information.
4. Comparing the results of two different sets of cluster analyses to determine which is
better.
5. Determining the ‘correct’ number of clusters.
Measures of Cluster Validity:
• Numerical measures that are applied to judge various aspects of cluster validity, are
classified into the following three types.
– External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
• E.g., entropy, precision, recall
– Internal Index: Used to measure the goodness of a clustering structure
without reference to external information.
• E.g., Sum of Squared Error (SSE)
– Relative Index: Used to compare two different clusterings or clusters.
• Often an external or internal index is used for this function, e.g., SSE
or entropy
• Sometimes these are referred to as criteria instead of indices
– However, sometimes criterion is the general strategy and index is the
numerical measure that implements the criterion.
Measuring Cluster Validity Via Correlation:
• Two matrices
– Similarity or Distance Matrix
• One row and one column for each data point
• An entry is the similarity or distance of the associated pair of points
– “Incidence” Matrix
• One row and one column for each data point
• An entry is 1 if the associated pair of points belong to the same cluster
• An entry is 0 if the associated pair of points belongs to different
clusters
• Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
• High correlation (positive for similarity, negative for distance) indicates that points
that belong to the same cluster are close to each other.
• Not a good measure for some density or contiguity based clusters.
b(ii) Hierarchical Clustering:
• Many times, clusters are not disjoint, but a cluster may have subclusters, in turn
having sub-subclusters, etc.
• Consider a sequence of partitions of the n samples into c clusters
• The first is a partition into n cluster, each one containing exactly one sample
• The second is a partition into n-1 clusters, the third into n-2, and so on, until
the n-th in which there is only one cluster containing all of the samples
• At the level k in the sequence, c = n-k+1.
•
•
•
•
•
Given any two samples x and x’, they will be grouped together at some level, and if
they are grouped a level k, they remain grouped for all higher levels
Hierarchical clustering  tree representation called dendrogram
Hierarchical clustering can be divided in agglomerative and divisive.
Agglomerative (bottom up, clumping): start with n singleton cluster and form the
sequence by merging clusters
Divisive (top down, splitting): start with all of the samples in one cluster and form
the sequence by successively splitting clusters
Many variants to defining closest pair of clusters:
1. Single-link
Similarity of the most cosine-similar (single-link)
2. Complete-link
Similarity of the “furthest” points, the least cosine-similar
3. Centroid
Clusters whose centroids (centers of gravity) are the most cosinesimilar
4. Average-link
Average cosine between pairs of elements
(c) The sum-of-squared-error criterion
– Let ni the number of samples in Di, and mi the mean of those samples
mi 
1
ni
x
xDi
Then, the sum of squared error is defined as
c
J e    x  mi
2
i 1 xDi
•
•
•
This criterion defines clusters by their mean vectors mi
 it minimizes the sum of the squared lengths of the error x – mi in Di
Clustering of this type is called the minimum variance partition that minimizes Je
Results:
– Good when clusters form well separated compact clouds
– Bad with large differences in the number of samples in different clusters.
K-MEANS CLUSTERING:
•
•
•
The k-means algorithm is an algorithm to cluster n objects based on attributes into k
partitions, where k < n.
It is similar to the expectation-maximization algorithm for mixtures of Gaussians in
that they both attempt to find the centers of natural clusters in the data.
It assumes that the object attributes form a vector space.
An algorithm for partitioning (or clustering) N data points into K disjoint subsets S j
containing data points so as to minimize the sum-of-squares criterion
where xn is a vector representing the the nth data point and uj is the geometric
centroid of the data points in Sj.
•
•
•
•
Step 1: Begin with a decision on the value of k = number of clusters .
Step 2: Put any initial partition that classifies the data into k clusters. You may
assign the training samples randomly, or systematically as the following:
1.Take the first k training sample as single-element clusters
2. Assign each of the remaining (N-k) training sample to the cluster with the
Nearest centroid. After each assignment, recompute the centroid of the
gaining cluster.
Step 3: Take each sample in sequence and compute its distance from the centroid of
each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.
Step 4 .: Repeat step 3 until convergence is achieved, that is until a pass through
the training sample causes no new assignments.
Download