BAYES DECISION
THEORY
‘-
Chapter 2 DHS
1
Introduction
•
Bayes decision theory is a fundamental statistical approach to pattern classification
•
This approach is based on quantifying the tradeoffs between various classification decisions
‘using probability and the costs that accompany such decisions.
•
Assumption: decision problem posed in probabilistic terms and relevant probability values are
known
2
Bayes Terms
•
Suppose that an observer watching fish arrive along the conveyor belt finds it hard to
predict what type will emerge next and that the sequence of types of fish appears to be
random.
•
State of Nature: π = π1 if it is Seabass, π = π2 if it is Salmon.
‘- We need to probabilistically
describe π
•
Priori probability (or simply Prior): P(π1) and P(π2) [we assume that no other types of
fishes are relevant here]
•
These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass
or salmon before the fish actually appears. It might, for instance, depend upon the time of
year or the choice of fishing area
3
Decision Rule
•
Decision Rule: Suppose for a moment that we were forced to make a decision about the
type of fish that will appear next without being allowed to see it.
•
The only information we are allowed to use is the value of the prior probabilities.
•
Decide π1 if π(π1) > π(π2); otherwise decide π2
•
How well does it work?
‘-
4
Class Conditional Density for Decision Rule
•
In most circumstances we are not asked to make decisions with so little information.
• In our example, we might for instance use a lightness measurement π₯ to improve our
classifier.
•
‘Class Conditional Probability Density Function: π₯ to be a continuous
random variable whose
distribution depends on the state of nature and is expressed as π π₯ π1 .
•
This is the probability density function for π₯ given that the state of nature is π1. (It is also
sometimes called state-conditional probability density.)
•
The difference between π π₯ π1 and π π₯ π2 describes the difference in lightness between
populations of sea bass and salmon
5
Bayes Formula
•
Posterior Probability (Posteriori prob. is a function of likelihood & prior), i.e. if the discovered
lightness value is x, how does it influence the real state of nature:
•
In this case,
•
Bayes Formula, expressed in English:
‘-
6
Bayes Decision Rule
•
Based on the larger posterior value we always choose the corresponding state of nature.
However, not always that we can be right, Therefore, Probability of error
‘-
•
Average error (or Unconditional error)
•
Bayes Decision Rule
•
The resulting Probability of Error is :
•
For each observation π₯, Bayes decision rule minimizes the probability of error
7
Role of Evidence
•
Evidence is π(π₯)
•
Recall:
Just a scale
factor ‘-
•
Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities
sum to 1
•
Eliminating evidence we have an equivalent decision rule:
•
For special cases like π(π₯|π1) = π(π₯|π2) , it depends on prior only.
8
Bayesian Decision Theory –Continuous Features
•
We shall now formalize the ideas just considered, and generalize them in four ways:
•
•
•
•
by allowing the use of more than one feature
by allowing more than two states of nature
by allowing actions other than merely deciding the state of‘-nature
by introducing a ‘loss function’ which is more general than the ‘probability of error’
9
Bayesian Decision Theory –Continuous Features
•
Allowing the use of more than one feature merely requires replacing the scalar π₯ by the
feature vector π, where π is in a d-dimensional Euclidean space πΉπ , called the feature space.
•
•
•
Allowing more feature than two states of nature provides us with a useful generalization for a
small notational space expense.
‘Allowing actions other than classification primarily allows the possibility of “rejection”
Rejection: Input pattern is rejected when it is difficult to decide between two classes or the
pattern is too noisy!
• The loss function specifies the cost of each action, and is used to convert a probability
determination into a decision
10
Bayesian Decision Theory –Continuous Features
•
The finite set of π states of nature ( or categories): π1, … … ππ
•
The finite set of π possible actions: πΌ1, … … πΌπ
•
The loss incurred for taking action πΌπ when the state of nature is ππ : π(πΌπ |ππ ) Remember! Not
‘all actions are equally costly!
•
The probability density function for π conditioned on ππ being the true state of nature: π π ππ
•
Bayes Formula
•
The evidence is now:
11
Conditional Risk
•
Suppose that we observe a particular x and that we contemplate taking action πΌπ .
•
If the true state of nature is ππ , by definition we will incur the loss π(πΌπ |ππ )
•
Since π(ππ |π) is the probability that the true state of nature is ππ , the expected loss (or
‘risk) associated with taking action πΌπ is merely
•
Whenever we encounter a particular observation π, we can minimize our expected loss
by selecting the action that minimizes the conditional risk.
•
Bayes decision procedure actually provides the optimal performance on an overall risk
12
Overall Risk
•
A general decision rule is a function πΌ(π) that tells us which action to take for every possible
observation.
•
In other words, for every π th decision function πΌ(π₯) assumes one of the π values
πΌ1, … … πΌπ
‘-
•
The overall risk R is
•
But remember! We need to estimate this decision rule with an objective of minimizing the
risk of misclassification.
• So, now goal is that πΌ(π) is chosen in such a way that π
(πΌπ (π)) is as small as possible
for every x, then the overall risk will be minimized.
13
Bayes Risk
•
for π = 1, … , π and select the action πΌπ for which π
(πΌπ |π) is minimum.
•
The resulting minimum overall risk is called the Bayes risk, denoted R∗, and is the best
performance that can be achieved.
‘-
14
Two Category Problem
•
Here action πΌ1 corresponds to deciding that the true state of nature is π1, and action
πΌ2 corresponds to deciding that it is π2.
•
πππ = π(πΌπ |ππ ) be the loss incurred for deciding ππ when the true state of nature is ππ
•
The conditional risk can be described as:
•
The fundamental rule is to decide π1 if π
(πΌ1|π) < π
(πΌ2|π). In terms of the posterior
probabilities, we decide π1 if
•
Or equivalently
‘-
15
Likelihood Ratio
•
We can re-written the above rule as:
‘-
•
We can consider π π₯ ππ as a function of ππ (and treat it as a likelihood function)
•
Thus the Bayes decision rule can be interpreted ratio as calling for deciding π1 if the
likelihood ratio exceeds a threshold value (ππ) that is independent of the observation π
16
Minimum Error Rate Classification
•
Zero-One Loss
This loss function assigns no loss to a correct decision and assigns a unit loss to any
error; thus, all errors are equally costly.
‘-
The risk corresponding to this loss function is precisely the average probability of error,
since the conditional risk is
the conditional probability that
action πΌπ is correct
17
Minimum Error-rate Solution
•
•
The Bayes decision rule to minimize risk calls for selecting the action that minimizes the
conditional risk.
‘- the probability of error
Zero-one loss function can obtain a decision rule that minimizes
18
Minimum Error-rate Solution
•
Recall: We choose the class 1 if the following holds:
say = ππ
‘-
•
Now for 0-1 loss function:
•
With π12 = π21 = 1, π11 = π22 = 0, we have the threshold only involving the priors: ππ =
π(π2)
= ππ
π(π1)
•
If π12 = 2, π21 = 1, π11 = π22 = 0, ππ =
2π(π2)
π(π1)
= ππ
19
Minimum Error rate Solution
‘-
20
Classifiers, Discriminant Functions and Decision Surfaces
•
Many different ways to represent classifiers or
decision rules;
•
One of the most useful is in terms of “discriminant
functions” The multi-category case:
•
Bayes classifier can be represented in this way, but
the choice of discriminant function is not unique:
ππ π₯ = −π
π (πΌπ |π₯)
•
For the minimum error rate, we take:
ππ π₯ = π ππ π₯ = ln π π₯ ππ + ln π(ππ )
‘-
21
Classifiers, Discriminant Functions and Decision Surfaces
•
A decision rule partitions the feature space into c decision regions {π
1, … … . π
π }. A sample is
in π
π , if
ππ π₯ > ππ π₯ ∀π ≠ π
•
Two-category case:
‘-
•
•
•
•
Here a classifier is a “dichotomizer” that has two discriminant functions π1 and π2
Combine them together :
π π₯ = π1 π₯ − π2 (π₯)
Decide π1 if π1 π₯ > 0, else decide π2
π π₯π
π(π )
• Solve as: π π₯ = π1 π₯ − π2 π₯ = π π1 π₯ − π π2 π₯ = ln π π₯ π1 + ln π(π1 )
2
2
22
‘-
23
The Normal Density
•
Normal density is analytically tractable
•
Continuous density with two parameters (mean, variance)
•
A number of processes are asymptotically Gaussian (CLT)
•
Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted
(noisy) versions of a single typical or prototype pattern
•
With mean as μ
•
Variance as π 2
‘-
24
The Normal Density, Implementation
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# define constants
mu = 998.8
sigma = 73.10
x1 = 900
x2 = 1100
# calculate the z-transform
z1 = ( x1 - mu ) / sigma
z2 = ( x2 - mu ) / sigma
x = np.arange(z1, z2, 0.001) # range of x in spec
x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec
# mean = 0, stddev = 1, since Z-transform was calculated
y = norm.pdf(x,0,1)
y2 = norm.pdf(x_all,0,1)
fig, ax = plt.subplots(figsize=(9,6))
plt.style.use('fivethirtyeight')
ax.plot(x_all,y2)
‘-
25
Multivariate Normal Density
•
Multivariate normal density in d dimensions:
where, π₯ ∈ π
π , mean π ∈ π
π , covariance matrix Σ ∈ π
π×π
•
‘•
We have
•
Covariance matrix is symmetric and positive
semidefinite; we assume Σ is positive definite so the
determinant of Σ is strictly positive.
Multivariate normal density is completely specified by
[π + π(π + 1)/2] parameters
and
•
For ππ‘β component π₯π of π₯
•
What happens when for some I and j, we have π₯π and π₯π statistically independent?
26
Multivariate Normal Density
•
Linear combinations of jointly normally distributed random variables, independent or not, are normally distributed
• π΄ ∈ π
π×π , π = π΄π π, πππππ π is a k component vector
• π π ~π π΄π π, π΄π Σπ΄
• If k =1, π = ππ π is a scaler, projection of x onto a line in the direction π
•
•
ππ» πΊ
‘-
π is the variance of the projection π onto π
In general then, knowledge of the covariance matrix allows us to calculate the dispersion of the data in any
direction, or in any subspace.
•
It is sometimes convenient to perform a coordinate transformation that converts an arbitrary multivariate normal
distribution into a spherical one, i.e., one having a covariance matrix proportional to the identity matrix π°.
•
Whitening Transform: π to be the matrix whose columns are the orthonormal eigenvectors of Σ, and Λ the
diagonal matrix of the corresponding eigenvalues, the transformed distribution has covariance matrix equal to the
identity matrix
π¨π = ππ²−π/π
27
‘-
28
Multivariate Gaussian Density
•
From the definition of multivariate normal density function, we claim that the loci of points of
constant density are hyperellipsoids for which the quadratic form:
•
π₯ − π π Σ −1 π₯ − π is constant
• The principal axes of these hyperellipsoids are given‘-by the eigenvectors of Σ
(described by Φ);
• the eigenvalues (described by Λ, which is a diagonal matrix with entries as the
eigen values of Σ) determine the lengths of these axes.
•
Mahalanabis Distance: the distance between π₯ and π , r2 = π₯ − π
π
Σ−1 π₯ − π
29
‘-
30
Reference for Implementation
•
https://docs.scipy.org/doc//numpy-1.10.0/reference/generated/numpy.random.multivariate_normal.html
•
Draws random samples from a multivariate normal distribution.
•
The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal
‘- mean and covariance matrix. These
distribution to higher dimensions. Such a distribution is specified by its
parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,”
squared) of the one-dimensional normal distribution.
•
numpy.random.multivariate_normal(mean, cov[, size])
• mean : 1-D array_like, of length N, Mean of the N-dimensional distribution.
• cov : 2-D array_like, of shape (N, N), Covariance matrix of the distribution. It must be symmetric and
positive-semidefinite for proper sampling.
• out : ndarray, The drawn samples, of shape size, if that was provided. If not, the shape is (N,).
31
Multivariate Normal
Density , Implementation
import numpy as np
from matplotlib import cm
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D
#Parameters to set
mu_x = 0
variance_x = 2
mu_y = 0
variance_y = 10
‘-
#Create grid and multivariate normal
x = np.linspace(-10,10,200)
y = np.linspace(-10,10,200)
X, Y = np.meshgrid(x,y)
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X; pos[:, :, 1] = Y
rv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]])
#Make a 3D plot
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, rv.pdf(pos),cmap=cm.coolwarm,linewidth=0)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
plt.show()
32
Discriminant Functions for the Normal Density
•
The minimum error-rate classification can be achieved by the discriminant functions
•
Recall: gi x = ln π π₯ ππ + ln π(ππ )
•
In case of multivariate normal densities, we assume
•
Thus, we have
‘-
33
Discriminant Functions for the Normal Density
•
Case 1: Σπ = π 2π° . The features are statistically independent, and each feature has the same
variance, π 2. In this case the covariance matrix is diagonal (i.e. samples fall within equal-size
hyperspherical clusters)
•
We have Σπ = π 2π and Σπ−1 = 1/π 2I. The resulting discriminant function is:
•
Expanded form:
‘-
This is the only term that prevents ππ () from being linear. However this is same
for all I, so toward making decision on i, this is like an additive constant
•
Therefore, an equivalent form is:
•
where,
Linear Machine: decision surfaces are pieces of hyperplanes
defined by the linear equations ππ (π₯) = ππ (π₯) for the two
categories with the highest posterior probabilities
34
Discriminant Functions for the Normal Density
Decision surfaces ππ (π₯) = ππ (π₯) can also be rewritten as
‘-
•
This hyperplane runs through ππ and orthogonal to π
•
By structure of π, we can say that the hyperplance dividing π
π and π
π is an orthogonal to the
line linking the means.
•
If π ππ = π ππ , then
. This means that ππ is halfway through the means.
Otherwise, ππ shifts away from the more likely mean
•
However, if π 2 is small relative to the squared distance | ππ − ππ |, then the position of the
decision boundary is relatively insensitive to the exact values of the prior probabilities.
35
Recall: g i x = ln π π₯ ππ + ln π(ππ )
Discriminant Function with the Normal Density
•
Case 2: Σπ = Σ, a simple case arises when the covariance matrices for all of the classes are
identical but otherwise arbitrary.
•
Recall:
‘-
Superfluous for this case
If these priors are same for all the classes,
only Mahalanabis distance (the first term)
would be good enough for classification
•
With this case’s condition we have,
•
Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal
clusters of equal size and shape, the cluster for the ith class being centered about the mean
vector ππ .
•
Rewriting it, we have
36
Discriminant Function with the Normal Density: Case 2 Derivation
•
For the decision Boundary between π
π and π
π we have:
π€ππ π₯ + π€π0= π€ππ π₯ + π€π0
(π€ππ −π€ππ )π₯ + π€π0 -π€π0 =0
(ππ −ππ )π (Σ −1 )π π₯ + π€π0 −π€π0 =0
(Σ −1 (ππ −ππ ))π π₯ + π€π0 −π€π0=0
‘-
•
Now consider:
1
2
1
−
2
1
2
π€π0−π€π0 = − πππ Σ−1 ππ + ln π π€π + πππ Σ−1 ππ − ln π π€π
=
(πππ − πππ ) Σ−1 ππ + ππ + ln( π π€π /π(π€π ))
= − (πππ − πππ ) (Σ−1)π [
Finally
Recall!!
ππ+ππ
2
−
ln( π π€π /π(π€π))
(πππ−
πππ)
(Σ−1 (ππ −ππ ))π π₯ + π€π0−π€π0=0
π€ π (π₯ − π₯0) = 0
Σ−1(ππ−ππ)
(ππ −ππ )] = π€ π π₯0
37
Discriminant Function with the Normal Density: Case 2
Visualization
‘-
38
Discriminant Function with the Normal Density
•
Case 3: Σπ is arbitrary
‘-
39
Discriminant Function with the Normal Density
Even in one dimension, for arbitrary covariance the decision regions need not be simply
connected
‘-
40
Discriminant Function with the Normal Density
‘-
41
Error Probabilities
•
There are two ways in which a
classification error can occur;
either an observation π₯ falls in π
2
and the true state of nature is π1,
or π₯ falls in π
1 and the true state
of nature is π2. Since these
events are mutually exclusive
and exhaustive, the probability of
error is
Because the decision point π₯ ∗ (and hence the regions π
1 and π
2 ) were
chosen arbitrarily for that figure, the probability of error is not as small
as it might be. In particular, the triangular area marked “reducible
error” can be eliminated if the decision boundary is moved to π₯π΅ .
‘-
42
Probability of Correctness in Multicatgeory Case
‘-
43
Signal Detection Theory and Operating Characteristics
•
We are interested in detecting a single weak pulse, e.g. radar
reflection; the internal signal (π₯) in detector has mean π1 (π2) when
pulse is absent (present)
•
The detector uses a threshold π₯ ∗ to determine the presence of pulse
•
‘Because of random noise — within and outside the detector itself —
the actual value is a random variable. We assume the distributions
are normal with different means but the same variance,
i.e.,π π₯ π€π ~π ππ , π 2
•
Discriminability : a measure of the ease of discriminating whether
the pulse is present or not, in a form independent of the choice of x∗.
This describes the inherent and unchangeable properties due to
noise and the strength of ability the external signal, but not on the
decision strategy (i.e., the actual choice of x∗).
•
This discriminability is defined as: π′ =
π1−π2
π
44
Receiver Operating Characteristic (ROC)
•
Experimentally compute hit and false alarm rates for fixed
x∗
•
Changing x∗ will change the hit and false alarm rates
•
A plot of hit and false alarm rates is called the ROC curve
•
Observe that the definition of ROC curves is not
dependent on the underlying assumption of distribution.
‘-
• In practice, distributions may not be Gaussian and
will be multidimensional; ROC curve can still be
plotted
• •Vary a single control parameter for the decision rule
and plot the resulting hit and false alarm rates
45
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
ROC curve Implementation*
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metricsimport roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.metricsimport roc_auc_score
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=4)
‘-
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
46
Bayes Decision Theory: Discrete Features
•
The features can as well be discrete: binary, ternary, or higher integer valued
•
So the probability density function turns singular and the Bayes formula considers the
probability, not the probability densities and modified as
‘-
•
However the conditional risk remains to be same and the basic rule for minimizing the errorrate we have
47
Independent Binary Features
•
As an example of a classification involving discrete features, consider the two-category problem in which
the components of the feature vector are binary-valued and conditionally independent.
‘-
•
So, each feature gives us a yes/no for the pattern.
•
Assuming the conditional dependence, we can conveniently write
48
Independent Binary Features
•
The likelihood ratio is:
•
And the discriminant function is:
•
Since this is linear in x:
‘-
49
1
ππ − 1
πi = ln
1
ππ − 1
•
Recall we decide on π1 is π π₯ > 0 and π2 otherwise
‘Indicates the relevance of yes answer
for π₯π
•
If ππ = ππ no information about the true state of nature is provided.
•
Otherwise, ππ > ππ a yes answer for π₯π contributes π€π votes to π1. For fixed 1 > ππ π€π is larger
as ππ gets larger. If ππ < ππ , π€π is negative and a ‘yes’ answer contributes |π€π | votes for π2
50
Missing Features
•
We let x = [xg, xb], where xg represents the known or “good” features and xb represents the “bad” ones,
i.e., either unknown or missing. We seek the Bayes rule given the good features, and for that the
posterior probabilities are needed. In terms of the good features the posteriors are
‘Marginalized over
all bad/missing
features
•
Finally we use the Bayes decision rule on the resulting posterior probabilities,
51
Noisy Features
•
We assume we have uncorrupted (good) features xg, as before, and a noise model, expressed as
p(xb|xt). Here we let xt denote the true value of the observed xb features, i.e., without the noise present;
that is, the xb are observed instead of the true xt. We assume that if xt were known, xb would be
independent of ωi and xg. From such an assumption we get:
‘-
•
Now note that,
52
Noisy Features
•
We put these together and thereby obtain
‘-
•
which we use as discriminant functions for classification in the manner dictated by
Bayes.
53
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )