Uploaded by omarshahid232

Bayes Decision Theory

advertisement
BAYES DECISION
THEORY
‘-
Chapter 2 DHS
1
Introduction
•
Bayes decision theory is a fundamental statistical approach to pattern classification
•
This approach is based on quantifying the tradeoffs between various classification decisions
‘using probability and the costs that accompany such decisions.
•
Assumption: decision problem posed in probabilistic terms and relevant probability values are
known
2
Bayes Terms
•
Suppose that an observer watching fish arrive along the conveyor belt finds it hard to
predict what type will emerge next and that the sequence of types of fish appears to be
random.
•
State of Nature: πœ” = πœ”1 if it is Seabass, πœ” = πœ”2 if it is Salmon.
‘- We need to probabilistically
describe πœ”
•
Priori probability (or simply Prior): P(πœ”1) and P(πœ”2) [we assume that no other types of
fishes are relevant here]
•
These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass
or salmon before the fish actually appears. It might, for instance, depend upon the time of
year or the choice of fishing area
3
Decision Rule
•
Decision Rule: Suppose for a moment that we were forced to make a decision about the
type of fish that will appear next without being allowed to see it.
•
The only information we are allowed to use is the value of the prior probabilities.
•
Decide πœ”1 if 𝑃(πœ”1) > 𝑃(πœ”2); otherwise decide πœ”2
•
How well does it work?
‘-
4
Class Conditional Density for Decision Rule
•
In most circumstances we are not asked to make decisions with so little information.
• In our example, we might for instance use a lightness measurement π‘₯ to improve our
classifier.
•
‘Class Conditional Probability Density Function: π‘₯ to be a continuous
random variable whose
distribution depends on the state of nature and is expressed as 𝑝 π‘₯ πœ”1 .
•
This is the probability density function for π‘₯ given that the state of nature is πœ”1. (It is also
sometimes called state-conditional probability density.)
•
The difference between 𝑝 π‘₯ πœ”1 and 𝑝 π‘₯ πœ”2 describes the difference in lightness between
populations of sea bass and salmon
5
Bayes Formula
•
Posterior Probability (Posteriori prob. is a function of likelihood & prior), i.e. if the discovered
lightness value is x, how does it influence the real state of nature:
•
In this case,
•
Bayes Formula, expressed in English:
‘-
6
Bayes Decision Rule
•
Based on the larger posterior value we always choose the corresponding state of nature.
However, not always that we can be right, Therefore, Probability of error
‘-
•
Average error (or Unconditional error)
•
Bayes Decision Rule
•
The resulting Probability of Error is :
•
For each observation π‘₯, Bayes decision rule minimizes the probability of error
7
Role of Evidence
•
Evidence is 𝑝(π‘₯)
•
Recall:
Just a scale
factor ‘-
•
Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities
sum to 1
•
Eliminating evidence we have an equivalent decision rule:
•
For special cases like 𝑝(π‘₯|πœ”1) = 𝑝(π‘₯|πœ”2) , it depends on prior only.
8
Bayesian Decision Theory –Continuous Features
•
We shall now formalize the ideas just considered, and generalize them in four ways:
•
•
•
•
by allowing the use of more than one feature
by allowing more than two states of nature
by allowing actions other than merely deciding the state of‘-nature
by introducing a ‘loss function’ which is more general than the ‘probability of error’
9
Bayesian Decision Theory –Continuous Features
•
Allowing the use of more than one feature merely requires replacing the scalar π‘₯ by the
feature vector 𝒙, where 𝒙 is in a d-dimensional Euclidean space 𝑹𝑑 , called the feature space.
•
•
•
Allowing more feature than two states of nature provides us with a useful generalization for a
small notational space expense.
‘Allowing actions other than classification primarily allows the possibility of “rejection”
Rejection: Input pattern is rejected when it is difficult to decide between two classes or the
pattern is too noisy!
• The loss function specifies the cost of each action, and is used to convert a probability
determination into a decision
10
Bayesian Decision Theory –Continuous Features
•
The finite set of 𝑐 states of nature ( or categories): πœ”1, … … πœ”π‘
•
The finite set of π‘Ž possible actions: 𝛼1, … … π›Όπ‘Ž
•
The loss incurred for taking action 𝛼𝑖 when the state of nature is πœ”π‘— : πœ†(𝛼𝑖 |πœ”π‘— ) Remember! Not
‘all actions are equally costly!
•
The probability density function for 𝒙 conditioned on πœ”π‘— being the true state of nature: 𝑃 𝒙 πœ”π‘—
•
Bayes Formula
•
The evidence is now:
11
Conditional Risk
•
Suppose that we observe a particular x and that we contemplate taking action 𝛼𝑖 .
•
If the true state of nature is πœ”π‘— , by definition we will incur the loss πœ†(𝛼𝑖 |πœ”π‘— )
•
Since 𝑃(πœ”π‘— |𝒙) is the probability that the true state of nature is πœ”π‘— , the expected loss (or
‘risk) associated with taking action 𝛼𝑖 is merely
•
Whenever we encounter a particular observation 𝒙, we can minimize our expected loss
by selecting the action that minimizes the conditional risk.
•
Bayes decision procedure actually provides the optimal performance on an overall risk
12
Overall Risk
•
A general decision rule is a function 𝛼(𝒙) that tells us which action to take for every possible
observation.
•
In other words, for every 𝒙 th decision function 𝛼(π‘₯) assumes one of the π‘Ž values
𝛼1, … … π›Όπ‘Ž
‘-
•
The overall risk R is
•
But remember! We need to estimate this decision rule with an objective of minimizing the
risk of misclassification.
• So, now goal is that 𝛼(𝒙) is chosen in such a way that 𝑅(𝛼𝑖 (𝒙)) is as small as possible
for every x, then the overall risk will be minimized.
13
Bayes Risk
•
for 𝑖 = 1, … , π‘Ž and select the action 𝛼𝑖 for which 𝑅(𝛼𝑖 |𝒙) is minimum.
•
The resulting minimum overall risk is called the Bayes risk, denoted R∗, and is the best
performance that can be achieved.
‘-
14
Two Category Problem
•
Here action 𝛼1 corresponds to deciding that the true state of nature is πœ”1, and action
𝛼2 corresponds to deciding that it is πœ”2.
•
πœ†π‘–π‘— = πœ†(𝛼𝑖 |πœ”π‘— ) be the loss incurred for deciding πœ”π‘– when the true state of nature is πœ”π‘—
•
The conditional risk can be described as:
•
The fundamental rule is to decide πœ”1 if 𝑅(𝛼1|𝒙) < 𝑅(𝛼2|𝒙). In terms of the posterior
probabilities, we decide πœ”1 if
•
Or equivalently
‘-
15
Likelihood Ratio
•
We can re-written the above rule as:
‘-
•
We can consider 𝑝 π‘₯ πœ”π‘— as a function of πœ”π‘— (and treat it as a likelihood function)
•
Thus the Bayes decision rule can be interpreted ratio as calling for deciding πœ”1 if the
likelihood ratio exceeds a threshold value (πœƒπœ†) that is independent of the observation 𝒙
16
Minimum Error Rate Classification
•
Zero-One Loss
This loss function assigns no loss to a correct decision and assigns a unit loss to any
error; thus, all errors are equally costly.
‘-
The risk corresponding to this loss function is precisely the average probability of error,
since the conditional risk is
the conditional probability that
action 𝛼𝑖 is correct
17
Minimum Error-rate Solution
•
•
The Bayes decision rule to minimize risk calls for selecting the action that minimizes the
conditional risk.
‘- the probability of error
Zero-one loss function can obtain a decision rule that minimizes
18
Minimum Error-rate Solution
•
Recall: We choose the class 1 if the following holds:
say = πœƒπœ†
‘-
•
Now for 0-1 loss function:
•
With πœ†12 = πœ†21 = 1, πœ†11 = πœ†22 = 0, we have the threshold only involving the priors: πœƒπœ† =
𝑃(πœ”2)
= πœƒπ‘Ž
𝑃(πœ”1)
•
If πœ†12 = 2, πœ†21 = 1, πœ†11 = πœ†22 = 0, πœƒπœ† =
2𝑃(πœ”2)
𝑃(πœ”1)
= πœƒπ‘
19
Minimum Error rate Solution
‘-
20
Classifiers, Discriminant Functions and Decision Surfaces
•
Many different ways to represent classifiers or
decision rules;
•
One of the most useful is in terms of “discriminant
functions” The multi-category case:
•
Bayes classifier can be represented in this way, but
the choice of discriminant function is not unique:
𝑔𝑖 π‘₯ = −𝑅𝑖 (𝛼𝑖 |π‘₯)
•
For the minimum error rate, we take:
𝑔𝑖 π‘₯ = 𝑃 πœ”π‘– π‘₯ = ln 𝑃 π‘₯ πœ”π‘– + ln 𝑃(πœ”π‘– )
‘-
21
Classifiers, Discriminant Functions and Decision Surfaces
•
A decision rule partitions the feature space into c decision regions {𝑅1, … … . 𝑅𝑐 }. A sample is
in 𝑅𝑖 , if
𝑔𝑖 π‘₯ > 𝑔𝑗 π‘₯ ∀𝑗 ≠ 𝑖
•
Two-category case:
‘-
•
•
•
•
Here a classifier is a “dichotomizer” that has two discriminant functions 𝑔1 and 𝑔2
Combine them together :
𝑔 π‘₯ = 𝑔1 π‘₯ − 𝑔2 (π‘₯)
Decide πœ”1 if 𝑔1 π‘₯ > 0, else decide πœ”2
𝑃 π‘₯πœ”
𝑃(πœ” )
• Solve as: 𝑔 π‘₯ = 𝑔1 π‘₯ − 𝑔2 π‘₯ = 𝑃 πœ”1 π‘₯ − 𝑃 πœ”2 π‘₯ = ln 𝑃 π‘₯ πœ”1 + ln 𝑃(πœ”1 )
2
2
22
‘-
23
The Normal Density
•
Normal density is analytically tractable
•
Continuous density with two parameters (mean, variance)
•
A number of processes are asymptotically Gaussian (CLT)
•
Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted
(noisy) versions of a single typical or prototype pattern
•
With mean as μ
•
Variance as 𝜎 2
‘-
24
The Normal Density, Implementation
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
# define constants
mu = 998.8
sigma = 73.10
x1 = 900
x2 = 1100
# calculate the z-transform
z1 = ( x1 - mu ) / sigma
z2 = ( x2 - mu ) / sigma
x = np.arange(z1, z2, 0.001) # range of x in spec
x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec
# mean = 0, stddev = 1, since Z-transform was calculated
y = norm.pdf(x,0,1)
y2 = norm.pdf(x_all,0,1)
fig, ax = plt.subplots(figsize=(9,6))
plt.style.use('fivethirtyeight')
ax.plot(x_all,y2)
‘-
25
Multivariate Normal Density
•
Multivariate normal density in d dimensions:
where, π‘₯ ∈ 𝑅𝑑 , mean πœ‡ ∈ 𝑅𝑑 , covariance matrix Σ ∈ 𝑅𝑑×𝑑
•
‘•
We have
•
Covariance matrix is symmetric and positive
semidefinite; we assume Σ is positive definite so the
determinant of Σ is strictly positive.
Multivariate normal density is completely specified by
[𝑑 + 𝑑(𝑑 + 1)/2] parameters
and
•
For π‘–π‘‘β„Ž component π‘₯𝑖 of π‘₯
•
What happens when for some I and j, we have π‘₯𝑖 and π‘₯𝑗 statistically independent?
26
Multivariate Normal Density
•
Linear combinations of jointly normally distributed random variables, independent or not, are normally distributed
• 𝐴 ∈ 𝑅𝑑×π‘˜ , π’š = 𝐴𝑇 𝒙, π’˜π’‰π’†π’“π’† π’š is a k component vector
• 𝑝 π’š ~𝑁 𝐴𝑇 πœ‡, 𝐴𝑇 Σ𝐴
• If k =1, π’š = 𝒂𝑇 𝒙 is a scaler, projection of x onto a line in the direction 𝒂
•
•
𝒂𝑻 𝚺
‘-
𝒂 is the variance of the projection 𝒙 onto 𝒂
In general then, knowledge of the covariance matrix allows us to calculate the dispersion of the data in any
direction, or in any subspace.
•
It is sometimes convenient to perform a coordinate transformation that converts an arbitrary multivariate normal
distribution into a spherical one, i.e., one having a covariance matrix proportional to the identity matrix 𝑰.
•
Whitening Transform: 𝝓 to be the matrix whose columns are the orthonormal eigenvectors of Σ, and Λ the
diagonal matrix of the corresponding eigenvalues, the transformed distribution has covariance matrix equal to the
identity matrix
π‘¨π’˜ = π“πš²−𝟏/𝟐
27
‘-
28
Multivariate Gaussian Density
•
From the definition of multivariate normal density function, we claim that the loci of points of
constant density are hyperellipsoids for which the quadratic form:
•
π‘₯ − πœ‡ 𝑇 Σ −1 π‘₯ − πœ‡ is constant
• The principal axes of these hyperellipsoids are given‘-by the eigenvectors of Σ
(described by Φ);
• the eigenvalues (described by Λ, which is a diagonal matrix with entries as the
eigen values of Σ) determine the lengths of these axes.
•
Mahalanabis Distance: the distance between π‘₯ and πœ‡ , r2 = π‘₯ − πœ‡
𝑇
Σ−1 π‘₯ − πœ‡
29
‘-
30
Reference for Implementation
•
https://docs.scipy.org/doc//numpy-1.10.0/reference/generated/numpy.random.multivariate_normal.html
•
Draws random samples from a multivariate normal distribution.
•
The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal
‘- mean and covariance matrix. These
distribution to higher dimensions. Such a distribution is specified by its
parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,”
squared) of the one-dimensional normal distribution.
•
numpy.random.multivariate_normal(mean, cov[, size])
• mean : 1-D array_like, of length N, Mean of the N-dimensional distribution.
• cov : 2-D array_like, of shape (N, N), Covariance matrix of the distribution. It must be symmetric and
positive-semidefinite for proper sampling.
• out : ndarray, The drawn samples, of shape size, if that was provided. If not, the shape is (N,).
31
Multivariate Normal
Density , Implementation
import numpy as np
from matplotlib import cm
import matplotlib.pyplot as plt
from scipy.stats import multivariate_normal
from mpl_toolkits.mplot3d import Axes3D
#Parameters to set
mu_x = 0
variance_x = 2
mu_y = 0
variance_y = 10
‘-
#Create grid and multivariate normal
x = np.linspace(-10,10,200)
y = np.linspace(-10,10,200)
X, Y = np.meshgrid(x,y)
pos = np.empty(X.shape + (2,))
pos[:, :, 0] = X; pos[:, :, 1] = Y
rv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]])
#Make a 3D plot
fig = plt.figure()
ax = fig.gca(projection='3d')
ax.plot_surface(X, Y, rv.pdf(pos),cmap=cm.coolwarm,linewidth=0)
ax.set_xlabel('X axis')
ax.set_ylabel('Y axis')
ax.set_zlabel('Z axis')
plt.show()
32
Discriminant Functions for the Normal Density
•
The minimum error-rate classification can be achieved by the discriminant functions
•
Recall: gi x = ln 𝑃 π‘₯ πœ”π‘– + ln 𝑃(πœ”π‘– )
•
In case of multivariate normal densities, we assume
•
Thus, we have
‘-
33
Discriminant Functions for the Normal Density
•
Case 1: Σ𝑖 = 𝜎 2𝑰 . The features are statistically independent, and each feature has the same
variance, 𝜎 2. In this case the covariance matrix is diagonal (i.e. samples fall within equal-size
hyperspherical clusters)
•
We have Σ𝑖 = 𝜎 2𝑑 and Σ𝑖−1 = 1/𝜎 2I. The resulting discriminant function is:
•
Expanded form:
‘-
This is the only term that prevents 𝑔𝑖 () from being linear. However this is same
for all I, so toward making decision on i, this is like an additive constant
•
Therefore, an equivalent form is:
•
where,
Linear Machine: decision surfaces are pieces of hyperplanes
defined by the linear equations 𝑔𝑖 (π‘₯) = 𝑔𝑗 (π‘₯) for the two
categories with the highest posterior probabilities
34
Discriminant Functions for the Normal Density
Decision surfaces 𝑔𝑖 (π‘₯) = 𝑔𝑗 (π‘₯) can also be rewritten as
‘-
•
This hyperplane runs through π’™πŸŽ and orthogonal to π’˜
•
By structure of π’˜, we can say that the hyperplance dividing 𝑅𝑖 and 𝑅𝑗 is an orthogonal to the
line linking the means.
•
If 𝑃 πœ”π‘– = 𝑃 πœ”π‘— , then
. This means that π’™πŸŽ is halfway through the means.
Otherwise, π’™πŸŽ shifts away from the more likely mean
•
However, if 𝜎 2 is small relative to the squared distance | πœ‡π‘– − πœ‡π‘— |, then the position of the
decision boundary is relatively insensitive to the exact values of the prior probabilities.
35
Recall: g i x = ln 𝑃 π‘₯ πœ”π‘– + ln 𝑃(πœ”π‘– )
Discriminant Function with the Normal Density
•
Case 2: Σ𝑖 = Σ, a simple case arises when the covariance matrices for all of the classes are
identical but otherwise arbitrary.
•
Recall:
‘-
Superfluous for this case
If these priors are same for all the classes,
only Mahalanabis distance (the first term)
would be good enough for classification
•
With this case’s condition we have,
•
Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal
clusters of equal size and shape, the cluster for the ith class being centered about the mean
vector 𝝁𝑖 .
•
Rewriting it, we have
36
Discriminant Function with the Normal Density: Case 2 Derivation
•
For the decision Boundary between 𝑅𝑖 and 𝑅𝑗 we have:
𝑀𝑖𝑇 π‘₯ + 𝑀𝑖0= 𝑀𝑗𝑇 π‘₯ + 𝑀𝑗0
(𝑀𝑖𝑇 −𝑀𝑗𝑇 )π‘₯ + 𝑀𝑖0 -𝑀𝑗0 =0
(πœ‡π‘– −πœ‡π‘— )𝑇 (Σ −1 )𝑇 π‘₯ + 𝑀𝑖0 −𝑀𝑗0 =0
(Σ −1 (πœ‡π‘– −πœ‡π‘— ))𝑇 π‘₯ + 𝑀𝑖0 −𝑀𝑗0=0
‘-
•
Now consider:
1
2
1
−
2
1
2
𝑀𝑖0−𝑀𝑗0 = − πœ‡π‘–π‘‡ Σ−1 πœ‡π‘– + ln 𝑃 𝑀𝑖 + πœ‡π‘—π‘‡ Σ−1 πœ‡π‘— − ln 𝑃 𝑀𝑗
=
(πœ‡π‘–π‘‡ − πœ‡π‘—π‘‡ ) Σ−1 πœ‡π‘– + πœ‡π‘— + ln( 𝑃 𝑀𝑖 /𝑃(𝑀𝑗 ))
= − (πœ‡π‘–π‘‡ − πœ‡π‘—π‘‡ ) (Σ−1)𝑇 [
Finally
Recall!!
πœ‡π‘–+πœ‡π‘—
2
−
ln( 𝑃 𝑀𝑖 /𝑃(𝑀𝑗))
(πœ‡π‘–π‘‡−
πœ‡π‘—π‘‡)
(Σ−1 (πœ‡π‘– −πœ‡π‘— ))𝑇 π‘₯ + 𝑀𝑖0−𝑀𝑗0=0
𝑀 𝑇 (π‘₯ − π‘₯0) = 0
Σ−1(πœ‡π‘–−πœ‡π‘—)
(πœ‡π‘– −πœ‡π‘— )] = 𝑀 𝑇 π‘₯0
37
Discriminant Function with the Normal Density: Case 2
Visualization
‘-
38
Discriminant Function with the Normal Density
•
Case 3: Σ𝑖 is arbitrary
‘-
39
Discriminant Function with the Normal Density
Even in one dimension, for arbitrary covariance the decision regions need not be simply
connected
‘-
40
Discriminant Function with the Normal Density
‘-
41
Error Probabilities
•
There are two ways in which a
classification error can occur;
either an observation π‘₯ falls in 𝑅2
and the true state of nature is πœ”1,
or π‘₯ falls in 𝑅1 and the true state
of nature is πœ”2. Since these
events are mutually exclusive
and exhaustive, the probability of
error is
Because the decision point π‘₯ ∗ (and hence the regions 𝑅1 and 𝑅2 ) were
chosen arbitrarily for that figure, the probability of error is not as small
as it might be. In particular, the triangular area marked “reducible
error” can be eliminated if the decision boundary is moved to π‘₯𝐡 .
‘-
42
Probability of Correctness in Multicatgeory Case
‘-
43
Signal Detection Theory and Operating Characteristics
•
We are interested in detecting a single weak pulse, e.g. radar
reflection; the internal signal (π‘₯) in detector has mean πœ‡1 (πœ‡2) when
pulse is absent (present)
•
The detector uses a threshold π‘₯ ∗ to determine the presence of pulse
•
‘Because of random noise — within and outside the detector itself —
the actual value is a random variable. We assume the distributions
are normal with different means but the same variance,
i.e.,𝑝 π‘₯ 𝑀𝑖 ~𝑁 πœ‡π‘– , 𝜎 2
•
Discriminability : a measure of the ease of discriminating whether
the pulse is present or not, in a form independent of the choice of x∗.
This describes the inherent and unchangeable properties due to
noise and the strength of ability the external signal, but not on the
decision strategy (i.e., the actual choice of x∗).
•
This discriminability is defined as: 𝑑′ =
πœ‡1−πœ‡2
𝜎
44
Receiver Operating Characteristic (ROC)
•
Experimentally compute hit and false alarm rates for fixed
x∗
•
Changing x∗ will change the hit and false alarm rates
•
A plot of hit and false alarm rates is called the ROC curve
•
Observe that the definition of ROC curves is not
dependent on the underlying assumption of distribution.
‘-
• In practice, distributions may not be Gaussian and
will be multidimensional; ROC curve can still be
plotted
• •Vary a single control parameter for the decision rule
and plot the resulting hit and false alarm rates
45
# First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)]))
ROC curve Implementation*
import numpy as np
import matplotlib.pyplot as plt
from itertools import cycle
from sklearn import svm, datasets
from sklearn.metricsimport roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier
from scipy import interp
from sklearn.metricsimport roc_auc_score
# Import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]
# Add noisy features to make the problem harder
random_state = np.random.RandomState(0)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]
# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,
random_state=0)
# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,
random_state=random_state))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
plt.figure()
lw = 2
plt.plot(fpr[2], tpr[2], color='darkorange',
lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])
plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic example')
plt.legend(loc="lower right")
plt.show()
# Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(n_classes):
mean_tpr += interp(all_fpr, fpr[i], tpr[i])
# Finally average it and compute AUC
mean_tpr /= n_classes
fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])
# Plot all ROC curves
plt.figure()
plt.plot(fpr["micro"], tpr["micro"],
label='micro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["micro"]),
color='deeppink', linestyle=':', linewidth=4)
plt.plot(fpr["macro"], tpr["macro"],
label='macro-average ROC curve (area = {0:0.2f})'
''.format(roc_auc["macro"]),
color='navy', linestyle=':', linewidth=4)
‘-
colors = cycle(['aqua', 'darkorange', 'cornflowerblue'])
for i, color in zip(range(n_classes), colors):
plt.plot(fpr[i], tpr[i], color=color, lw=lw,
label='ROC curve of class {0} (area = {1:0.2f})'
''.format(i, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Some extension of Receiver operating characteristic to multi-class')
plt.legend(loc="lower right")
plt.show()
# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
roc_auc[i] = auc(fpr[i], tpr[i])
# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html
46
Bayes Decision Theory: Discrete Features
•
The features can as well be discrete: binary, ternary, or higher integer valued
•
So the probability density function turns singular and the Bayes formula considers the
probability, not the probability densities and modified as
‘-
•
However the conditional risk remains to be same and the basic rule for minimizing the errorrate we have
47
Independent Binary Features
•
As an example of a classification involving discrete features, consider the two-category problem in which
the components of the feature vector are binary-valued and conditionally independent.
‘-
•
So, each feature gives us a yes/no for the pattern.
•
Assuming the conditional dependence, we can conveniently write
48
Independent Binary Features
•
The likelihood ratio is:
•
And the discriminant function is:
•
Since this is linear in x:
‘-
49
1
𝑝𝑖 − 1
πœ”i = ln
1
π‘žπ‘– − 1
•
Recall we decide on πœ”1 is 𝑔 π‘₯ > 0 and πœ”2 otherwise
‘Indicates the relevance of yes answer
for π‘₯𝑖
•
If 𝑝𝑖 = π‘žπ‘– no information about the true state of nature is provided.
•
Otherwise, 𝑝𝑖 > π‘žπ‘– a yes answer for π‘₯𝑖 contributes 𝑀𝑖 votes to πœ”1. For fixed 1 > π‘žπ‘– 𝑀𝑖 is larger
as 𝑝𝑖 gets larger. If 𝑝𝑖 < π‘žπ‘– , 𝑀𝑖 is negative and a ‘yes’ answer contributes |𝑀𝑖 | votes for πœ”2
50
Missing Features
•
We let x = [xg, xb], where xg represents the known or “good” features and xb represents the “bad” ones,
i.e., either unknown or missing. We seek the Bayes rule given the good features, and for that the
posterior probabilities are needed. In terms of the good features the posteriors are
‘Marginalized over
all bad/missing
features
•
Finally we use the Bayes decision rule on the resulting posterior probabilities,
51
Noisy Features
•
We assume we have uncorrupted (good) features xg, as before, and a noise model, expressed as
p(xb|xt). Here we let xt denote the true value of the observed xb features, i.e., without the noise present;
that is, the xb are observed instead of the true xt. We assume that if xt were known, xb would be
independent of ωi and xg. From such an assumption we get:
‘-
•
Now note that,
52
Noisy Features
•
We put these together and thereby obtain
‘-
•
which we use as discriminant functions for classification in the manner dictated by
Bayes.
53
Download