Bayes Decision Theory Presentation

BAYES DECISION THEORY ‘- Chapter 2 DHS 1 Introduction • Bayes decision theory is a fundamental statistical approach to pattern classification • This approach is based on quantifying the tradeoffs between various classification decisions ‘using probability and the costs that accompany such decisions. • Assumption: decision problem posed in probabilistic terms and relevant probability values are known 2 Bayes Terms • Suppose that an observer watching fish arrive along the conveyor belt finds it hard to predict what type will emerge next and that the sequence of types of fish appears to be random. • State of Nature: 𝜔 = 𝜔1 if it is Seabass, 𝜔 = 𝜔2 if it is Salmon. ‘- We need to probabilistically describe 𝜔 • Priori probability (or simply Prior): P(𝜔1) and P(𝜔2) [we assume that no other types of fishes are relevant here] • These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass or salmon before the fish actually appears. It might, for instance, depend upon the time of year or the choice of fishing area 3 Decision Rule • Decision Rule: Suppose for a moment that we were forced to make a decision about the type of fish that will appear next without being allowed to see it. • The only information we are allowed to use is the value of the prior probabilities. • Decide 𝜔1 if 𝑃(𝜔1) > 𝑃(𝜔2); otherwise decide 𝜔2 • How well does it work? ‘- 4 Class Conditional Density for Decision Rule • In most circumstances we are not asked to make decisions with so little information. • In our example, we might for instance use a lightness measurement 𝑥 to improve our classifier. • ‘Class Conditional Probability Density Function: 𝑥 to be a continuous random variable whose distribution depends on the state of nature and is expressed as 𝑝 𝑥 𝜔1 . • This is the probability density function for 𝑥 given that the state of nature is 𝜔1. (It is also sometimes called state-conditional probability density.) • The difference between 𝑝 𝑥 𝜔1 and 𝑝 𝑥 𝜔2 describes the difference in lightness between populations of sea bass and salmon 5 Bayes Formula • Posterior Probability (Posteriori prob. is a function of likelihood & prior), i.e. if the discovered lightness value is x, how does it influence the real state of nature: • In this case, • Bayes Formula, expressed in English: ‘- 6 Bayes Decision Rule • Based on the larger posterior value we always choose the corresponding state of nature. However, not always that we can be right, Therefore, Probability of error ‘- • Average error (or Unconditional error) • Bayes Decision Rule • The resulting Probability of Error is : • For each observation 𝑥, Bayes decision rule minimizes the probability of error 7 Role of Evidence • Evidence is 𝑝(𝑥) • Recall: Just a scale factor ‘- • Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1 • Eliminating evidence we have an equivalent decision rule: • For special cases like 𝑝(𝑥|𝜔1) = 𝑝(𝑥|𝜔2) , it depends on prior only. 8 Bayesian Decision Theory –Continuous Features • We shall now formalize the ideas just considered, and generalize them in four ways: • • • • by allowing the use of more than one feature by allowing more than two states of nature by allowing actions other than merely deciding the state of‘-nature by introducing a ‘loss function’ which is more general than the ‘probability of error’ 9 Bayesian Decision Theory –Continuous Features • Allowing the use of more than one feature merely requires replacing the scalar 𝑥 by the feature vector 𝒙, where 𝒙 is in a d-dimensional Euclidean space 𝑹𝑑 , called the feature space. • • • Allowing more feature than two states of nature provides us with a useful generalization for a small notational space expense. ‘Allowing actions other than classification primarily allows the possibility of “rejection” Rejection: Input pattern is rejected when it is difficult to decide between two classes or the pattern is too noisy! • The loss function specifies the cost of each action, and is used to convert a probability determination into a decision 10 Bayesian Decision Theory –Continuous Features • The finite set of 𝑐 states of nature ( or categories): 𝜔1, … … 𝜔𝑐 • The finite set of 𝑎 possible actions: 𝛼1, … … 𝛼𝑎 • The loss incurred for taking action 𝛼𝑖 when the state of nature is 𝜔𝑗 : 𝜆(𝛼𝑖 |𝜔𝑗 ) Remember! Not ‘all actions are equally costly! • The probability density function for 𝒙 conditioned on 𝜔𝑗 being the true state of nature: 𝑃 𝒙 𝜔𝑗 • Bayes Formula • The evidence is now: 11 Conditional Risk • Suppose that we observe a particular x and that we contemplate taking action 𝛼𝑖 . • If the true state of nature is 𝜔𝑗 , by definition we will incur the loss 𝜆(𝛼𝑖 |𝜔𝑗 ) • Since 𝑃(𝜔𝑗 |𝒙) is the probability that the true state of nature is 𝜔𝑗 , the expected loss (or ‘risk) associated with taking action 𝛼𝑖 is merely • Whenever we encounter a particular observation 𝒙, we can minimize our expected loss by selecting the action that minimizes the conditional risk. • Bayes decision procedure actually provides the optimal performance on an overall risk 12 Overall Risk • A general decision rule is a function 𝛼(𝒙) that tells us which action to take for every possible observation. • In other words, for every 𝒙 th decision function 𝛼(𝑥) assumes one of the 𝑎 values 𝛼1, … … 𝛼𝑎 ‘- • The overall risk R is • But remember! We need to estimate this decision rule with an objective of minimizing the risk of misclassification. • So, now goal is that 𝛼(𝒙) is chosen in such a way that 𝑅(𝛼𝑖 (𝒙)) is as small as possible for every x, then the overall risk will be minimized. 13 Bayes Risk • for 𝑖 = 1, … , 𝑎 and select the action 𝛼𝑖 for which 𝑅(𝛼𝑖 |𝒙) is minimum. • The resulting minimum overall risk is called the Bayes risk, denoted R∗, and is the best performance that can be achieved. ‘- 14 Two Category Problem • Here action 𝛼1 corresponds to deciding that the true state of nature is 𝜔1, and action 𝛼2 corresponds to deciding that it is 𝜔2. • 𝜆𝑖𝑗 = 𝜆(𝛼𝑖 |𝜔𝑗 ) be the loss incurred for deciding 𝜔𝑖 when the true state of nature is 𝜔𝑗 • The conditional risk can be described as: • The fundamental rule is to decide 𝜔1 if 𝑅(𝛼1|𝒙) < 𝑅(𝛼2|𝒙). In terms of the posterior probabilities, we decide 𝜔1 if • Or equivalently ‘- 15 Likelihood Ratio • We can re-written the above rule as: ‘- • We can consider 𝑝 𝑥 𝜔𝑗 as a function of 𝜔𝑗 (and treat it as a likelihood function) • Thus the Bayes decision rule can be interpreted ratio as calling for deciding 𝜔1 if the likelihood ratio exceeds a threshold value (𝜃𝜆) that is independent of the observation 𝒙 16 Minimum Error Rate Classification • Zero-One Loss This loss function assigns no loss to a correct decision and assigns a unit loss to any error; thus, all errors are equally costly. ‘- The risk corresponding to this loss function is precisely the average probability of error, since the conditional risk is the conditional probability that action 𝛼𝑖 is correct 17 Minimum Error-rate Solution • • The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. ‘- the probability of error Zero-one loss function can obtain a decision rule that minimizes 18 Minimum Error-rate Solution • Recall: We choose the class 1 if the following holds: say = 𝜃𝜆 ‘- • Now for 0-1 loss function: • With 𝜆12 = 𝜆21 = 1, 𝜆11 = 𝜆22 = 0, we have the threshold only involving the priors: 𝜃𝜆 = 𝑃(𝜔2) = 𝜃𝑎 𝑃(𝜔1) • If 𝜆12 = 2, 𝜆21 = 1, 𝜆11 = 𝜆22 = 0, 𝜃𝜆 = 2𝑃(𝜔2) 𝑃(𝜔1) = 𝜃𝑏 19 Minimum Error rate Solution ‘- 20 Classifiers, Discriminant Functions and Decision Surfaces • Many different ways to represent classifiers or decision rules; • One of the most useful is in terms of “discriminant functions” The multi-category case: • Bayes classifier can be represented in this way, but the choice of discriminant function is not unique: 𝑔𝑖 𝑥 = −𝑅𝑖 (𝛼𝑖 |𝑥) • For the minimum error rate, we take: 𝑔𝑖 𝑥 = 𝑃 𝜔𝑖 𝑥 = ln 𝑃 𝑥 𝜔𝑖 + ln 𝑃(𝜔𝑖 ) ‘- 21 Classifiers, Discriminant Functions and Decision Surfaces • A decision rule partitions the feature space into c decision regions {𝑅1, … … . 𝑅𝑐 }. A sample is in 𝑅𝑖 , if 𝑔𝑖 𝑥 > 𝑔𝑗 𝑥 ∀𝑗 ≠ 𝑖 • Two-category case: ‘- • • • • Here a classifier is a “dichotomizer” that has two discriminant functions 𝑔1 and 𝑔2 Combine them together : 𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2 (𝑥) Decide 𝜔1 if 𝑔1 𝑥 > 0, else decide 𝜔2 𝑃 𝑥𝜔 𝑃(𝜔 ) • Solve as: 𝑔 𝑥 = 𝑔1 𝑥 − 𝑔2 𝑥 = 𝑃 𝜔1 𝑥 − 𝑃 𝜔2 𝑥 = ln 𝑃 𝑥 𝜔1 + ln 𝑃(𝜔1 ) 2 2 22 ‘- 23 The Normal Density • Normal density is analytically tractable • Continuous density with two parameters (mean, variance) • A number of processes are asymptotically Gaussian (CLT) • Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted (noisy) versions of a single typical or prototype pattern • With mean as μ • Variance as 𝜎 2 ‘- 24 The Normal Density, Implementation import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm # define constants mu = 998.8 sigma = 73.10 x1 = 900 x2 = 1100 # calculate the z-transform z1 = ( x1 - mu ) / sigma z2 = ( x2 - mu ) / sigma x = np.arange(z1, z2, 0.001) # range of x in spec x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec # mean = 0, stddev = 1, since Z-transform was calculated y = norm.pdf(x,0,1) y2 = norm.pdf(x_all,0,1) fig, ax = plt.subplots(figsize=(9,6)) plt.style.use('fivethirtyeight') ax.plot(x_all,y2) ‘- 25 Multivariate Normal Density • Multivariate normal density in d dimensions: where, 𝑥 ∈ 𝑅𝑑 , mean 𝜇 ∈ 𝑅𝑑 , covariance matrix Σ ∈ 𝑅𝑑×𝑑 • ‘• We have • Covariance matrix is symmetric and positive semidefinite; we assume Σ is positive definite so the determinant of Σ is strictly positive. Multivariate normal density is completely specified by [𝑑 + 𝑑(𝑑 + 1)/2] parameters and • For 𝑖𝑡ℎ component 𝑥𝑖 of 𝑥 • What happens when for some I and j, we have 𝑥𝑖 and 𝑥𝑗 statistically independent? 26 Multivariate Normal Density • Linear combinations of jointly normally distributed random variables, independent or not, are normally distributed • 𝐴 ∈ 𝑅𝑑×𝑘 , 𝒚 = 𝐴𝑇 𝒙, 𝒘𝒉𝒆𝒓𝒆 𝒚 is a k component vector • 𝑝 𝒚 ~𝑁 𝐴𝑇 𝜇, 𝐴𝑇 Σ𝐴 • If k =1, 𝒚 = 𝒂𝑇 𝒙 is a scaler, projection of x onto a line in the direction 𝒂 • • 𝒂𝑻 𝚺 ‘- 𝒂 is the variance of the projection 𝒙 onto 𝒂 In general then, knowledge of the covariance matrix allows us to calculate the dispersion of the data in any direction, or in any subspace. • It is sometimes convenient to perform a coordinate transformation that converts an arbitrary multivariate normal distribution into a spherical one, i.e., one having a covariance matrix proportional to the identity matrix 𝑰. • Whitening Transform: 𝝓 to be the matrix whose columns are the orthonormal eigenvectors of Σ, and Λ the diagonal matrix of the corresponding eigenvalues, the transformed distribution has covariance matrix equal to the identity matrix 𝑨𝒘 = 𝝓𝚲−𝟏/𝟐 27 ‘- 28 Multivariate Gaussian Density • From the definition of multivariate normal density function, we claim that the loci of points of constant density are hyperellipsoids for which the quadratic form: • 𝑥 − 𝜇 𝑇 Σ −1 𝑥 − 𝜇 is constant • The principal axes of these hyperellipsoids are given‘-by the eigenvectors of Σ (described by Φ); • the eigenvalues (described by Λ, which is a diagonal matrix with entries as the eigen values of Σ) determine the lengths of these axes. • Mahalanabis Distance: the distance between 𝑥 and 𝜇 , r2 = 𝑥 − 𝜇 𝑇 Σ−1 𝑥 − 𝜇 29 ‘- 30 Reference for Implementation • https://docs.scipy.org/doc//numpy-1.10.0/reference/generated/numpy.random.multivariate_normal.html • Draws random samples from a multivariate normal distribution. • The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal ‘- mean and covariance matrix. These distribution to higher dimensions. Such a distribution is specified by its parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution. • numpy.random.multivariate_normal(mean, cov[, size]) • mean : 1-D array_like, of length N, Mean of the N-dimensional distribution. • cov : 2-D array_like, of shape (N, N), Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling. • out : ndarray, The drawn samples, of shape size, if that was provided. If not, the shape is (N,). 31 Multivariate Normal Density , Implementation import numpy as np from matplotlib import cm import matplotlib.pyplot as plt from scipy.stats import multivariate_normal from mpl_toolkits.mplot3d import Axes3D #Parameters to set mu_x = 0 variance_x = 2 mu_y = 0 variance_y = 10 ‘- #Create grid and multivariate normal x = np.linspace(-10,10,200) y = np.linspace(-10,10,200) X, Y = np.meshgrid(x,y) pos = np.empty(X.shape + (2,)) pos[:, :, 0] = X; pos[:, :, 1] = Y rv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]]) #Make a 3D plot fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_surface(X, Y, rv.pdf(pos),cmap=cm.coolwarm,linewidth=0) ax.set_xlabel('X axis') ax.set_ylabel('Y axis') ax.set_zlabel('Z axis') plt.show() 32 Discriminant Functions for the Normal Density • The minimum error-rate classification can be achieved by the discriminant functions • Recall: gi x = ln 𝑃 𝑥 𝜔𝑖 + ln 𝑃(𝜔𝑖 ) • In case of multivariate normal densities, we assume • Thus, we have ‘- 33 Discriminant Functions for the Normal Density • Case 1: Σ𝑖 = 𝜎 2𝑰 . The features are statistically independent, and each feature has the same variance, 𝜎 2. In this case the covariance matrix is diagonal (i.e. samples fall within equal-size hyperspherical clusters) • We have Σ𝑖 = 𝜎 2𝑑 and Σ𝑖−1 = 1/𝜎 2I. The resulting discriminant function is: • Expanded form: ‘- This is the only term that prevents 𝑔𝑖 () from being linear. However this is same for all I, so toward making decision on i, this is like an additive constant • Therefore, an equivalent form is: • where, Linear Machine: decision surfaces are pieces of hyperplanes defined by the linear equations 𝑔𝑖 (𝑥) = 𝑔𝑗 (𝑥) for the two categories with the highest posterior probabilities 34 Discriminant Functions for the Normal Density Decision surfaces 𝑔𝑖 (𝑥) = 𝑔𝑗 (𝑥) can also be rewritten as ‘- • This hyperplane runs through 𝒙𝟎 and orthogonal to 𝒘 • By structure of 𝒘, we can say that the hyperplance dividing 𝑅𝑖 and 𝑅𝑗 is an orthogonal to the line linking the means. • If 𝑃 𝜔𝑖 = 𝑃 𝜔𝑗 , then . This means that 𝒙𝟎 is halfway through the means. Otherwise, 𝒙𝟎 shifts away from the more likely mean • However, if 𝜎 2 is small relative to the squared distance | 𝜇𝑖 − 𝜇𝑗 |, then the position of the decision boundary is relatively insensitive to the exact values of the prior probabilities. 35 Recall: g i x = ln 𝑃 𝑥 𝜔𝑖 + ln 𝑃(𝜔𝑖 ) Discriminant Function with the Normal Density • Case 2: Σ𝑖 = Σ, a simple case arises when the covariance matrices for all of the classes are identical but otherwise arbitrary. • Recall: ‘- Superfluous for this case If these priors are same for all the classes, only Mahalanabis distance (the first term) would be good enough for classification • With this case’s condition we have, • Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal clusters of equal size and shape, the cluster for the ith class being centered about the mean vector 𝝁𝑖 . • Rewriting it, we have 36 Discriminant Function with the Normal Density: Case 2 Derivation • For the decision Boundary between 𝑅𝑖 and 𝑅𝑗 we have: 𝑤𝑖𝑇 𝑥 + 𝑤𝑖0= 𝑤𝑗𝑇 𝑥 + 𝑤𝑗0 (𝑤𝑖𝑇 −𝑤𝑗𝑇 )𝑥 + 𝑤𝑖0 -𝑤𝑗0 =0 (𝜇𝑖 −𝜇𝑗 )𝑇 (Σ −1 )𝑇 𝑥 + 𝑤𝑖0 −𝑤𝑗0 =0 (Σ −1 (𝜇𝑖 −𝜇𝑗 ))𝑇 𝑥 + 𝑤𝑖0 −𝑤𝑗0=0 ‘- • Now consider: 1 2 1 − 2 1 2 𝑤𝑖0−𝑤𝑗0 = − 𝜇𝑖𝑇 Σ−1 𝜇𝑖 + ln 𝑃 𝑤𝑖 + 𝜇𝑗𝑇 Σ−1 𝜇𝑗 − ln 𝑃 𝑤𝑗 = (𝜇𝑖𝑇 − 𝜇𝑗𝑇 ) Σ−1 𝜇𝑖 + 𝜇𝑗 + ln( 𝑃 𝑤𝑖 /𝑃(𝑤𝑗 )) = − (𝜇𝑖𝑇 − 𝜇𝑗𝑇 ) (Σ−1)𝑇 [ Finally Recall!! 𝜇𝑖+𝜇𝑗 2 − ln( 𝑃 𝑤𝑖 /𝑃(𝑤𝑗)) (𝜇𝑖𝑇− 𝜇𝑗𝑇) (Σ−1 (𝜇𝑖 −𝜇𝑗 ))𝑇 𝑥 + 𝑤𝑖0−𝑤𝑗0=0 𝑤 𝑇 (𝑥 − 𝑥0) = 0 Σ−1(𝜇𝑖−𝜇𝑗) (𝜇𝑖 −𝜇𝑗 )] = 𝑤 𝑇 𝑥0 37 Discriminant Function with the Normal Density: Case 2 Visualization ‘- 38 Discriminant Function with the Normal Density • Case 3: Σ𝑖 is arbitrary ‘- 39 Discriminant Function with the Normal Density Even in one dimension, for arbitrary covariance the decision regions need not be simply connected ‘- 40 Discriminant Function with the Normal Density ‘- 41 Error Probabilities • There are two ways in which a classification error can occur; either an observation 𝑥 falls in 𝑅2 and the true state of nature is 𝜔1, or 𝑥 falls in 𝑅1 and the true state of nature is 𝜔2. Since these events are mutually exclusive and exhaustive, the probability of error is Because the decision point 𝑥 ∗ (and hence the regions 𝑅1 and 𝑅2 ) were chosen arbitrarily for that figure, the probability of error is not as small as it might be. In particular, the triangular area marked “reducible error” can be eliminated if the decision boundary is moved to 𝑥𝐵 . ‘- 42 Probability of Correctness in Multicatgeory Case ‘- 43 Signal Detection Theory and Operating Characteristics • We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (𝑥) in detector has mean 𝜇1 (𝜇2) when pulse is absent (present) • The detector uses a threshold 𝑥 ∗ to determine the presence of pulse • ‘Because of random noise — within and outside the detector itself — the actual value is a random variable. We assume the distributions are normal with different means but the same variance, i.e.,𝑝 𝑥 𝑤𝑖 ~𝑁 𝜇𝑖 , 𝜎 2 • Discriminability : a measure of the ease of discriminating whether the pulse is present or not, in a form independent of the choice of x∗. This describes the inherent and unchangeable properties due to noise and the strength of ability the external signal, but not on the decision strategy (i.e., the actual choice of x∗). • This discriminability is defined as: 𝑑′ = 𝜇1−𝜇2 𝜎 44 Receiver Operating Characteristic (ROC) • Experimentally compute hit and false alarm rates for fixed x∗ • Changing x∗ will change the hit and false alarm rates • A plot of hit and false alarm rates is called the ROC curve • Observe that the definition of ROC curves is not dependent on the underlying assumption of distribution. ‘- • In practice, distributions may not be Gaussian and will be multidimensional; ROC curve can still be plotted • •Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates 45 # First aggregate all false positive rates all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)])) ROC curve Implementation* import numpy as np import matplotlib.pyplot as plt from itertools import cycle from sklearn import svm, datasets from sklearn.metricsimport roc_curve, auc from sklearn.model_selection import train_test_split from sklearn.preprocessing import label_binarize from sklearn.multiclass import OneVsRestClassifier from scipy import interp from sklearn.metricsimport roc_auc_score # Import some data to play with iris = datasets.load_iris() X = iris.data y = iris.target # Binarize the output y = label_binarize(y, classes=[0, 1, 2]) n_classes = y.shape[1] # Add noisy features to make the problem harder random_state = np.random.RandomState(0) n_samples, n_features = X.shape X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] # shuffle and split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0) # Learn to predict each class against the other classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state)) y_score = classifier.fit(X_train, y_train).decision_function(X_test) plt.figure() lw = 2 plt.plot(fpr[2], tpr[2], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2]) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show() # Then interpolate all ROC curves at this points mean_tpr = np.zeros_like(all_fpr) for i in range(n_classes): mean_tpr += interp(all_fpr, fpr[i], tpr[i]) # Finally average it and compute AUC mean_tpr /= n_classes fpr["macro"] = all_fpr tpr["macro"] = mean_tpr roc_auc["macro"] = auc(fpr["macro"], tpr["macro"]) # Plot all ROC curves plt.figure() plt.plot(fpr["micro"], tpr["micro"], label='micro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["micro"]), color='deeppink', linestyle=':', linewidth=4) plt.plot(fpr["macro"], tpr["macro"], label='macro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["macro"]), color='navy', linestyle=':', linewidth=4) ‘- colors = cycle(['aqua', 'darkorange', 'cornflowerblue']) for i, color in zip(range(n_classes), colors): plt.plot(fpr[i], tpr[i], color=color, lw=lw, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i, roc_auc[i])) plt.plot([0, 1], [0, 1], 'k--', lw=lw) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Some extension of Receiver operating characteristic to multi-class') plt.legend(loc="lower right") plt.show() # Compute ROC curve and ROC area for each class fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # Compute micro-average ROC curve and ROC area fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html 46 Bayes Decision Theory: Discrete Features • The features can as well be discrete: binary, ternary, or higher integer valued • So the probability density function turns singular and the Bayes formula considers the probability, not the probability densities and modified as ‘- • However the conditional risk remains to be same and the basic rule for minimizing the errorrate we have 47 Independent Binary Features • As an example of a classification involving discrete features, consider the two-category problem in which the components of the feature vector are binary-valued and conditionally independent. ‘- • So, each feature gives us a yes/no for the pattern. • Assuming the conditional dependence, we can conveniently write 48 Independent Binary Features • The likelihood ratio is: • And the discriminant function is: • Since this is linear in x: ‘- 49 1 𝑝𝑖 − 1 𝜔i = ln 1 𝑞𝑖 − 1 • Recall we decide on 𝜔1 is 𝑔 𝑥 > 0 and 𝜔2 otherwise ‘Indicates the relevance of yes answer for 𝑥𝑖 • If 𝑝𝑖 = 𝑞𝑖 no information about the true state of nature is provided. • Otherwise, 𝑝𝑖 > 𝑞𝑖 a yes answer for 𝑥𝑖 contributes 𝑤𝑖 votes to 𝜔1. For fixed 1 > 𝑞𝑖 𝑤𝑖 is larger as 𝑝𝑖 gets larger. If 𝑝𝑖 < 𝑞𝑖 , 𝑤𝑖 is negative and a ‘yes’ answer contributes |𝑤𝑖 | votes for 𝜔2 50 Missing Features • We let x = [xg, xb], where xg represents the known or “good” features and xb represents the “bad” ones, i.e., either unknown or missing. We seek the Bayes rule given the good features, and for that the posterior probabilities are needed. In terms of the good features the posteriors are ‘Marginalized over all bad/missing features • Finally we use the Bayes decision rule on the resulting posterior probabilities, 51 Noisy Features • We assume we have uncorrupted (good) features xg, as before, and a noise model, expressed as p(xb|xt). Here we let xt denote the true value of the observed xb features, i.e., without the noise present; that is, the xb are observed instead of the true xt. We assume that if xt were known, xb would be independent of ωi and xg. From such an assumption we get: ‘- • Now note that, 52 Noisy Features • We put these together and thereby obtain ‘- • which we use as discriminant functions for classification in the manner dictated by Bayes. 53

Bayes Decision Theory Presentation

Related documents

Products

Support

Bayes Decision Theory Presentation

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib