BAYES DECISION THEORY ‘- Chapter 2 DHS 1 Introduction • Bayes decision theory is a fundamental statistical approach to pattern classification • This approach is based on quantifying the tradeoffs between various classification decisions ‘using probability and the costs that accompany such decisions. • Assumption: decision problem posed in probabilistic terms and relevant probability values are known 2 Bayes Terms • Suppose that an observer watching fish arrive along the conveyor belt finds it hard to predict what type will emerge next and that the sequence of types of fish appears to be random. • State of Nature: π = π1 if it is Seabass, π = π2 if it is Salmon. ‘- We need to probabilistically describe π • Priori probability (or simply Prior): P(π1) and P(π2) [we assume that no other types of fishes are relevant here] • These prior probabilities reflect our prior knowledge of how likely we are to get a sea bass or salmon before the fish actually appears. It might, for instance, depend upon the time of year or the choice of fishing area 3 Decision Rule • Decision Rule: Suppose for a moment that we were forced to make a decision about the type of fish that will appear next without being allowed to see it. • The only information we are allowed to use is the value of the prior probabilities. • Decide π1 if π(π1) > π(π2); otherwise decide π2 • How well does it work? ‘- 4 Class Conditional Density for Decision Rule • In most circumstances we are not asked to make decisions with so little information. • In our example, we might for instance use a lightness measurement π₯ to improve our classifier. • ‘Class Conditional Probability Density Function: π₯ to be a continuous random variable whose distribution depends on the state of nature and is expressed as π π₯ π1 . • This is the probability density function for π₯ given that the state of nature is π1. (It is also sometimes called state-conditional probability density.) • The difference between π π₯ π1 and π π₯ π2 describes the difference in lightness between populations of sea bass and salmon 5 Bayes Formula • Posterior Probability (Posteriori prob. is a function of likelihood & prior), i.e. if the discovered lightness value is x, how does it influence the real state of nature: • In this case, • Bayes Formula, expressed in English: ‘- 6 Bayes Decision Rule • Based on the larger posterior value we always choose the corresponding state of nature. However, not always that we can be right, Therefore, Probability of error ‘- • Average error (or Unconditional error) • Bayes Decision Rule • The resulting Probability of Error is : • For each observation π₯, Bayes decision rule minimizes the probability of error 7 Role of Evidence • Evidence is π(π₯) • Recall: Just a scale factor ‘- • Evidence P(x) can be viewed as a scale factor that guarantees that the posterior probabilities sum to 1 • Eliminating evidence we have an equivalent decision rule: • For special cases like π(π₯|π1) = π(π₯|π2) , it depends on prior only. 8 Bayesian Decision Theory –Continuous Features • We shall now formalize the ideas just considered, and generalize them in four ways: • • • • by allowing the use of more than one feature by allowing more than two states of nature by allowing actions other than merely deciding the state of‘-nature by introducing a ‘loss function’ which is more general than the ‘probability of error’ 9 Bayesian Decision Theory –Continuous Features • Allowing the use of more than one feature merely requires replacing the scalar π₯ by the feature vector π, where π is in a d-dimensional Euclidean space πΉπ , called the feature space. • • • Allowing more feature than two states of nature provides us with a useful generalization for a small notational space expense. ‘Allowing actions other than classification primarily allows the possibility of “rejection” Rejection: Input pattern is rejected when it is difficult to decide between two classes or the pattern is too noisy! • The loss function specifies the cost of each action, and is used to convert a probability determination into a decision 10 Bayesian Decision Theory –Continuous Features • The finite set of π states of nature ( or categories): π1, … … ππ • The finite set of π possible actions: πΌ1, … … πΌπ • The loss incurred for taking action πΌπ when the state of nature is ππ : π(πΌπ |ππ ) Remember! Not ‘all actions are equally costly! • The probability density function for π conditioned on ππ being the true state of nature: π π ππ • Bayes Formula • The evidence is now: 11 Conditional Risk • Suppose that we observe a particular x and that we contemplate taking action πΌπ . • If the true state of nature is ππ , by definition we will incur the loss π(πΌπ |ππ ) • Since π(ππ |π) is the probability that the true state of nature is ππ , the expected loss (or ‘risk) associated with taking action πΌπ is merely • Whenever we encounter a particular observation π, we can minimize our expected loss by selecting the action that minimizes the conditional risk. • Bayes decision procedure actually provides the optimal performance on an overall risk 12 Overall Risk • A general decision rule is a function πΌ(π) that tells us which action to take for every possible observation. • In other words, for every π th decision function πΌ(π₯) assumes one of the π values πΌ1, … … πΌπ ‘- • The overall risk R is • But remember! We need to estimate this decision rule with an objective of minimizing the risk of misclassification. • So, now goal is that πΌ(π) is chosen in such a way that π (πΌπ (π)) is as small as possible for every x, then the overall risk will be minimized. 13 Bayes Risk • for π = 1, … , π and select the action πΌπ for which π (πΌπ |π) is minimum. • The resulting minimum overall risk is called the Bayes risk, denoted R∗, and is the best performance that can be achieved. ‘- 14 Two Category Problem • Here action πΌ1 corresponds to deciding that the true state of nature is π1, and action πΌ2 corresponds to deciding that it is π2. • πππ = π(πΌπ |ππ ) be the loss incurred for deciding ππ when the true state of nature is ππ • The conditional risk can be described as: • The fundamental rule is to decide π1 if π (πΌ1|π) < π (πΌ2|π). In terms of the posterior probabilities, we decide π1 if • Or equivalently ‘- 15 Likelihood Ratio • We can re-written the above rule as: ‘- • We can consider π π₯ ππ as a function of ππ (and treat it as a likelihood function) • Thus the Bayes decision rule can be interpreted ratio as calling for deciding π1 if the likelihood ratio exceeds a threshold value (ππ) that is independent of the observation π 16 Minimum Error Rate Classification • Zero-One Loss This loss function assigns no loss to a correct decision and assigns a unit loss to any error; thus, all errors are equally costly. ‘- The risk corresponding to this loss function is precisely the average probability of error, since the conditional risk is the conditional probability that action πΌπ is correct 17 Minimum Error-rate Solution • • The Bayes decision rule to minimize risk calls for selecting the action that minimizes the conditional risk. ‘- the probability of error Zero-one loss function can obtain a decision rule that minimizes 18 Minimum Error-rate Solution • Recall: We choose the class 1 if the following holds: say = ππ ‘- • Now for 0-1 loss function: • With π12 = π21 = 1, π11 = π22 = 0, we have the threshold only involving the priors: ππ = π(π2) = ππ π(π1) • If π12 = 2, π21 = 1, π11 = π22 = 0, ππ = 2π(π2) π(π1) = ππ 19 Minimum Error rate Solution ‘- 20 Classifiers, Discriminant Functions and Decision Surfaces • Many different ways to represent classifiers or decision rules; • One of the most useful is in terms of “discriminant functions” The multi-category case: • Bayes classifier can be represented in this way, but the choice of discriminant function is not unique: ππ π₯ = −π π (πΌπ |π₯) • For the minimum error rate, we take: ππ π₯ = π ππ π₯ = ln π π₯ ππ + ln π(ππ ) ‘- 21 Classifiers, Discriminant Functions and Decision Surfaces • A decision rule partitions the feature space into c decision regions {π 1, … … . π π }. A sample is in π π , if ππ π₯ > ππ π₯ ∀π ≠ π • Two-category case: ‘- • • • • Here a classifier is a “dichotomizer” that has two discriminant functions π1 and π2 Combine them together : π π₯ = π1 π₯ − π2 (π₯) Decide π1 if π1 π₯ > 0, else decide π2 π π₯π π(π ) • Solve as: π π₯ = π1 π₯ − π2 π₯ = π π1 π₯ − π π2 π₯ = ln π π₯ π1 + ln π(π1 ) 2 2 22 ‘- 23 The Normal Density • Normal density is analytically tractable • Continuous density with two parameters (mean, variance) • A number of processes are asymptotically Gaussian (CLT) • Patterns (e.g., handwritten characters, speech signals ) can be viewed as randomly corrupted (noisy) versions of a single typical or prototype pattern • With mean as μ • Variance as π 2 ‘- 24 The Normal Density, Implementation import numpy as np import matplotlib.pyplot as plt from scipy.stats import norm # define constants mu = 998.8 sigma = 73.10 x1 = 900 x2 = 1100 # calculate the z-transform z1 = ( x1 - mu ) / sigma z2 = ( x2 - mu ) / sigma x = np.arange(z1, z2, 0.001) # range of x in spec x_all = np.arange(-10, 10, 0.001) # entire range of x, both in and out of spec # mean = 0, stddev = 1, since Z-transform was calculated y = norm.pdf(x,0,1) y2 = norm.pdf(x_all,0,1) fig, ax = plt.subplots(figsize=(9,6)) plt.style.use('fivethirtyeight') ax.plot(x_all,y2) ‘- 25 Multivariate Normal Density • Multivariate normal density in d dimensions: where, π₯ ∈ π π , mean π ∈ π π , covariance matrix Σ ∈ π π×π • ‘• We have • Covariance matrix is symmetric and positive semidefinite; we assume Σ is positive definite so the determinant of Σ is strictly positive. Multivariate normal density is completely specified by [π + π(π + 1)/2] parameters and • For ππ‘β component π₯π of π₯ • What happens when for some I and j, we have π₯π and π₯π statistically independent? 26 Multivariate Normal Density • Linear combinations of jointly normally distributed random variables, independent or not, are normally distributed • π΄ ∈ π π×π , π = π΄π π, πππππ π is a k component vector • π π ~π π΄π π, π΄π Σπ΄ • If k =1, π = ππ π is a scaler, projection of x onto a line in the direction π • • ππ» πΊ ‘- π is the variance of the projection π onto π In general then, knowledge of the covariance matrix allows us to calculate the dispersion of the data in any direction, or in any subspace. • It is sometimes convenient to perform a coordinate transformation that converts an arbitrary multivariate normal distribution into a spherical one, i.e., one having a covariance matrix proportional to the identity matrix π°. • Whitening Transform: π to be the matrix whose columns are the orthonormal eigenvectors of Σ, and Λ the diagonal matrix of the corresponding eigenvalues, the transformed distribution has covariance matrix equal to the identity matrix π¨π = ππ²−π/π 27 ‘- 28 Multivariate Gaussian Density • From the definition of multivariate normal density function, we claim that the loci of points of constant density are hyperellipsoids for which the quadratic form: • π₯ − π π Σ −1 π₯ − π is constant • The principal axes of these hyperellipsoids are given‘-by the eigenvectors of Σ (described by Φ); • the eigenvalues (described by Λ, which is a diagonal matrix with entries as the eigen values of Σ) determine the lengths of these axes. • Mahalanabis Distance: the distance between π₯ and π , r2 = π₯ − π π Σ−1 π₯ − π 29 ‘- 30 Reference for Implementation • https://docs.scipy.org/doc//numpy-1.10.0/reference/generated/numpy.random.multivariate_normal.html • Draws random samples from a multivariate normal distribution. • The multivariate normal, multinormal or Gaussian distribution is a generalization of the one-dimensional normal ‘- mean and covariance matrix. These distribution to higher dimensions. Such a distribution is specified by its parameters are analogous to the mean (average or “center”) and variance (standard deviation, or “width,” squared) of the one-dimensional normal distribution. • numpy.random.multivariate_normal(mean, cov[, size]) • mean : 1-D array_like, of length N, Mean of the N-dimensional distribution. • cov : 2-D array_like, of shape (N, N), Covariance matrix of the distribution. It must be symmetric and positive-semidefinite for proper sampling. • out : ndarray, The drawn samples, of shape size, if that was provided. If not, the shape is (N,). 31 Multivariate Normal Density , Implementation import numpy as np from matplotlib import cm import matplotlib.pyplot as plt from scipy.stats import multivariate_normal from mpl_toolkits.mplot3d import Axes3D #Parameters to set mu_x = 0 variance_x = 2 mu_y = 0 variance_y = 10 ‘- #Create grid and multivariate normal x = np.linspace(-10,10,200) y = np.linspace(-10,10,200) X, Y = np.meshgrid(x,y) pos = np.empty(X.shape + (2,)) pos[:, :, 0] = X; pos[:, :, 1] = Y rv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]]) #Make a 3D plot fig = plt.figure() ax = fig.gca(projection='3d') ax.plot_surface(X, Y, rv.pdf(pos),cmap=cm.coolwarm,linewidth=0) ax.set_xlabel('X axis') ax.set_ylabel('Y axis') ax.set_zlabel('Z axis') plt.show() 32 Discriminant Functions for the Normal Density • The minimum error-rate classification can be achieved by the discriminant functions • Recall: gi x = ln π π₯ ππ + ln π(ππ ) • In case of multivariate normal densities, we assume • Thus, we have ‘- 33 Discriminant Functions for the Normal Density • Case 1: Σπ = π 2π° . The features are statistically independent, and each feature has the same variance, π 2. In this case the covariance matrix is diagonal (i.e. samples fall within equal-size hyperspherical clusters) • We have Σπ = π 2π and Σπ−1 = 1/π 2I. The resulting discriminant function is: • Expanded form: ‘- This is the only term that prevents ππ () from being linear. However this is same for all I, so toward making decision on i, this is like an additive constant • Therefore, an equivalent form is: • where, Linear Machine: decision surfaces are pieces of hyperplanes defined by the linear equations ππ (π₯) = ππ (π₯) for the two categories with the highest posterior probabilities 34 Discriminant Functions for the Normal Density Decision surfaces ππ (π₯) = ππ (π₯) can also be rewritten as ‘- • This hyperplane runs through ππ and orthogonal to π • By structure of π, we can say that the hyperplance dividing π π and π π is an orthogonal to the line linking the means. • If π ππ = π ππ , then . This means that ππ is halfway through the means. Otherwise, ππ shifts away from the more likely mean • However, if π 2 is small relative to the squared distance | ππ − ππ |, then the position of the decision boundary is relatively insensitive to the exact values of the prior probabilities. 35 Recall: g i x = ln π π₯ ππ + ln π(ππ ) Discriminant Function with the Normal Density • Case 2: Σπ = Σ, a simple case arises when the covariance matrices for all of the classes are identical but otherwise arbitrary. • Recall: ‘- Superfluous for this case If these priors are same for all the classes, only Mahalanabis distance (the first term) would be good enough for classification • With this case’s condition we have, • Geometrically, this corresponds to the situation in which the samples fall in hyperellipsoidal clusters of equal size and shape, the cluster for the ith class being centered about the mean vector ππ . • Rewriting it, we have 36 Discriminant Function with the Normal Density: Case 2 Derivation • For the decision Boundary between π π and π π we have: π€ππ π₯ + π€π0= π€ππ π₯ + π€π0 (π€ππ −π€ππ )π₯ + π€π0 -π€π0 =0 (ππ −ππ )π (Σ −1 )π π₯ + π€π0 −π€π0 =0 (Σ −1 (ππ −ππ ))π π₯ + π€π0 −π€π0=0 ‘- • Now consider: 1 2 1 − 2 1 2 π€π0−π€π0 = − πππ Σ−1 ππ + ln π π€π + πππ Σ−1 ππ − ln π π€π = (πππ − πππ ) Σ−1 ππ + ππ + ln( π π€π /π(π€π )) = − (πππ − πππ ) (Σ−1)π [ Finally Recall!! ππ+ππ 2 − ln( π π€π /π(π€π)) (πππ− πππ) (Σ−1 (ππ −ππ ))π π₯ + π€π0−π€π0=0 π€ π (π₯ − π₯0) = 0 Σ−1(ππ−ππ) (ππ −ππ )] = π€ π π₯0 37 Discriminant Function with the Normal Density: Case 2 Visualization ‘- 38 Discriminant Function with the Normal Density • Case 3: Σπ is arbitrary ‘- 39 Discriminant Function with the Normal Density Even in one dimension, for arbitrary covariance the decision regions need not be simply connected ‘- 40 Discriminant Function with the Normal Density ‘- 41 Error Probabilities • There are two ways in which a classification error can occur; either an observation π₯ falls in π 2 and the true state of nature is π1, or π₯ falls in π 1 and the true state of nature is π2. Since these events are mutually exclusive and exhaustive, the probability of error is Because the decision point π₯ ∗ (and hence the regions π 1 and π 2 ) were chosen arbitrarily for that figure, the probability of error is not as small as it might be. In particular, the triangular area marked “reducible error” can be eliminated if the decision boundary is moved to π₯π΅ . ‘- 42 Probability of Correctness in Multicatgeory Case ‘- 43 Signal Detection Theory and Operating Characteristics • We are interested in detecting a single weak pulse, e.g. radar reflection; the internal signal (π₯) in detector has mean π1 (π2) when pulse is absent (present) • The detector uses a threshold π₯ ∗ to determine the presence of pulse • ‘Because of random noise — within and outside the detector itself — the actual value is a random variable. We assume the distributions are normal with different means but the same variance, i.e.,π π₯ π€π ~π ππ , π 2 • Discriminability : a measure of the ease of discriminating whether the pulse is present or not, in a form independent of the choice of x∗. This describes the inherent and unchangeable properties due to noise and the strength of ability the external signal, but not on the decision strategy (i.e., the actual choice of x∗). • This discriminability is defined as: π′ = π1−π2 π 44 Receiver Operating Characteristic (ROC) • Experimentally compute hit and false alarm rates for fixed x∗ • Changing x∗ will change the hit and false alarm rates • A plot of hit and false alarm rates is called the ROC curve • Observe that the definition of ROC curves is not dependent on the underlying assumption of distribution. ‘- • In practice, distributions may not be Gaussian and will be multidimensional; ROC curve can still be plotted • •Vary a single control parameter for the decision rule and plot the resulting hit and false alarm rates 45 # First aggregate all false positive rates all_fpr = np.unique(np.concatenate([fpr[i] for i in range(n_classes)])) ROC curve Implementation* import numpy as np import matplotlib.pyplot as plt from itertools import cycle from sklearn import svm, datasets from sklearn.metricsimport roc_curve, auc from sklearn.model_selection import train_test_split from sklearn.preprocessing import label_binarize from sklearn.multiclass import OneVsRestClassifier from scipy import interp from sklearn.metricsimport roc_auc_score # Import some data to play with iris = datasets.load_iris() X = iris.data y = iris.target # Binarize the output y = label_binarize(y, classes=[0, 1, 2]) n_classes = y.shape[1] # Add noisy features to make the problem harder random_state = np.random.RandomState(0) n_samples, n_features = X.shape X = np.c_[X, random_state.randn(n_samples, 200 * n_features)] # shuffle and split training and test sets X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5, random_state=0) # Learn to predict each class against the other classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=random_state)) y_score = classifier.fit(X_train, y_train).decision_function(X_test) plt.figure() lw = 2 plt.plot(fpr[2], tpr[2], color='darkorange', lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2]) plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--') plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Receiver operating characteristic example') plt.legend(loc="lower right") plt.show() # Then interpolate all ROC curves at this points mean_tpr = np.zeros_like(all_fpr) for i in range(n_classes): mean_tpr += interp(all_fpr, fpr[i], tpr[i]) # Finally average it and compute AUC mean_tpr /= n_classes fpr["macro"] = all_fpr tpr["macro"] = mean_tpr roc_auc["macro"] = auc(fpr["macro"], tpr["macro"]) # Plot all ROC curves plt.figure() plt.plot(fpr["micro"], tpr["micro"], label='micro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["micro"]), color='deeppink', linestyle=':', linewidth=4) plt.plot(fpr["macro"], tpr["macro"], label='macro-average ROC curve (area = {0:0.2f})' ''.format(roc_auc["macro"]), color='navy', linestyle=':', linewidth=4) ‘- colors = cycle(['aqua', 'darkorange', 'cornflowerblue']) for i, color in zip(range(n_classes), colors): plt.plot(fpr[i], tpr[i], color=color, lw=lw, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i, roc_auc[i])) plt.plot([0, 1], [0, 1], 'k--', lw=lw) plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('Some extension of Receiver operating characteristic to multi-class') plt.legend(loc="lower right") plt.show() # Compute ROC curve and ROC area for each class fpr = dict() tpr = dict() roc_auc = dict() for i in range(n_classes): fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i]) roc_auc[i] = auc(fpr[i], tpr[i]) # Compute micro-average ROC curve and ROC area fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel()) roc_auc["micro"] = auc(fpr["micro"], tpr["micro"]) https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html 46 Bayes Decision Theory: Discrete Features • The features can as well be discrete: binary, ternary, or higher integer valued • So the probability density function turns singular and the Bayes formula considers the probability, not the probability densities and modified as ‘- • However the conditional risk remains to be same and the basic rule for minimizing the errorrate we have 47 Independent Binary Features • As an example of a classification involving discrete features, consider the two-category problem in which the components of the feature vector are binary-valued and conditionally independent. ‘- • So, each feature gives us a yes/no for the pattern. • Assuming the conditional dependence, we can conveniently write 48 Independent Binary Features • The likelihood ratio is: • And the discriminant function is: • Since this is linear in x: ‘- 49 1 ππ − 1 πi = ln 1 ππ − 1 • Recall we decide on π1 is π π₯ > 0 and π2 otherwise ‘Indicates the relevance of yes answer for π₯π • If ππ = ππ no information about the true state of nature is provided. • Otherwise, ππ > ππ a yes answer for π₯π contributes π€π votes to π1. For fixed 1 > ππ π€π is larger as ππ gets larger. If ππ < ππ , π€π is negative and a ‘yes’ answer contributes |π€π | votes for π2 50 Missing Features • We let x = [xg, xb], where xg represents the known or “good” features and xb represents the “bad” ones, i.e., either unknown or missing. We seek the Bayes rule given the good features, and for that the posterior probabilities are needed. In terms of the good features the posteriors are ‘Marginalized over all bad/missing features • Finally we use the Bayes decision rule on the resulting posterior probabilities, 51 Noisy Features • We assume we have uncorrupted (good) features xg, as before, and a noise model, expressed as p(xb|xt). Here we let xt denote the true value of the observed xb features, i.e., without the noise present; that is, the xb are observed instead of the true xt. We assume that if xt were known, xb would be independent of ωi and xg. From such an assumption we get: ‘- • Now note that, 52 Noisy Features • We put these together and thereby obtain ‘- • which we use as discriminant functions for classification in the manner dictated by Bayes. 53