CS 188 Discussion Wed April 4, 2007 Classification - Predict labels (classes) for inputs Eg. Spam detection (input: email, classes: spam/not spam) OCR (input: image, classes: characters) Data: Inputs x, class labels c Basic Setup: Training data: Lots of <x,c> pairs Feature extractors: Obtain Fi, attributes of example x Test data: more x’s, predict c Bayes Nets for Classification c = argmax c P(c| F1, … Fn) Examples: x is an input image Fij 1 or 0 depending whether pixel intensity at (i,j) is > 0.5 or not - (May not be the best choice of features) - Can be something more complex eg. Height, Width of non-blank space How many horizontal lines? c can be one of {0,1,2,.., 9} Naïve Bayes Features (effects) are independent given class P(C|F1…Fn) = αP(C) Π P(Fi|C) i Need to only specify how each feature depends on the class What do we need to learn (what are the parameters of the network)? How to learn them from the training data? Maximum-a posteriori (MAP) hypothesis Ө is a parameter in the network, for example, eg. P(C = c1), P(F1 = f1,1 | C = c1) Need to have some idea about the distribution of the parameter Maximum-likelihood (ML) hypothesis Assume P(Ө) is uniform, reduce to ML parameters learning steps: 1. Write down an expression for the likelihood of the data as a function of parameter 2. Take logarithm and differentiate with respect to each parameter 3. Find parameter values that derivatives are 0 Example (from Spring 2002 final, available from http://www.cs.berkeley.edu/~russell/classes/cs188/f05/) Generalization and Over fitting ML is easier but can have problem of over fitting. Eg. If I flip a coin once, and it’s heads, what is P(heads)? How about 10 times, with 8 heads? How about 10M times with 8M heads? Idea: Small data-> trust prior, Large Data -> trust data Laplace Smoothing, assume seeing every outcome 1 extra times. eg. for example (a) above, add p and n by 1 for example (b) above. append entries for all possible values of X1,X2,Y (not so useful in this case) Generative vs. Discriminative Generative: eg. Naïve Bayes: Build causal model of the variables, then query the model for causes given evident Discriminative: eg. Perceptron: No causal model , no Bayes rule, no prob (mostly). Try to predict output directly. (Mistake driven rather than model driven) Binary Perceptron If activation is positive, output 1, negative, output 0 Eg. How to represent AND? i1 i2 y0 0 0 0 0 1 0 1 0 0 1 1 1 Eg. w01 = 0.5, w02 = 0.5, w0b = -0.9 Geometric Interpretation Treat input as points in high dimensional space, find a hyper plane that separate points into two group. Will work perfectly if data is “linearly seperable” Note: Summation Dot product Multiclass perceptron If we have > 2 classes, have weight vector for each class, calculate an activation for each class, highest activation wins Perceptron Update Rules - Start with 0 weight - Try to classify - If correct, no change. - If wrong, lower score of wrong answer, raise score of right answer (c* is the correct answer) This is not the ideal update rules: Better update rule will result in one of the State of the art classifier (SVM), when used with the idea presented below (beyond the scope of this class) Non-Linear Seperators Original feature space can always be mapped into higher dimension where it’s linearly seperable