Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith Using Data Data estimation; regression; learning; training Model classification; decision pattern classification machine learning statistical inference ... Action Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured, ...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. 2. Where do models come from? If we have a model, how do we use it? Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. x Model P(spam | x) P(mail | x) spam if Pspam | x Pmail | x ŷ otherwise mail Bayes’ Rule likelihood: one distribution over complex observations per y prior P(x | y ) P( y ) P( y | x) P(x) what we said the model must define normalizes into a distribution: P(x) P( y ' ) P(x | y ' ) y' Naive Bayes Models Suppose X = (X1, X2, X3, ..., Xm). Let m P(x | y ) P(xi | y ) i1 Naive Bayes: Graphical Model Y X1 X2 X3 ... Xm Part II Where do the model parameters come from? Using Data Data estimation; regression; learning; training Model Action Warning This is a HUGE topic. We will barely scratch the surface. Forms of Models Recall that a model defines P(x | y) and P(y). These can have a simple multinomial form, like P(mail) = 0.545, P(spam) = 0.455 Or they can take on some other form, like a binomial, Gaussian, etc. Example: Gaussian Suppose y is {male, female}, and one observed variable is H, height. P(H | male) ~ N(μm, σm2) P(H | female) ~ N(μf, σf2) How to estimate μm, σm2, μf, σf2? Maximum Likelihood Pick the model that makes the data as likely as possible max P(data | model) Maximum Likelihood (Gaussian) Estimating the parameters μm, σm2, μf, σf2 can be seen as fitting the data estimating an underlying statistic (point estimate) n m ˆ y i malehi i1 # males n 2 ˆm 2 y male h ˆ i i m i1 # males 1 Using the model 1.2 p(H | male) p(H | female) 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 Using the model 1 P(male | H) P(female | H) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 Example: Regression Suppose y is actual runtime, and x is input length. Regression tries to predict some continuous variables from others. Regression Linear: assume linear relationship, fit a line. We can turn this into a model! Linear Model Given x, predict y. y = β1x + β0 + N(0, σ2) true regression line random deviation Principle of Least Squares Minimize the sum of squared vertical deviations. Unique, closed form solution! vertical deviation Other kinds of regression transform one or both variables (e.g., take a log) polynomial regression (least squares → linear system) multivariate regression logistic regression Example: text categorization Bag-of-words model: x is a histogram of counts for all words P(x | y ) puni( w | y ) w y is a topic count( w;x ) MLE for Multinomials “Count and Normalize” count( w; training) p̂uni w | y count(*; training) The Truth about MLE You will never see all the words. For many models, MLE isn’t safe. To understand why, consider a typical evaluation scenario. Evaluation Train your model on some data. How good is the model? Test on different data that the system never saw before. Why? Tradeoff overfits the training data doesn’t generalize low variance low accuracy Text categorization again Suppose ‘v1@gra’ never appeared in any document in training, ever. P(x | y ) puni( w | y ) count( w;x ) w What is the above probability for a new document containing ‘v1@gra’ at test time? Solutions Regularization Smoothing Prefer less extreme parameters “Flatten out” the distribution Bayesian Estimation Construct a prior over model parameters, then train to maximize P(data | model) × P(model) One More Point Building models is not the only way to be empirical. Neural networks, SVMs, instancebased learning MLE and smoothed/Bayesian estimation are not the only ways to estimate. Minimize error, for example (“discriminative” estimation) Assignment 3 Spam detection We provide a few thousand examples Perform EDA and pick features Estimate probabilities Build a Naive-Bayes classifier