Empirical Research Methods in Computer Science Lecture 7 November 30, 2005

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith Using Data Data estimation; regression; learning; training Model classification; decision pattern classification machine learning statistical inference ... Action Probabilistic Models  Let X and Y be random variables. (continuous, discrete, structured, ...)  Goal: predict Y from X.  A model defines P(Y = y | X = x). 1. 2. Where do models come from? If we have a model, how do we use it? Using a Model  We want to classify a message, x, as spam or mail: y ε {spam, mail}. x Model P(spam | x) P(mail | x) spam if Pspam | x   Pmail | x  ŷ   otherwise  mail Bayes’ Rule likelihood: one distribution over complex observations per y prior P(x | y )  P( y ) P( y | x)  P(x) what we said the model must define normalizes into a distribution: P(x)   P( y ' )  P(x | y ' ) y' Naive Bayes Models  Suppose X = (X1, X2, X3, ..., Xm).  Let m P(x | y )   P(xi | y ) i1 Naive Bayes: Graphical Model Y X1 X2 X3 ... Xm Part II Where do the model parameters come from? Using Data Data estimation; regression; learning; training Model Action Warning   This is a HUGE topic. We will barely scratch the surface. Forms of Models   Recall that a model defines P(x | y) and P(y). These can have a simple multinomial form, like P(mail) = 0.545, P(spam) = 0.455  Or they can take on some other form, like a binomial, Gaussian, etc. Example: Gaussian  Suppose y is {male, female}, and one observed variable is H, height.  P(H | male) ~ N(μm, σm2) P(H | female) ~ N(μf, σf2)  How to estimate μm, σm2, μf, σf2?  Maximum Likelihood  Pick the model that makes the data as likely as possible max P(data | model) Maximum Likelihood (Gaussian) Estimating the parameters μm, σm2, μf, σf2 can be seen as    fitting the data estimating an underlying statistic (point estimate) n m  ˆ  y i  malehi i1 # males n 2  ˆm  2     y  male h   ˆ  i i m i1 # males  1 Using the model 1.2 p(H | male) p(H | female) 1 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 Using the model 1 P(male | H) P(female | H) 0.8 0.6 0.4 0.2 0 0 2 4 6 8 10 12 Example: Regression   Suppose y is actual runtime, and x is input length. Regression tries to predict some continuous variables from others. Regression   Linear: assume linear relationship, fit a line. We can turn this into a model! Linear Model  Given x, predict y. y = β1x + β0 + N(0, σ2) true regression line random deviation Principle of Least Squares   Minimize the sum of squared vertical deviations. Unique, closed form solution! vertical deviation Other kinds of regression   transform one or both variables (e.g., take a log) polynomial regression    (least squares → linear system) multivariate regression logistic regression Example: text categorization  Bag-of-words model:   x is a histogram of counts for all words P(x | y )   puni( w | y ) w  y is a topic count( w;x ) MLE for Multinomials  “Count and Normalize” count( w; training) p̂uni w | y   count(*; training) The Truth about MLE  You will never see all the words.  For many models, MLE isn’t safe.  To understand why, consider a typical evaluation scenario. Evaluation  Train your model on some data.  How good is the model?  Test on different data that the system never saw before.  Why? Tradeoff overfits the training data doesn’t generalize low variance low accuracy Text categorization again  Suppose ‘v1@gra’ never appeared in any document in training, ever. P(x | y )   puni( w | y ) count( w;x ) w  What is the above probability for a new document containing ‘v1@gra’ at test time? Solutions  Regularization   Smoothing   Prefer less extreme parameters “Flatten out” the distribution Bayesian Estimation  Construct a prior over model parameters, then train to maximize P(data | model) × P(model) One More Point  Building models is not the only way to be empirical.   Neural networks, SVMs, instancebased learning MLE and smoothed/Bayesian estimation are not the only ways to estimate.  Minimize error, for example (“discriminative” estimation) Assignment 3      Spam detection We provide a few thousand examples Perform EDA and pick features Estimate probabilities Build a Naive-Bayes classifier

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005

Related documents

Products

Support

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib