Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith Getting Empirical about Software Example: given a file, is it text or binary? file command if /the/ then text else binary Getting Empirical about Software Example: early spam filtering regular expressions: /viagra/ email address originating IP address Other reasons Spam in 2006 Spam in 2005 Code re-use Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work! Using Data Data estimation; regression; learning; training Model classification; decision pattern classification machine learning statistical inference ... Action Probabilistic Models Let X and Y be random variables. (continuous, discrete, structured, ...) Goal: predict Y from X. A model defines P(Y = y | X = x). 1. 2. Where do models come from? If we have a model, how do we use it? Using a Model We want to classify a message, x, as spam or mail: y ε {spam, mail}. x Model P(spam | x) P(mail | x) spam if Pspam | x Pmail | x ŷ otherwise mail Bayes Minimum-Error Decision Criterion Decide yi if P(yi | x) > P(yj | x) for all j i. (Pick the most likely y, given x.) Example X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail. Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0? Improving our view of the data Why just use [m/viagra/]? Cialis, diet, stock, bank, ... Why just use {<50%, >50%}? X could be a histogram of words! Tradeoff simple features limited descriptive power complex features data sparseness Problem Need to estimate P(spam | x) for each x! There are lots of word histograms! length ε {1, 2, 3, ...} |Vocabulary| = huge! number of documents: ? length | V | length1 “Data Sparseness” You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets. Other simple examples Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender} Magic Trick Often, P(y | x) is hard, but P(x | y) and P(y) are easier to get, and more natural. P(y): prior (how much mail is spam?) P(x | y): likelihood P(x | spam) models what spam looks like P(x | mail) models what mail looks like Bayes’ Rule likelihood: one distribution over complex observations per y prior P(x | y ) P( y ) P( y | x) P(x) what we said the model must define normalizes into a distribution: P(x) P( y ' ) P(x | y ' ) y' Example P(spam) = 0.455, P(mail) = 0.545 X P(x | P(x | spam) mail) known sender, >50% dict. words .00 .70 known sender, <50% dict. words .01 .06 unknown sender, >50% dict. words .19 .24 unknown sender, <50% dict. words .80 .00 Resulting Classifier times .455 times .545 X P(x | P(x | P(spam, x) spam) mail) P(mail, x) decision known, >50% .00 .70 .00 .38 mail known, <50% .01 .06 .005 .03 mail unknown, >50% .24 .24 .11 .13 mail unknown, <50% .75 .00 .34 .00 spam Possible improvement P(spam) = 0.455, P(mail) = 0.545 Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y) Modeling N and D p(n | y ) ( y )n(1 ( y )) geometric, with parameter κ(y) n d n d p(d | n, y ) (y ) (1 (y )) d binomial, with parameter δ(y) Resulting Classifier times .455 times .545 X = (S, N, D) known, 1, 0 known, 1, 1 known, 2, 0 ... P(x | P(x | P(spam, x) spam) mail) P(mail, x) decision Old model vs. New model How many different x? 4 ∞ How many degrees of freedom? P(y): P(x | y): 2 6 Which is better? 2 4 Old model vs. New model The first model had a Boolean variable: “Are > 50% of the words in the dictionary?” The second model made an independence assumption about S and (D, N). Graphical Models Y S, rnd(D/N) prior predicts Y predicts Y prior P(x | y) P(s | y) Y S N D geometric binomial Generative Story First, pick y: spam or mail? Use prior, P(Y). Given that it’s spam, decide whether the sender is known. Use P(S | spam). Given that it’s spam, pick the length. Use geometric. Given spam and n, decide how many of the words are from the dictionary. Use binomial. Naive Bayes Models Suppose X = (X1, X2, X3, ..., Xm). Let m P(x | y ) P(xi | y ) i1 Naive Bayes: Graphical Model Y X1 X2 X3 ... Xm Noisy Channel Models Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition P(y) is the source model P(x | y) is the channel model Y X Loss Functions Some errors are more costly than others. cost(spam | spam) = $0 cost(mail | mail) = $0 cost(mail | spam) = $1 cost(spam | mail) = $100 What to do? Risk Conditional risk: R( y | x) cost( y | y ' ) P( y '| x) y' Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1. Risk X P(x | P(x | spam) mail) P(spam | x) P(mail | R(spam | x) x) R(mail | x) known, >50% .00 .70 .00 1.00 $100 $0 known, <50% .01 .06 .02 .98 $98 $.02 unknown, >50% .24 .24 .46 .54 $54 $.46 unknown, <50% .75 .00 1.00 .00 $0 $1 Determinism and Randomness If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?