Empirical Research Methods in Computer Science Lecture 6 November 16, 2005

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005 Noah Smith Getting Empirical about Software  Example: given a file, is it text or binary? file command if /the/ then text else binary Getting Empirical about Software  Example: early spam filtering regular expressions: /viagra/ email address originating IP address Other reasons  Spam in 2006  Spam in 2005  Code re-use   Two programs may work in essentially the same way, but for entirely different applications. Empirical techniques work! Using Data Data estimation; regression; learning; training Model classification; decision pattern classification machine learning statistical inference ... Action Probabilistic Models  Let X and Y be random variables. (continuous, discrete, structured, ...)  Goal: predict Y from X.  A model defines P(Y = y | X = x). 1. 2. Where do models come from? If we have a model, how do we use it? Using a Model  We want to classify a message, x, as spam or mail: y ε {spam, mail}. x Model P(spam | x) P(mail | x) spam if Pspam | x   Pmail | x  ŷ   otherwise  mail Bayes Minimum-Error Decision Criterion  Decide yi if P(yi | x) > P(yj | x) for all j  i. (Pick the most likely y, given x.) Example   X = [/viagra/], Y ε {spam, mail} From data, estimate: P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 BDC: if X > 0 then spam, else mail. Probability of error? P(spam | X > 0) = 0.99 P(mail | X > 0) = 0.01 P(spam | X = 0) = 0.45 P(mail | X = 0) = 0.55 What is the probability of error, given X > 0? Given X = 0? Improving our view of the data  Why just use [m/viagra/]? Cialis, diet, stock, bank, ... Why just use {<50%, >50%}?  X could be a histogram of words!   Tradeoff simple features limited descriptive power complex features data sparseness Problem   Need to estimate P(spam | x) for each x! There are lots of word histograms!    length ε {1, 2, 3, ...} |Vocabulary| = huge! number of documents: ? length | V |  length1 “Data Sparseness”    You will never see every x. So you can’t estimate distributions that condition on each x. Not just in text: anything dealing with continuous variables or just darn big sets. Other simple examples   Classify fish into {salmon, sea bass} by X = (Weight, Length) Classify people into {undergrad, grad, professor} by X = {Age, Hair-length, Gender} Magic Trick    Often, P(y | x) is hard, but P(x | y) and P(y) are easier to get, and more natural. P(y): prior (how much mail is spam?) P(x | y): likelihood   P(x | spam) models what spam looks like P(x | mail) models what mail looks like Bayes’ Rule likelihood: one distribution over complex observations per y prior P(x | y )  P( y ) P( y | x)  P(x) what we said the model must define normalizes into a distribution: P(x)   P( y ' )  P(x | y ' ) y' Example  P(spam) = 0.455, P(mail) = 0.545 X P(x | P(x | spam) mail) known sender, >50% dict. words .00 .70 known sender, <50% dict. words .01 .06 unknown sender, >50% dict. words .19 .24 unknown sender, <50% dict. words .80 .00 Resulting Classifier times .455 times .545 X P(x | P(x | P(spam, x) spam) mail) P(mail, x) decision known, >50% .00 .70 .00 .38 mail known, <50% .01 .06 .005 .03 mail unknown, >50% .24 .24 .11 .13 mail unknown, <50% .75 .00 .34 .00 spam Possible improvement   P(spam) = 0.455, P(mail) = 0.545 Let X = (S, L, D). S ε {known sender, unknown sender}, N = length in words, D = # dictionary words P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y) Modeling N and D p(n | y )  ( y )n(1 ( y )) geometric, with parameter κ(y) n d n  d   p(d | n, y )   (y ) (1 (y ))  d binomial, with parameter δ(y) Resulting Classifier times .455 times .545 X = (S, N, D) known, 1, 0 known, 1, 1 known, 2, 0 ... P(x | P(x | P(spam, x) spam) mail) P(mail, x) decision Old model vs. New model   How many different x? 4 ∞ How many degrees of freedom?    P(y): P(x | y): 2 6 Which is better? 2 4 Old model vs. New model  The first model had a Boolean variable: “Are > 50% of the words in the dictionary?”  The second model made an independence assumption about S and (D, N). Graphical Models Y S, rnd(D/N) prior predicts Y predicts Y prior P(x | y) P(s | y) Y S N D geometric binomial Generative Story  First, pick y: spam or mail? Use prior, P(Y).  Given that it’s spam, decide whether the sender is known. Use P(S | spam).  Given that it’s spam, pick the length. Use geometric.  Given spam and n, decide how many of the words are from the dictionary. Use binomial. Naive Bayes Models  Suppose X = (X1, X2, X3, ..., Xm).  Let m P(x | y )   P(xi | y ) i1 Naive Bayes: Graphical Model Y X1 X2 X3 ... Xm Noisy Channel Models    Y is produced by a source. Y is corrupted as it goes through a channel; it turns into X. Example: speech recognition P(y) is the source model P(x | y) is the channel model Y X Loss Functions  Some errors are more costly than others.      cost(spam | spam) = $0 cost(mail | mail) = $0 cost(mail | spam) = $1 cost(spam | mail) = $100 What to do? Risk  Conditional risk: R( y | x)   cost( y | y ' )  P( y '| x) y'   Minimize expected loss by picking the y to minimize R. Minimizing error is a special case where cost(y | y) = $0 and cost(y | y’) = $1. Risk X P(x | P(x | spam) mail) P(spam | x) P(mail | R(spam | x) x) R(mail | x) known, >50% .00 .70 .00 1.00 $100 $0 known, <50% .01 .06 .02 .98 $98 $.02 unknown, >50% .24 .24 .46 .54 $54 $.46 unknown, <50% .75 .00 1.00 .00 $0 $1 Determinism and Randomness  If we build a classifier from a model, and use a Bayes decision rule to make decisions, is the algorithm randomized, or is it deterministic?

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005

Related documents

Products

Support

Empirical Research Methods in Computer Science Lecture 6 November 16, 2005

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib