How does computer know what is spam and what is ham? Attempt 1: (define (spam? email) (cond ( (email from known sender) False) ( (email contains “viagra”) True) ( (email begins with “Dear Mr/Mrs.”) True) ( (email contains URL) True) ( (email contains attachment) True) ( ... Attempt 1: (define (spam? email) (cond ( (email from known sender) False) ( (email contains “viagra”) True) ( (email begins with “Dear Mr/Mrs.”) True) ( (email contains URL) True) ( (email contains attachment) True) ( ... Problem: (email contain URL) is an indication, NOT a PROOF Features: Score: email from known sender -50 email contains "viagra" 75 email begins with "Dear Mr/Mrs." 70 email contains URL 10 email contains attachment ... ... ... 5 If Total Sum > 100, classify as spam. Features: Score: email from known sender -50 email contains "viagra" 75 Problems: email begins with "Dear Mr/Mrs." 70 email contains URL 10 - How to determine the score? email contains attachment ... ... ... 5 If Total Sum > 100, classify as spam. - How to combine the score? Key Idea: Learn which features are important through examples Training Set: lots of emails with correct labels (both spam and ham) The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set: The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set: - Count percentage of spams in Training Set: P(spam) - Count percentage of hams in Training Set: P(ham) - For every feature F_1, F_2, F_3 ... : = Count percentage of spams with feature F_i : P(F_i | spam) = Count percentage of hams with feature F_i : P(F_i | ham) The Naive Bayes Algorithm: Say, F_1 = email contains “viagra” F_2 = email begins with “Dear Mr/Mrs.” The Naive Bayes Algorithm: Say, F_1 = email contains “viagra” F_2 = email begins with “Dear Mr/Mrs.” From Training Set, we discovered: P(spam) = 0.85 P(ham) = 0.15 P(F_1 | spam) = 0.2 P(F_1 | ham) = 0.001 P(NOT F_1 | spam) = 0.8 P(NOT F_1 | ham) 0.999 P(F_2 | spam) = 0.99 P(F_2 | ham) = 0.0001 P(NOT F_2 | spam) = 0.01 P(NOT F_2 | ham) = 0.9999 The Naive Bayes Algorithm: Step 1. Gather Statistics inside Training Set: - Count percentage of spams in Training Set: P(spam) - Count percentage of hams in Training Set: P(ham) - For every feature F_1, F_2, F_3 ... : = Count percentage of spams with feature F_i : P(F_i | spam) = Count percentage of hams with feature F_i : P(F_i | ham) Step 2. On a new Instance: - Find what features the new instance has - Use Bayes Rule to compute probability - Take the most probable label Example: Optical Character Recognition GOAL: recognize scanned hand-written numbers ............................ ......++++++................ ......##############++...... ......+++++##########+...... ............+.+++++##+...... ..................+##....... .................+##+....... .................+##+....... ................+##+........ ................+#+......... ................##+......... ...............+#+.......... ..............+##+.......... ..............##+........... .............###+........... ............+##+............ ...........+##+............. ...........+##+............. ..........+###+............. ..........+###+............. ..........+##............... ............................ ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares ............................ ............................ ............+#.............. ..........+###.............. .........+####+............. .........+######+........... ........+###+####+.......... ........+##..+####.......... ........+#+...+##+.......... ........+#+...###+.......... ........+##+++####+......... .........#####++##+......... .........+###+..+##+........ ..........+++....+#+........ .................+##........ ..................+#+....... ..................+##+...... ...................+#+...... ...................+##+..... ....................+#+..... .....................+#+.... ......................#+.... ............................ Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type) (done for you) Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type) (done for you) - Gather feature statistics from Training File (mostly done for you) Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type) (done for you) - Gather feature statistics from Training File (mostly done for you) - Implement Bayes' Rule (mostly your own work) Instance – scanned image of hand-written number Labels – 1,2,3,4,5,6,7,8,9 Features – (for project) every 2x2 pixel squares Steps. - Turn image-file into a stream of Images (Abstract Data Type) (done for you) - Gather feature statistics from Training File (mostly done for you) - Implement Bayes' Rule (mostly your own work) - Evaluate your OCR by guessing labels on Validation File (mostly done for you)