spam.ppt

advertisement
How does computer know what is
spam and what is ham?
Attempt 1:
(define (spam? email)
(cond ( (email from known sender) False)
( (email contains “viagra”) True)
( (email begins with “Dear Mr/Mrs.”) True)
( (email contains URL) True)
( (email contains attachment) True)
( ...
Attempt 1:
(define (spam? email)
(cond ( (email from known sender) False)
( (email contains “viagra”) True)
( (email begins with “Dear Mr/Mrs.”) True)
( (email contains URL) True)
( (email contains attachment) True)
( ...
Problem: (email contain URL) is an indication, NOT a
PROOF
Features:
Score:
email from known sender
-50
email contains "viagra"
75
email begins with "Dear Mr/Mrs."
70
email contains URL
10
email contains attachment
... ... ...
5
If Total Sum > 100, classify as spam.
Features:
Score:
email from known sender
-50
email contains "viagra"
75
Problems:
email begins with "Dear Mr/Mrs."
70
email contains URL
10
- How to determine
the score?
email contains attachment
... ... ...
5
If Total Sum > 100, classify as spam.
- How to combine the
score?
Key Idea:
Learn which features are important through
examples
Training Set:
lots of emails with correct labels
(both spam and ham)
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:
- Count percentage of spams in Training Set: P(spam)
- Count percentage of hams in Training Set: P(ham)
- For every feature F_1, F_2, F_3 ... :
= Count percentage of spams with feature F_i : P(F_i | spam)
= Count percentage of hams with feature F_i : P(F_i | ham)
The Naive Bayes Algorithm:
Say, F_1 = email contains “viagra”
F_2 = email begins with “Dear Mr/Mrs.”
The Naive Bayes Algorithm:
Say, F_1 = email contains “viagra”
F_2 = email begins with “Dear Mr/Mrs.”
From Training Set, we discovered:
P(spam) = 0.85
P(ham) = 0.15
P(F_1 | spam) = 0.2
P(F_1 | ham) = 0.001
P(NOT F_1 | spam) = 0.8
P(NOT F_1 | ham) 0.999
P(F_2 | spam) = 0.99
P(F_2 | ham) = 0.0001
P(NOT F_2 | spam) = 0.01
P(NOT F_2 | ham) = 0.9999
The Naive Bayes Algorithm:
Step 1. Gather Statistics inside Training Set:
- Count percentage of spams in Training Set: P(spam)
- Count percentage of hams in Training Set: P(ham)
- For every feature F_1, F_2, F_3 ... :
= Count percentage of spams with feature F_i : P(F_i | spam)
= Count percentage of hams with feature F_i : P(F_i | ham)
Step 2. On a new Instance:
- Find what features the new instance has
- Use Bayes Rule to compute probability
- Take the most probable label
Example:
Optical Character Recognition
GOAL: recognize scanned hand-written numbers
............................
......++++++................
......##############++......
......+++++##########+......
............+.+++++##+......
..................+##.......
.................+##+.......
.................+##+.......
................+##+........
................+#+.........
................##+.........
...............+#+..........
..............+##+..........
..............##+...........
.............###+...........
............+##+............
...........+##+.............
...........+##+.............
..........+###+.............
..........+###+.............
..........+##...............
............................
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
............................
............................
............+#..............
..........+###..............
.........+####+.............
.........+######+...........
........+###+####+..........
........+##..+####..........
........+#+...+##+..........
........+#+...###+..........
........+##+++####+.........
.........#####++##+.........
.........+###+..+##+........
..........+++....+#+........
.................+##........
..................+#+.......
..................+##+......
...................+#+......
...................+##+.....
....................+#+.....
.....................+#+....
......................#+....
............................
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)
(done for you)
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)
(done for you)
- Gather feature statistics from Training File
(mostly done for you)
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)
(done for you)
- Gather feature statistics from Training File
(mostly done for you)
- Implement Bayes' Rule
(mostly your own work)
Instance – scanned image of hand-written number
Labels – 1,2,3,4,5,6,7,8,9
Features – (for project)
every 2x2 pixel squares
Steps.
- Turn image-file into a stream of Images (Abstract Data Type)
(done for you)
- Gather feature statistics from Training File
(mostly done for you)
- Implement Bayes' Rule
(mostly your own work)
- Evaluate your OCR by guessing labels on Validation File
(mostly done for you)
Download