Lecture 33: Text Categorization

advertisement
Text Categorization
(actually, methods apply for
categorizing anything into
fixed categories – tagging,
WSD, PP attachment ...)
600.465 - Intro to NLP - J. Eisner
1
Why Text Categorization?
 Is it spam?
 Is it Spanish?
 Is it interesting to this user?
 News filtering
 Helpdesk routing
 Is it interesting to this NLP program?
 e.g., should my calendar system try to interpret this
email as an appointment (using info. extraction)?
 Where should it go in the directory?
 Yahoo! / Open Directory / digital libraries
 Which mail folder? (work, friends, junk, urgent ...)
600.465 - Intro to NLP - J. Eisner
2
Measuring Performance
 Classification accuracy: What % of
messages were classified correctly?
 Is this what we care about?
System 1
Overall
accuracy
95%
Accuracy
on spam
99.99%
Accuracy
on gen
90%
System 2
95%
90%
99.99%
 Which system do you prefer?
600.465 - Intro to NLP - J. Eisner
3
Measuring Performance
Precision vs. Recall of
Good (non-spam) Email
 Precision =
good messages kept
all messages kept
Precision
100%
75%
50%
25%
0%
0%
25%
50%
75%
Recall
100%
 Recall =
good messages kept
all good messages
Trade off precision vs. recall by setting threshold
Measure the curve on annotated dev data (or test data)
Choose a threshold where user is comfortable
600.465 - Intro to NLP - J. Eisner
4
F-measure = 1 / (average(1/precision, 1/recall))
Measuring Performance
Precision vs. Recall of
Good (non-spam) Email
Precision
100%
75%
50%
25%
OK for search
engines (maybe)
high threshold:
all we keep is good,
but we don’t keep much
point where
precision=recall
(sometimes
reported)
would prefer
to be here!
low threshold:
keep all the good stuff,
but a lot of the bad too
0%
0%
25%
600.465 - Intro to NLP - J. Eisner
50%
Recall
75%
100%
OK for spam
filtering and
legal search
5
More Complicated Cases of
Measuring Performance
 For multi-way classifiers:
 Average accuracy (or precision or recall) of 2-way
distinctions: Sports or not, News or not, etc.
 Better, estimate the cost of different kinds of errors
 e.g., how bad is each of the following?
 putting Sports articles in the News section
 putting Fashion articles in the News section
 putting News articles in the Fashion section

 Now tune system to minimize total cost
Which articles are most Sports-like?
For ranking systems: Which articles / webpages most relevant?
 Correlate with human rankings?
 Get active feedback from user?
 Measure user’s wasted time by tracking clicks?
600.465 - Intro to NLP - J. Eisner
6
How to Categorize?
Subject: would you like to . . . .
. . drive a new vehicle for free ? ? ? this is not hype or a
hoax , there are hundreds of people driving brand new cars ,
suvs , minivans , trucks , or rvs . it does not matter to us
what type of vehicle you choose . if you qualify for our
program , it is your choice of vehicle , color , and options
. we don ' t care . just by driving the vehicle , you are
promoting our program . if you would like to find out more
about this exciting opportunity to drive a brand new vehicle
for free , please go to this site : http : / / 209 . 134 . 14
. 131 / ntr to watch a short 4 minute audio / video
presentation which gives you more information about our
exciting new car program . if you do n't want to see the
short video , but want us to send you our information package
that explains our exciting opportunity for you to drive a new
vehicle for free , please go here : http : / / 209 . 134 . 14
. 131 / ntr / form . htm we would like to add you the group
of happy people driving a new vehicle for free . happy
motoring .
600.465 - Intro to NLP - J. Eisner
7
How to Categorize?
(supervised)
We’ve seen lots of options in this course!
1. Build n-gram model of each category


Question: How to classify test message?
Answer: Bayes’ Theorem
600.465 - Intro to NLP - J. Eisner
8
How to Categorize?
(supervised)
We’ve seen lots of options in this course!
2. Represent each document as a vector
(must choose representation and distance measure; use SVD?)


Question: How to classify test message?
Answer 1: Category whose centroid is most similar
(may not work well if category is diverse)

Answer 2: Cluster each category into subcategories
(then use answer 1 to pick a subcategory)
(return the category that the subcategory is in)
(this can also be useful for n-gram models)

Answer 3: Just look at labels of nearby training docs
(e.g., let the k nearest neighbors vote – flexible!)
(maybe the closer ones get a bigger vote)
600.465 - Intro to NLP - J. Eisner
9
How to Categorize?
(supervised)
We’ve seen lots of options in this course!
3. Treat it like word-sense disambiguation
a) Vector model – use all the features (we just saw this)
b) Decision list – use single most indicative feature
c) Naive Bayes – use all the features, weighted by how
well they discriminate among the categories
d) Decision tree – use some of the features in sequence
e) Other options from machine learning, like perceptron,
Support Vector Machine (SVM), logistic regression, …
Features matter more than which machine learning method
600.465 - Intro to NLP - J. Eisner
10
Review: Vector Model
These two documents are similar:
After normalizing vector length to 1,
Close in Euclidean space (similar endpoint)
High dot product (similar direction)
(0,
0,
3,
1,
0,
7,
...
1,
0)
(0,
0,
1,
0,
0,
3,
...
0,
1)
Can play lots of encoding games when creating vector:
Remove function words or reduce their weight
Use features other than unigrams
600.465 - Intro to NLP - J. Eisner
11
slide courtesy of D. Yarowsky (modified)
Review: Decision Lists
To disambiguate a token of lead :
 Scan down the sorted list
 The first cue that is found
gets to make the decision all
by itself
 Not as subtle as combining
cues, but works well for WSD
Cue’s score is its log-likelihood ratio:
log [ p(cue | sense A) [smoothed]
/ p(cue | sense B) ]
600.465 - Intro to NLP - J. Eisner
12
slide courtesy of D. Yarowsky (modified)
Review: Combining Cues via Naive Bayes
these stats
come from term
papers of known
authorship
(i.e., supervised
training)
600.465 - Intro to NLP - J. Eisner
13
slide courtesy of D. Yarowsky (modified)
Review: Combining Cues via Naive Bayes
1
2
1
2
“Naive Bayes” model for classifying text
(Note the naive independence assumptions!)
600.465 - Intro to NLP - J. Eisner
Would this kind of
sentence be more
typical of a student A
paper or a student B
paper?
14
example from Manning & Schütze
Decision Trees
Is this Reuters article an Earnings Announcement?
2301/7681 = 0.3 of all docs
split on feature
that reduces our
uncertainty most
contains “cents”  2 times
contains “cents” < 2 times
1607/1704 = 0.943
694/5977 = 0.116
contains
“versus”
 2 times
1398/1403
= 0.996
“yes”
contains
“versus”
< 2 times
209/301
= 0.694
600.465 - Intro to NLP - J. Eisner
contains
“net”
 1 time
422/541
= 0.780
contains
“net”
< 1 time
272/5436
= 0.050
“no”
15
Features Besides Unigrams
 All these approaches (except n-gram model) can use
“interesting” features, not just unigrams.
 There’s generally a heuristic feature selection problem
 Use some very large set of features defined by a template
 Maybe restrict to features that look useful in isolation?
 Add features greedily, one at a time
 Measure or guess expected improvement of each feature
 Make sure to smooth when doing this – why?
 At the end, remove features that hurt performance on held-out data
 What does SpamAssassin use?
600.465 - Intro to NLP - J. Eisner
16
SpamAssassin Features
100
4.0
3.994
3.970
3.910
3.801
3.472
3.437
3.371
3.350
3.284
3.283
3.261
3.251
3.250
3.200
From: address is in the user's black-list
Sender is on www.habeas.com Habeas Infringer List
Invalid Date: header (timezone does not exist)
Written in an undesired language
Listed in Razor2, see http://razor.sf.net/
Subject is full of 8-bit characters
Claims compliance with Senate Bill 1618
exists:X-Precedence-Ref
Reverses Aging
Claims you can be removed from the list
'Hidden' assets
Claims to honor removal requests
Contains "Stop Snoring"
Received: contains a name with a faked IP-address
Received via a relay in list.dsbl.org
Character set indicates a foreign language
600.465 - Intro to NLP - J. Eisner
17
SpamAssassin Features
3.198
3.193
3.180
3.140
3.123
3.090
3.072
3.044
3.009
3.005
2.991
2.975
2.968
2.932
2.900
2.879
Forged eudoramail.com 'Received:' header found
Free Investment
Received via SBLed relay, seehttp://www.spamhaus.org/sbl/
Character set doesn't exist
Dig up Dirt on Friends
No MX records for the From: domain
X-Mailer contains malformed Outlook Expressversion
Stock Disclaimer Statement
Apparently, NOT Multi Level Marketing
Bulk email software fingerprint (jpfree) found inheaders
exists:Complain-To
Bulk email software fingerprint (VC_IPA) found inheaders
Invalid Date: year begins with zero
Mentions Spam law "H.R. 3113"
Received forged, contains fake AOL relays
Asks for credit card details
600.465 - Intro to NLP - J. Eisner
18
SpamAssassin Features
2.858
2.851
2.842
2.826
2.800
2.800
2.796
2.795
2.786
2.784
2.783
2.782
2.782
2.748
2.744
2.737
To: username at front of subject
Claims you actually asked for this spam
To header contains 'recipient' marker
Compare Rates
Received: says mail bounced all around the world
Mentions Spam Law "UCE-Mail Act"
Received via buggy SMTP server (MDaemon2.7.4SP4R)
Bulk email software fingerprint (StormPost) foundin headers
Broken CGI script message
Message-Id generated by a spam tool
Urges you to call now
Tells you it's an ad
RAND found, spammer forgot to run the random-IDgenerator
Cable Converter
No Age Restrictions
Possible porn - Celebrity Porn
600.465 - Intro to NLP - J. Eisner
19
SpamAssassin Features
2.782
2.782
2.748
2.744
2.737
2.735
2.730
2.726
2.720
2.720
2.702
2.695
2.693
2.668
2.660
2.658
Tells you it's an ad
RAND found, spammer forgot to run the random-IDgenerator
Cable Converter
No Age Restrictions
Possible porn - Celebrity Porn
Bulk email software fingerprint (JiXing) found inheaders
DNSBL: sender is Confirmed Spam Source
Bulk email software fingerprint (MMailer) found inheaders
exists:X-Encoding
DNSBL: sender is Confirmed Open Relay
SEC-mandated penny-stock warning -- thanks SEC
Claims you can be removed from the list
Removes Wrinkles
Offers a stock alert
Listed in DCC, seehttp://rhyolite.com/anti-spam/dcc/
Common pyramid scheme phrase (1)
600.465 - Intro to NLP - J. Eisner
20
SpamAssassin Features
2.654
2.645
2.642
2.640
2.639
2.622
2.620
2.611
2.566
2.565
2.541
2.516
2.513
2.510
2.502
2.500
Offers a free consultation
Bulk email software fingerprint (EVAMAIL) foundin headers
Possible porn - Amateur Porn
Listed in Razor1, see http://razor.sf.net/
Subject contains lots of white space
exists:X-x
Received via a relay in relays.visi.com
Bulk email software fingerprint (IMktg) found inheaders
Compete for your business
Possible porn - Pay Site
Contains "CBYI"
Spam phrases score is 34 to 55 (high)
Possible porn - Lesbian Site
Contains 'free installation' with capitals
Free Grant Money
Listed in Pyzor, see http://pyzor.sf.net/
600.465 - Intro to NLP - J. Eisner
21
SpamAssassin Features
2.500
2.500
2.500
2.500
2.500
2.500
2.496
2.492
2.488
2.456
2.450
2.445
2.443
2.425
2.421
2.398
Tre¶æ zawiera 'odes³anie z dopiskiem NIE'
Tre¶æ zawiera 'Artykul 25 ust 2 punkt 2'
Tresc zawiera 'przepraszamy za zajêty czas'
Tresc zawiera 'Zamów teraz!!!'
Tresc zawiera 'Je¿eli (Pañstwo) nie ¿yczycie(sz)sobie'
Tresc zawiera 'Aby usun±æ adres e-mail...'
Spam tool pattern in MIME boundary
'Message-Id' was added by a relay
Bulk email software fingerprint (screwup 1) found inheaders
University Diplomas
Character set indicates foreign language body
Claims you can be removed from the list
Headers include 3 consecutive 8-bit characters
Date: is 24 to 48 hours after Received: date
'From' juno.com does not match 'Received' headers
Meet Singles
600.465 - Intro to NLP - J. Eisner
22
SpamAssassin Features
2.362
2.361
2.357
2.357
2.351
2.334
2.331
2.314
2.292
2.290
2.280
2.276
2.261
2.250
2.242
2.240
Serious Enquiries Only.
Claims auto-email removal
MiME-Version header (oddly capitalized)
A "microsoft" header was found
X-Mailer contains "OutLook Express 3.14159"
Possible porn - Rape
"Collect Child Support" Scam
Claims spam helps the environment
Free Leads
Fake name used in SMTP HELO command
Received via a relay in ipwhois.rfc-ignorant.org
Possible porn - Cum Shot
Amazing Stuff
Received via a relay in orbs.dorkslayers.com
Possible porn - Mega Porn
Offers pure profit
600.465 - Intro to NLP - J. Eisner
23
SpamAssassin Features
2.216
2.210
2.209
2.206
2.203
2.203
2.202
2.180
2.176
2.170
2.145
2.114
2.109
2.100
2.088
2.083
Received contains a faked HELO hostname
Tells you it's an ad
Uses control sequences inside a URL's hostname
Claims spam helps the environment
Tells you to 'take action now!'
Cash Bonus
From an address @btamail.net.cn
exists:X-Library
Contains "My wife, Jody" testimonial
Possible porn - Nasty Girls
Promise you ...!
Claims to be in accordance with some Spam law
Uses a numeric IP address in URL
Possible porn - Live Porn
Discusses search engine listings
HTML comments which obfuscate text
600.465 - Intro to NLP - J. Eisner
24
SpamAssassin Features
2.066
2.066
2.060
2.052
2.044
2.030
2.022
2.011
2
2
2
2
2
2
2
2
Information on getting a larger penis or breasts (2)
Contains 'free preview' with capitals
A foreign language charset used in headers
Says "We strongly oppose the use of spam email"
trail of Received: headers seems to be forged
Credit Bureaus
Claims compliance with House Bill 4176
No Investment
Tre¶æ zawiera 'adres e-mail zostalznaleziony/pozyskany'
Tre¶æ zawiera 'adres (e-mail) pochodzi zogólnodostêpnych....'
Tre¶æ zawiera 'Ustawy o ochronie danychosobowych'
Tresc zawiera 'temat USUN'
Tresc zawiera 'na podstawie adresow e-mailpublicznie...'
Tresc zawiera 'kliknij w poni¿szy link'
Tresc zawiera 'do nabycia u nas'
Tresc zawiera 'Wys³aæ pusty mail'
600.465 - Intro to NLP - J. Eisner
25
SpamAssassin Features
2
2
2
2
2
2
2
2
2
2
1.995
1.984
1.977
1.952
1.910
1.904
Tresc zawiera 'Wiadomo¶æ nadano na podstawie...'
Tresc zawiera 'Wiadomo¶æ nadano jednorazowo...'
Tresc zawiera 'USUN Z BAZY'
Tresc zawiera 'Prosimy o przes³anie pustego maila'
Tresc zawiera 'Je¿eli nie interesuj±...'
Tresc zawiera 'Je¿eli nie chcesz (otrzymywac)...'
Tresc zawiera '...prosimy o zwrotny e-mail...'
Tresc zawiera '...adres z bazy...'
Dice cumplir con la ley
Clama cumplir con la normativa SPAM
Serious cash
Viagra and other drugs
If only it were that easy
Nigerian scam key phrase (million dollars)
Drastically Reduced
Contains "Temple Kiff"
600.465 - Intro to NLP - J. Eisner
26
SpamAssassin Features
1.889
1.889
1.880
1.858
1.856
1.844
1.842
1.839
1.836
1.831
1.824
1.813
1.778
1.772
1.754
1.744
Forged 'by gw05' 'Received:' header found
Credit Card Offers
Find out Anything
Contains "Gentle Ferocity"
Spam phrases score is 21 to 34 (high)
Possible Porn - Porn membership
Potential Earnings
Bulk email software fingerprint (Group Mail) foundin headers
Once in a lifetime, apparently
Offers Free (often stolen) Passwords
Contains 'Dear (something)'
Possible porn - Porn Password
Message is 90-100% HTML tags
Sent using a trial version of CommuniGate
Date: is 48 to 96 hours after Received: date
To: has no local-part before @ sign
600.465 - Intro to NLP - J. Eisner
27
SpamAssassin Features
1.739
1.721
1.697
1.690
1.687
1.686
1.682
1.681
1.663
1.640
1.640
1.639
1.631
1.625
1.598
1.591
Talk about a check or money order
Contains 'for only pennies a day'
Spam tool pattern in MIME boundary
Form for checking email address
Subject: contains advertising tag
Talks about bulk email
Claims you registered with some kind of partner
Long Distance Phone Offer
Additional Income
Spam phrases score is 05 to 08 (medium)
Contains 'subject to credit approval'
Talks about tracing by SSN
Possible Porn - XXX Photos
Contains 'earn (dollar) something per week'
Message-Id has characters often found in spam
'X-Mailer' line contains gibberish
600.465 - Intro to NLP - J. Eisner
28
SpamAssassin Features
1.591
1.578
1.552
1.548
1.546
1.544
1.539
1.526
1.523
1.518
1.506
1.505
1.503
1.500
1.500
1.500
Cures Baldness
Subject starts with "Hello"
"Refinance your home"
Doing something with my income
Date: is 96 hours or more before Received: date
To: address contains spaces
Cents on the Dollar
Uses a username in a URL
Secretly Recorded
Invalid Date: header (not RFC 2822)
From and To are same (3)
Valid-looking To "undisclosed-recipients"
exists:Date-warning
Temat zawiera 'oferta'
Tre¶æ zawiera 'Zaprosiæ pañstwo'
Tre¶æ zawiera 'Szanowni Pañstwo'
600.465 - Intro to NLP - J. Eisner
29
SpamAssassin Features
1.500
1.500
1.500
1.495
1.490
1.486
1.479
1.470
1.466
1.459
1.435
1.410
1.404
1.404
1.400
1.394
Tresc zawiera 'publicznie dostêpny (email)'
Tresc zawiera 'Upowaznienie do wystawiania fakturVAT...'
Tresc zawiera '...mail z tematem...'
Possible registry spammer
Possible porn - Adult Web Sites
'one time mailing' doesn't mean it isn't spam
Forged hotmail.com 'Received:' header found
Talks about opting in
Possible porn - Barely Legal
Claims compliance with Senate Bill 1618
Direct Marketing
Money back guarantee.
Date: is 48 to 96 hours before Received: date
Instructions on how to increase something
NOS CHILLAN PARA DECIR QUE ES GRATIS
Plugs Viagra
600.465 - Intro to NLP - J. Eisner
30
SpamAssassin Features
1.385
1.382
1.373
1.370
1.368
1.363
1.361
1.352
1.337
1.332
1.319
1.314
1.306
1.302
1.301
1.293
Spam phrases score is 08 to 13 (medium)
URL uses words and phrases which indicate porn (4)
As seen on national TV!
Message text disguised using base-64 encoding
Date: is 3 to 6 hours after Received: date
Score with babes!
From and To are same (6)
'From' yahoo.com does not match 'Received' headers
Spam phrases score is 13 to 21 (high)
Not intended for residents of XYZ.
Faked To "Undisclosed-Recipients"
From and To are same (5)
Only thing addresses on CD are useful for is spam
Contains "Vjestika Aphrodisia"
Lower Monthly Payment
HTML comment has 3 consecutive 8-bit characters
600.465 - Intro to NLP - J. Eisner
31
SpamAssassin Features
1.285
1.283
1.275
1.274
1.273
1.270
1.269
1.253
1.253
1.247
1.246
1.231
1.226
1.224
1.218
1.201
From: does not include a real name
Uses a dotted-decimal IP address in URL
Contains link without http:// prefix
'Subject' contains G.a.p.p.y-T.e.x.t
Marketing Solutions
Spam tool pattern in MIME boundary
'Prestigious Non-Accredited Universities'
Spam tool pattern in MIME boundary
Incorporates a tracking ID number
From and To are same (2)
Contains 'free sample' with capitals
Claims compliance with spam regulations
Online Pharmacy
Received via SMTPD32 server (SMTPD32-n.n)
Includes a form which will send an email
While you Sleep
600.465 - Intro to NLP - J. Eisner
32
SpamAssassin Features
1.187
1.175
1.148
1.146
1.138
1.131
1.119
1.118
1.112
1.110
1.099
1.098
1.092
1.084
1.084
1.078
Uses non-standard port number for HTTP
Possible porn - in ALL CAPS
Subject contains a unique ID
Bulk email software fingerprint (hash 2) found inheaders
Get Paid
Contains 'URGENT BUSINESS'
Why Pay More?
Requires Initial Investment
Javascript to open a new window
exists:X-List-Unsubscribe
Date: is 6 to 12 hours after Received: date
Subject starts with dollar amount
Increase your ejaculation!
Subject: contains Korean unsolicited email tag
Spam phrases score is 03 to 05 (medium)
Plugs "Herbal Viagra"
600.465 - Intro to NLP - J. Eisner
33
SpamAssassin Features
1.187
1.175
1.148
1.146
1.138
1.131
1.119
1.118
1.112
1.110
1.099
1.098
1.092
1.084
1.084
1.078
Uses non-standard port number for HTTP
Possible porn - in ALL CAPS
Subject contains a unique ID
Bulk email software fingerprint (hash 2) found inheaders
Get Paid
Contains 'URGENT BUSINESS'
Why Pay More?
Requires Initial Investment
Javascript to open a new window
exists:X-List-Unsubscribe
Date: is 6 to 12 hours after Received: date
Subject starts with dollar amount
Increase your ejaculation!
Subject: contains Korean unsolicited email tag
Spam phrases score is 03 to 05 (medium)
Plugs "Herbal Viagra"
600.465 - Intro to NLP - J. Eisner
34
SpamAssassin Features
1.077
1.057
1.045
1.042
1.039
1.038
1.023
1.021
1.009
1
1
1
1
1
1
1
Apparently, you'll be amazed
People just leave money laying around
Bulk email software fingerprint (eGroups) found inheaders
Date: is 24 to 48 hours before Received: date
Talks about direct email
Unneeded encoding of HTML tags
Javascript to move windows around
No such thing as a free lunch (3)
Save big money
Frequent SPAM content
Frequent SPAM content
Frequent SPAM content
Frequent SPAM content
Frequent SPAM content
Frequent SPAM content
Frequent SPAM content
600.465 - Intro to NLP - J. Eisner
35
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Filename is just a '\#'; probably a JS trick
Old Murkowski disclaimer
Obfuscated action attribute in HTML form
Mentions monsterhut.com
Form for verifying email address
Contains signature of unregistered spam tool
Publicidad por e-mail
Contiene la palabra gratis en las cabeceras
exists:X-Fix
To: non-existent 'Investors' address
Subject contains 'Your Membership Exchange'
Spam tool pattern in MIME boundary
Reply-To: is empty
Received via a relay in bl.spamcop.net
Received via RSSed relay, seehttp://www.mail-abuse.org/rss/
Received via RBLed relay, seehttp://www.mail-abuse.org/rbl/
600.465 - Intro to NLP - J. Eisner
36
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Received from first hop dialup, seehttp://www.mail-abuse.org/dul/
Received from dialup, seehttp://www.mail-abuse.org/dul/
Received contains fake 'Post.cz' hostname
From an address @email-publisher.com
Bulk email software fingerprint (xmailer tag) foundin headers
Bulk email software fingerprint (pascual) found inheaders
Bulk email software fingerprint (eBizmailer) foundin headers
Bulk email software fingerprint (charset) found inheaders
Bulk email software fingerprint (Yam) found inheaders
Bulk email software fingerprint (V3161) found inheaders
Bulk email software fingerprint (Uproar) found inheaders
Bulk email software fingerprint (Seednet) found inheaders
Bulk email software fingerprint (PowerCampaign)found in headers
Bulk email software fingerprint (Opt-In Lightning)found in headers
Bulk email software fingerprint (Matchmaker) foundin headers
Bulk email software fingerprint (Mail Bomber)found in headers
600.465 - Intro to NLP - J. Eisner
37
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Bulk email software fingerprint (Henry Su) found inheaders
Bulk email software fingerprint (GRMessageQueue)found in heade
Bulk email software fingerprint (EPaper) found inheaders
Bulk email software fingerprint (DiffondiCool)found in headers
Bulk email software fingerprint (CurrentMailer)found in headers
Bulk email software fingerprint (Caretop) found inheaders
Bulk email software fingerprint (Campaign Blaster)found in header
Bulk email software fingerprint ("outlook") found inheaders
'Received:' contains huge hostname
'From' contains more than one address
Tre¶æ jest od wydawnictwa Verlag Dashofer(spamerzy)
Tresc zawiera 'Za zaliczeniem pocztowym...'
/zam.wieni/i
/zainteresowan.{0,50}wsp..prac/
/www\.adresy\.org/i
/specjaln.{0,50}ofert/i
600.465 - Intro to NLP - J. Eisner
38
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Presentación de un nuevo producto.
Porno gratis.
Para dejar de fumar
Pago contra reembolso.
Nos animan a contestar si estamos interesados
No se puede considerar spam
Mensaje enviado por error
Mas informacion.
Los regalos no existen, salvo de nuestros amigos.
Inmigración legal (?) a los Estados Unidos
Informacion y reserva
If you want to subscribe...
If you send an email you will be OptOut
IMPERATIVOS EN MAYUSCULAS.
Haga click aqui.
Ha sido ganador.
600.465 - Intro to NLP - J. Eisner
39
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Ha sido ganador.
El correo como alternativa comercial
Conviertete en Spammer.
Claims you can opt-out
Claims you can be removed in Spanish
Claims not to be spam in Spanish
Alta en buscadores hispanos.
spam software: PopLaunch
mentions Cyber FirePower!, a spam-tool
Will not Belive your Eyes!
Well known spam senders
Wants you to do business online
Things incredible
They keep your money -- No Refund!
Terms and conditions
Suspect you might have received the message bymistake
600.465 - Intro to NLP - J. Eisner
40
SpamAssassin Features
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Slashed Price
SSPL found, spammer forgot to run the random-IDgenerator
Psychics Scam
Prices won't Last
Possible porn - Galleries of Pictures
Plugs "Natural Viagra"
Outstanding Values
Orders shipped by priority mail
No Middleman
No Medical Exams
No Gimmick
Nigerian scam, cfhttp://www.snopes2.com/inboxer/scams/nigeria.htm
New Customers Only
More Internet Traffic
Luxury Car
List removal information
600.465 - Intro to NLP - J. Eisner
41
SpamAssassin Features
1 Get Started Now
1 Cyber FirePower! rant about losing dropboxes
1 Confidentially on all orders
1 Claims you were on a list
1 Claims to listen to some removal request list
1 Claims not to be spam
1 Claims not to be selling anything
1 Claims compliance with spam regulations
1 Claims compliance with spam regulations
1 Claims "This is not junk email"
1 Cell Phone Cancer Scam
1 Buying judgements
1 Achieve Wealth
0.982
Talks about future mailings
0.977
Excessive quoted-printable encoding in body
0.975
Multi Level Marketing mentioned
600.465 - Intro to NLP - J. Eisner
42
SpamAssassin Features
0.968
0.959
0.954
0.952
0.948
0.947
0.935
0.931
0.910
0.908
0.906
0.904
0.900
0.893
0.885
0.882
Possible porn - Hardcore Porn
Missing To: header
From: has no local-part before @ sign
Targeted Traffic / Email Addresses
Information on getting a larger penis or breasts
Message is 70-90% HTML tags
Free Membership
To: and Cc: contain similar domains at least 8 times
Received contains a (dollar) variable reference
Claims compliance with spam regulations
'From' ebay.com does not match 'Received' headers
Unlimited in caps
Accept Credit Cards
From: ends in numbers
'Message-Id' was added by a relay (3)
Gives information about an opportunity
600.465 - Intro to NLP - J. Eisner
43
SpamAssassin Features
0.874
0.863
0.853
0.849
0.849
0.849
0.838
0.820
0.817
0.810
0.796
0.795
0.781
0.781
0.781
0.779
Don't delete me! Nooooo!!!!
Fast Viagra Delivery
Frequent SPAM content
exists:X-Stormpost-To
Missing Date: header
List removal information
Consolidate Debt and Credit
Financial Freedom
Lots and lots of Cc: headers
Received via a relay in multihop.dsbl.org
Contains word 'guarantee' in all-caps
Claims you can be removed from the list
Spam phrases score is 00 to 01 (low)
HTML message is a saved web page
Claims compliance with Senate Bill 1618
exists:X-PMFLAGS
600.465 - Intro to NLP - J. Eisner
44
SpamAssassin Features
0.676
0.673
0.670
0.666
0.665
0.658
0.653
0.646
0.643
0.630
0.628
0.622
0.620
0.614
0.612
0.612
See for yourself
You'd better read all of this spam!
Easy Terms
Contains "Toner Cartridge"
Human Growth Hormone
Trying to sell insurance online
No experience needed!
Claims to be legitimate email
Subject: starts with advertising tag
Frequent SPAM content
illegal Nigerian transactions (2)
Subject GUARANTEED
DNSBL: sender ip address in in a dialup block
Possible porn - Must be 18
Tells you to click on a URL (in caps)
Free Quote
600.465 - Intro to NLP - J. Eisner
45
SpamAssassin Features
0.611
0.610
0.608
0.606
0.605
0.601
0.601
0.600
0.594
0.573
0.563
0.560
0.556
0.553
0.552
0.549
Refinance Home
Received via a relay in relays.ordb.org
Contains 'free access' with capitals
Uses a long numeric IP address in URL
Have you been turned down?
Includes a URL link to send an email with the subject'remove'
No Credit Check
No Inventory
To: has a malformed address
Be your own boss
Information on how to work at home (2)
Contains mail-in order form
One hundred percent guaranteed
Guaranteed Stuff
Information on mortgage rates
Frequent SPAM content
600.465 - Intro to NLP - J. Eisner
46
SpamAssassin Features
0.544
0.542
0.542
0.541
0.539
0.536
0.531
0.525
0.521
0.518
0.514
0.513
0.511
0.506
0.506
0.505
From and To the same (1)
Bulk email software fingerprint (screwup 2) found inheaders
Gives an excuse for why message was sent
Avoid Bankruptcy
Includes a link for AOL users to click
Form for changing email address
Apply online (with capital O)
List removal information
Date: is 12 to 24 hours after Received: date
Asks you for your signature on a form
Subject talks about losing pounds
Lower Interest Rates
Do it Today
Unsecured Credit/Debt
The best Rates
From: starts with nums
600.465 - Intro to NLP - J. Eisner
47
SpamAssassin Features
0.505
0.505
0.503
0.503
0.501
0.501
0.500
0.496
0.489
0.488
0.483
0.466
0.466
0.459
0.448
0.448
Spam phrases score 55 or higher (high)
Impotence cure
Vacation Offers
Spam is 100% natural?!
Possible porn - Free Porn
Possible porn - Best, Largest Porn Collections
Spam phrases score is 01 to 02 (low)
Can not be combined with any other offer
Message contains disclaimer
Claims to be Legal
Subject is all capitals
MS-Outlook-style To "<Undisclosed-Recipient:;>"
Date: is 96 hours or more after Received: date
Spam tool pattern in MIME boundary
Date: is 6 to 12 hours before Received: date
Says: "to be removed, reply via email" or similar
600.465 - Intro to NLP - J. Eisner
48
SpamAssassin Features
0.448
0.446
0.443
0.443
0.435
0.434
0.431
0.431
0.429
0.428
0.426
0.424
0.424
0.422
0.422
0.421
Possible porn - Porn Fest
Sent with 'X-Priority' set to high
Local part containing a "4u" variant
HTML font color is magenta
Join Millions of Americans
Asks for a billing address
Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(doll
Claims "This is not spam"
Sent with 'X-Msmail-Priority' set to high
Subject contains "FREE" in CAPS
exists:X-MailingID
MIME section missing boundary
Asks you to fill out a form
HTML font color is unknown to us
Domain name containing a "4u" variant
HTML font color is yellow
600.465 - Intro to NLP - J. Eisner
49
SpamAssassin Features
0.419
0.419
0.418
0.417
0.416
0.415
0.414
0.414
0.414
0.414
0.413
0.412
0.411
0.410
0.408
0.407
Includes a link to send a mail with a subject
Standard investment opportunity spam
Javascript to hide URLs in browser
Offers Extra Cash
Eliminate Bad Credit
Lose Weight Spam
Subject talks about savings
Subject ends with lots of white space
Offers a full refund
Gives instructions for removal from list
Free Cell Phone
Frontpage used to create the message
Offers a limited time offer
Claims you can be removed from the list
Attempt at obfuscating the word "mortgage"
Opportunity - What a deal!
600.465 - Intro to NLP - J. Eisner
50
SpamAssassin Features
0.407
0.406
0.406
0.406
0.405
0.405
0.405
0.405
0.405
0.404
0.404
0.404
0.403
0.402
0.402
0.402
Nobody's perfect
Tells you about a strong buy
HTML table has thick border
Buy Direct
Instant Access button
HTML font color is green
HTML font color is cyan
Discusses money making
Asks you to click below (in caps)
Uses open redirection service
exists:X-ServerHost
Claims you can be removed from the list
List removal information
Message with extraneous Content-type:...type=header
There is no obligation.
Talks about lots of money
600.465 - Intro to NLP - J. Eisner
51
SpamAssassin Features
0.402
0.401
0.401
0.400
0.400
0.400
0.400
0.386
0.382
0.380
0.369
0.365
0.364
0.364
0.362
0.362
Contains 'Get it now' with capitals
Supplies are Limited
No such thing as a free lunch (2)
You won't be dissapointed.
Possible porn - Offers Instant Access
Nigerian scam key phrase ((dollar)NN,NNN,NNN.NN)
How dear can you be if you don't know my name?
No Strings Attached
HTML with embedded plugin object
Received via a relay in relays.osirusoft.com
Off Shore Scams
Information on how to work at home (1)
Possible porn - Hot, Nasty, Wild, Young
Contains word 'amazing' in all-caps
exists:X-SMTPExp-Version
There is no catch.
600.465 - Intro to NLP - J. Eisner
52
SpamAssassin Features
0.361
0.360
0.344
0.336
0.335
0.334
0.333
0.333
0.330
0.329
0.329
0.327
0.327
0.326
0.325
0.324
sent to you@you.com or similar
Received from first hop dialup listed inrelays.osirusoft.com
HTML font color is same as background
Subject: is empty or missing
FONT Size +2 and up or 3 and up
Lowest Price
HTML font color has unusual name
Contains word 'profits' in all-caps
HTML font color is gray
What are you waiting for
One Time Rip Off
Talks about prizes
Free Website
To: and Cc: contain similar usernames at least 5 times
HTML font face is not a commonly used face
Quoted-printable line longer than 76 characters
600.465 - Intro to NLP - J. Eisner
53
SpamAssassin Features
0.324
0.323
0.323
0.323
0.321
0.321
0.321
0.320
0.320
0.320
0.319
0.318
0.317
0.315
0.315
0.315
From: has a malformed address
exists:X-SMTPExp-Registration
Message-Id has no @ sign
No such thing as a free lunch (1)
URL of CGI script called "unsubscribe" or "remove"
Satisfaction Guaranteed
"if you do not wish to receive any more"
Message contains a lot of ^M characters
exists:x-esmtp
Claims you are a winner
From: contains numbers mixed in with letters
Can't live without?
HTML mail with non-white background
Talks about email marketing
Save big money
HTML font color is red
600.465 - Intro to NLP - J. Eisner
54
SpamAssassin Features
0.315
0.313
0.313
0.312
0.312
0.308
0.308
0.307
0.307
0.306
0.305
0.304
0.302
0.302
0.302
0.301
3 WHOLE LINES OF YELLING DETECTED
Save Up To
Domain registration spam body
Tells you to click on a URL
Subject: domain registration spam subject
URL contains spamhaus signature: numbered servers
Name Brand
Asks you to click below
Act Now! Don't Hesitate!
Talks about Hidden Charges
Message is 50-70% HTML tags
While Supplies Last
Easily-executed JavaScript code
Subject starts with "Free"
HTML font color not within safe 6x6x6 palette
No Purchase Necessary
600.465 - Intro to NLP - J. Eisner
55
SpamAssassin Features
0.301
0.300
0.300
0.300
0.300
0.299
0.296
0.294
0.281
0.279
0.245
0.242
0.239
0.229
0.224
0.222
Auto-executing JavaScript code
DNSBL: sender is a Spamware site or vendor
Significant Savings
No Fees
Click-to-remove with PHP/ASP action found
X-Mailer header indicates a non-spam MUA (TheBat!)
'remove' URL contains an email address
Being a Member
Investment Decision
Date: is 3 to 6 hours before Received: date
Contains a Privacy Statement
Tells you how to stop further spam
Month Trial Offer
Save (dollar) (dollar) (dollar)
Sign up Free Today
To: repeats address as real name
600.465 - Intro to NLP - J. Eisner
56
SpamAssassin Features
0.218
0.217
0.216
0.214
0.212
0.212
0.212
0.212
0.212
0.211
0.211
0.211
0.211
0.211
0.210
0.210
Congratulations - you've been scammed?
2 WHOLE LINES OF YELLING DETECTED
Weekend Getaway
Trying to offer you something
Member Stuff
HTML font color is missing hash (
Doesn't ask any questions
Contains 'Special Promotion'
A WHOLE LINE OF YELLING DETECTED
To: is empty
Winning in Caps
Stuff on Sale
Only (dollar) (dollar) (dollar)
Encourages you to waste no time in ordering
Who really wins?
HTML font face has excess capital characters
600.465 - Intro to NLP - J. Eisner
57
SpamAssassin Features
0.209
0.207
0.207
0.206
0.205
0.204
0.204
0.204
0.203
0.203
0.203
0.203
0.202
0.201
0.201
0.201
Free DVD
Date: is 12 to 24 hours before Received: date
JavaScript code
Header with all capitals found
HTML font color is blue
Winner in Caps
HTML font face is not a word
Fantastic Deal
Includes a 'remove' email address
Includes a URL link to send an email
Possible porn - Large Number of movies, pics
Free Offer
Contains a tollfree number
illegal Nigerian transactions (1)
Image tag with an ID code to identify you
Frame wanted to load outside URL
600.465 - Intro to NLP - J. Eisner
58
SpamAssassin Features
0.201
0.181
0.150
0.146
0.144
0.137
0.134
0.127
0.123
0.117
0.114
0.114
0.111
0.108
0.107
0.106
Contains 'for only' some amount of cash
X-Mailer header indicates a non-spam MUA(Outlook Express)
Spam tool pattern in MIME boundary
Cancel at any time!
Talks about social security numbers
Click to perform an action on an account
Gives an excuse about why you were sent this spam
Nigerian scam key phrase ((dollar) NNN.Nm/USDNNN.N m/US(doll
Contains a comment with nothing but unique ID
No Claim Forms
'Message-Id' was added by a relay (2)
Free Trial
They're just giving it away!
Message-Id has characters indicating spam
Dear you@you.com?
Free Hosting
600.465 - Intro to NLP - J. Eisner
59
SpamAssassin Features
0.105
0.104
0.103
0.102
0.101
0.101
0.100
0.100
0.100
0.100
0.038
0.032
0.031
0.028
0.014
0.009
Contains an ASCII-formatted form
I wonder how many emails they sent in error...
URL of page called "unsubscribe"
Subject has exclamation mark and question mark
Offer Expires
Contains 'Dear Somebody'
Javascript protocol in a URI
Message includes Microsoft executable program
MIME filename does not match content
Spam tool pattern in MIME boundary
'Received:' has 'may be forged' warning
Message-Id is not valid, according to RFC 2822
Offers Coupon
Please read this! Please oh please oh please!
Shopping Spree
Contains a line >=199 characters long
600.465 - Intro to NLP - J. Eisner
60
SpamAssassin Features
0.009
0.009
0.008
0.008
0.005
0.004
0.003
-0.006
-0.019
-0.026
-0.069
-0.075
-0.102
-0.118
-0.123
-0.133
Spam tool pattern in MIME boundary
Risk free. Suuurreeee....
Reserves the right
Expect to earn
Contains 'G.a.p.p.y-T.e.x.t'
Gift Certificate
Big Bucks
X-Mailer header indicates a non-spam MUA(Outlook)
From Majordomo
Missing From: header
Free money!
Forwarded email (Outlook style)
Email came from some known mailing list software
Mailer daemon failure notice (1)
Message text is over 40K in size
Came via Internet Mail Service plugin
600.465 - Intro to NLP - J. Eisner
61
SpamAssassin Features
-0.137
-0.143
-0.196
-0.200
-0.207
-0.211
-0.215
-0.217
-0.231
-0.233
-0.240
-0.298
-0.300
-0.301
-0.302
-0.304
Correct for MIME 'null block'
X-Mailer header indicates a non-spam MUA(Netscape)
Mailing list headers are suspicious
exists:Resent-To
exists:X-Authentication-Warning
Where are you working at?
exists:X-Accept-Language
Subject contains newsletter header (list)
'Message-Id' was added by yahoo.com, that's OK
exists:X-Loop
X-Mailer header indicates a non-spam MUA (AOL)
To: repeats local-part as real name
User-Agent header indicates a non-spam MUA(Entourage)
Short signature present (no empty lines)
exists:X-Mailing-List
Long signature present (empty lines)
600.465 - Intro to NLP - J. Eisner
62
SpamAssassin Features
-0.484
-0.484
-0.489
-0.506
-0.506
-0.506
-0.518
-0.522
-0.558
-0.601
-0.605
-0.616
-0.641
-0.695
-0.708
-0.708
Subject contains a month name - probable newsletter(2)
Subject contains a month name - probable newsletter
Common footer for Hotmail
Contains a PGP-signed message
Appears to be from yahoo groups
Yahoo! Groups message
exists:User-Agent
Has a valid-looking References header
Forwarded email
User-Agent header indicates a non-spam MUA(Mozilla)
User-Agent header indicates a non-spam MUA(Outlook Express)
Subject contains newsletter header (news)
Message-Id indicates a non-spam MUA (Pine)
Contains what looks like an 'E-Mail Disclaimer'
Contains a PGP-signed message (signature attached)
Message text is over 20K in size
600.465 - Intro to NLP - J. Eisner
63
SpamAssassin Features
-0.725
-0.754
-0.832
-0.847
-0.864
-0.897
-0.949
-0.986
-1
-1
-1
-1
-1
-1
-1
-1
Subject contains a frequency - probable newsletter
X-Mailer header indicates a non-spam MUA(T-Offline)
Contains what looks like a quoted email text
exists:In-Reply-To
Has an Approved-By moderated list header
User-Agent header indicates a non-spam MUA(IMP)
Contains what looks like a patch from diff -u
Mailer daemon failure notice (2)
X-Mailer header indicates a non-spam MUA (Gnus)
User-Agent header indicates a non-spam MUA(Gnus)
Subject contains newsletter header (in review)
From: looks like US Telephone Number
recommended page from MailBits.com
Talks about tracking numbers
Common footer for MSN
A MailMan confirm-your-address message
600.465 - Intro to NLP - J. Eisner
64
SpamAssassin Features
-1.118
-1.128
-1.152
-1.176
-1.301
-1.334
-1.433
-1.451
-1.596
-1.628
-1.696
-1.780
-1.801
-1.898
-2.092
-2.170
Common footer for MSN
Contains a password retrieval system
Something about registration
User-Agent header indicates a non-spam MUA(Mutt)
Came from MSN Communities
exists:X-Cron-Env
Subject looks like order info
From the Mailer-Daemon
Subject contains a date
Contains what looks like an email attribution
Common footer for Hotmail
X-Mailer header indicates a non-spam MUA (AppleMail)
Common footer for Hotmail
Sent through Microsoft's ListBuilder service
Short signature present (empty lines)
Common footer for Hotmail
600.465 - Intro to NLP - J. Eisner
65
SpamAssassin Features
-2.174
-2.442
-2.473
-2.475
-2.550
-2.699
-2.863
-3.052
-3.127
-4.0
-6
-10
-10
-20
-100
-100
Message from eBay
Contains what looks like a patch from diff -c
Looks like a Debian BTS bug
Common footer for Hotmail
Subject is an eBay question
Looks like a Bugzilla bug
User-Agent header indicates a non-spam MUA(KMail)
non-spam Yahoo! Groups banner found
Long signature present (no empty lines)
Uses the Habeas warrant mark(http://www.habeas.com/)
User is listed in 'whitelist_to'
Not Matt's Scripts formmail.pl
Bonded sender, seehttp://www.bondedsender.org/referred.html
User is listed in 'more_spam_to'
User is listed in 'all_spam_to'
From: address is in the user's white-list
600.465 - Intro to NLP - J. Eisner
66
How to Categorize?
(unsupervised)
What if we don’t have supervised training data?
Might try an iterative approach as usual:
1. Cluster the messages
2. Train n-gram, Naive Bayes, or decision list
model to discriminate among the clusters
3. Use the model to reassign messages to clusters
(most will stay put but some will move)
4. Return to step 2 until convergence
600.465 - Intro to NLP - J. Eisner
67
How to Categorize?
(semisupervised)
What if we have only a little supervised data?
Could try bootstrapping like Yarowsky’s WSD:
1. Start with very small, rather accurate classes
2. Train n-gram, Naive Bayes, or decision list
model to discriminate among the classes
3. Augment each class with new messages
that the model confidently classifies there
(maybe also move or remove some existing messages)
4. Return to step 2 until convergence
600.465 - Intro to NLP - J. Eisner
68
How to Categorize?
(adaptive)
What if we gradually get more new data over time?
 User feedback (active or passive) on our classifications
 News / email systems that categorize, or judge relevance
 Add new articles / messages to training data
 If they’re unlabeled (no supervision), label them automatically
 Add them only if we’re confident? Add them fractionally, like EM?
So model adjusts over time:
 E.g., change the cluster centroids or n-gram parameters
 May want to weight the more recent data more heavily,
since the future is more like the present than the past
 E.g., message from k days ago has weight 0.9k (k=0,1,2, ...)
 So today’s model = today’s data + 0.9 * yesterday’s model
600.465 - Intro to NLP - J. Eisner
69
How to Categorize?
(hierarchical)
What if we are putting document in a Yahoo! category?
 There are thousands of categories (at least) – too hard!




Choose one of the 14 top-level categories, e.g., Science
Then use a Science-specific classifier to choose one of the
54 second-level categories within Science (14 are symlinks)
Continue working your way down the tree ...
When you can’t classify with high confidence, ask a human
(then use the human’s answer as more training data)
600.465 - Intro to NLP - J. Eisner
70
Download