5/6/09 Eric
P.
Jiang
 University
of
San
Diego


advertisement
5/6/09
Eric
P.
Jiang
University
of
San
Diego
SIAM
International
Conference
on
Data
Mining
2009
–
Text
Mining
Workshop
–
Sparks,
Nevada
–
May
2,
2009
• 
Spam
is
a
plaque
on
the
Internet
and
during
the
1st
quarter
of
2008,
spam
accounts
for
about
9
out
every
10
email
sent
over
the
Internet
• 
Spam
filtering
can
be
performed
  At
the
server
level
(e.g.,
by
querying
DNSBL
in
real‐time)
or
  At
the
client
level
(e.g.,
by
examining
email
content
in
greater
detail)
• 
Each
approach
has
pros
and
cons
and
it
would
be
better
if
combining
both
approaches
• 
For
content
filtering,
supervised
machine
learning
for
text
classification
can
be
applied
1
5/6/09
• 
This
study
considers
5
content‐based
algorithms:
 Naïve
Bayes
 SVM
 LogitBoost
 Augmented
LSI
space
 RBF
network
• 
We
evaluate
the
algorithms
by
 Applying
them
directly
to
2
spam
corpora
constructed
from
2
different
languages
 Varying
feature
size
to
analyze
the
usefulness
of
feature
selection
to
the
algorithms
• 
Spam
filtering
can
be
cost‐sensitive
 False
positive
errors
are
generally
more
expensive
• 
Primary
objectives
of
the
work
 To
understand
whether
and
to
what
extent
the
algorithms
are
applicable
to
the
cost‐sensitive
spam
filtering
problem
 To
identify
what
characteristics
of
the
algorithms
may
have
toward
this
applicability
2
5/6/09
• 
Naïve
Bayes
 A
probabilistic
learning
algorithm
based
on
Bayesian
decision
theory
 For
spam
classification,
the
probability
of
a
message
d
being
in
class
c
is
estimated
by
P
(c|d)
≈
P
(c)
Π
P
(tk|c)
 It
is
based
on
a
naïve
assumption
that
a
feature
in
a
class
is
completely
independent
of
any
other
features
 In
practice,
it
can
work
surprisingly
well
and
produce
impressive
classification
results
 The
implementation
of
naïve
Bayes
has
a
linear
complexity
• 
LogitBoost
 A
popular
boosting
algorithm
that
implements
forward
stage‐wide
modeling
to
form
additive
logistic
regression
 It
adds
base
or
weak
learners
iteratively
and
updates
sample
weights
adaptively
through
iterations
 For
spam
classification,
if
fm
is
the
mth
base
learner,
then
the
probability
of
a
message
d
being
in
class
c
is
estimated
by
P
(c|d)
=
e
F(d)
/
[1
+
e
F(d)
],
F
(d)
=
½
Σ
fm
(d)
 It
uses
a
decision
stump
as
the
base
learner
3
5/6/09
• 
SVM
 A
top
choice
and
widely
used
for
text
classification
 It
uses
linear
models
to
implement
nonlinear
class
boundaries
by
transforming
instance
spaces
through
mappings
 It
maximizes
hyper‐
plane
margins
 Nonlinear
cases
can
be
solved
by
kernel
functions
 We
use
an
SVM
with
a
linear
kernel
• 
Augmented
LSI
spaces
 LSI
is
a
well‐known
conceptual
IR
approach
using
SVD
 It
can
be
used
for
text
classification
by
changing
the
notion
of
query‐relevance
to
the
notion
of
category‐
membership
 LSI
is
completely
unsupervised
o 
When
applied
for
email
classification,
important
category
info
in
training
data
should
be
explored
and
used
to
boost
model
accuracy
 Augmented
LSI
space
model
applies
o  A
unsupervised‐supervised
combined
feature
selection
procedure
o  Two
separate
LSI
spaces,
one
for
each
email
category
4
5/6/09
• 
Augmented
LSI
spaces
 Conceptually,
individualized
LSI
spaces
should
offer
more
accurate
content
profiles,
but
practically
it
can
still
encounter
difficulty
in
spam
classification
 We
construct
an
augmented
LSI
space
by
adding
some
training
samples
that
are
close
to
the
class
in
appearance
but
belong
to
the
other
class
in
label
 Use
cluster
centroids
to
expand
training
samples
for
the
learning
spaces
• 
RBF
neural
networks
 Radial
basis
function
nets
have
many
applications
in
science
and
engineering
 RBF
has
a
feed‐forward
structure
with
3
layers:
input,
processing
middle
and
output
 The
middle
layer
neurons
use
a
nonlinear
RBF
function
Φ
as
their
activations
x
Φ
 The
output
layer
y
x
neurons
use
a
Φ
weighted
sum
of
x
Φ
y
middle
layer
x
activations
1
1
1
2
2
3
k
m
n
5
5/6/09
• 
RBF
neural
networks
 RBF
training
can
be
done
by
a
global
optimization
algorithm,
but
it
is
more
computationally
efficient
if
using
a
two‐stage
training
for
determining
network
parameters
 The
first
stage
is
to
form
a
representation
of
density
distribution
in
the
input
space
in
terms
of
RBF
parameters
o  Can
be
done
by
unsupervised
clustering
models
 The
second
stage
is
to
determine
the
weights
of
the
output
layer
o  Can
be
done
by
supervised
linear
models
• 
Feature
selection
 Two
objectives:
o  Reducing
dimensionality
of
feature
space
while
preserving
email
content
o  Eliminating
irrelevant
features,
which
is
particularly
useful
for
some
algorithms
(e.g.,
RBF
networks)
 Two
steps:
o  Unsupervised
–
removing
stop
words,
applying
word
stemming,
removing
low
frequent
words
and
also
very
high
frequent
words
o  Supervised
–
using
frequency
distributions
to
identify
the
features
that
distribute
most
differently
between
spam
and
ham
(e.g.
using
Information
Gain)
 Features
can
be
reduced
from
20k
to
tens,
hundreds,
and
thousands
6
5/6/09
• 
Each
message
is
encoded
as
a
numeric
vector
of
values
of
retained
features
• 
Each
feature
value
in
a
vector
represents
the
combined
feature’s
local
and
global
weights
 Experiments
indicate
that
a
weight
coding
is
more
informative
than
a
simple
binary
coding
• 
The
traditional
log(tf)‐idf
weighting
scheme
is
used
• 
Spam
filtering
can
be
cost‐sensitive,
i.e.,
errors
of
false
positive
are
more
costly
than
false
negative
• 
Most
traditional
measures
do
not
take
such
an
unbalanced
cost
into
consideration
• 
We
use
the
Weighted
Accuracy
measure:
WA(λ)
=
[λ
nTN
+
nTP]
/
[λ
(nTN
+
nFP)
+
(nTP
+
nFN)]
• 
It
might
be
debatable
if
a
misclassification
cost
can
be
quantified
by
a
const
 We
use
λ
=
9
or
a
similar
quantity
to
observe
if
and
how
the
performance
of
an
algorithm
changes
when
a
cost‐
sensitive
condition
is
imposed
7
5/6/09
• 
Use
two
public
spam
testing
corpora
of
real
email
messages
collected
by
a
single
user,
X
and
Y,
resp.
• 
PU
1
Dataset
 Has
618
ham
and
481
spam
messages
 Email
messages
are
numerically
encoded
• 
ZH
1
Dataset
 Has
428
ham
and
1,205
spam
messages
 Constructed
similarly
as
PU
1,
but
 Written
in
Chinese
(has
a
vastly
different
linguistic
structure,
a
huge
vocabulary
and
no
explicit
word
boundaries)
• 
Email
content
refers
to
subject
line
and
body
parts
 A
limit
imposed
by
the
corpora
we
used
 All
algorithms,
however,
would
work
for
expanded
content
(e.g.,
by
including
additional
header
fields)
 Better
filtering
results
can
be
expected
with
expanded
email
content
• 
Features
are
statistically
extracted
from
the
text
in
email
subject
and
body
 Alternatively,
they
can
also
be
generated
heuristically
by
some
rules
based
system
(e.g.,
SpamAssassin)
 Should
be
interesting
and
useful
if
combining
both
8
5/6/09
• 
Evaluation
is
done
by
10‐folder
cross
validation
 A
corpus
is
partitioned
into
10
equally
sized
subsets
and
each
experiment
takes
one
subset
for
testing
and
the
remaining
for
training.
 The
process
repeats
10
times
with
each
subset
takes
a
turn
for
testing
 The
performance
is
evaluated
by
averaging
over
10
experiments
• 
Feature
size
 We
use
various
sizes
that
range
from
50
to
1,650
with
an
increment
of
100
to
analyze
the
usefulness
of
feature
selection
to
the
algorithms
9
5/6/09
10
5/6/09
• 
Spam
filtering
is
a
special
and
challenging
text
classification
task
 Two
categories
(ham
and
spam)
 Cost‐sensitive
with
unbalanced
misclassification
costs
 Very
difficult
(many
spam
messages
are
carefully
crafted
to
look
like
ham
email)
• 
Some
category
characteristics
should
not
be
overlooked
 Ham
email
has
in
general
a
broader
vocabulary
than
spam
email
 Ham
email
has
a
more
eclectic
subject
matter
than
spam
email
11
5/6/09
• 
We
would
like
to
present
some
characteristics
of
individual
algorithms
revealed
from
experiments
and
analysis
• 
Naïve
Bayes
(NB)
 Simple,
and
fast
in
model
learning
 Work
well
for
general
text
classification
 Can
be
benefited
by
effective
feature
selection
(due
to
its
simplistic
feature
independence
assumption)
 Can
perform
poorly
if
date
sets
have
potentially
heavy
feature
dependencies
and
it
can
lead
to
inaccurate
probability
estimation
(e.g.,
Chinese
dataset
ZH
1)
• 
LogitBoost
(LB)
 Simple
base
learner
but
the
ensemble
construction
can
still
take
time
 Generally
it
delivers
competitive
results
 Seems
insensible
to
feature
size
–
large
feature
sizes
may
not
help
improve
performance
and
we
may
use
relatively
small
feature
size
such
as
250
 Its
learning
ability
of
profiling
a
category
may
be
influenced
by
the
number
of
available
training
samples
• 
Support
Vector
Machines
(SVM)
 A
very
stable
and
scalable
to
feature
dimensionality
 It
consistently
performs
as
the
best
or
a
very
competitive
classifier
in
this
study
12
5/6/09
• 
Support
Vector
Machines
(SVM)
 It
provides
superior
results
particularly
when
cost‐
insensitive
classification
is
concerned
 It
is
relatively
fast
in
model
training
• 
Radial
Basis
Function
Networks
(RBF)
 It
is
RBF
network
based
with
a
fast
two‐stage
training
procedure
 It
performs
reasonably
well,
in
particular
when
used
in
cost‐sensitive
learning
 Seems
sensitive
to
feature
size,
and
excessive
feature
reduction
should
be
avoided
• 
Augmented
LSI
spaces
(LSI)
 It
constructs
separate
LSI
spaces,
one
for
each
category
 A
very
reliable
classifier
with
consistently
good
results
 Like
RBF,
seems
well‐suitable
to
cost‐sensitive
spam
filtering,
and
it
is
in
part
due
to
its
integrated
clustering
component
for
constructing
augmented
LSI
learning
spaces
 Good
performance
generally
requires
a
feature
size
at
about
500
or
larger
 The
model
training
can
be
expensive
when
the
feature
size
gets
very
large
13
5/6/09
• 
This
study
considers
5
algorithms
(most
popularly
used
or
most
recently
proposed)
for
an
evaluation
• 
Experiments
and
analysis
have
shown
that
 Overall
LSI,
RBF
and
SVM
are
the
top
performers
 Both
LSI
and
RBF
show
their
strength
when
applying
to
cost‐sensitive
spam
filtering
 Algorithms
for
spam
filtering
can
likely
be
benefited
by
an
integrable
clustering
process
to
enhancing
their
profile
accuracy
of
ham
email
14
Download