5/6/09 Eric P. Jiang University of San Diego SIAM International Conference on Data Mining 2009 – Text Mining Workshop – Sparks, Nevada – May 2, 2009 • Spam is a plaque on the Internet and during the 1st quarter of 2008, spam accounts for about 9 out every 10 email sent over the Internet • Spam filtering can be performed At the server level (e.g., by querying DNSBL in real‐time) or At the client level (e.g., by examining email content in greater detail) • Each approach has pros and cons and it would be better if combining both approaches • For content filtering, supervised machine learning for text classification can be applied 1 5/6/09 • This study considers 5 content‐based algorithms: Naïve Bayes SVM LogitBoost Augmented LSI space RBF network • We evaluate the algorithms by Applying them directly to 2 spam corpora constructed from 2 different languages Varying feature size to analyze the usefulness of feature selection to the algorithms • Spam filtering can be cost‐sensitive False positive errors are generally more expensive • Primary objectives of the work To understand whether and to what extent the algorithms are applicable to the cost‐sensitive spam filtering problem To identify what characteristics of the algorithms may have toward this applicability 2 5/6/09 • Naïve Bayes A probabilistic learning algorithm based on Bayesian decision theory For spam classification, the probability of a message d being in class c is estimated by P (c|d) ≈ P (c) Π P (tk|c) It is based on a naïve assumption that a feature in a class is completely independent of any other features In practice, it can work surprisingly well and produce impressive classification results The implementation of naïve Bayes has a linear complexity • LogitBoost A popular boosting algorithm that implements forward stage‐wide modeling to form additive logistic regression It adds base or weak learners iteratively and updates sample weights adaptively through iterations For spam classification, if fm is the mth base learner, then the probability of a message d being in class c is estimated by P (c|d) = e F(d) / [1 + e F(d) ], F (d) = ½ Σ fm (d) It uses a decision stump as the base learner 3 5/6/09 • SVM A top choice and widely used for text classification It uses linear models to implement nonlinear class boundaries by transforming instance spaces through mappings It maximizes hyper‐ plane margins Nonlinear cases can be solved by kernel functions We use an SVM with a linear kernel • Augmented LSI spaces LSI is a well‐known conceptual IR approach using SVD It can be used for text classification by changing the notion of query‐relevance to the notion of category‐ membership LSI is completely unsupervised o When applied for email classification, important category info in training data should be explored and used to boost model accuracy Augmented LSI space model applies o A unsupervised‐supervised combined feature selection procedure o Two separate LSI spaces, one for each email category 4 5/6/09 • Augmented LSI spaces Conceptually, individualized LSI spaces should offer more accurate content profiles, but practically it can still encounter difficulty in spam classification We construct an augmented LSI space by adding some training samples that are close to the class in appearance but belong to the other class in label Use cluster centroids to expand training samples for the learning spaces • RBF neural networks Radial basis function nets have many applications in science and engineering RBF has a feed‐forward structure with 3 layers: input, processing middle and output The middle layer neurons use a nonlinear RBF function Φ as their activations x Φ The output layer y x neurons use a Φ weighted sum of x Φ y middle layer x activations 1 1 1 2 2 3 k m n 5 5/6/09 • RBF neural networks RBF training can be done by a global optimization algorithm, but it is more computationally efficient if using a two‐stage training for determining network parameters The first stage is to form a representation of density distribution in the input space in terms of RBF parameters o Can be done by unsupervised clustering models The second stage is to determine the weights of the output layer o Can be done by supervised linear models • Feature selection Two objectives: o Reducing dimensionality of feature space while preserving email content o Eliminating irrelevant features, which is particularly useful for some algorithms (e.g., RBF networks) Two steps: o Unsupervised – removing stop words, applying word stemming, removing low frequent words and also very high frequent words o Supervised – using frequency distributions to identify the features that distribute most differently between spam and ham (e.g. using Information Gain) Features can be reduced from 20k to tens, hundreds, and thousands 6 5/6/09 • Each message is encoded as a numeric vector of values of retained features • Each feature value in a vector represents the combined feature’s local and global weights Experiments indicate that a weight coding is more informative than a simple binary coding • The traditional log(tf)‐idf weighting scheme is used • Spam filtering can be cost‐sensitive, i.e., errors of false positive are more costly than false negative • Most traditional measures do not take such an unbalanced cost into consideration • We use the Weighted Accuracy measure: WA(λ) = [λ nTN + nTP] / [λ (nTN + nFP) + (nTP + nFN)] • It might be debatable if a misclassification cost can be quantified by a const We use λ = 9 or a similar quantity to observe if and how the performance of an algorithm changes when a cost‐ sensitive condition is imposed 7 5/6/09 • Use two public spam testing corpora of real email messages collected by a single user, X and Y, resp. • PU 1 Dataset Has 618 ham and 481 spam messages Email messages are numerically encoded • ZH 1 Dataset Has 428 ham and 1,205 spam messages Constructed similarly as PU 1, but Written in Chinese (has a vastly different linguistic structure, a huge vocabulary and no explicit word boundaries) • Email content refers to subject line and body parts A limit imposed by the corpora we used All algorithms, however, would work for expanded content (e.g., by including additional header fields) Better filtering results can be expected with expanded email content • Features are statistically extracted from the text in email subject and body Alternatively, they can also be generated heuristically by some rules based system (e.g., SpamAssassin) Should be interesting and useful if combining both 8 5/6/09 • Evaluation is done by 10‐folder cross validation A corpus is partitioned into 10 equally sized subsets and each experiment takes one subset for testing and the remaining for training. The process repeats 10 times with each subset takes a turn for testing The performance is evaluated by averaging over 10 experiments • Feature size We use various sizes that range from 50 to 1,650 with an increment of 100 to analyze the usefulness of feature selection to the algorithms 9 5/6/09 10 5/6/09 • Spam filtering is a special and challenging text classification task Two categories (ham and spam) Cost‐sensitive with unbalanced misclassification costs Very difficult (many spam messages are carefully crafted to look like ham email) • Some category characteristics should not be overlooked Ham email has in general a broader vocabulary than spam email Ham email has a more eclectic subject matter than spam email 11 5/6/09 • We would like to present some characteristics of individual algorithms revealed from experiments and analysis • Naïve Bayes (NB) Simple, and fast in model learning Work well for general text classification Can be benefited by effective feature selection (due to its simplistic feature independence assumption) Can perform poorly if date sets have potentially heavy feature dependencies and it can lead to inaccurate probability estimation (e.g., Chinese dataset ZH 1) • LogitBoost (LB) Simple base learner but the ensemble construction can still take time Generally it delivers competitive results Seems insensible to feature size – large feature sizes may not help improve performance and we may use relatively small feature size such as 250 Its learning ability of profiling a category may be influenced by the number of available training samples • Support Vector Machines (SVM) A very stable and scalable to feature dimensionality It consistently performs as the best or a very competitive classifier in this study 12 5/6/09 • Support Vector Machines (SVM) It provides superior results particularly when cost‐ insensitive classification is concerned It is relatively fast in model training • Radial Basis Function Networks (RBF) It is RBF network based with a fast two‐stage training procedure It performs reasonably well, in particular when used in cost‐sensitive learning Seems sensitive to feature size, and excessive feature reduction should be avoided • Augmented LSI spaces (LSI) It constructs separate LSI spaces, one for each category A very reliable classifier with consistently good results Like RBF, seems well‐suitable to cost‐sensitive spam filtering, and it is in part due to its integrated clustering component for constructing augmented LSI learning spaces Good performance generally requires a feature size at about 500 or larger The model training can be expensive when the feature size gets very large 13 5/6/09 • This study considers 5 algorithms (most popularly used or most recently proposed) for an evaluation • Experiments and analysis have shown that Overall LSI, RBF and SVM are the top performers Both LSI and RBF show their strength when applying to cost‐sensitive spam filtering Algorithms for spam filtering can likely be benefited by an integrable clustering process to enhancing their profile accuracy of ham email 14