A New Email Classification Approach

advertisement
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
Grouping Tournament: A New Email Classification Approach
Sabah Sayed1, Samir AbdelRahman2 and Ibrahim Farag3
1TA,
2
Computer Science, Faculty of Computers and Information
Cairo, Egypt
s.sayed@fci-cu.edu.eg
Assoc. professor, Computer Science, Faculty of Computers and Information
Cairo, Egypt
s.elsayed@fci-cu.edu.eg
3
Professor, Computer Science, Faculty of Computers and Information
Cairo, Egypt
i.farag@fci-cu.edu.eg
Abstract
Email communication has become one of the fastest, easiest, and
cheapest ways of communication. The increased amount of
emails has resulted in introducing many automated tools for
email management. Email classification has recently taken much
attention to help the user organize emails data into folders. This
paper presents a new email classification approach called
grouping tournament, in which classes or folders compete with
each other for a new incoming email depending on rules similar
to the rules within the world final cup. It uses the Un-weighted
Pair-Group Method with Arithmetic mean (UPGMA) similarity
algorithm in grouping the classes according to their similarities.
The proposed Grouping Tournament approach is applied on
Winnow and Maximum Entropy classifiers using the crossvalidation evaluation method. The experimental evaluation on
Enron corpus proves that the introduced classification method
outperforms the compared email classification methods.
Keywords: Email Classification, Round Robin Tournament,
Elimination Tournament, Grouping Tournament, multi-class
binarization.
1. Introduction
Text mining [1] refers generally to the process of
extracting interesting and non-trivial patterns or
information from text documents. It can be viewed as an
extension of data mining or knowledge discovery from
databases [2, 3]. It also deals with higher complexity
because text is by nature unstructured. Text mining is a
multidisciplinary field [4] involving information retrieval,
text analysis, information extraction, clustering,
classification, database technology, and machine learning.
Text classification (or text categorization) is an important
text mining application. It is the task of classifying new
(unseen before) text to one or more pre-defined class(es)
based on the knowledge accumulated during the training
process.
Email has been an efficient and popular communication
mechanism as the number of internet users increase. We
waste more time in filtering e-mail messages and
organizing them into folders to facilitate retrieval and
searching when necessary[5].
Therefore, email
management is an important and growing problem for
individuals and organizations as amount of email messages
increases by every minute. During past decades, many
tools are presented for automatically managing email data
as detecting spam, searching emails, and organizing emails
into specific folders. For these reasons, email classification
[5] is one of the most important text classification
problems in which the data storage is the user email-inbox
that includes set of email messages. Email classification is
the process of automatically assigning new emails to predefined classes or specific folders based on their contents
and properties [5].
Email classification is different and more complex than
document classification. This is because the email domain
presents several challenges and categorization habits vary
from one user to another; that is why automated
categorization methods may perform well for one user, and
fail for other [6]. An email is a semi-structured document
with few, and un-formal sentences that make the
preprocessing process more difficult. Email users create
new folders, and do not use the other folders so many
folders and messages are created, destroyed, and
reorganized in other folders. They may create folders to
contain group of different topics, project groups, certain
senders, or unachieved tasks which mean that these folders
do not necessarily correspond to unified semantic topics.
Also some of users may categorize email messages, not
according to its content, but according to their re-action to
IJCSMS
www.ijcsms.com
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
these messages as reply, forward to specific persons, and
make it as a reminder to do some tasks and so on. Also
folder hierarchies contain a large number of different
folders that may be similar in contents, and a subfolder
may not be a subtopic of the parent folder content.
We present a new approach for classifying E-mails;
grouping tournament in which the multi-class
classification process is broken down into a set of binary
classification tasks. The grouping tournament has a simple
implementation and it reduces the execution time by
dividing the main problem into smaller ones and running
them in parallel. It groups classes into groups of four and
inherits its rules from World final Cup. Grouping
tournament approach combines the clustering thinking
within the email classification problem in a simple way. It
utilizes the Un-weighted Pair-Group Method with
Arithmetic mean (UPGMA) [7] as a similarity measure in
forming the classes' clusters according to their similarities.
In grouping tournament approach all email parts are
exploited without ignoring any field to extract hidden
useful information from the user email data. This is very
effective in feature construction and selection steps.
To accomplish our goals, the following four decisions
were made:
1) A large set of email messages, Enron corpus [8]
was chosen as a benchmark data for our
experiments. The Enron corpus was made public
during the legal investigation concerning the
Enron Corporation [8].
2) Wide-margin winnow and Maximum Entropy
classifiers were chosen for applying the email
classification task.
3) Cross-validation evaluation method was used in
calculating accuracies and other evaluation
measures. Up to our knowledge it is the best way
to measure the real effectiveness of an email
classifier because it considers each email message
in both training and testing phase.
4) Our approach was tested against the current
tournament, the classical N-way, and a proposed
voting classification method results.
The remainder of this paper is organized as follows:
Section 2 gives a brief overview of related work in email
classification. In Section 3 we state our design decisions in
some issues as data preprocessing and features
construction, etc. In section 4 we briefly explore the
classification procedures that we used to compare our
proposed approach with. Section 5 explains the proposed
email classification approach. How we implement those
classification methods and on what data, are described in
Section 6 with the experimental analysis and results.
Finally, Section 7 concludes the work and focuses on some
important future directions.
2. Related Works
There are many studies; with different features, design
aspects and using different types of classifiers; for
categorizing emails into spam/ non-spam or into multitopic folders. Svetlana Kiritchenko and Stan Matwin [9]
proposed an email classification system that explores cotraining. Co-training is an algorithm that uses unlabeled
data along with a few labeled examples to boost the
performance of a classifier on the email domain. Their
results showed that the performance of co-training depends
on the learning algorithm it uses. Also it argued that
Support Vector Machines significantly outperforms Naive
Bayes on email classification.
Svetlana Kiritchenko, Stan Matwin, and Suhayya AbuHakima [10] proposed a method of combining the
temporal features (time-related features) with the
traditional content-based features in email classification.
They noticed that, with adding temporal features to
traditional features SVM and Naïve Bayes performed
better than decision trees. They concluded that temporal
characteristics had complex dependencies with contentbased features and when tested alone, the results were not
promising. They had a value only when combined with
content in more sophisticated ways such as in Naive Bayes
and SVM.
Bekkerman, McCallum, and Huang [6] presented a
benchmark case study of email foldering on both the
Enron and SRI email datasets by comparing email
classification accuracies of four classifiers (Maximum
Entropy, Naive Bayes, SVM and Winnow). They found
that SVM outperforms the other 3 classifiers. Naïve Bayes
was found to be the worst classifier. They also proposed an
evaluation method for the classification performance that
divided the dataset into time-based data splits and
incrementally testing classifiers performance on each split
while training on all the previous splits. Unfortunately, this
evaluation method is more complex in implementation and
resulted in low accuracies if it was compared with the
random training/testing splits.
Yunqing Xia, Wei Liu, Louise Guthrie, Kam-Fai Wong
[11] implemented tournament-like classification scheme
using a probabilistic classification method and showed that
the tournament methods round robin and elimination
outperform the N-way method by 11.7% regarding
precision. Also, they found that round robin takes more
execution time due to its complex implementation. They
used an email collection for ten users and 50 folders by
mixing all the folders up and extracting the biggest 15
folders regarding the number of emails the folders contain.
This corpus isn't a standard email corpus for the email
categorization evaluation purpose. Also, these 15 folders
belonging to ten users which is not enough to show the
IJCSMS
www.ijcsms.com
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
users habits in the email categorization task and makes the
folders' relationships disappear.
3. Design Issues
There exist many design choices for how to conduct the
email classification task [6]. In this section we state our
design decisions in preprocessing the data, constructing
features, evaluating our experiment performance, and
choosing classifier algorithms.
3.1 Dataset Preprocessing
Raw email datasets are usually unstructured. The Enron
dataset is not an exception in that [6]. We apply some
preprocessing and organization steps on the data before
experiment execution. In [6] they removed the non-topical
folders "all documents", "calendar", "contacts", "deleted
items", "discussion threads", "inbox", "notes inbox", "sent",
"sent items" and "sent mail" and then flatten all the folder
hierarchies. Also they removed the attachments and the Xfolder field in the message headers that actually contain
the class label. We follow the same steps but we also
ignore small folders that contain less than ten messages to
provide the classifiers with enough number of training
examples.
3.2 Feature Construction
The accuracy of a classifier was increased for all
classification methods as the feature size increases [12].
From this we conclude that the process of how to extract
features from the email is very important for the classifier.
Each email field (as subject, sender, recipient, and body)
gives a piece of information about the email. To build an
effective classifier we should use all email fields without
ignoring any of them. This help in extracting hidden useful
information from the user email documents. So all emails
fields are parsed and tokenized then down-cased. Also
date-related information are considered, we combine these
information with other email fields information. The bag
of words representation is used and each e-mail message is
represented as vector of word counts. Stop words that
contains auxiliary words, verbs, adverbs, conjunction are
removed because it is helpless in the classification process.
Finally, features that appear more than 100 times or just
once in all messages are ignored.
3.3 Evaluation Method
There are many approaches that can be employed in the
evaluation phase. In [6] one approach is employed which
is an incremental time-based splits. This approach also has
a problem that emails are usually related to other received
emails in closed periods, rather than to emails received
long after or before it so low classification accuracies are
resulted[6]. In [11] they split the email collection into
training and testing splits, and then they use 20% for
training and 80% for testing. The problem of this approach
is that only a small portion of emails is used in training;
which means if this portion is changed, the classification
performance may differ.
Ten-fold cross-validation method is chosen in evaluating
our experiment. The data is divided into ten partitions. All
these partitions are randomly taken but with the attention
to include all categories in both training and testing split.
The classification task is applied ten times. In each time,
one partition is selected for training and the other
partitions are randomly mixed for testing. The employed
approach here ensures that all email messages in all
categories are used in training and testing sets which is
more reliable in email classification.
3.4 Classifiers Algorithms
A classifier is a system that classifies text into discrete sets
of predefined categories [12]. Here, 2 classification
algorithms are used which are Maximum Entropy and
Wide Margin Winnow.
3.4.1 Maximum Entropy Classifier
Maximum entropy classifier [13] does not assume
statistical independence of the independent variables or
features that serve as predictors. As mentioned in [6] the
unique distribution that satisfies the constraints in the
training data and has the maximum entropy belonging to
the exponential family is as in Eq. (1):
Where is the normalization factor that depends on the
document, is a feature, and is the weight or relevance
of the feature. Maximum Entropy classifier is
implemented as a part of the Mallet software system [14].
3.4.2 Wide-Margin Winnow
Winnow [6] belongs to the family of on-line learning
algorithms. It attempts to minimize the number of
incorrect guesses during the training phase while training
examples are presented to it one by one. It uses a
multiplicative scheme that allows it to perform much
better when many dimensions are irrelevant. The winnow
training implementation in [14] is used with θ = 0.5, α =
2.0, and β = 2.0.
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
4. Classification Procedures
In this section we briefly describe the candidate
classification methods which are used to compare against
our proposed classification approach.
4.1 N-Way Classification
It is a multi-class classification method where one
classifier trains on a portion of data as training examples,
and then is tested on the rest portions of the data, with a
condition that all categories are included in training and
testing portions of data.
4.2 Tournament Classification
The tournament classification methods aim to perform
classification on multi-class tasks using a tournament-like
decision process, in which classes compete against each
other to produce the final winner class [11]. Tournament
classification method has two implementation ways,
elimination and round robin tournament. We will briefly
explain each type.
4.2.1 The Elimination Tournament Method (ET)
The idea behind elimination tournament classification is
inherited from the Wimbledon Championship [11]. In this
method, every class is required to compete against a set of
classes that is pre-determined before. When one
competition is over, one of the two classes obtains the
chance in the next round competition and the unlucky one
is eliminated. The winner class in the last round is the
optimal class to win with the incoming email message.
4.2.2 The Round Robin Tournament Method (RRT)
Round robin tournament classification follows the
competitions rules of the English Premiership [11]. As in
elimination method but instead of eliminating the loser
classes, the Round Robin Method assigns scores to both
classes after each competition. Then, every class will
accumulate a total score after all competitions are over.
The class that accumulates the highest score is considered
the optimal class that the incoming email message belongs
to. Here, a score table for winner and loser classes is
maintained.
Under round robin tournament method, a tie may occurs in
the final score table [11]. This tie occurs when more than
two classes have the same final score in the competition
table.
5. The Proposed Classification Approach:
"Grouping Tournament"
Grouping tournament classification approach (GTA) is
another way for multi-class classification, where a multiclass problem is broken into a set of binary-class problems.
All classes will then compete with each other in groups
with certain competition rules similar to the rules within
the World Cup Final. There exist two approaches to map
one multi-class problem to a set of binary-class problems:
1) One-against-all
approach
or
unordered
binarization [15], in which number of binary
classifiers equals to number of classes (c). Each
classifier is trained on all the training data set
where training examples of one class are
considered as positive examples and all the others
as negative examples.
2) pair-wise classification or Round Robin
binarization [15], in which there is one classifier
for each pair of classes, which means that number
of classifiers is equal to C(C-1)/2 where C is
number of classes and classifier for classes(i,j) is
not different of classifier for classes(j, i). A
classifier of two classes is trained only on training
data for these two classes and ignores all other
training data.
Figure 1 shows 4-class problem with multi-class, pair-wise,
and one-against-all approaches. In 1(a) one classifier
separates all classes from each other. One-against-all
approach need 4 classifiers for this problem, one classifier
is shown in 1(b), which separates between class2 and all
other classes. Pair-wise approach produces 6 classifiers,
one for each pair of classes; 1(c) shows one classifier that
separates class1 and class4.
Fig. 1 Pair-wise versus One-against-all approach
IJCSMS
www.ijcsms.com
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
It is clear from the figure that in pair-wise approach each
classifier uses fewer training examples which gives an
efficient boundary between the two classes than classifiers
in one-against-all approach.
The conclusion is that, the pair-wise approach may result
in significant improvements over the one-against all
approach without having a high risk of decreasing
performance [15]. The GTA rely on the pair-wise
classification approach by learning one classifier for each
pair of classes, using only training data for these two
classes and ignoring all others. In GTA classes are
organized into groups where each group contains four
classes. RRT is then applied within each group to
determine a winner class of the group. ET will then be
applied on the winner classes from all groups. The winner
class in the last round is the optimal class to win with the
incoming email message. GTA for 3 groups is illustrated
in Figure 2.
calculated as the average distance between all pairs of
emails in the two different classes i.e. the similarities of all
emails in the first class with all emails in the second class,
but it doesn’t include the internal similarities of the emails
in each class. UPGMA differs from the similarity methods
in that it uses information about all pairs of distances, not
just the nearest or the furthest. For this reason, it is
preferred more than the single and complete linkage or
similarity methods for grouping analysis [7].
We calculate similarity between 2 classes C1, C2
according to Eq. (2):
Where:
- N1, N2: number of emails in first class C1, and second
class C2.
: email documents
in C1, C2.
5.2 Tie-Breaking Technique
Fig. 2 Grouping Tournament method
The difficult point here exists in class grouping for finding
the way which produces the optimal grouping scheme.
One simple class grouping approach is random pick, in
which classes are randomly picked out to form groups.
Another approach is to group classes in the order that the
classes have been seen by the algorithm in the training step.
In GTA classes are clustered according to their similarities
using the UPGMA similarity measure mentioned in
Section 5.1. The ET and RRT methods are designed and
implemented according to [11].
5.1 The UPGMA Similarity Measure
The Un-weighted Pair-Group Method with Arithmetic
mean (UPGMA) [7] is a popular distance analysis method.
In this method, the distance between two classes is
The tie that may occur in the score table of the RRT
method has two types:
1) Two-Way Tie occurs when two classes have the
same final score in the competition table. In this
case, the binary classification process between
these two classes is investigated and the winner
class is the class that won in this classification
process.
2) Many-Way Tie occurs when three or more classes
have the same final score in the competition table.
In this case, the many-way tie is divided into
many two-way ties. These new two-way ties are
solved one by one, where the winner class from
one two-way tie will compete with another class
from many-way tie to create another two-way tie
and so on until reaching only one two-way tie.
The winner class in the many-way tie is the
winner in the last two-way tie.
5.3 Voting Classification
Voting classification method performs the multi-class
classification using all mentioned classification methods. It
applies N-way, tournament, and GTA methods then
performs voting strategy between the winner classes of
these methods. It checks if the three winner classes (or at
least two) are the same, then this class is the final winner
class. On the other hand, if the three winner classes are
different, it applies the binary classifier of the first two
classes and the resulted class from this classifier will
compete with the third winner class with the binary
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
classifier produced from it. The resulted class is the winner
class in the voting method.
It is notable from the table that the least number of emails
is larger than 1000 emails which is a suitable number for
an email classifier.
6. Experiments
6.2 Evaluation Criteria
The email classification system evaluation depends on two
main factors namely, the corpus data collection and the
evaluation measure(s). This section first demonstrates the
corpus data analysis and states the used evaluation
measures. Finally, it describes and analyzes the
experimental results.
We used the Accuracy measure to evaluate all mentioned
classification methods. Also we use Micro Average
Precision, Micro Average Recall, and Micro Average Fmeasure to present the overall system performance for
each test split of cross-validation splits. Suppose we have
n classes for one Enron dataset user, where:
: Number of manually classified emails
(by the user) to class i.
: Number of emails that are classified by
the program to class i.
: Number of corrected classified emails
by the program to class i.
Micro Average Precision (MAP), Micro Average Recall
(MAR), Micro Average F-measure (MAF), and Accuracy
are calculated according to Eq. (3), Eq. (4), Eq. (5), and Eq.
(6).
6.1 Corpus Description
Currently there are some public email corpora, such as
Ling-spam, PU1, PU123A, Enron Corpus, etc [5]. We
have evaluated our experiment on the Enron benchmark
corpus for the email classification problem. The complete
Enron dataset, along with an explanation of its origin, is
available at [8]. We decided to use the email directories of
seven former Enron employees that are especially large:
beck-s, farmer-d, kaminski-v, kitchen-l, lokay-m, sanders-r
and williams-w3. From each Enron user directory, we
chose the largest folders according to the number of emails.
This resulted in sixteen folders for beck-s, farmer-d,
kaminski-v, kitchen-l, and sanders-r. For lokay-m and
williams-w3 we chose the largest folders that contain more
than ten messages which are eight and twelve folders in
order. The reason behind that is to provide the classifier
with enough training examples and to be sure that all
folders share some examples in each data split in the tenfold cross-validation execution. Our filtered dataset
version contains around 17,861 emails (~89 MB) for 7
users distributed among 100 folders. Table 1 shows
statistics on the seven resulting datasets.
Table 1: Statistics on our portion of the Enron datasets
User
Number
of
folders
Number
of
messages
Size of
smallest
folder
(messages)
Size of
largest
folder
(messages)
beck-s
16
1030
39
166
Farmer-d
kaminskiv
Kitchen-l
16
3578
18
1192
16
3871
21
547
16
3119
21
715
lokay-m
8
2446
51
1159
sanders-r
williamsw3
16
1077
20
420
12
2740
11
1398
6.3 Experiment Description
In this experiment, the effectiveness of email classification
methods is evaluated. A java-based open source package
called MALLET [14] (A Machine Learning for Language
Toolkit) is used in implementing our email classification
system. Our implemented email classification system
applies tokenization method described in [14].
Tokenization is used to retrieve emails containing words
as terms or units. Then the stop-words from the Standard
English stop-list in [14] are removed. After that, feature
construction steps in Section 3 are applied. The 4
classification methods are applied on each user in Enron
dataset with ten-fold cross-validation evaluation method.
For each Enron User data, the following steps are
performed:
(1) Randomly divide all data set into ten splits where each
split contains email messages from all folders or classes.
IJCSMS
www.ijcsms.com
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
(2) Prepare the training set in each execution time by
randomly combining nine splits of all data; the remaining
split is the testing set for this execution time.
(3) Split the training set examples of each class to separate
training set.
(4)Produce all binary classifiers; one classifier for each
two different classes by randomly combining the separate
training examples of only these two classes.
(5) Produce the multi-class classifier using all training data
in point (2) to be used in the N-way classification method.
(6) Apply the four classification methods ten times with
different training and testing sets in each time and
calculate the evaluation measures according to Section 6.2
6.4 Experimental Results
To evaluate the performance of our proposed email
classification approach GTA, it is compared with the
current popular email classification methods tournament
and N-way. Also it is compared with the proposed voting
classification method mentioned in Section 5.3. Our
experimental study is performed through two experiments.
From Figure 3 we note that voting classification method
accuracy values are almost similar to accuracy values of
GTA at all users except for lokay-m user. GTA accuracy
value for Lokay-m user is 53.81% but its accuracy value
with voting method is 47.59%. This could be explained by
noticing that, voting method produces accuracy values as
average between the GTA, tournament, and N-way
methods accuracy values. For Lokay-m user, the
difference between the three methods' accuracy values is
large so the voting accuracy value is less than the GTA
accuracy value.
We note in Figure 3 that in williams-w3 dataset the
accuracies for all methods are so high. This is explained by
observing that, in this dataset 4 classes hold 2587 email of
total 2740 and the remaining 8 classes hold only 153
emails. So predicting one of the big four classes for any
new email has a high probability.
Figures 4, 5, and 6 show Micro average Precision, Micro
average Recall, and Micro average F-measure values of
the proposed GTA, tournament, and N-way methods for
the ten cross-validation execution times for 3 Enron users.
6.4.1 Experiment1
This experiment was conducted on the seven Enron email
directories mentioned in Section 6.1 using Wide-Margin
Winnow classifier. Figure 3 reports the results of this
experiment. The figure shows the average accuracy values
for the four classification methods. It is clear from the
figure that GTA outperforms tournament, N-way, and
voting methods at all the seven users.
Fig. 4 Kitchen-l Micro Avg Precision
Fig. 3 Winnow Accuracy values of the 4 methods for the 7 Enron users
Fig. 5 Farmer-d Micro Avg Recall
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
Figures 8, 9, and 10 show Micro average Precision, Micro
average Recall, and Micro Average F-measure values of
the proposed GTA and the N-way for the ten crossvalidation execution times for 3 Enron Users.
Fig. 6 Beck-s Micro Avg F-Measure
All values in the figures demonstrate that GTA method is
superior to both N-way and tournament methods in all the
mentioned evaluation measures. Also this is the case for
all other Enron users.
Fig. 8 Williams-w3 Micro Avg Precision
6.4.1 Experiment2
In this experiment, we study the effect of applying the
proposed GTA on Maximum Entropy classifier. This
experiment was conducted on the seven Enron email
directories mentioned in Section 6.1. Figure 7 shows the
accuracy average values for the four classification methods.
It is clear from the figure that GTA is the superior to
tournament, N-way, and voting methods at all the seven
users.
Fig. 9 Kaminski-v Micro Avg Recall
Fig. 7 Maximum Entropy Accuracy values of the 4 methods for the 7
Enron users
Fig. 10 Beck-s Micro Avg F-Measure
IJCSMS
www.ijcsms.com
IJCSMS International Journal of Computer Science and Management Studies, Vol. **, Issue **, Month Year
ISSN (Online): 2231-5268
www.ijcsms.com
It is clear from the figures that GTA outperforms the Nway classification method. Also it is the case for all other
Enron users. We can conclude from all values in these
figures that GTA improves the Maximum Entropy
classifier accuracy for all seven used Enron Users.
7. Conclusions and Future Work
In this paper, we presented a new Email classification
approach to categorize user email data into predefined
folders. This can be used in multi-class classification
problems. Our classification approach utilizes the UPGMA
similarity algorithm to cluster user email classes according
to their similarities into groups applying rules similar to
the World Cup Final rules. Our classification method is
applied using winnow classifier. In addition, it is applied
on Maximum Entropy. In all experiments, cross-validation
evaluation technique is used, which applies random
training/test splits obtaining much higher results compared
with the step-incremental time-based splits.
We carried out some empirical evaluations on email data
of seven users, from Enron email dataset. The results
showed that our proposed Grouping Tournament
outperforms the current classification methods. Hence, we
proved that our classification method reflects the email
users' needs in the classification task, where it obtains both
better performance and lower execution cost. The results
of the carried experiments showed that, running Winnow
and Maximum Entropy classifiers using Grouping
Tournament classification method slightly improves their
accuracy. Also we noted that the improvement in Winnow
is slightly greater than in the Maximum entropy.
In the future, we plan to investigate other similarity
strategies to find the optimal grouping scheme for our
Grouping Tournament classification method. We also will
study how to get benefit from the user profile to enhance it.
References
[1] R. Feldman and I. Dagan, “Knowledge Discovery in Textual
Databases”, in Proceedings of the First International
Conference on Knowledge Discovery and Data Mining
KDD95 August 2021 1995 Montreal Canada, 1995, pp. 112117.
[2] P. Fayyad, U., Piatetsky-Shapiro, G. & Smyth, “From data
mining to knowledge discovery: An Overview”, 1996, pp. 136.
[3] E. Simoudis, “check for data mining”, IEEE Expert,
1996,11(5).
[4] A.-hwee Tan, “Text Mining : The state of the art and the
challenges Concept-based”, Proceedings of the PAKDD 1999
Workshop on Knowledge Disocovery from Advanced
Databases, 1999, pp. 65–70.
[5] P. Li, J. Li, and Q. Zhu, “An Approach to Email
Categorization with the ME Model”, Artificial Intelligence,
2005, no. 2, pp. 229-234.
[6] R. Bekkerman, A. Mccallum, and G. Huang, “Automatic
Categorization of Email into Folders : Benchmark
Experiments on Enron and SRI Corpora”, Science, 2004, vol.
418, pp. 1-23.
[7] G. McLachlan, “Cluster Analysis”, Methods, 1984, pp. 361392.
[8] Email Dataset, http://www.cs.cmu.edu/~enron/
[9] S. Kiritchenko and S. Matwin, “Email Classification with CoTraining”, Machine Learning, 2001.
[10] S. Kiritchenko, S. Matwin, and S. Abu-hakima, “Email
Classification with Temporal Features”, International
Intelligent Information Systems, 2004, pp. 523-533.
[11] Y. Xia, W. Liu, and L. Guthrie, “Email Categorization with
Tournament Methods”, in Lecture Notes in Computer
Science Natural Language Processing and Information
Systems, vol. 351, Y. Xia, W. Liu, and L. Guthrie, Eds.
Springer Berlin / Heidelberg, 2005, pp. 150-160.
[12] S. Youn and D. Mcleod, “A Comparative Study for Email
Classification”, Group, 2006,pp. 387-391.
[13]
Maximum
Entropy
Classifier,
http://en.wikipedia.org/wiki/Maximum_entr
opy_classifier
[14] McCallum and A. Kachites, “MALLET: A Machine
Learning for Language Toolkit”, httpmalletcsumassedu
Mohri M Pereira F Riley M, 2002,vol. 231.
[15] J. Fürnkranz, “Round robin classification,” The Journal of
Machine Learning Research, 2002, vol. 2, no. 4, pp. 721-747.
Download