Essay2_NaiveBayesClassiffier

advertisement
Naive Bayes Classifier
Naïve Bayes classifier is a simple probabilistic classifier that tends to show the
relationship between two conditional events. I have been asking myself the question, why
did the name of the classifier start with the word “naïve” connoting lack of experience,
understanding etc but I later discovered somewhere that’s naïve could also represent
simple, and I felt more comfortable. Actually naïve bayes classifier is simple and very
fast at execution.
To illustrate further the idea behind naïve bayes, the chance that an event would occur is
based on the occurrence of an immediate event. For example the probability that someone
is “drunk” is dependent on the alcohol volume test done of the person. If there is a 98%
accuracy of alcohol volume test performed on a person then we can deal with the + or –
2% and say that there is a 2% false positive and 2% false negative. Within this range
Naïve Bayes would be able to calculate the probability that someone is drunk based on
the alcohol level test. This example is just a simple one that describes an aspect of Naïve
Bayes classifier.
Naïve Bayes classifier which is sometimes known as “Idiot’s Bayes” by few people but
not technically accepted as a replacement for the name “Naïve Bayes” is used in many
aspects of our lives; we use it unknowingly most times because it’s just an algorithm
hiding behind applications we use on a daily basis. The classifier is simple and fast but
one of the cons is that it produces probabilities that are not completely accurate hence the
result is adversely affected.
There are quite a number of systems that makes use of Naïve Bayes classifier. The major
examples are the emails systems like Mozilla, Microsoft outlook, hotmail etc, we interact
with the email systems on a daily basis and the classifier is very useful in sifting out spam
mails from non-spam mails it works on the assumption that features are independent i.e.
the occurrence of an event does not determine the occurrence of the next event, and that
makes it easier to classify. Naive Bayes Classifier, in the context of spam, says that the
probability that an email is spam, given that it has certain words in it, is equal to the
probability of finding those certain words in spam email, times the probability that any
email is spam, divided by the probability of finding those words in any email.
After training has been done, the word probabilities are used to compute the probability
that an email with a particular set of words in it belongs to either category. Each word in
the email contributes to the e-mail’s spam probability. Then, the e-mail’s spam
probability is computed over all words in the email, and if the total exceeds a certain
threshold, the filter will mark the email as a spam and takes necessary actions, otherwise
is not regarded as a spam mail.
Naïve Bayes classifier also has its way of filtering spam on a per user basis which is a
plus to the classifier, to achieve this on a per user bases, the filter needs to be trained. The
filter is trained by user manually indicating that an email is a spam. This process builds a
database of spam words, for all words in each training email, the filter will adjust the
probabilities that each word will appear in spam or legitimate email in its database.
The classifier also treats the mails that are received from friends or the names stored in
the recipients contact list differently. The list is used by the classifier to decide the
probability level of each email received. For example if an email contains some spam
words but for the fact that the email recipient know the sender the scoring of such an
email as spam is highly reduced and eventually such email is not likely to be regarded as
spam however if the emails contains a lot of spam words the classifier scores the email
and the probability that it’s a spam may shot above the threshold set for spam mails and
such mail can end up in the junk folder.
This is the reason why in so many occasions friends complain that they did send emails
and since most people don’t bother to check the junk/bulk folder they never get to see
such mails, in fact a lot of people empty junk/bulk mail folder without checking the mails
within the folder and in such process a lot of people have trashed important mails.
Another approach to Mail filtering which is different from accessing the textual elements
in the mail sent is to filter emails based on geographical domains. The idea is to build a
dataset of domains and classify them as illegitimate domains and whenever email are
received by the email system the classifier will check the originating domain of the email
and scores it. The actions taken can be to completely delete such emails, move the emails
to dedicated folder or mark as flagged. The classifier can also score such email and sends
it to another classifier whose function is to organize the emails by putting them in the
respective folder based on the result of the previous classifier. If the email is a spam it is
put into a junk/spam folder otherwise it is placed into the Inbox of the recipient. This type
of system makes use of multiple classifier in determine the status and destination of
emails.
Just as the Naïve Bayes classifier has been very useful in spam removal but it also has its
own downside which spammers has taken advantage of. Spammers tends to send spam
emails containing a lot of legitimate words but with few spam words, because the
legitimate words in the emails far exceed the spam in the mail the classifier stills scores
the mail very low and will not likely reach the threshold for spam mails. Another strategy
spammers have adopted is to represent spam words in form of images in emails, this
works since the Naïve Bayes classifier can only interpret text in emails. But recently
Google and his team came up with high level tracking system that can read images and
has the capability of extracting the text embedded on images. If the text in the image is
spam then this increases the probability factor of the email to be a spam mail. This
innovation has really reduced the number of spam mails received by gmail customers.
We can also have another classifier that checks if a mail is a spam or not and this
classifier put the mail in the respective folder based on the type of mail. An email user
can define unique folders with certain characteristics like sender’s name or email address
and when mails are received the classifier puts the mail in the respective folder. Standard
Junk mail folder is provided by most email system with a classifier that puts junk mail by
default into such folders.
Naïve Bayes classifier is also applied to text classification. Text classification system
extracts words and word sequences from texts to be analyzed. The words that are
extracted and the word sequences are compared with training data which comprise of
words and word sequences together with a measure of probability. Some examples of
text classification is classifying business names by industry, classifying jokes as funny or
not funny, classifying movies reviews as Favorable, unfavorable, neutral, the list is
endless.
Another means for blocking unwanted e-mails using classifier filter is by tracking emails
Internet Protocol address (IP address). An e-mail is identified as unwanted by the IP. A
source IP address of the unwanted e-mail is determined from the trained set. Other source
IP addresses owned or registered by an owner of the source IP address of the unwanted email are determined and added to the blacklisted list. Subsequent e-mails from the source
IP address and the other IP addresses are blocked. This will prevent a spammer who
shifts to a new source IP address when its spam is blocked from one source IP address.
Naïve bayes classifier is also applied in Short messaging Systems (SMS). The classifier is
used to prevent what is called SMS flooding. The system is deployed on the mobile
operators servers and it checks the traffic of messages. The system build two types list,
the white list and black list, the white list is made up of allowed/permitted senders, SMSCenter address, keywords, or other parameters, While the black list is made up of blocked
senders, SMS-Center address, keywords, or other parameters. Each SMS is
filtered/matched using this lists and , SMS that does not match the black-list or are
found in the white list are passed through to there destination. SMS detected on the black
list are stopped and the sender is sent an acknowledgment, this could be negative or
positive depending on the customized message by the mobile operator.
Mobile operators has found SMS filters to be very helpful and this has greatly increased
revenue, this is due to the fact that customers on such network hardly receive SMS that
are more or less spam and naturally a customer could decide to switch to another mobile
operator if he receives a lot of undesirable SMS. Customers are also protected from fraud
as such fraudulent SMS are filtered before getting the recipient.
Naïve bayes classifier can also be applied in user’s hand writing recognition. Although
there are better algorithmic technique better suited for hand writing recognition. The idea
is that the user inputs words and a recognition operation are performed on the
handwritten input to produce an initial recognition result. An incorrect recognition is
determined when the normal writing style is not consistent with the initial recognition
result. A comparison is done between the initial recognition results with the samples
previously provided by the user. If the comparison indicates that the initial recognition
result is not consistent with at least one sample, then the result is identified as possibly
incorrect. This process is applied in modern day devices such as the PDA’s, palmtops,
mobile phones etc.
Naive Bayes classifier is also found in Network security by using intrusion detector.
Intrusion detection is used to protect computers on the network from hackers that are
continuously finding ways of stealing information like credit cards details, passwords and
other confidential information from the system. The way intrusion detection works is that
it periodically checks the network for suspicious behavior that can indicates potential
system attacks.
Download