Uploaded by Noob Af

DWM Classification

advertisement
Classification
A classification is a form of data analysis that extracts models describing important data
classes. Such models, called classifiers, predict categorical (discrete, unordered) class labels.
Example: we can build a classification model to categorize bank loan applications as either
safe or risky. Such analysis can help provide us with a better understanding of the data at large.
 Many classification methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
 Most algorithms are memory resident, typically assuming a small data size.
 Recent data mining research has built on such work, developing scalable
classification and prediction techniques capable of handling large amounts of diskresident data.
Classification has numerous applications, including fraud detection, target marketing,
performance prediction, manufacturing, and medical diagnosis.
Suppose that the marketing manager wants to predict how much a given customer will spend during a
sale at an e-Commerce website
This data analysis task is an example of numeric prediction, where the model constructed
predicts a continuous-valued function, or ordered value, as opposed to a class label.
This model is a predictor. Regression analysis is a statistical methodology that is most often
used for numeric prediction
Hence the two terms tend to be used synonymously, although other methods for numeric
prediction exist. Classification and numeric prediction are the two major types of prediction
problems.
“How does classification work?”
Data classification is a two-step process consisting of a learning step (where a classification
model is constructed) and a classification step (where the model is used to predict class labels
for given data).
Training data
Classification Algorithm
Classifier(Model)
Test Data
Classifier
Because the class label of each training tuple is provided, this step is also known as supervised
learning, It contrasts with unsupervised learning (or clustering), in which the class label of
each training tuple is not known, and the number or set of classes to be learned may not be
known in advance.
“What about classification accuracy?”
The performance of a model (classifier) can be evaluated by comparing the predicted labels
against the true labels of instances. This information can be summarized in a table called a
confusion matrix.
Above Table depicts the confusion matrix for a binary classification problem.
Each entry fij denotes the number of instances from class i predicted to be of class j. For
example, f01 is the number of instances from class 0 incorrectly predicted as class 1.
The number of correct predictions made by the model is (f11 + f00)
The number of incorrect predictions is (f10 + f01).
Although a confusion matrix provides the information needed to determine how well a
classification model performs, summarizing this information into a single number makes it
more convenient to compare the relative performance of different models.
This can be done using an evaluation metric such as accuracy, which is computed in the
following way:
Accuracy=
𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒄𝒐𝒓𝒓𝒆𝒄𝒕 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔
The error rate is another related metric, which is defined as follows for binary classification
problems:
𝑵𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒊𝒏𝒄𝒐𝒓𝒓𝒆𝒄𝒕/𝒘𝒓𝒐𝒏𝒈 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔
Error rate=
𝑻𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒑𝒓𝒆𝒅𝒊𝒄𝒕𝒊𝒐𝒏𝒔
The decision Tree Induction
 Decision tree induction is the learning of decision trees from class-labeled training
tuples.
 A decision tree is a flowchart-like tree structure, where each internal node (non-lea
node) denotes a test on an attribute, each branch represents an outcome of the test, and
each leaf node (or terminal node) holds a class label.
 The topmost node in a tree is the root node.
Basic Algorithm to Build a Decision Tree
One of the earliest method is Hunt’s algorithm, which is the basis for many current
implementations of decision tree classifiers, including ID3, C4.5, and CART.
Hunt Algorithm
 In Hunt’s algorithm, a decision tree is grown in a recursive fashion.
 The tree initially contains a single root node that is associated with all the training
instances.
 If a node is associated with instances from more than one class, it is expanded using an
attribute test condition that is determined using a
splitting criterion.
 A child leaf node is created for each outcome of the attribute test condition and the
instances associated with the parent node are distributed to the children based on the
test outcomes. This node expansion step can then be recursively applied to each child
node, as long as it has labels of more than one class.


If all the instances associated with a leaf node have identical class labels, then the node
is not expanded any further.
Each leaf node is assigned a class label that occurs most frequently in the training
instances associated with the node.
Information
Pure Node
Impure Node
Low Degree Impurity
High Degree Impurity
Class 0
0
3
1
4
Class1
8
5
8
4
Impurity Measure for a Single Node
How to split the Decision tree(Splitting Criterion)?
Binary Attributes The test condition for a binary attribute generates two potential out
comes,
Nominal Attributes Since a nominal attribute can have many values, its attribute test
condition can be expressed in two ways, as a multiway split or a binary split.
Ordinal Attributes Ordinal attributes can also produce binary or multiway splits. Ordinal
attribute values can be grouped as long as the grouping. does not violate the order property of
the attribute values.
Continuous Attributes For continuous attributes, the attribute test condition can be expressed
as a comparison test (e.g., A < v) producing a binary split, or as a range query of form vi ≤ A <
vi+1, for i = 1, . . . , k, producing a multiway split.
Bays Classification Methods
Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities such as the probability that a given tuple
belongs to a particular class.
Bayesian classification is based on Bayes’ theorem, described next. Studies compar- ing
classification algorithms have found a simple Bayesian classifier known as the na ̈ıve Bayesian
classifier to be comparable in performance with decision tree and selected neu- ral network
classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to large
databases.
Naive Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class-conditional
independence.
It is made to simplify the computations involved and, in this sense, is considered “na ̈ıve.”
Bays Theorem
Bayes’ theorem is named after Thomas Bayes
Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described
by measurements made on a set of n attributes. Let H be some hypothesis such as that the data
tuple X belongs to a specified class C. For classification problems, we want to determine
P(H|X), the probability that hypothesis H holds given the “evidence” or observed data tuple X.
In other words, we are looking for the probability that tuple X belongs to class C, given that
we know the attribute description of X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For
example, suppose our world of data tuples is confined to customers described by the attributes
age and income, respectively, and that X is a 35-year-old customer with an income of $40,000.
Suppose that H is the hypothesis that our customer will buy a computer. Then P(H|X) reflects
the probability that customer X will buy a computer given that we know the customer’s age
and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our exam- ple, this
is the probability that any given customer will buy a computer, regardless of age, income, or
any other information, for that matter. The posterior probability, P(H|X), is based on more
information (e.g., customer information) than the prior probability, P(H), which is independent
of X.
Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the probability
that a customer, X, is 35 years old and earns $40,000, given that we know the customer will
buy a computer.
P(X) is the prior probability of X. Using our example, it is the probability that a person from
our set of customers is 35 years old and earns $40,000.
“How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from the
given data, as we shall see next. Bayes’ theorem is useful in that it provides a way of
calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X). Bayes’ theorem is
P(H|X)=
𝑃(𝑋 |𝐻 )𝑃(𝐻)
𝑃(𝑋)
Download