databases classifiers

advertisement
1. How decision trees are used for classification?Explain decision tree induction
algorithm for classification [Dec-14/Jan 2015][10marks]
To illustrate how classification with a decision tree works, consider a simpler version of the
vertebrate classification problem described in the previous section. Instead of classifying the
vertebrates into five distinct groups of species, we assign them to two categories:
The tree has three types of
� A root node that has no incoming edges and zeroor more outgoing edges.
� Internal nodes, each of which has exactly one incoming edge and two or more outgoing edges.
� Leaf or terminal nodes, each of which has exactly one incoming edge and no outgoing edges
2. How to improve accuracy of classification?Explain [Dec-14/Jan 2015][5marks]
some tricks for increasing classification accuracy. We focus on ensemble methods. An ensemble
or classification is a composite model, made up of a combination of classifiers. The individual
classifiers vote, and a class label prediction is returned by the ensemble based on the collection of
votes. Ensembles tend to be more accurate than their component classifiers. We start off in
introducing ensemble methods in general. Bagging, boosting and random forests are popular
ensemble methods. Traditional learning models assume that the data classes are well distributed.
In many real-world data domains
3. Explain the importance of evaluation criteria for classification methods [Dec-14/Jan
2015][8marks]
The input data for a classification task is a collection of records.Each record, also known as an
instance or example, is characterized by a tuple (x, y), where x is the attribute set and y is a special
attribute, designateas the class label. sample data set used for classifying vertebrates into one of
the following categories: mammal, bird,fish,reptile, oramphibian. The attribute set includes
properties of a vertebrate suchas its body temperature, skin cover, method of reproduction ability
to fly, andability to livein water.the attribute set can also contain continuous features.The
classlabel,on the other hand, must be a discreteattribute. This is a key characteristic that
distinguishes classification from regression, a predictive modelling task in
4. Explain a.Continious b.Discrete c.Asymmetric Attributes with example? [June/July 2014]
[10marks]
Discrete attribute has a finite or countably infinite set of values. Such attributes can be categorical
or numeric. Discrete attributes are often represented by integer valyes.E.g. Zip code, counts etc.
Continuous attribute is one whose values are real numbers. Continious attributes are typically
represented as floating point variables. E.g. tempreture, height etc. For asymmetric attributes only
presence- a non zero attribute value- is regarded as important. E.g. consider a data set where each
object is a student & each attribute records whether or not a student took a particular course at
university. For a specific student an attribute has a value of 1 If the student took the course and a
value 0 otherwise. Because student took only a small fraction of all available courses, most of the
value in such a data set would be 0.therefore it is more meaningful and more efficient to focus on
non 0 values
5. Explain hunts algorithm and illustrate is working? [june/july 2015][10 marks]
[June/July 2014][10marks]
Data cleaning: This refers to the preprocessing of data in order to remove or reduce noise (by
applying smoothing techniques, for example) and the treatment of missing values (e.g., by
replacing a missing value with the most commonly occurring value for that attribute, or with the
most probable value based on statistics). Although most classification algorithms have some
mechanisms for handling noisy or missing data, this step can help reduce confusion during
learning.
Relevance analysis: Many of the attributes in the data may be redundant. Correlation analysis can
be used to identify whether any two given attributes are statistically related. For example, a strong
correlation between attributes A1 and A2 would suggest that one of the two could be removed from
further analysis. A database may also contain irrelevant attributes. Attribute subset selection4 can
be used in these cases to find a reduced set of attributes such that the resulting probability
distribution of the data classes is as close as possible to the original distribution obtained using all
attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset
selection, can be used to detect attributes that do not contribute to the classification or prediction
task. Including such attributes may otherwise slow down, and possibly mislead, the learning step.
6. What is rule based classifier? Explain how a rule based classifier works. [Dec-14/Jan
2015][10marks] [Dec 13/jan14][7 marks]
Using IF-THEN Rules for Classification
Rules are a good way of representing information or bits of knowledge. A rule-based classifier
uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF
condition THEN conclusion.
An example is rule R1,
R1: IF age = youth AND student = yes THEN buys computer = yes.
The “IF”-part (or left-hand side) of a rule is known as the rule antecedent or precondition.
The “THEN”-part (or right-hand side) is the rule consequent. In the rule antecedent, the
condition consists of one or more attribute tests (such as age = youth, and student = yes) that are
logically ANDed. The rule’s consequent contains a class prediction (in this case, we are predicting
whether a customer will buy a computer). R1 can also be written as R1: (age = youth) ^ (student =
yes))(buys computer = yes).
If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple,
we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule
covers the tuple.
A rule R can be assessed by its coverage and accuracy. Given a tuple, X, from a class labeled data
set,D, let ncovers be the number of tuples covered by R; ncorrect be the number of tuples correctly
classified by R; and jDj be the number of tuples in D. We can define the coverage and accuracy of
R as coverage(R) = ncovers jDj accuracy(R) = ncorrect ncovers
That is, a rule’s coverage is the percentage of tuples that are covered by the rule (i.e., whose
attribute values hold true for the rule’s antecedent). For a rule’s accuracy, we look at the tuples
that it covers and see what percentage of them the rule can correctly classify.
7. Write the algorithm for k-nearest neighbour classification. [june/july 2015] [Dec 13/jan14][3 marks]
Data clustering is under vigorous development. Contributing areas of research include data mining,
statistics, machine learning, spatial database technology, biology, and marketing. Owing to the
huge amounts of data collected in databases, cluster analysis has recently become a highly active
topic in data mining research. As a branch of statistics, cluster analysis has been extensively
studied for many years, focusing mainly on distance-based cluster analysis. Cluster analysis tools
based on k-means, k-medoids, and several other methods have also been built into many statistical
analysis software packages or systems, such as S-Plus, SPSS, and SAS. In machine learning,
clustering is an example of unsupervised learning.
Unlike classification, clustering and unsupervised learning do not rely on predefined classes and
class-labeled training examples. For this reason, clustering is a form of learning by observation,
rather than learning by examples. In data mining, efforts have focused on finding methods for
efficient and effective cluster analysis in large databases. Active themes of research focus on the
scalability of clustering methods, the effectiveness of methods for clustering complex shapes.
Download