randomly mining

advertisement
1. Explain a. Origins of data mining b. Data Mining tasks in brief? [June/July 2014]
[10marks]
We have observed various types of databases and information repositories on which data mining
can be performed. Let us now examine the kinds of data patterns that can be mined. Data mining
functionalities are used to specify the kind of patterns to be found in data mining tasks. In general,
data mining tasks can be classified into two categories: descriptive and predictive. Descriptive
mining tasks characterize the general properties of the data in the database. Predictive mining tasks
perform inference on the current data in order to make predictions.
In some cases, users may have no idea regarding what kinds of patterns in their data may be
interesting, and hence may like to search for several different kinds of patterns in parallel. Thus it
is important to have a data mining system that can mine multiple kinds of patterns to accommodate
different user expectations or applications. Furthermore, data mining systems should be able to
discover patterns at various granularity (i.e., different levels of abstraction).
Data mining systems should also allow users to specify hints to guide or focus the search for
interesting patterns. Because some patterns may not hold for all of the data in the database, a
measure of certainty or “trustworthiness” is usually associated with each discovered pattern. Data
mining functionalities, and the kinds of patterns they can discover, are described below.
Concept/Class Description: Characterization and Discrimination
Data can be associated with classes or concepts. For example, in the AllElectronics store, classes
of items for sale include computers and printers, and concepts of customers include bigSpenders
and budgetSpenders. It can be useful to describe individual classes and concepts in summarized,
concise, and yet precise terms. Such descriptions of a class or a concept are called class/concept
descriptions. These descriptions can be derived via (1) data characterization, by summarizing the
data of the class under study (often called the target class) in general terms, or (2) data
discrimination, by comparison of the target class with one or a set of comparative classes (often
called the contrasting classes), or (3) both data characterization and discrimination.
Data characterization is a summarization of the general characteristics or features of a target class
of data. The data corresponding to the user-specified class are typically collected by a database
query. For example, to study the characteristics of software products whose sales increased by
10% in the last year, the data related to such products can be collected by executing an SQL query.
2. What is bayes theorm? Show how is it used for classification. [Dec-14/Jan 2015][10marks],
[June/July 2014][10marks][jun/july-15]
3. Discuss the methods for estimating predictive accuracy of classification method [Dec
13/jan-14][7 marks]
How can we use the above measures to obtain a reliable estimate of classifier accuracy (or predictor
accuracy in terms of error)? Holdout, random subsampling, crossvalidation, and the bootstrap are
common techniques for assessing accuracy based on accuracy increases the overall computation
time, yet is useful for model selection.
Holdout Method and Random Subsampling
The holdout method is what we have alluded to so far in our discussions about accuracy. In this
method, the given data are randomly partitioned into two independent sets, a training set and a test
set. Typically, two-thirds of the data are allocated to the training set, and the remaining onethird is
allocated to the test set. The training set is used to derive the model, whose accuracy is estimated
with the test set. The estimate is pessimistic because only a portion of the initial data is used to
derive the model.
Random subsampling is a variation of the holdout method in which the holdout method is repeated
k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each
iteration. (For prediction, we can take the average of the predictor error rates.)
Cross-validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2, : : : , Dk, each of approximately equal size. Training and testing is
performed k times. In iteration i, partition Di is reserved as the test set, and the remaining partitions
are collectively used to train the model. That is, in the first iteration, subsets D2, : : : , Dk
collectively serve as the training set in order to obtain a first model, which is tested on D1; the
second iteration is trained on subsets D1, D3, : : : , Dk and tested on D2; and so on. Unlike the
holdout and random subsampling methods above, here, each sample is used the same number of
times for training and once for testing. For classification, the accuracy estimate is the overall
number of correct classifications from the k iterations, divided by the total number of tuples in the
initial data. For prediction, the error estimate can be computed as the total loss from the k iterations,
divided by the total number of initial tuples.
4. What are the two approaches for extending the binary classifiers to extend to handle
multiclass problems. [Dec 13]/jan-14[7 marks]
Supervised learning (classification)
1. Supervision: The training data (observations, measurements, etc.) are accompanied by labels
indicating the class of the observations
2. New data is classified based on the training set n supervised learning (clustering)
3. The class labels of training data is unknown
4. Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data.
Download