CIS600/CSE 690: Analytical Data Mining Some Basic Concepts in Data Mining (This set will be updated as need during the semester) September 8, 2009 1. 2. 3. 4. Supervised and unsupervised learning. Model selection and assessment Training, validation and test data Cross validation (1) Supervised and unsupervised learning Supervised Learning This type of learning is “similar” to human learning from experience. Since computers have no experience, we provide previous data, called training data as a substitute. If is analogous to learning from a teacher and hence the name supervised. Two such tasks in data mining are classification and prediction. In classification, data attributes are related to a class label while in prediction they are related to a numerical value. Unsupervised Learning In this type of learning, we discover patterns in data attributes to learn or better understand the data. Clustering algorithms are used to discover such patterns, i.e., to determine data clusters. Algorithms are employed to organize data into groups (clusters) where members in a group are similar in some way and different from members in other groups. (2) Supervised and unsupervised learning In data mining applications we seek a model that learns well from the available data as well as has good generalization performance. Towards this objective, we need a way to manage model complexity and a method to measure performance of the chosen model. A common approach is to divide the data into three sets, training, validation and test. Training data are used to learn or develop candidate models, validation set is used to select a model and test set is used for assessing model performance on future data. However, in many applications only two sets are created, training and test. 3. Training, Validation and Test Data Example: (A)We have data on 16 data items , their attributes and class labels. RANDOMLY divide them into 8 for training, 4 for validation and 4 for testing. Training Validation Test Item No. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. d– Attributes KNOWN FOR ALL DATA ITEMS Class 0 0 1 1 1 1 0 0 0 0 1 0 0 0 1 1 (B). Next, suppose we develop, three classification models A, B, C from the training data. Let the training errors on these models be as shown below (recall that the models do not necessarily provide perfect results on training data—neither they are required to). Item No. 1. 2. 3. 4. 5. 6. 7. 8. d- Attributes ALL KNOWN True Class 0 0 1 1 1 1 0 0 Classification Error Classification results from Model A Model B Model C 0 1 1 0 0 0 0 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 2/8 3/8 3/8 (C). Next, use the three models A, B, C to classify each item in the validation set based on its attribute vales. Recall that we do know their true labels as well. Suppose we get the following results: Item No. 9. 10. 11. 12. d- Attributes True Class 0 0 1 0 Classification results from Model A Model B Model C 1 0 0 0 1 0 0 1 0 0 1 0 2/4 2/4 1/4 Classification Error If we use minimum validation error as model selection criterion, we would select model C. (D). Now use model C to determine class values for each data point in the test set. We do so by substituting the (known) attribute value into the classification model C. Again, recall that we know the true label of each of these data items so that we can compare the values obtained from the classification model with the true labels to determine classification error on the test set. Suppose we get the following results. Item No. 13. 14. 15. 16. d- Attributes ALL KNOWN True Class 0 0 1 1 Classification Error Classification results from Model C 0 0 0 1 1/4 (E). Based on the above, an estimate of generalization error is 25%. What this means is that if we use Model C to classify future items for which only the attributes will be known, not the class labels, we are likely to make incorrect classifications about 25% of the time. (F). A summary of the above is as follows: Model A B C Training 25 37.5 37.5 Validation 50 50 25 Test ------25 4. Cross Validation If available data are limited, we employ Cross Validation (CV). In this approach, data are randomly divided into almost k equal sets. Training is done based on (k-1) sets and the k-th set is used for test. This process is repeated k times (k-fold CV). The average error on the k repetitions is used as a measure of the test error. For the special case when k=1, the above is called Leave- One –Out-Cross-Validation (LOO-CV). EXAMPLE: Consider the above data consisting of 16 items. (A). Let k= 4, i.e., 4- fold Cross Validation. Divide the data into four sets of 4 items each. Suppose the following set up occurs and the errors obtained are as shown. Training Set 1 Items 1 - 12 Test Items 13-16 Set 2 Items 1 - 8 13-16 Items 9-12 25% 35% Error on test set (assume) Set 3 Items 1 - 4 9-16 Items 5 - 8 Set 4 Items 5-16 Items 1 – 4 28% 32% Estimated Classification Error (CE) = 25+35+28+32 = 30% 4 (B). LOO – CV For this, data are divided into 16 sets, each consisting of 15 training data and one test data. Training Set 1 Items 1 - 15 Set 2 Items 1 – 14,16 Test Item 16 0% Error on test set (assume) Set 16 Items 2-16 Item 15 Set 15 Item 1, 3-8 Item 2 100% 100% 100% Suppose Average Classification Error based on the values in the last row is CE)= 32% Then the estimate of test error is 32% . Item 1