Illustrative Example:Training, Validation and Test Data

advertisement
CIS600/CSE 690: Analytical Data Mining
Some Basic Concepts in Data Mining
(This set will be updated as need during the semester) September 8, 2009
1.
2.
3.
4.
Supervised and unsupervised learning.
Model selection and assessment
Training, validation and test data
Cross validation
(1)
Supervised and unsupervised learning
Supervised Learning
This type of learning is “similar” to human learning from experience. Since computers have no
experience, we provide previous data, called training data as a substitute. If is analogous to learning
from a teacher and hence the name supervised. Two such tasks in data mining are classification and
prediction. In classification, data attributes are related to a class label while in prediction they are
related to a numerical value.
Unsupervised Learning
In this type of learning, we discover patterns in data attributes to learn or better understand the data.
Clustering algorithms are used to discover such patterns, i.e., to determine data clusters. Algorithms
are employed to organize data into groups (clusters) where members in a group are similar in some
way and different from members in other groups.
(2)
Supervised and unsupervised learning
In data mining applications we seek a model that learns well from the available data as well as has
good generalization performance. Towards this objective, we need a way to manage model
complexity and a method to measure performance of the chosen model. A common approach is to
divide the data into three sets, training, validation and test. Training data are used to learn or develop
candidate models, validation set is used to select a model and test set is used for assessing model
performance on future data. However, in many applications only two sets are created, training and
test.
3. Training, Validation and Test Data
Example:
(A)We have data on 16 data items , their attributes and class labels.
RANDOMLY divide them into 8 for training, 4 for validation and 4 for testing.
Training
Validation
Test
Item No.
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
d– Attributes
KNOWN FOR ALL
DATA ITEMS
Class
0
0
1
1
1
1
0
0
0
0
1
0
0
0
1
1
(B). Next, suppose we develop, three classification models A, B, C from the training
data. Let the training errors on these models be as shown below (recall that the
models do not necessarily provide perfect results on training data—neither they are
required to).
Item No.
1.
2.
3.
4.
5.
6.
7.
8.
d- Attributes
ALL KNOWN
True Class
0
0
1
1
1
1
0
0
Classification
Error
Classification results from
Model A
Model B
Model C
0
1
1
0
0
0
0
1
0
1
0
1
0
0
0
1
1
1
0
0
0
0
0
0
2/8
3/8
3/8
(C). Next, use the three models A, B, C to classify each item in the validation set
based on its attribute vales. Recall that we do know their true labels as well.
Suppose we get the following results:
Item No.
9.
10.
11.
12.
d- Attributes
True Class
0
0
1
0
Classification results from
Model A
Model B
Model C
1
0
0
0
1
0
0
1
0
0
1
0
2/4
2/4
1/4
Classification
Error
If we use minimum validation error as model selection criterion, we would select
model C.
(D). Now use model C to determine class values for each data point in the test set.
We do so by substituting the (known) attribute value into the classification model C.
Again, recall that we know the true label of each of these data items so that we can
compare the values obtained from the classification model with the true labels to
determine classification error on the test set. Suppose we get the following results.
Item No.
13.
14.
15.
16.
d- Attributes
ALL KNOWN
True Class
0
0
1
1
Classification
Error
Classification results
from
Model C
0
0
0
1
1/4
(E). Based on the above, an estimate of generalization error is 25%.
What this means is that if we use Model C to classify future items for which only the
attributes will be known, not the class labels, we are likely to make incorrect
classifications about 25% of the time.
(F). A summary of the above is as follows:
Model
A
B
C
Training
25
37.5
37.5
Validation
50
50
25
Test
------25
4. Cross Validation
If available data are limited, we employ Cross Validation (CV). In this approach, data are
randomly divided into almost k equal sets. Training is done based on (k-1) sets and the
k-th set is used for test. This process is repeated k times (k-fold CV). The average error on
the k repetitions is used as a measure of the test error.
For the special case when k=1, the above is called Leave- One –Out-Cross-Validation
(LOO-CV).
EXAMPLE:
Consider the above data consisting of 16 items.
(A). Let k= 4, i.e., 4- fold Cross Validation.
Divide the data into four sets of 4 items each.
Suppose the following set up occurs and the errors obtained are as shown.
Training
Set 1
Items 1 - 12
Test
Items 13-16
Set 2
Items 1 - 8
13-16
Items 9-12
25%
35%
Error on test
set (assume)
Set 3
Items 1 - 4
9-16
Items 5 - 8
Set 4
Items 5-16
Items 1 – 4
28%
32%
Estimated Classification Error (CE) = 25+35+28+32 = 30%
4
(B). LOO – CV
For this, data are divided into 16 sets, each consisting of 15 training data and one test
data.
Training
Set 1
Items 1 - 15
Set 2
Items 1 – 14,16
Test
Item 16
0%
Error on test
set (assume)
Set 16
Items 2-16
Item 15
Set 15
Item 1,
3-8
Item 2
100%
100%
100%
Suppose Average Classification Error based on the values in the last row is
CE)= 32%
Then the estimate of test error is 32% .
Item 1
Download