Uploaded by Brian Cheng

Lec10 QA Summary

advertisement
Training, Validation and Test
1. Training/Valida-on/Tes-ng accuracy
Training, Validation, and Test
Training set
For training the ML models
Validation set
For validation:
choosing the
parameter or
model
Test set
For testing the “real”
performance and
generalization
Fig.1 Training/Valida-on/Test Par--on.
Example: Assume I want to build a Random Forest. I have a parameter to decide: shall I have
• 100 Trees?
• 200 Trees?
Let’s say, we use the training set to train two Random Forest classifiers,
do next is to use the training set to train two classifiers,
• C1What
withwe100
trees
1) !! : Random Forest with 100 trees, and 2) !" : Random Forest with 200 trees
• C2 with 200 trees
© Copyright EE, NUS. All Rights Reserved.
8
They have the following accuracy:
1.
!! : Random Forest with 100 trees: training accuracy 92%, validation accuracy 90%
then
apply
C1 and C2 on the training set, and compute the accuracy of C1 and
2.
!" : Random Forest with 200 trees: training accuracy 95%, validation accuracy 88%
We can
C2,
which is known as the Training Accuracy. In other words, we check the performance of C1
one todataset
choose for
real is
application,
testing?
and C2 onWhich
the same
which
used for i.e.,
training
them.
The one with higher validation accuracy, i.e., Random Forest with 100 trees!
We apply C1 and C2 on the validaDon set, and compare their accuracy. In this case, we are13
!"#$%&'()*+",,-"./01"233"4()*+5"4656'7681"
checking
the accuracy of C1 and C2 on a dataset that is different from the training one. The
accuracy of C1 and C2 on the validaDon set is known as Valida-on Accuracy.
Suppose C1 has a higher validaDon accuracy than C2, we will select C1 for deployment and
use C1 for tesDng. When we apply C1 on the test set, the accuracy is known as Tes-ng (or
Test) Accuracy.
2. Is benchmarking on the test set considered as valida-on?
No, it is not validaDon. As menDoned during lecture, validaDon is involved in the training
stage and for the sake of parameter selecDon. Benchmarking on the test set is to compare
the generalizaDon capability (or the real performance) of trained models.
Imagine the following scenario:
Xinchao trained a classifier, and he had two parameters to choose. Hence, he trained two
models (C1, C2) using the training set. He then applied C1 and C2 on the validaDon set, and
found that C1 works beRer than C2. Hence, Xinchao decided to use C1 for deployment (or
tesDng).
Vincent trained another type of classifier, and he had three parameters to choose. Hence,
trained three models (C3, C4, C5) using the training set. He then applied C3, C4, and C5 on
the validaDon set, and found that C4 works beRer than C3 and C5. Hence, Vincent decided
to use C4 for deployment (or tesDng).
Then the quesDon is, given the same test set, whose classifier is beRer? Xinchao’s or
Vincent’s?
To answer this quesDon, we will check the accuracy of Xinchao’s C1 and Vincent’s C4 on the
test set. If Vincent’s C4 yields a beRer accuracy, well, Vincent’s classifier beats Xinchao’s. In
other words, we are comparing the real performance but not doing any parameter selecDon
here.
3. Can we use the False Nega-ve/False Posi-ve/Confusion Matrix for valida-on (i.e., for
selec-ng parameters)?
Yes, we can.
For example, C1 is the Random Forest with 100 trees and C2 is the Random Forest with 200
trees.
If our focus is to select a parameter with a lower False PosiDve, we can compare C1 and C2
using the metric of False PosiDve on the validaDon set, and choose the one with a lower
False PosiDve. The same goes for other metrics.
4. DET/ROC/AUC Curves
Don’t worry about how to exactly derive these curves -- we won’t ask you such quesDons in
the final exam. As long as you know that we *can* derive AUC from ROC, and ROC from DET,
it would be great.
From Page 25 to 31, the message is that, if we change the threshold on Y, we will end up
having different FN/FP/TP/TN.
What if we want a metric that is invariant of the threshold on Y? Well, we use AUC, which is
a single number that accounts for all the thresholds.
Download