Uploaded by ari pribadi

Les 1b - Overview of Supervised Learning

advertisement
CZ4041/CE4041:
Machine Learning
Lecture 1b: Overview of
Supervised Learning
Sinno Jialin PAN
School of Computer Science and Engineering,
NTU, Singapore
Homepage: http://www3.ntu.edu.sg/home/sinnopan/
Supervised Learning
 Recall: in supervised learning, the examples
presented to computers are pairs of inputs
and the corresponding outputs (i.e., labeled
training data), the goal is to “learn” a map or
a model from inputs to labels
inputs
Outputs
2
Female
Female
Male
Male
Supervised Learning (cont.)
Labeled training data
In mathematics,
 Given: a set of 𝒙𝑖 , 𝑦𝑖 for 𝑖 = 1, … , 𝑁, where
𝒙𝑖 = [𝑥𝑖𝑖 , 𝑥𝑖𝑖 , … , 𝑥𝑖𝑖 ], and 𝑦𝑖 is a scalar,
 Goal: to learn a mapping 𝑓: 𝒙 → 𝑦 by
Output
requiring 𝑓 𝒙𝑖 = 𝑦𝑖 .
Classifier
 The learned mapping 𝑓 is expected to be able
to make precise predictions on any unseen 𝒙∗
as 𝑓(𝒙∗ )
A vector of features
3
Example 1: Sentiment Classification
Compact; easy to operate; very good picture quality; looks sharp!
It is also quite blurry in very dark settings. I will never_buy HP again.
Bag-of-words representation
𝑦1 +1
𝑦2 −1
Dictionary
4
F1
1
F2
1
F3
0
F4
0
F5
1
F6
0
…
…
0
0
1
1
0
1
…
F1
compact
F2
easy
F3
quite
F4
blurry
F5
good
F6
never_buy
𝒙1
𝒙2
…
…
Other Examples
𝒙𝟑
𝒙𝟒
𝒙𝟏
𝒙𝟐
Female
Female
Male
Male
−1
−1
+1
+1
Wearable sensors
5
Image
processing &
Computer
vision tech.
𝒙𝟏 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ]
𝒙𝟐 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ]
𝒙𝟑 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ]
𝒙𝟒 = [𝑥11 , 𝑥12 , … , 𝑥1𝑚 ]
𝒙𝒕 = [𝑥𝑡1 , 𝑥𝑡2 , … , 𝑥𝑡𝑚 ]
𝑦𝑡 = 𝑙,
where 𝑙 ∈ {1, … , 𝐿}
E.g., 1 – working, 2 – sleeping, etc.
Other Examples (cont.)
1
Home Marital Taxable
Cheat
Owner Status Income
125K
Yes
Single
No
2
No
Married
3
No
Divorced 95K
Tid
100K
No
Yes
One-hot encoding
6
HO_Yes HO_No MS_S
𝒙1 1
0
1
1
0
𝒙2 0
𝒙3
0
1
0
MS_M
0
MS_D
0
TaxInc
125K
1
0
100K
0
1
95K
Cheat
−1
−1
+1
𝑦1
𝑦2
𝑦3
Classification v.s. Regression
 For classification, 𝑦 is discrete
 If 𝑦 is binary, then binary classification
 If 𝑦 is nominal not binary, then multi-class classification
 For regression, 𝑦 is continuous
 Can an example be assigned to more than
one categories?
7
Typical Learning Procedure
Two phrases:
 Training phrase
 Given labeled training data 𝒙𝑖 , 𝑦𝑖 , for 𝑖 = 1, … , 𝑁
 Apply supervised learning algorithms on 𝒙𝑖 , 𝑦𝑖 to learn
a model 𝑓 such that 𝑓 𝒙𝑖 = 𝑦𝑖
 Testing or prediction phrase
 Given unseen test data 𝒙𝑖∗ , for 𝑖 = 1, … , 𝑇
 Use the trained model 𝑓 to make predictions 𝑓(𝒙𝑖∗ )’s
8
Training data
𝑦1 +1
𝑦2 −1
𝑦𝑁 −1
F1
1
F2
1
F3
0
F4
0
F5
1
F6
0
…
…
0
0
1
1
0
1
…
1
1
…
...
0
1
0
Some classification algorithm
Predicted label
?
9
0
F1
1
𝑓: 𝒙 → 𝑦
𝑦 ∗ = 𝑓(𝒙∗ )
F2
1
F3
0
F4
0
𝒙1
𝒙2
𝒙𝑁
Test data
F5
1
Inductive Learning
F6
0
…
…
𝒙∗
Inductive Learning
 How to learn a predictive model 𝑓(∙)
 Bayesian Classifiers
 Decision Trees
 Artificial Neural Networks
 Support Vector Machines
 Regularized Regressions Model
10
Bayesian Classifiers:
Naïve Bayes Classifier
P(MS=Married) = 0.5
P(MS=Single) = 0.4
Marital Status
P(HO=Yes) = 0.4
Home Owner
P(TI > 80K) = 0.4
Taxable Income
Cheat
P(C=Yes | MS=Married, HO=Yes, TI>80K) = 0.8
P(C=Yes | MS=Married, HO=No, TI>80K) = 0.6
P(C=Yes | MS=Married, HO=Yes, TI≤80K) = 0.7
P(C=Yes | MS=Married, HO=No, TI≤80K) = 0.5
…
11
Bayesian Classifiers:
Bayesian Belief Networks
12
Decision Tree
Refund
Yes
No
NO
MarSt
Single, Divorced
TaxInc
13
Married
NO
< 80K
> 80K
NO
YES
Artificial Neural Networks
y1
y2
yk
. . .
Output Layer
. . .
Hidden Layer(s)
. . .
. . .
Input Layer
X1
14
X2
Xm
Support Vector Machines (SVMs)
B2
Linear SVMs
15
Nonlinear SVMs
Kernel trick!
Regularized Regression Model
1
t
t
0
1
0
x
Linear Regression
1
x
Nonlinear Regression
Kernel trick
16
Training data
𝑦1 +1
𝑦2 −1
𝑦𝑁 −1
F1
1
F2
1
F3
0
F4
0
F5
1
F6
0
…
…
0
0
1
1
0
1
…
1
1
…
...
0
1
0
0
𝒙1
𝒙2
𝒙𝑁
Predicted label
?
17
F1
1
𝑦∗
Test data
F2
1
F3
0
Lazy Learning
F4
0
F5
1
F6
0
…
…
𝒙∗
E.g., 𝐾 Nearest neighbor classifier
Evaluation of Supervised Learning
 Recall: the learned predictive model should be
able to make predictions on previously unseen
data as accurately as possible
 Solution: the whole training data is divided into
“training” and “test” sets, with “training” set used
to learn the predictive model and “test” set used to
validate the performance of the learned model
Training Data
Original
Training Data
Validation Data
18
All Training data
𝑦1 +1
𝑦2 −1
𝑦𝑁 −1
F1
1
F2
1
F3
0
F4
0
F5
1
F6
0
…
…
0
0
1
1
0
1
…
1
1
…
...
0
1
0
0
Used for training
𝑦1 +1
𝑦2 −1
𝑦𝑳 +1
19
𝒙1
𝒙2
𝒙𝑁
Used for evaluation
F1 F2 F3 F4 F5 F6 …
F1 F2 F3 F4 F5 F6 …
1 1 0 0 1 0 … 𝒙1 𝑦𝐿+1 +1 1 1 0 0 1 0 … 𝒙𝐿+1
0 0 1 1 0 1 … 𝒙2
...
...
𝑦𝑵 +1 1 1 1 0 1 1 … 𝒙𝑵
1 1 1 0 1 1 … 𝒙𝑳
Evaluate
𝑦�𝑖 = 𝑓(𝒙𝑖 )
Learn
𝑓: 𝒙 → 𝑦
for 𝑖 = 𝐿 + 1, … , 𝑁
𝑦𝐿+1
𝑦�𝐿+1
… v.s …
Some evaluation metric
𝑦𝑁
𝑦�𝑁
Common Evaluation Metrics
 Classification: Accuracy, error rate, F-score etc.
Actual
𝒙1
𝒙2
𝒙3
𝒙4
𝒙5
𝒙6
𝒙7
𝒙8
+𝟏
−𝟏
+𝟏
−𝟏
−𝟏
+𝟏
−𝟏
+𝟏
Prediction
−𝟏
−𝟏
+𝟏
+𝟏
−𝟏
+𝟏
−𝟏
+𝟏
Accuracy = 5/8
Error rate = 3/8
 Regression: Mean Absolute Error (MAE), Root Mean
Squared Error (RMSE)
20
Split of Training and Evaluation Sets
 Random subsampling
 Cross validation
 Bootstrap
21
Random Subsampling
1st random sampling
2nd random sampling
(Sampled without replacement)
(Sampled without replacement)
Training Data
Original Data
Validation Data
Training Data
Original Data
Validation Data
acc2
acc1
22
𝑘
1
accavg = � acc𝑖
𝑘
𝑖=1
…
…
Cross Validation:
 k-fold cross-validation: partition data into k subsets
of the same size
 Hold aside one group for testing and use the rest to
build model
Test
Iterations
23
Training
…
Training
Test
5-fold Cross-Validation
Partition into
5 subsets
Hold aside 1
group for
testing and
use the rest 4
for training
Evaluate
24 performance
1
2
3
4
5
1. Test
2. Train
3. Train
4. Train
5. Train
1. Train
2. Test
3. Train
4. Train
5. Train
1. Train
2. Train
3. Test
4. Train
5. Train
1. Train
2. Train
3. Train
4. Test
5. Train
1. Train
2. Train
3. Train
4. Train
5. Test
1. Test
2. Test
3. Test
4. Test
5. Test
Leave-One-Out
 k-fold cross-validation special case:
 leave-one-out, k=n, n is the number of data
instances for training
…
25
Bootstrap
1st random sampling
Sampled with replacement
2nd random sampling
Training Data
Original Data
26
Validation Data
Training Data
Original Data
Validation Data
…
…
Next Week …
 We start by introducing Bayesian Classifiers
Thank you!
27
Download