9 Exploring advanced methods

advertisement
고급컴퓨터알고리듬
전자전기컴퓨터공학부
홍기주

환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면,
어떤 문제가 있는가? (limited features)
◦ Training variance
 training set이 조금만 바뀌어도 예측 결과가 크게 달라짐
◦ Non-monotone effect
 ideal healthy weight is in a bounded range, not
arbitrarily heavy or arbitrarily light.
◦ Linearly inseparable data
 선형 분리가 안됨




Reducing training variance with bagging and
random forests
Learning non-monotone relationships with
generalized additive models
Increasing data separation with kernel methods
Modeling complex decision boundaries with
support vector machines

Decision trees are an attractive method for a
number of reasons:
◦ They take any type of data, numerical or categorical,
without any distributional assumptions and without
preprocessing.
◦ Most implementations (in particular, R’s) handle missing
data; the method is also robust to redundant and nonlinear
data.
◦ The algorithm is easy to use, and the output (the tree) is
relatively easy to understand.
◦ Once the model is fit, scoring is fast.

On the other hand, decision trees do have some
drawbacks:
◦ They have a tendency to overfit, especially without pruning
◦ They have high training variance: samples drawn from the
same population can produce trees with different structures
and different prediction accuracy
◦ Prediction accuracy can be low, compared to other methods
위의 단점을 개선하기 위해 bagging 혹은 random forest
를 이용함

Using bagging to improve prediction
◦ Data set
 spamD.tsv
(https://github.com/WinVector/zmPDSwR/raw/master/Spambase/spam
D.tsv)

Using bagging to improve prediction
◦ Preparing Spambase data

Using bagging to improve prediction
◦ Evaluating the performance of decision trees

Using bagging to improve prediction
◦ Evaluating the performance of decision trees
• The accuracy and F1 scores
both degrade on the test set,
and the deviance increases

Using bagging to improve prediction
◦ Bagging decision trees

Using bagging to improve prediction
◦ Bagging decision trees
Bagging improves accuracy and F1, and
reduces deviance over both the
training and test sets when compared to
the single decision tree
(less generalization error)

Using random forests to further improve prediction
◦ Bagging의 단점
 개별 tree에서 사용되는 Feature set이 거의 동일함
랜덤 포레스트란?
랜덤 포레스트 알고리즘 과정은 모집단으로부터 추출된 training dataset에서 복
원 추출에 의해 부스트랩 데이터를 생성한다. 이러한 방법을 N번 반복하여 N개의
부트스트랩 데이터를 생성하고, 의사결정나무 알고리즘을 적용할 때 각각의 노드
에서 랜덤하게 m개의 설명변수를 선택한다.랜덤 포레스트 방법론은 트리(tree) 사
이에 상관관계를 줄임으로써baging 방법에 비해 분산을 줄여준다는 장점을 가지
고 있다.

Using random forests to further improve prediction
◦ The random forest method does the following:
1. Draws a bootstrapped sample from the training data
2. For each sample, grows a decision tree, and at each node of the
tree
1. Randomly draws a subset of 𝑚𝑡𝑟𝑦 variables from the 𝑝 total
features that are available
2. Picks the best variable and the best split from that set of
𝑚𝑡𝑟𝑦 variables
3. Continues until the tree is fully grown

Using random forests to further improve prediction
◦ The random forest method does the following:
1. Draws a bootstrapped sample from the training data
2. For each sample, grows a decision tree, and at each node of the
tree
1. Randomly draws a subset of 𝑚𝑡𝑟𝑦 variables from the 𝑝 total
features that are available
2. Picks the best variable and the best split from that set of
𝑚𝑡𝑟𝑦 variables
3. Continues until the tree is fully grown

Using random forests to further improve prediction
◦ Using random forests

Using random forests to further improve prediction
◦ Report the model quality
The random forest model performed
dramatically better than the other two
models in both training and test. But
the random forest’s generalization
error was comparable to that of a
single decision tree (and almost twice
that of the bagged model).

Using random forests to further improve prediction
◦ Examining Variable Importance
 randomForest() 호출 시 importance = T를 설정하면 variable importance를
계산

Using random forests to further improve prediction
◦ Examining Variable Importance
중요한 변수 선택을 통해 더 작고
빠르게 tree를 만드는 것이 가능하
고, 다른 modeling algorithm에서
사용하는 것 또한 가능함

Using random forests to further improve prediction
◦ Examining Variable Importance
중요한 변수 선택을 통해 더 작고
빠르게 tree를 만드는 것이 가능하
고, 다른 modeling algorithm에서
사용하는 것 또한 가능함

Using random forests to further improve prediction
◦ Fitting with fewer variables
The smaller model performs just as well as the random forest model
built using all 57variables.

Bagging and random forest takeaways
◦ Bagging stabilizes decision trees and improves accuracy by
reducing variance.
◦ Bagging reduces generalization error.
◦ Random forests further improve decision tree performance by decorrelating the individual trees in the bagging ensemble.
◦ Random forests’ variable importance measures can help you
determine which variables are contributing the most strongly to
your model.
◦ Because the trees in a random forest ensemble are unpruned and
potentially quite deep, there’s still a danger of overfitting. Be sure
to evaluate the model on holdout data to get a better estimate of
model performance.

Understanding GAMs
◦ 저 체중 환자라면, 몸무게가 늘릴수록 더욱 건강해 질 수 있다. 하지만 거기에도 한계가 있다.
(non monotone)

A one-dimensional regression example
◦ Preparing an artificial problem

A one-dimensional regression example
◦ Linear regression applied to our artificial example
data가 sin()과 cos()에
의해 만들어졌으므로
linear 하지 않음.
R-squared가 0.04로
매우 낮음

A one-dimensional regression example
◦ Linear regression applied to our artificial example
현재 model은 error가
heteroscedastic함

A one-dimensional regression example
◦ GAM applied to our artificial example

A one-dimensional regression example
◦ GAM applied to our artificial example

A one-dimensional regression example
◦ GAM applied to our artificial example

A one-dimensional regression example
◦ GAM applied to our artificial example
The GAM has been fit to be
homoscedastic

A one-dimensional regression example
◦ Comparing linear regression and GAM performance
The GAM performed similarly on both sets (RMSE of 1.40 on test versus 1.45
on training; R-squared of 0.78 on test versus 0.83 on training).

Extracting the nonlinear relationships
◦ Extracting a learned spline from a GAM

Using GAM on actual data
◦ Applying linear regression (with and without GAM) to health data
Dataset
CDC 2010 natality dataset
(https://github.com/WinVector/zmPDSwR/blob/master/CDC/NatalB
irthData.rData)
주어진 데이터를 이용하여 신생아의 몸무게를 예측
독립변수
mother’s weight (PWGT)
mother’s pregnancy weight gain(WTGAIN)
mother’s age(MAGER)
number of prenatal medical visits(UPREVIS)

Using GAM on actual data
◦ Applying linear regression (with and without GAM) to health data

Using GAM on actual data
◦ Applying linear regression (with and without GAM) to health data

Using GAM on actual data
◦ Applying linear regression (with and without GAM) to health data

Using GAM on actual data
◦ Applying linear regression (with and without GAM) to health data
edf가 1보다 크므로 4개
의 변수 모두 nonlinear
관계라 할 수 있음

Using GAM on actual data
◦ Plotting GAM results

Using GAM on actual data
◦ Plotting GAM results
S() spline과 smooth curve의
형태가 similar함

Using GAM on actual data
◦ Checking GAM model performance on hold-out data
Train set과 비교하여 큰 차이가 없으므로 크게 overfit
되지 않았다고 할 수 있음

Using GAM for logistic regression
신생아의 몸무게가 2000이하 (DBWT<2000) 예측
◦ GLM logistic regression
◦ GAM logistic regression

Using GAM for logistic regression
◦ GAM logistic regression

GAM takeaways
◦ GAMs let you represent nonlinear and non-monotonic
relationships between variables and outcome in a linear or logistic
regression framework.
◦ In the mgcv package, you can extract the discovered relationship
from the GAM model using the predict() function with the
type="terms" parameter.
◦ You can evaluate the GAM with the same measures you’d use for
standard linear or logistic regression: residuals, deviance, Rsquared, and pseudo R-squared. The gam() summary also gives
you an indication of which variables have a significant effect on
the model.
◦ Because GAMs have increased complexity compared to standard
linear or logistic regression models, there’s more risk of overfit.

Synthetic variables?
◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로
운 변수를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여
새로운 변수를 만들 수 있는데 이를 Synthetic variable이라고 함
◦ Kernel method를 이용하여 새로운 변수를 만들어서 machine
learning의 성능을 향상시킴

Kernel method
◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로운 변수
를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여 새로운 변수를
만들 수 있는데 이를 Synthetic variable이라고 함
◦ Kernel method를 이용하여 새로운 변수를 만들어서 machine learning
의 성능을 향상시킴

Understanding kernel functions
◦ An artificial kernel example
k(u,v) = phi(u) %*% phi(v)

Understanding kernel functions
Kernel transformation을
이용하여 linear하게 데이
터를 나누는 것이 목표

Using an explicit kernel on a problem
◦ Applying stepwise linear regression to PUMS data
(https://github.com/WinVector/zmPDSwR/raw/master/PUMS/psub.RData)

Using an explicit kernel on a problem
◦ Applying an example explicit kernel transform
Phi()를 이용하여 새로운 modeling variable을 생성함

Using an explicit kernel on a problem
◦ Applying an example explicit kernel transform

Using an explicit kernel on a problem
◦ Modeling using the explicit kernel transform
RMSE가 조금 개선됨

Using an explicit kernel on a problem
◦ Inspecting the results of the explicit kernel model
age와 log income간의 non-monotone
관계를 반영하기 위해 AGEP_AGEP라는 새
로운 변수를 사용함

Kernel takeaways
◦ Kernels provide a systematic way of creating interactions
and other synthetic variables that are combinations of
individual variables
◦ The goal of kernelizing is to lift the data into a space where
the data is separable, or where linear methods can be used
directly

Understanding support vector machines
선형 분리가 불가능한 데이터(left)
를 고차원 커널 공간으로 lift(right)
하여 데이터를 선형 분리하는 초
평면을 구하는 문제

Trying an SVM on artificial example data
◦ Setting up the spirals data as an example classification problem

Trying an SVM on artificial example data
◦ Setting up the spirals data as an example classification problem

SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL
◦ SVM with a poor choice of kernel
커널을 잘못 선택한 경우

SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL
◦ SVM with a poor choice of kernel

SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL
◦ SVM with a good choice of kernel

Using SVMs on real data
◦ Revisiting the Spambase example with GLM

Using SVMs on real data
◦ Applying an SVM to the Spambase example

Using SVMs on real data
◦ Printing the SVM results summary

COMPARING RESULTS
◦ Shifting decision point to perform an apples-to-apples comparison
SVM에서 spam(false postive)을 162개로 예측했으므로
GLM의 threshold도 수정

Support vector machine takeaways
◦ SVMs are a kernel-based classification approach where the kernels
are represented in terms of a (possibly very large) subset of the
training examples.
◦ SVMs try to lift the problem into a space where the data is linearly
separable (or as near to separable as possible).
◦ SVMs are useful in cases where the useful interactions or other
combinations of input variables aren’t known in advance. They’re
also useful when similarity is strong evidence of belonging to the
same class.




Bagging and random forests—To reduce the sensitivity of
models to early modeling choices and reduce modeling
variance
Generalized additive models—To remove the (false)
assumption that each model feature contributes to the model
in a monotone fashion
Kernel methods—To introduce new features that are
nonlinear combinations of existing features, increasing the
power of our model
Support vector machines—To use training examples as
landmarks (support vectors), again increasing the power of
our model
END
Download