고급컴퓨터알고리듬 전자전기컴퓨터공학부 홍기주 환자의 키와 몸무게만으로 건강 여부를 판단하는 모델을 만든다면, 어떤 문제가 있는가? (limited features) ◦ Training variance training set이 조금만 바뀌어도 예측 결과가 크게 달라짐 ◦ Non-monotone effect ideal healthy weight is in a bounded range, not arbitrarily heavy or arbitrarily light. ◦ Linearly inseparable data 선형 분리가 안됨 Reducing training variance with bagging and random forests Learning non-monotone relationships with generalized additive models Increasing data separation with kernel methods Modeling complex decision boundaries with support vector machines Decision trees are an attractive method for a number of reasons: ◦ They take any type of data, numerical or categorical, without any distributional assumptions and without preprocessing. ◦ Most implementations (in particular, R’s) handle missing data; the method is also robust to redundant and nonlinear data. ◦ The algorithm is easy to use, and the output (the tree) is relatively easy to understand. ◦ Once the model is fit, scoring is fast. On the other hand, decision trees do have some drawbacks: ◦ They have a tendency to overfit, especially without pruning ◦ They have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy ◦ Prediction accuracy can be low, compared to other methods 위의 단점을 개선하기 위해 bagging 혹은 random forest 를 이용함 Using bagging to improve prediction ◦ Data set spamD.tsv (https://github.com/WinVector/zmPDSwR/raw/master/Spambase/spam D.tsv) Using bagging to improve prediction ◦ Preparing Spambase data Using bagging to improve prediction ◦ Evaluating the performance of decision trees Using bagging to improve prediction ◦ Evaluating the performance of decision trees • The accuracy and F1 scores both degrade on the test set, and the deviance increases Using bagging to improve prediction ◦ Bagging decision trees Using bagging to improve prediction ◦ Bagging decision trees Bagging improves accuracy and F1, and reduces deviance over both the training and test sets when compared to the single decision tree (less generalization error) Using random forests to further improve prediction ◦ Bagging의 단점 개별 tree에서 사용되는 Feature set이 거의 동일함 랜덤 포레스트란? 랜덤 포레스트 알고리즘 과정은 모집단으로부터 추출된 training dataset에서 복 원 추출에 의해 부스트랩 데이터를 생성한다. 이러한 방법을 N번 반복하여 N개의 부트스트랩 데이터를 생성하고, 의사결정나무 알고리즘을 적용할 때 각각의 노드 에서 랜덤하게 m개의 설명변수를 선택한다.랜덤 포레스트 방법론은 트리(tree) 사 이에 상관관계를 줄임으로써baging 방법에 비해 분산을 줄여준다는 장점을 가지 고 있다. Using random forests to further improve prediction ◦ The random forest method does the following: 1. Draws a bootstrapped sample from the training data 2. For each sample, grows a decision tree, and at each node of the tree 1. Randomly draws a subset of 𝑚𝑡𝑟𝑦 variables from the 𝑝 total features that are available 2. Picks the best variable and the best split from that set of 𝑚𝑡𝑟𝑦 variables 3. Continues until the tree is fully grown Using random forests to further improve prediction ◦ The random forest method does the following: 1. Draws a bootstrapped sample from the training data 2. For each sample, grows a decision tree, and at each node of the tree 1. Randomly draws a subset of 𝑚𝑡𝑟𝑦 variables from the 𝑝 total features that are available 2. Picks the best variable and the best split from that set of 𝑚𝑡𝑟𝑦 variables 3. Continues until the tree is fully grown Using random forests to further improve prediction ◦ Using random forests Using random forests to further improve prediction ◦ Report the model quality The random forest model performed dramatically better than the other two models in both training and test. But the random forest’s generalization error was comparable to that of a single decision tree (and almost twice that of the bagged model). Using random forests to further improve prediction ◦ Examining Variable Importance randomForest() 호출 시 importance = T를 설정하면 variable importance를 계산 Using random forests to further improve prediction ◦ Examining Variable Importance 중요한 변수 선택을 통해 더 작고 빠르게 tree를 만드는 것이 가능하 고, 다른 modeling algorithm에서 사용하는 것 또한 가능함 Using random forests to further improve prediction ◦ Examining Variable Importance 중요한 변수 선택을 통해 더 작고 빠르게 tree를 만드는 것이 가능하 고, 다른 modeling algorithm에서 사용하는 것 또한 가능함 Using random forests to further improve prediction ◦ Fitting with fewer variables The smaller model performs just as well as the random forest model built using all 57variables. Bagging and random forest takeaways ◦ Bagging stabilizes decision trees and improves accuracy by reducing variance. ◦ Bagging reduces generalization error. ◦ Random forests further improve decision tree performance by decorrelating the individual trees in the bagging ensemble. ◦ Random forests’ variable importance measures can help you determine which variables are contributing the most strongly to your model. ◦ Because the trees in a random forest ensemble are unpruned and potentially quite deep, there’s still a danger of overfitting. Be sure to evaluate the model on holdout data to get a better estimate of model performance. Understanding GAMs ◦ 저 체중 환자라면, 몸무게가 늘릴수록 더욱 건강해 질 수 있다. 하지만 거기에도 한계가 있다. (non monotone) A one-dimensional regression example ◦ Preparing an artificial problem A one-dimensional regression example ◦ Linear regression applied to our artificial example data가 sin()과 cos()에 의해 만들어졌으므로 linear 하지 않음. R-squared가 0.04로 매우 낮음 A one-dimensional regression example ◦ Linear regression applied to our artificial example 현재 model은 error가 heteroscedastic함 A one-dimensional regression example ◦ GAM applied to our artificial example A one-dimensional regression example ◦ GAM applied to our artificial example A one-dimensional regression example ◦ GAM applied to our artificial example A one-dimensional regression example ◦ GAM applied to our artificial example The GAM has been fit to be homoscedastic A one-dimensional regression example ◦ Comparing linear regression and GAM performance The GAM performed similarly on both sets (RMSE of 1.40 on test versus 1.45 on training; R-squared of 0.78 on test versus 0.83 on training). Extracting the nonlinear relationships ◦ Extracting a learned spline from a GAM Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data Dataset CDC 2010 natality dataset (https://github.com/WinVector/zmPDSwR/blob/master/CDC/NatalB irthData.rData) 주어진 데이터를 이용하여 신생아의 몸무게를 예측 독립변수 mother’s weight (PWGT) mother’s pregnancy weight gain(WTGAIN) mother’s age(MAGER) number of prenatal medical visits(UPREVIS) Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data Using GAM on actual data ◦ Applying linear regression (with and without GAM) to health data edf가 1보다 크므로 4개 의 변수 모두 nonlinear 관계라 할 수 있음 Using GAM on actual data ◦ Plotting GAM results Using GAM on actual data ◦ Plotting GAM results S() spline과 smooth curve의 형태가 similar함 Using GAM on actual data ◦ Checking GAM model performance on hold-out data Train set과 비교하여 큰 차이가 없으므로 크게 overfit 되지 않았다고 할 수 있음 Using GAM for logistic regression 신생아의 몸무게가 2000이하 (DBWT<2000) 예측 ◦ GLM logistic regression ◦ GAM logistic regression Using GAM for logistic regression ◦ GAM logistic regression GAM takeaways ◦ GAMs let you represent nonlinear and non-monotonic relationships between variables and outcome in a linear or logistic regression framework. ◦ In the mgcv package, you can extract the discovered relationship from the GAM model using the predict() function with the type="terms" parameter. ◦ You can evaluate the GAM with the same measures you’d use for standard linear or logistic regression: residuals, deviance, Rsquared, and pseudo R-squared. The gam() summary also gives you an indication of which variables have a significant effect on the model. ◦ Because GAMs have increased complexity compared to standard linear or logistic regression models, there’s more risk of overfit. Synthetic variables? ◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로 운 변수를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여 새로운 변수를 만들 수 있는데 이를 Synthetic variable이라고 함 ◦ Kernel method를 이용하여 새로운 변수를 만들어서 machine learning의 성능을 향상시킴 Kernel method ◦ 현재 사용 가능한 변수들로는 좋은 모델을 만들기 힘들어서 새로운 변수 를 얻고자 할 때, 기존에 가지고 있는 데이터를 조합하여 새로운 변수를 만들 수 있는데 이를 Synthetic variable이라고 함 ◦ Kernel method를 이용하여 새로운 변수를 만들어서 machine learning 의 성능을 향상시킴 Understanding kernel functions ◦ An artificial kernel example k(u,v) = phi(u) %*% phi(v) Understanding kernel functions Kernel transformation을 이용하여 linear하게 데이 터를 나누는 것이 목표 Using an explicit kernel on a problem ◦ Applying stepwise linear regression to PUMS data (https://github.com/WinVector/zmPDSwR/raw/master/PUMS/psub.RData) Using an explicit kernel on a problem ◦ Applying an example explicit kernel transform Phi()를 이용하여 새로운 modeling variable을 생성함 Using an explicit kernel on a problem ◦ Applying an example explicit kernel transform Using an explicit kernel on a problem ◦ Modeling using the explicit kernel transform RMSE가 조금 개선됨 Using an explicit kernel on a problem ◦ Inspecting the results of the explicit kernel model age와 log income간의 non-monotone 관계를 반영하기 위해 AGEP_AGEP라는 새 로운 변수를 사용함 Kernel takeaways ◦ Kernels provide a systematic way of creating interactions and other synthetic variables that are combinations of individual variables ◦ The goal of kernelizing is to lift the data into a space where the data is separable, or where linear methods can be used directly Understanding support vector machines 선형 분리가 불가능한 데이터(left) 를 고차원 커널 공간으로 lift(right) 하여 데이터를 선형 분리하는 초 평면을 구하는 문제 Trying an SVM on artificial example data ◦ Setting up the spirals data as an example classification problem Trying an SVM on artificial example data ◦ Setting up the spirals data as an example classification problem SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a poor choice of kernel 커널을 잘못 선택한 경우 SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a poor choice of kernel SUPPORT VECTOR MACHINES WITH THE WRONG KERNEL ◦ SVM with a good choice of kernel Using SVMs on real data ◦ Revisiting the Spambase example with GLM Using SVMs on real data ◦ Applying an SVM to the Spambase example Using SVMs on real data ◦ Printing the SVM results summary COMPARING RESULTS ◦ Shifting decision point to perform an apples-to-apples comparison SVM에서 spam(false postive)을 162개로 예측했으므로 GLM의 threshold도 수정 Support vector machine takeaways ◦ SVMs are a kernel-based classification approach where the kernels are represented in terms of a (possibly very large) subset of the training examples. ◦ SVMs try to lift the problem into a space where the data is linearly separable (or as near to separable as possible). ◦ SVMs are useful in cases where the useful interactions or other combinations of input variables aren’t known in advance. They’re also useful when similarity is strong evidence of belonging to the same class. Bagging and random forests—To reduce the sensitivity of models to early modeling choices and reduce modeling variance Generalized additive models—To remove the (false) assumption that each model feature contributes to the model in a monotone fashion Kernel methods—To introduce new features that are nonlinear combinations of existing features, increasing the power of our model Support vector machines—To use training examples as landmarks (support vectors), again increasing the power of our model END