Potpourri of Methods.

advertisement
Class 4 – More Classifiers
Ramoza Ahsan, Yun Lu, Dongyun Zhang, Zhongfang
Zhuang, Xiao Qin, Salah Uddin Ahmed
Lesson 4.1
Classification Boundaries
Classification Boundaries
• Visualization of the data in the
training stage of building a
classifier can provide guidance
in parameter selection
• Weka visuilization tool
• 2 dimensional data set
Boundary Representation With OneR
• Color diagram shows the
decision boundaries with
training data
• Spatial representation of the
decision boundary on OneR
algorithm
Boundary Representation With IBk
• Lazy classifier (instance based
learner)
• Chooses nearest instance to
classify
• Piece wise linear boundary
• Increasing k will give blurry
boundaries
Boundary Representation With Naïve Bayes
• Naïve Bayes treats each of the two
attribute as contributing equally
and independently to decision
• When multiple along the two
dimensions get a checkerboard
pattern of probabilities.
Boundary Representation With J-48
• Increasing the minNumObj
parameter will result in
simpler tree
Classification Boundaries
• Different classifiers have different capabilities for
carving up instance space. (“Bias”)
• Usefulness:
• Important visualization tool.
• Provides insight how the algorithm works on data.
• Limitations:
• Restricted to numeric attributes and 2-dimensional plot.
Lesson 4.2
Linear Regression
What Is Linear Regression?
• In statistics, linear regression is an approach to model the
relationship between a dependent variable y and one or
more explanatory variables denoted X.
 Straight-line regression analysis: one explanatory variable.
 Multiple linear regression: more than one explanatory variable
• In data mining, we use this method to make predictions based
on numeric attributes for numeric classes.
 NominalToBinary filter
Why Linear Regression?
• A regression models the past relationship between variables
to predict the future behavior.
• Businesses use regression to predict such things as future
sales, stock prices, currency exchange rates, and productivity
gains resulting from a training program.
• Example: A person’s salary is related with years of
experience. The dependent variable in this instance is salary
and the explanatory variable (also called independent
variable) is experience here.
Mathematics Of Simple Linear Regression
• The simplest form of regression function is:
𝑦 = 𝑏 + 𝑤𝑥
Where y is the dependent variable, x is the explanatory
variable, b and w are regression coefficients. By thinking
regression coefficients as weight, we could get:
𝑦 = 𝑤0 + 𝑤1 𝑥
Where:
𝑤1 =
𝐷
𝑖=1(𝑥𝑖 − 𝑥)(𝑦𝑖 −
𝐷
2
(𝑥
−
𝑥)
𝑖=1 𝑖
𝑤0 = 𝑦 − 𝑤1 𝑥
𝑦)
Previous Example
Salary Dataset
X
Y
Years of
Salary in
Experience $1000s
3
30
8
57
9
64
13
72
3
36
6
43
11
59
21
90
1
20
16
83
• From the given dataset we could get: 𝑥 = 9.1; 𝑦 = 55.4
• 𝑤1 =
3−9.1 ∗ 30−55.4 + 8−9.1 ∗ 57−55.4 +⋯+ 16−9.1 ∗(83−55.4)
(3−9.1)2 +(8−9.1)2 + ⋯+(16−9.1)2
= 3.5
• 𝑤0 = 55.4 − 3.5 ∗ 9.1 = 23.6
• Thus, we could get 𝑦 = 23.6 + 3.5 ∗ 𝑥
• For the instances, we can predict that a person with 10 years experience will get
the salary of $58,600 per year.
Run This Dataset On Weka
Run This Dataset On Weka
Run This Dataset On Weka
Run This Dataset On Weka
Multiple Linear Regression
• Multiple linear regression is an extension of straight-line regression so as to
involve more than one predictor variable. It allows the dependent variable
𝑦 to be modeled as linear summary of n predictor variables described by
tuple 𝑋 ( 𝑋 = (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) ).
𝑘
𝑦 = 𝑤0 + 𝑤1 𝑥1 + 𝑤2 𝑥2 + ⋯ + +𝑤𝑘 𝑥𝑘 =
𝑤𝑗 𝑥𝑗
𝑗=0
Then adjust weights to minimize square error on training data:
𝑛
𝑘
(𝑖)
𝑥 (𝑖) −
𝑖=1
𝑤𝑗 𝑥𝑗
𝑗=0
 This equation is hard to solve by hand, so we need a tool like Weka to do it.
Non-linear Regression
• Often no linear relationship between
dependent variable(class attribute) and
explanatory variables.
• Often convert into a linear by a
patchwork of serial linear regression
models
 In Weka, we have “model tree” named M5P
method, which can solve this problem. A
"model tree" is a tree where each leaf has
one of these linear regression models. And
we can calculate coefficients for each linear
function and then we could make prediction
based on this “model tree”.
Lesson 4.3
Classification By Regression
Review: Linear Regression
• Several numeric attributes: 𝑎1 , 𝑎2 , … , 𝑎𝑘
• Weights of each attributes plus a constant:
𝑤0 , 𝑤1 , 𝑤2 , … , 𝑤𝑘
• Weighed sum of the attributes:
• Minimize the squared error:
𝑛
𝑖=1
(𝑖)
𝑘
𝑗=0 𝑤𝑗 𝑎𝑗
𝑥 (𝑖) −
(𝑖)
𝑘
𝑤
𝑎
𝑗=0 𝑗 𝑗
Using Regression In Classification
• Convert the class values to numeric values(usually
binary)
• Decide the class according to the regression result
• The result is NOT the probability!!!
• Set the threshold
2-Class Problems
• Assign the binary values to the two classes
• Training: Linear Regression
• Output prediction
Multi-Class Problems
• Multi-Response Linear Regression
• Divide into n regression problems
• Build different model for each problem
• Select the model with the largest output
More Investigations
• Cool stuff
• Lead to the foundation of Logistic Regression
• Convert the class value to binary
• Add the Linear Regression result as an attribute
• Detect the split using OneR
Lesson 4.4
Logistic Regression
Logistic Regression
• In linear regression, we use
𝛽0 + 𝛽1 𝑥1 + 𝛽2 𝑥2 + ⋯ + 𝛽𝑛 𝑥𝑛
to calculate weights from training data
(1)
• In Logistic Regression, we use
Pr 𝑥 =
1
1 + 𝑒 −(𝛽0+𝛽1 𝑥1 +𝛽2 𝑥2 +⋯+𝛽𝑛 𝑥𝑛 )
to estimate class probabilities directly.
(2)
Classification
• Email: Spam / Not Spam?
• Online Transactions: Fraudulent (Yes / No)?
• Tumor: Malignant / Benign?
𝑦 ∈ {0, 1}
0: “negative class” (e.g., Benign tumor)
1: “positive class” (e.g., Malignant tumor)
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
ℎ(𝑥)
(Yes) 1
Malignant?
(No) 0
0.5
𝑥1
Tumor Size
Threshold classifier output y at 0.5:
If ℎ(𝑥) ≥ 0.5, predict “𝑦 = 1”
If ℎ(𝑥) < 0.5, predict “𝑦 = 0”
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
ℎ(𝑥)
(Yes) 1
Malignant?
(No) 0
0.5
𝑥1
𝑥2
Threshold classifier output y at 0.5:
If ℎ(𝑥) ≥ 0.5, predict “𝑦 = 1”
If ℎ(𝑥) < 0.5, predict “𝑦 = 0”
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
Classification: y = 0 or 1
In Linear Regression, ℎ(𝑥) can be >1 or <0
Logistic regression: 0 ≤ ℎ(𝑥) ≤ 1
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
Logistic Regression Model
• We want 0 ≤ ℎ(𝑥) ≤ 1
(3)•
ℎ𝑤 𝑥 = 𝑤0 + 𝑤1 𝑥
(4)•
𝑔(𝑧) =
1
1+𝑒 −𝑧
ℎ𝑤 𝑥 =
1
1 + 𝑒 −(𝑤0+𝑤1𝑥)
• Sigmoid function
Logistic function
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
Interpretation Of Hypothesis Output
ℎ𝑤 (𝑥)= estimated probability that 𝑦 = 1 on input 𝑥
𝑥0
1
Example: if 𝑥 = 𝑥 =
𝑡𝑢𝑚𝑜𝑟𝑆𝑖𝑧𝑒
1
(5)
ℎ𝑤 (𝑥) = 0.7
Tell patient that 70% chance of tumor being malignant(𝑦 = 1).
ℎ𝑤 𝑥 = pr 𝑦 = 1 𝑥; 𝑤 = 0.7
(6)
“Probability that 𝑦 = 1, given 𝑥, parameterized by 𝑤.
Pr 𝑦 = 1 𝑥; 𝑤 + pr 𝑦 = 0 𝑥; 𝑤 = 1
Coursera – Machine Learning – Prof. Andrew Ng from Stanford University
(7)
Lesson 4.5
Support Vector Machine
Things About SVM
• Better on small and not linear separable data
• Low dimensions => High dimensions
• Support Vectors
• Maximum Marginal Hyperplane
Overview
• The support vectors are
the most difficult tuples
to classify and give the
most information
regarding classification.
SVM searches for the
hyperplane with the
largest margin, that is,
the Maximum Marginal
Hyperplane (MMH).
SVM Demo
• CMsoft SVM Demo Tool
• Question(s)
More
Very resilient to overfitting
• Boundary depends on a few points
• Parameter setting (regularization)
Weka: functions>SMO
Restricted to two classes
• So use Multi-Response linear regression … or pairwise linear regression
Weka: functions>libsvm
• External library for support vector machines
• Faster than SMO, more sophisticated options
Lesson 4.6
Ensemble Learning
Ensemble Learning
• Take the training data
• Derive several different training sets from it
• Learn a model from each training set
• Combine them to produce an ensemble of learned
models
• In brief, instead of depending on a single classifier, we take
vote from different classifiers to reach a verdict.
Ensemble Learning
We will discuss 4 types of ensemble methods:
• Bagging
• Randomization- Random Forest
• Boosting
• Stacking
Ensemble Learning -- Bagging
• Several training datasets of the same size are chosen at random
from the problem domain or produced by sampling with replacement
• A particular machine learning technique is used to build a model for
each dataset
• For each new test instance we get the prediction from each model
• The class that has the largest support or vote from the models
becomes the resultant class
Ensemble Learning -- Bagging In Weka
Ensemble Learning -- Bagging In Weka
Ensemble Learning -- Bagging In Weka
Ensemble Learning -- Randomization
• One training dataset
• Randomize the choices of classifier algorithm to build several model for
the same dataset
• For each new test instance we get the prediction from each model
• Very similar to bagging , the class that has the largest support or vote
from the models becomes the resultant class
• Ex: Random Forest – uses J48, instead of choosing the best attribute,
randomly pick from k best options
Ensemble Learning -- Random Forest In Weka
Ensemble Learning -- Random Forest In Weka
Ensemble Learning -- Random Forest In Weka
Ensemble Learning -- Boosting
• One training dataset
• A particular classifier is used iteratively several times to
produce several models
• Output of a iteration becomes input for the next iteration.
• Extra weight is assigned to misclassified instances to
encourage the next iteration to correctly classify
• For a test instance, the class that has the largest vote from
the models becomes the resultant class
• Ex: AdaBoostM1 with C4.5 or J48
Ensemble Learning -- Boosting In Weka
Ensemble Learning -- Boosting In Weka
Ensemble Learning -- Boosting In Weka
Ensemble Learning -- Stacking
• One training dataset is the input for several base learners
or level-0 learners
• Output predictions of base learners become input of a meta
learner or level-1 learner
• Base learners are different classifiers
• An instance is first fed into the level-0 models, the guesses
are fed into the level-1 model, which combines them into the
final prediction.
• Ex: StackingC with LinearRegression
Ensemble Learning -- Stacking In Weka
Ensemble Learning -- Stacking In Weka
Ensemble Learning -- Stacking In Weka
Ensemble Learning -- Stacking In Weka
Ensemble Learning
• Usefulness:
• Diversity helps, especially with “unstable”
learners
• Disadvantages:
• It is hard to analyze - it is not easy to
understand what factors are contributing to the
improved decisions.
Download