xwb

advertisement
Class 12-13 Notes
S1
S2
x∙𝑤 =𝑏
x∙𝑤 =𝑏−1
x∙𝑤 =𝑏+1
S3: Another Proof of margin for the curious:
margin is distance from points u and v
on their perspective planes lying along the vector w
xwb 1
+
u
-
v
w
u w
v  w
1  u  w  b  w w  b   
( 1  b)
|| w ||2
margin =||u  v ||
similarly
||
x  w  b  1
(1  b)
|| w ||2
 
(1  b)
( 1  b)
w

w ||
2
2
|| w ||
|| w ||
2
||
w ||
2
|| w ||
2

|| w ||
S4
S5
x∙𝑤 =𝑏
x∙𝑤 =𝑏−1
x∙𝑤 =𝑏+1
S6: Approximate Misclassification
error with a nicer convex function
Misclassification error
SVM error
a.k.a. hinge loss
1
-1
0
yf(x)
1
Misclassification error for point (x,y) : step(-y(xw-b))
SVM Error for point(x,y):
=max(1-y(xw+b),0)
S7: AMPL Model File
HW3_QSARSVM_mod.txt
; #CONSTANTS
param p; # number of predictor variables
param n; # number of observations
param c; # SVM model parameter
#OBJECTIVE FUNCTION for SVM classification
model
minimize objective:
c*sum{i in N} z[i] + sum{j in P} w[j]*w[j] ;
set P:={1..p}; # indices of predictor variables
set N:={1..n}; # indices of observations
#CONSTRAINTS
subject to trainerror {i in N}:
Y[i]*( sum {j in P} X[i,j]*w[j]-b ) + z[i] >= 1;
subject to slack {i in N}:
z[i] >= 0
param X{N,P}; # observation data
param Y{N}; # classes of each observation (1
or -1)
#VARIABLES of model
var w{P};
var b;
var z{N};
S8: AMPL Data File
sampleSVM_dat2.txt
#PARAMETERS
param p:= 2; # number of features
param n:= 7; # total number of observations
param c:=10; # constant used in SVM model
# All X Data
param X :
1
:=
1
1
2
-1
3
-1
4
1
5
2
6
0
7
0
;
2
0
-2
1
2
3
4
-0.5
# Y Data (must be -1 or 1)
param Y :=
1
1
2
1
3
1
4
-1
5
-1
6
-1
7
-1
;
S9: AMPL Command File
HW3_QSARSVM_cmd.txt
# Command file to train and test SVM
# Created by Kristin Benentt, March 2015.
#Solve SVM Model
solve;
# Display Solutions weights, bias, and
objective
display w;
display b;
S8: Calling from NEOS using Minos
solver
• http://www.neos-server.org/neos/solvers/nco:MINOS/AMPL.html
S9: Minos Results
S11: Estimating Generalization Error of
a Model
S12: Classification Homework: Drug Discovery Problem
•Cost about $1Billion to develop drug
•First-year sales > $1Billion/drug
•1 drug approved/1000 compounds tested
•1 out of 100 drugs succeeds to market
•Trend – More drug failures in late stage clinical trials
or even after on market!
•Pressure to reduce drug costs/development time
• Data Driven models to predict bioactivites of drugs have potential
to accelerate drug discovery process
• Efficacy
• Toxicity
• Absorption
• Distribution
• Metabolism
• Excretion
• Biodegradibility
S13: Classification Task
The European REACH regulation requires information on
ready biodegradation, which is a screening test to assess
the biodegradability of chemicals. At the same time
REACH encourages the use of alternatives to animal
testing which includes predictions from quantitative
structure–activity relationship (QSAR) models.
Your assignment is to build a classification QSAR model to
predict ready biodegradation of chemicals based on
provided training set, estimate generalization error using
the testing set. Estimate biodegradiblity of a new point.
S14: Data Available
Experimental values chemicals were collected from the webpage of
the National Institute of Technology and Evaluation of Japan (NITE).
The training data sets consists of these 83e chemicals, whether they
were readily biodegradabe (1= yes , -1 – no) and 41 molecular
descriptors.
A testing set of 42 molecules with known biodegradability
Mystery point with unknown biodegradability
Data files are in form
IDNUM, descriptor 1, ….., descriptor 41, class
Molecules and Molecular descriptors are proprietary. No details
provided except cryptic names in a separate file.
S15: Overall Experimental Design
Data provided
Training set of 833 41 dimensional points
Test set of 50 41 dimensional points
Train classification model on training set to determine model
parameters using NEOS.
Calculate error on the training set.
Test classification model with the model parameters found on
the test set to see how well it works on future data.
+ some other stuff
Metric of success:
small number of points misclassified on test set
S16: Many Data Sets are Published
with Associated Papers.
Reference: Mansour et al, Training and testing a
linear SVM models to predict biodegradibility of
chemicals based on data, 2013
Location:
http://pubs.acs.org/doi/abs/10.1021/ci4000213
S18: Regression Data Set: Energy Efficiency Data Set
We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with
respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We
simulate various settings as functions of the afore-mentioned characteristics to obtain 499 building shapes. The
dataset comprises 80 samples and 8 features, aiming to predict two real valued responses.
The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes,
denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses.
X1 Relative Compactness
X2 Surface Area
X3 Wall Area
X4 Roof Area
X5 Overall Height
X6 Orientation
X7 Glazing Area
X8 Glazing Area Distribution
y1 Heating Load
y2 Cooling Load
Reference A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings
using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012
Location http://archive.ics.uci.edu/ml/datasets/Energy+efficiency
UC IRVINE
S20: Homework 3 Extra Credit
• Design and experiment to explore the effect of
the tradeoff parameter λ in regression and C
in classification on the training and testing
errors.
• Describe what you observed in terms of
training and testing errors. Indicate which
model is best for each problem and why.
3 pts for Regression
3 pts for Classification
S21: Project
Project Description and Checklist will be
available on Prof. K’s web page in DM folder.
Deliverables per group:
• Project Proposal - April 3
• Project Progress Report – April 24
• Project Final Report - May 12
• Project Poster Presentation (Grad Only)
May 22 during exam slot
S22: Sample Project Like homework
Find a problem that can be solved using data and a classification or
regression model (can be SVM or other optimization based model).
• Pick one or more data problems from sources like UCI, public
Papers
• Make train and test sets for each problem.
• Create one or more predictive models using NEOS (easy to make
many if you change parameters like C)
• Run models on the training sets
• Evaluate training set errors
• Test models on the testing sets, calculate
• Discuss your findings. How well did the models work? Which
models were preferable. Compare the performance of the models
on the different data set in terms of efficiency, error,
interpretability of solutions.
• Extra credit for creativity and exploration
• See project description for more details.
S23: Finding Problems or Methods
1) scholar.google.com or pubmed
Find a paper and topic that interests you. Try to
duplicate it
2) Start with a machine learning task and look up
some relevant papers
Many learning algorithm have an optimization
problem as their engine
3) Text book , wikipedia, problem in other class
4) Ask your favorite professor
S24: Finding Data
1) Machine learning repositories
DELVE
http://www.cs.toronto.edu/~delve/data/datasets.html
UCI
http://archive.ics.uci.edu/ml/
2) Pubmed frequently has data in supplementary material
http://www.ncbi.nlm.nih.gov/pubmed
3) Domain specific data repositories
finance, astronomy, biology, ???
4) Your local research project
5) Simulated data
S25: Help available for formulating
project
Regular office hours
M 2:30 to 3:30
F 9 to 11
Give a call 6899 or email bennek@rpi.edu
if you want to come by or set up and
appointment
Download