Class 12-13 Notes S1 S2 x∙𝑤 =𝑏 x∙𝑤 =𝑏−1 x∙𝑤 =𝑏+1 S3: Another Proof of margin for the curious: margin is distance from points u and v on their perspective planes lying along the vector w xwb 1 + u - v w u w v w 1 u w b w w b ( 1 b) || w ||2 margin =||u v || similarly || x w b 1 (1 b) || w ||2 (1 b) ( 1 b) w w || 2 2 || w || || w || 2 || w || 2 || w || 2 || w || S4 S5 x∙𝑤 =𝑏 x∙𝑤 =𝑏−1 x∙𝑤 =𝑏+1 S6: Approximate Misclassification error with a nicer convex function Misclassification error SVM error a.k.a. hinge loss 1 -1 0 yf(x) 1 Misclassification error for point (x,y) : step(-y(xw-b)) SVM Error for point(x,y): =max(1-y(xw+b),0) S7: AMPL Model File HW3_QSARSVM_mod.txt ; #CONSTANTS param p; # number of predictor variables param n; # number of observations param c; # SVM model parameter #OBJECTIVE FUNCTION for SVM classification model minimize objective: c*sum{i in N} z[i] + sum{j in P} w[j]*w[j] ; set P:={1..p}; # indices of predictor variables set N:={1..n}; # indices of observations #CONSTRAINTS subject to trainerror {i in N}: Y[i]*( sum {j in P} X[i,j]*w[j]-b ) + z[i] >= 1; subject to slack {i in N}: z[i] >= 0 param X{N,P}; # observation data param Y{N}; # classes of each observation (1 or -1) #VARIABLES of model var w{P}; var b; var z{N}; S8: AMPL Data File sampleSVM_dat2.txt #PARAMETERS param p:= 2; # number of features param n:= 7; # total number of observations param c:=10; # constant used in SVM model # All X Data param X : 1 := 1 1 2 -1 3 -1 4 1 5 2 6 0 7 0 ; 2 0 -2 1 2 3 4 -0.5 # Y Data (must be -1 or 1) param Y := 1 1 2 1 3 1 4 -1 5 -1 6 -1 7 -1 ; S9: AMPL Command File HW3_QSARSVM_cmd.txt # Command file to train and test SVM # Created by Kristin Benentt, March 2015. #Solve SVM Model solve; # Display Solutions weights, bias, and objective display w; display b; S8: Calling from NEOS using Minos solver • http://www.neos-server.org/neos/solvers/nco:MINOS/AMPL.html S9: Minos Results S11: Estimating Generalization Error of a Model S12: Classification Homework: Drug Discovery Problem •Cost about $1Billion to develop drug •First-year sales > $1Billion/drug •1 drug approved/1000 compounds tested •1 out of 100 drugs succeeds to market •Trend – More drug failures in late stage clinical trials or even after on market! •Pressure to reduce drug costs/development time • Data Driven models to predict bioactivites of drugs have potential to accelerate drug discovery process • Efficacy • Toxicity • Absorption • Distribution • Metabolism • Excretion • Biodegradibility S13: Classification Task The European REACH regulation requires information on ready biodegradation, which is a screening test to assess the biodegradability of chemicals. At the same time REACH encourages the use of alternatives to animal testing which includes predictions from quantitative structure–activity relationship (QSAR) models. Your assignment is to build a classification QSAR model to predict ready biodegradation of chemicals based on provided training set, estimate generalization error using the testing set. Estimate biodegradiblity of a new point. S14: Data Available Experimental values chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). The training data sets consists of these 83e chemicals, whether they were readily biodegradabe (1= yes , -1 – no) and 41 molecular descriptors. A testing set of 42 molecules with known biodegradability Mystery point with unknown biodegradability Data files are in form IDNUM, descriptor 1, ….., descriptor 41, class Molecules and Molecular descriptors are proprietary. No details provided except cryptic names in a separate file. S15: Overall Experimental Design Data provided Training set of 833 41 dimensional points Test set of 50 41 dimensional points Train classification model on training set to determine model parameters using NEOS. Calculate error on the training set. Test classification model with the model parameters found on the test set to see how well it works on future data. + some other stuff Metric of success: small number of points misclassified on test set S16: Many Data Sets are Published with Associated Papers. Reference: Mansour et al, Training and testing a linear SVM models to predict biodegradibility of chemicals based on data, 2013 Location: http://pubs.acs.org/doi/abs/10.1021/ci4000213 S18: Regression Data Set: Energy Efficiency Data Set We perform energy analysis using 12 different building shapes simulated in Ecotect. The buildings differ with respect to the glazing area, the glazing area distribution, and the orientation, amongst other parameters. We simulate various settings as functions of the afore-mentioned characteristics to obtain 499 building shapes. The dataset comprises 80 samples and 8 features, aiming to predict two real valued responses. The dataset contains eight attributes (or features, denoted by X1...X8) and two responses (or outcomes, denoted by y1 and y2). The aim is to use the eight features to predict each of the two responses. X1 Relative Compactness X2 Surface Area X3 Wall Area X4 Roof Area X5 Overall Height X6 Orientation X7 Glazing Area X8 Glazing Area Distribution y1 Heating Load y2 Cooling Load Reference A. Tsanas, A. Xifara: 'Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools', Energy and Buildings, Vol. 49, pp. 560-567, 2012 Location http://archive.ics.uci.edu/ml/datasets/Energy+efficiency UC IRVINE S20: Homework 3 Extra Credit • Design and experiment to explore the effect of the tradeoff parameter λ in regression and C in classification on the training and testing errors. • Describe what you observed in terms of training and testing errors. Indicate which model is best for each problem and why. 3 pts for Regression 3 pts for Classification S21: Project Project Description and Checklist will be available on Prof. K’s web page in DM folder. Deliverables per group: • Project Proposal - April 3 • Project Progress Report – April 24 • Project Final Report - May 12 • Project Poster Presentation (Grad Only) May 22 during exam slot S22: Sample Project Like homework Find a problem that can be solved using data and a classification or regression model (can be SVM or other optimization based model). • Pick one or more data problems from sources like UCI, public Papers • Make train and test sets for each problem. • Create one or more predictive models using NEOS (easy to make many if you change parameters like C) • Run models on the training sets • Evaluate training set errors • Test models on the testing sets, calculate • Discuss your findings. How well did the models work? Which models were preferable. Compare the performance of the models on the different data set in terms of efficiency, error, interpretability of solutions. • Extra credit for creativity and exploration • See project description for more details. S23: Finding Problems or Methods 1) scholar.google.com or pubmed Find a paper and topic that interests you. Try to duplicate it 2) Start with a machine learning task and look up some relevant papers Many learning algorithm have an optimization problem as their engine 3) Text book , wikipedia, problem in other class 4) Ask your favorite professor S24: Finding Data 1) Machine learning repositories DELVE http://www.cs.toronto.edu/~delve/data/datasets.html UCI http://archive.ics.uci.edu/ml/ 2) Pubmed frequently has data in supplementary material http://www.ncbi.nlm.nih.gov/pubmed 3) Domain specific data repositories finance, astronomy, biology, ??? 4) Your local research project 5) Simulated data S25: Help available for formulating project Regular office hours M 2:30 to 3:30 F 9 to 11 Give a call 6899 or email bennek@rpi.edu if you want to come by or set up and appointment