Data Mining Packages in R: logistic regression and SVM Jiang Du March 2008 Logistic Regression • lrm in package ``Design” – http://biostat.mc.vanderbilt.edu/s/Design/html/lr m.html • glm in package ``stats” – http://finzi.psych.upenn.edu/R/library/stats/html/ glm.html • … Logistic Regression: lrm Usage lrm(formula, data, subset, na.action=na.delete, method="lrm.fit", model=FALSE, x=FALSE, y=FALSE, linear.predictors=TRUE, se.fit=FALSE, penalty=0, penalty.matrix, tol=1e-7, strata.penalty=0, var.penalty=c('simple','sandwich'), weights, normwt, ...) Arguments • Formula – a formula object. An offset term can be included. The offset causes fitting of a model such as logit(Y=1) = Xβ + W, where W is the offset variable having no estimated coefficient. The response variable can be any data type; lrm converts it in alphabetic or numeric order to an S factor variable and recodes it 0,1,2,... internally. • Data – data frame to use. Default is the current frame. Usage ## S3 method for class 'lrm': predict(object, ..., type=c("lp", "fitted", "fitted.ind", "mean", "x", "data.frame", "terms", "adjto","adjto.data.frame", "model.frame"), se.fit=FALSE, codes=FALSE) Arguments • Object – a object created by lrm • ... • – arguments passed to predict.Design, such as kint and newdata (which is used if you are predicting out of data). See predict.Design to see how NAs are handled. Type – … Logistic Regression: lrm • Fitting training data – model = lrm(Class ~ X + Y + Z, data=train) • Prediction on new data – To get logit(Y=1) • predict(model, newdata = test, type = “lp”) – To get Pr(Y=1) • predict(model, newdata = test, type = “fitted.ind”) ?formula • The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form. The ~ operator is basic in the formation of such models. An expression of the form y ~ model is interpreted as a specification that the response y is modelled by linear predictor specified symbolically by model. Such a model consists of a series of terms separated by + operators. The terms a themselves consist of variable and factor names separated by : operators. Such a term is interpreted as the interaction of all the variables and factors appearing in the term. • In addition to + and :, a number of other operators are useful in model formulae. The * operator denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c) which in turn expands to a formula containing the main effects for a, b and c together with their second-order interactions. The %in% operator indicates that the terms on its left are nested within those on the right. For example a + b %in% a expands to the formula a + a:b. The operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b + c + b:c + a:c. It can also used to remove the intercept term: y ~ x - 1 is a line through the origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x. Logistic Regression: glm • Fitting training data – model = glm(Class ~ X + Y + Z, data=train, family=binomial(logit)) • Prediction on new data – To get logit(Y=1) • predict(model, newdata = test) – To get Pr(Y=1) • predict(model, newdata = test, type = “response”) SVM • svm in ``e1071” – http://www.potschi.de/svmtut/svmtut.html • ksvm in ``kernlab” – http://rss.acs.unt.edu/Rdoc/library/kernlab/html/k svm.html SVM: svm Kernel • the kernel used in training and predicting. You might consider changing some of the following parameters, depending on the kernel type. – linear: • u'*v – polynomial: • (gamma*u'*v + coef0)^degree – radial basis: • exp(-gamma*|u-v|^2) – sigmoid: • Tanh(gamma*u'*v + coef0) SVM: svm • Training – model = svm(Class ~ X + Y + Z, data=train, type = "C“, kernel = “linear”) • Prediction – predict(model, newdata = test)