Data Mining Packages in R

advertisement
Data Mining Packages in R:
logistic regression and SVM
Jiang Du
March 2008
Logistic Regression
• lrm in package ``Design”
– http://biostat.mc.vanderbilt.edu/s/Design/html/lr
m.html
• glm in package ``stats”
– http://finzi.psych.upenn.edu/R/library/stats/html/
glm.html
• …
Logistic Regression: lrm
Usage
lrm(formula, data, subset,
na.action=na.delete, method="lrm.fit",
model=FALSE, x=FALSE, y=FALSE,
linear.predictors=TRUE, se.fit=FALSE,
penalty=0, penalty.matrix, tol=1e-7,
strata.penalty=0,
var.penalty=c('simple','sandwich'), weights,
normwt, ...)
Arguments
• Formula
– a formula object. An offset term can be
included. The offset causes fitting of a
model such as logit(Y=1) = Xβ + W,
where W is the offset variable having no
estimated coefficient. The response
variable can be any data type; lrm converts
it in alphabetic or numeric order to an S
factor variable and recodes it 0,1,2,...
internally.
•
Data
– data frame to use. Default is the current
frame.
Usage
## S3 method for class 'lrm':
predict(object, ..., type=c("lp", "fitted",
"fitted.ind", "mean", "x", "data.frame", "terms",
"adjto","adjto.data.frame", "model.frame"),
se.fit=FALSE, codes=FALSE)
Arguments
• Object
– a object created by lrm
•
...
•
– arguments passed to predict.Design, such
as kint and newdata (which is used if you
are predicting out of data). See
predict.Design to see how NAs are
handled.
Type
–
…
Logistic Regression: lrm
• Fitting training data
– model = lrm(Class ~ X + Y + Z, data=train)
• Prediction on new data
– To get logit(Y=1)
• predict(model, newdata = test, type = “lp”)
– To get Pr(Y=1)
• predict(model, newdata = test, type = “fitted.ind”)
?formula
•
The models fit by, e.g., the lm and glm functions are specified in a compact symbolic form.
The ~ operator is basic in the formation of such models. An expression of the form y ~
model is interpreted as a specification that the response y is modelled by
linear predictor specified symbolically by model. Such a model
consists of a series of terms separated by + operators. The terms
a
themselves consist of variable and factor names separated by :
operators. Such a term is interpreted as the interaction of all the variables and factors
appearing in the term.
•
In addition to + and :, a number of other operators are useful in model formulae. The * operator
denotes factor crossing: a*b interpreted as a+b+a:b. The ^ operator indicates crossing
to the specified degree. For example (a+b+c)^2 is identical to (a+b+c)*(a+b+c)
which in turn expands to a formula containing the main effects for a, b and c together with their
second-order interactions. The %in% operator indicates that the terms on its left are nested
within those on the right. For example a + b %in% a expands to the formula a + a:b. The operator removes the specified terms, so that (a+b+c)^2 - a:b is identical to a + b +
c + b:c + a:c. It can also used to remove the intercept term: y ~ x - 1 is a line through the
origin. A model with no intercept can be also specified as y ~ x + 0 or y ~ 0 + x.
Logistic Regression: glm
• Fitting training data
– model = glm(Class ~ X + Y + Z, data=train,
family=binomial(logit))
• Prediction on new data
– To get logit(Y=1)
• predict(model, newdata = test)
– To get Pr(Y=1)
• predict(model, newdata = test, type = “response”)
SVM
• svm in ``e1071”
– http://www.potschi.de/svmtut/svmtut.html
• ksvm in ``kernlab”
– http://rss.acs.unt.edu/Rdoc/library/kernlab/html/k
svm.html
SVM: svm
Kernel
• the kernel used in training and predicting. You might
consider changing some of the following parameters,
depending on the kernel type.
– linear:
• u'*v
– polynomial:
• (gamma*u'*v + coef0)^degree
– radial basis:
• exp(-gamma*|u-v|^2)
– sigmoid:
• Tanh(gamma*u'*v + coef0)
SVM: svm
• Training
– model = svm(Class ~ X + Y + Z, data=train, type =
"C“, kernel = “linear”)
• Prediction
– predict(model, newdata = test)
Download