Lecture Data Mining in R 732A44 Programming in R Logistic regression: two classes • Consider Logistic model with one predictor X=Price of the car Y=Equipment • Logistic model P(Y 1 | X x) P(Y 1 | X x) log 0 1 x P(Y 0 | X x) 1 P(Y 0 | X x) exp( 0 1 x) P(Y 1 | X x) 1 exp( 0 1 x) log • Use function glm(formula, family, data) – Formula: Response~Model • Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction) . All predictors – Family: specify binomial 732A44 Programming in R Logistic regression: two classes reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata); 732A44 Programming in R Logistic regression: several predictors log P(Y 1 | X x) 0 1T x P(Y 0 | X x) P(Y 1 | X x) exp( 0 1T x) 1 exp( 0 1T x) Data about contraceptive use – Several analysis plots can be obtained by plot(lrfit) – Response: matrix success/failure 732A44 Programming in R Logistic regression Further comments • Nominal logistic regressions (library mlogit, function mlogit) • Stepwise model selection: step() function. • Prediction: predict() function 732A44 Programming in R Smoothing splines Minimize a penalized sum of squared residuals N RSS f , yi f xi f t dt 2 i 1 where λ is smoothing parameter. λ=0 : any function interpolating data λ=+ : least squares line fit 732A44 Programming in R 2 Smoothing splines • smooth.spline(x, y, df, spar, cv,…) – Df degrees of freedom – Spar: penalty parameter – CV= plot(m2$Kilometer,m2$Price, • TRUE=GCV • FALSE=CV • NA= no CV main="df=40"); res<-smooth.spline( m2$Kilometer, m2$Price,df=40); lines(res, col="blue"); 732A44 Programming in R Generalized additive models A function g ( E(Y | X1, ..., X n )) of the expected response is additive in the set of inputs, i.e., g ( E(Y | X1, ..., X n )) s1 ( X1 ) ... s p ( X p ) Example: Nonlinear logistic regression of a binary response log E (Y | X x) P(Y 1 | X x) log 0 s ( x) 1 E (Y | X x) P(Y 0 | X x) 732A44 Programming in R GAM • gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp) – Method: method for selection of smoothing parameters – Select: TRUE – variable selection is performed – Sp: smoothing parameters (maximal df) – Formula: usual terms and spline terms s(…) Library: mgcv • Car properties bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3) vis.gam(bp, theta=10, phi=30); • Predict.gam() can be used for predictions 732A44 Programming in R GAM Smoothing components plot(bp, pages=1) 732A44 Programming in R Principal components analysis 15 PC1 X2 Idea: Introduce a new coordinate system (PC1, PC2, …) where • The first principal component (PC1) is the direction that maximizes the variance of the projected data • The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed • … 10 PC2 5 0 5 X1 In the new coordinate system, coefficients corresponding to the last principal components are very small can take away this columns 732A44 Programming in R 10 Principal components analysis • princomp(x, ...) m4<-m3; m4$MODEL<-c(); res<-princomp(m4); loadings(res); plot(res); biplot(res); summary(res); 732A44 Programming in R Decision trees 20 X1 <9 >=9 X2 <16 0 10 0 10 20 732A44 Programming in R >=16 1 X2 <7 >=7 1 X1 <15 >=15 1 0 Regression tree example 732A44 Programming in R Training-validation-test • Training-validation (60/40) sub <- sample(nrow(m2), floor(nrow(m2) * 0.6)) training <- m2[sub, ] validation <- m2[-sub, ] • If training-validation-test is required, use similar strategy 732A44 Programming in R Decision trees by CART Growing a full tree Library ”tree”. • Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…) – Subset: if subset of cases needs to be used for training – Split: splitting criterion – More parameters with control parameter • Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…) • Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10, ...) – K is number of folds in cross-validation 732A44 Programming in R Classification trees: CART Example: Olive oils in Italy sub <- sample(nrow(m5), floor(nrow(m5) * 0.6)) training <- m5[sub, ] validation <- m5[-sub, ] mytree<-tree(Area~.-Region-X,data=training); summary(mytree) plot(mytree,type="uniform"); text(mytree,cex=0.5); 732A44 Programming in R Classification trees: CART • Dependence of the misclassification rate on the length of the tree: treeseq1<-prune.tree(mytree, newdata=validation,method="misclass") plot(treeseq1); title("Validation"); treeseq2<-cv.tree(mytree, method="misclass") plot(treeseq2); title("CV"); 732A44 Programming in R Regression trees: CART mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training); mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves print(mytree3) summary(mytree3) plot.tree(mytree3) text(mytree3) 732A44 Programming in R Decision trees: other techniques • Conditional inference trees Library: party training$X<-c(); training$Area<-c(); mytree4<-ctree(Region~.,data=training); print(mytree4) plot(mytree4, type= "simple");# gives nice plots • CART, another library ”rpart” 732A44 Programming in R Neural network • Input nodes, input layer • [Hidden nodes, Hidden layer(s)] • Output nodes, output layer • Weights • Activation functions • Combination functions z1 x1 732A44 Programming in R … f1 z2 x2 fK … … zM xp Neural networks • Feed –forward NNs Library: neuralnet • neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…) – – – – – – – • • • Hidden: vector showing amount of hidden neurons at each layer Rep: amount of runs of network Startweights: starting weights Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr” Err.fct: any function +”sse”+”ce” (cross-entropy) Act.fct:any function+”logistic”+”tanh” Linear.output: TRUE, if no activation at the output confidence.interval(x, alpha = 0.05) Confidence intervals for weights compute(x, covariate) Prediction plot(x,…) plot given neural network 732A44 Programming in R Neural networks • Example mynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh") plot(mynet); mynet$result.matrix 732A44 Programming in R Neural networks • Prediction with compute() • Finding misclassification rate: table(true_values,predicted values) – not only for neural networks • Another package, ready for qualitative response (classical nnet): mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3); coef(mynet1) predict(mynet1, data=validation); 732A44 Programming in R Clustering • Purpose is to identify groups of observations into intput space (separated) – K-means – Hierarchical – Density-based 732A44 Programming in R K-means • Amount of seeds K should be given • Starting seed positions needed • kmeans(x, centers, iter.max = 10, nstart = 1) – – – X: data frame Centers: either ”K” value or set of initial cluster centers Iter.max: maximum number of iterations 732A44 Programming in R res<-kmeans(data.frame (m5$linoleic, m5$eicosenoic),2); K-means • One way to visualize plot(m5$linoleic, m5$eicosenoic, col=res$cluster); points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2) 732A44 Programming in R Hierarchical clustering • Agglomerative – Place each point into a single cluster – Merge nearest clusters until you get 1 cluster • Meaning of ”two objects are close”? – Measure of proximity (ex: quantiative vars, Euclidian distance) • Similarity measure srs (=1 if same object, <1 otherwise) – Ex: correlation • Dissimilarity measure δrs (=0 if same object, >0 otherwise) – Ex: euclidian distance 732A44 Programming in R Hierarchical clustering • hclust(d, method = "complete", members=NULL) – D: dissimilarity measure – Method: ”ward”, "single", "complete", "average", "mcquitty", "median" or "centroid". Returned: a tree showing merging sequence • cutree(tree, k = NULL, h = NULL) – K: number of clusters to make – H: at which level to cut Returned: cluster index 732A44 Programming in R Hierarchical clustering • Example x<-data.frame(m5$linolenic, m5$eicosenoic); m5_dist<-dist(x); m5_dend<-hclust(m5_dist, method="complete") plot(m5_dend); 732A44 Programming in R Hierarchical clustering • Example clust=cutree(m5_dend, k=2); plot(m5$linoleic, m5$eicosenoic, col=clust); DO NOT forget to standardize! 732A44 Programming in R Density-based clustering • Kernel-based density estimation. Library: pdfcluster • pdfCluster(x, h = h.norm(x), hmult = 0.75,…) – – – X: Data to be partitioned h: a vector of smoothing parameters Hmult: shrinkage factor x<-data.frame(m5$linolenic, m5$eicosenoic); res<-pdfCluster(x); plot(res) 732A44 Programming in R Reference http://cran.r-project.org/doc/contrib/YanchangZhaorefcard-data-mining.pdf 732A44 Programming in R