Biostatistics 278, discussion 3: R code for supervised learning: k-nearest neighbor predictors, linear discriminant analysis, rpart, error estimation Steve Horvath E-mail: shorvath@mednet.ucla.edu http://www.ph.ucla.edu/biostat/people/horvath.htm The notes are based in part on R code described in: http://bioinformatics.med.yale.edu/proteomics/BioSupp1.html by the Yale NHLBI/Proteomics Center. All analyses were done within the R statistical analyses software. R links: http://www.rproject.org/ for general information, and http://cran.r-project.org/ for downloading. We will use the following functions/libraries 1) LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis) R package: MASS function: lda, qda 2) KNN (k-nearest neighbor) R package: class function: knn 3) Bagging, boosting classification trees R package: rpart, tree function: rpart, tree Our bagging/boosting programs are based on functions "rpart, tree" from these two packages. 4) SVM (Support Vector Machine) R package: e1071 function: svm The underlying C code is from libsvm 5) RF (Random forest) R package: randomForest function: randomForest The underlying Fortran code is from Leo Breiman 6) Error estimation: cv-10 (10-fold cross-validation); .632+ Package: ipred, which requires packages mlbench, survival, nnet, mvtnorm. mvtnorm.ipred which provides very convenient wrappers to various statistical methods. 1 Download the relevant libraries as follows: i) click button “packages” on the R session bar ii) choose “Install packages from cran..” Hint: the computer needs has to be connected to the internet. iii) To find out the contents of a library, type help(package="ipred") iv) read the libraries into the R session by using the library() command, see below. R SESSION library(MASS) library(class) library(rpart) # recursive partitioning, tree predictors.... library(tree) library(e1071) library(randomForest) library(mlbench);library(survival); library(nnet); library(mvtnorm) library(ipred) # the followin function takes a table and computes the error rate. # it assumes that the rows are predicted class outcomes while the #columns are observed #(test set) outcomes rm(misclassification.rate) misclassification.rate=function(tab){ num1=sum(diag(tab)) denom1=sum(tab) signif(1-num1/denom1,3) } # Chapter 1: Simulated data set with 50 observations. # set a random seed for reproducing results later, any integer set.seed(123) #Binary outcome, 25 observations are class 1, 25 are class 2 no.obs=50 # class outcome y=rep(c(1,2),c(no.obs/2,no.obs/2)) # the following covariate contains a signal x1=y+0.8*rnorm(no.obs) # the remaining covariates contain noise (random permutations of x1) x2=sample(x1) x3=sample(x1) x4=sample(x1) x5=sample(x1) dat1=data.frame(y,x1,x2,x3,x4,x5) dim(dat1) names(dat1) 2 # RPART (tree analysis) rp1=rpart(factor(y)~x1+x2+x3+x4+x5,data=dat1) plot(rp1) text(rp1) x1< 1.421 | 1 2 summary(rp1) Call: rpart(formula = factor(y) ~ x1 + x2 + x3 + x4 + x5, data = dat1) n= 50 CP nsplit rel error xerror xstd 1 0.64 0 1.00 1.36 0.1319394 2 0.01 1 0.36 0.40 0.1131371 Node number 1: 50 observations, complexity param=0.64 predicted class=1 expected loss=0.5 class counts: 25 25 probabilities: 0.500 0.500 left son=2 (24 obs) right son=3 (26 obs) Primary splits: x1 < 1.421257 to the left, improve=10.2564100, (0 x4 < 2.640618 to the left, improve= 2.0764120, (0 x3 < 0.525794 to the left, improve= 0.7475083, (0 x2 < 1.686658 to the left, improve= 0.6493506, (0 x5 < 1.089018 to the right, improve= 0.4010695, (0 Surrogate splits: x4 < 1.964868 to the left, agree=0.64, adj=0.250, x2 < 0.7332517 to the left, agree=0.60, adj=0.167, x5 < 0.820739 to the left, agree=0.58, adj=0.125, x3 < 0.7332517 to the left, agree=0.56, adj=0.083, missing) missing) missing) missing) missing) (0 (0 (0 (0 split) split) split) split) Node number 2: 24 observations predicted class=1 expected loss=0.1666667 class counts: 20 4 probabilities: 0.833 0.167 Node number 3: 26 observations predicted class=2 expected loss=0.1923077 class counts: 5 21 probabilities: 0.192 0.808 3 # Let us now eliminate the signal variable!!! # further we choose 3 fold cross-validation and a cost complexity parameter=0 rp1=rpart(factor(y)~x2+x3+x4+x5,control=rpart.control(xval=4, cp=0), data=dat1) plot(rp1) text(rp1) x4< 2.641 | x3< 1.883 2 1 2 Note that the above tree overfits the data since x4 and x5 have nothing to do with y! From the following output you can see that the cross-validated relative error rate is 1.28, i.e. it is worth than the naive predictor (stump tree), that assigns each observation the class 1. summary(rp1) summary(rp1) Call: rpart(formula = factor(y) ~ x2 + x3 + x4 + x5, data = dat1, control = rpart.control(xval = 4, cp = 0)) n= 50 CP nsplit rel error xerror xstd 1 0.20 0 1.00 1.12 0.1403994 2 0.12 1 0.80 1.24 0.1372880 3 0.00 2 0.68 1.28 0.1357645 ETC 4 # let us cross-tabulate learning set predictions versus true learning set outcomes: tab1=table(predict(rp1,newdata=dat1,type="class"),dat1$y) tab1 1 2 1 18 10 2 7 15 misclassification.rate(tab1) [1] 0.34 # Note the error rate is unrealistically low, given that the predictors have nothing to do # with the outcome. This illustrates that the “resubstitution” error rate is biased. #Let’s create a test set as follows ytest=sample(1:2,100,replace=T) x1test=ytest+0.8*rnorm(100) dattest=data.frame(y=ytest, x1=sample(x1test), x2=sample(x1test), x3=sample(x1test),x4=sample(x1test),x5=sample(x1test)) # Now let’s cross-tabulate the test set predictions with the test set outcomes: tab1=table(predict(rp1,newdata=dattest,type="class"),dattest$y) tab1 > tab1 1 2 1 34 26 2 20 20 misclassification.rate(tab1) [1] 0.46 # this test set error rate is realistic given that the predictor contained no information. 5 #Linear Discriminant Analysis dathelp=data.frame(x1,x2,x3,x4,x5) lda1=lda(factor(y)~ . , data=dathelp ,CV=FALSE, method="moment") > Call: lda(factor(y) ~ ., data = dathelp, CV = FALSE, method = "moment") Prior probabilities of groups: 1 2 0.5 0.5 Group means: x1 x2 x3 x4 x5 1 0.9733358 1.474684 1.450246 1.405641 1.491884 2 2.0817099 1.580361 1.604800 1.649404 1.563162 Coefficients of linear discriminants: LD1 x1 1.31534493 x2 0.12657254 x3 0.16943895 x4 0.06726993 x5 0.07174623 # resubstitution error tab1=table(predict(lda1)$class,y) tab1 misclassification.rate(tab1) > tab1 y 1 2 1 19 6 2 6 19 > misclassification.rate(tab1) [1] 0.24 ### leave one out cross-validation analysis lda1=lda(factor(y)~.,data=dathelp,CV=TRUE, method="moment") tab1=table(lda1$class,y) > tab1 y 1 2 1 18 7 2 7 18 > misclassification.rate(tab1) [1] 0.28 6 # Chapter 2: The Iris Data data(iris) ### parameter values setup cv.k = 10 ## 10-fold cross-validation B = 100 ## using 100 Bootstrap samples in .632+ error estimation C.svm = 10 ## Cost parameters for svm, needs to be tuned for different datasets #Linear Discriminant Analysis ip.lda <- function(object, newdata) predict(object, newdata = newdata)$class # 10 fold cross-validation errorest(Species ~ ., data=iris, model=lda, estimator="cv",est.para=control.errorest(k=cv.k), predict=ip.lda)$err [1] 0.02 # The above is the 10 fold cross validation error rate, which depends # on how the observations are assigned to 10 random bins! # Bootstrap error estimator .632+ errorest(Species ~ ., data=iris, model=lda, estimator="632plus", est.para=control.errorest(nboot=B), predict=ip.lda)$err [1] 0.02315164 # The above is the boostrap estimate of the error rate. Note that it is comparable to # the cross-validation estimate of the error rate #Quadratic Discriminant Analysis ip.qda <- function(object, newdata) predict(object, newdata = newdata)$class # 10 fold cross-validation errorest(Species ~ ., data=iris, model=qda, estimator="cv", est.para=control.errorest(k=cv.k), predict=ip.qda)$err [1] 0.02666667 # Bootstrap error estimator .632+ errorest(Species ~ ., data=iris, model=qda, estimator="632plus", est.para=control.errorest(nboot=B), predict=ip.qda)$err [1] 0.02373598 # Note that both error rate estimates are higher in QDA than in LDA 7 #k-nearest neighbor predictors# #Currently, there is an error in the underlying wrapper code for "knn" in package ipred. #The error is due to the name conflict of variable "k" used in the wrapper function #"ipredknn" and the original function "knn". # We need to change variable "k" to something else (here "kk") to avoid conflict. bwpredict.knn <- function(object, newdata) predict.ipredknn(object, newdata, type="class") ## 10 fold cross validation, 1 nearest neighbor errorest(Species ~ ., data=iris, model=ipredknn, estimator="cv", est.para=control.errorest(k=cv.k), predict=bwpredict.knn, kk=1)$err [1] 0.03333333 ## 10 fold cross validation, 3 nearest neighbors errorest(Species ~ ., data=iris, model=ipredknn, estimator="cv", est.para=control.errorest(k=cv.k), predict=bwpredict.knn, kk=3)$err [1] 0.04 ## .632+ errorest(Species ~ ., data=iris, model=ipredknn, estimator="632plus", est.para=control.errorest(nboot=B), predict=bwpredict.knn, kk=1)$err [1] 0.04141241 errorest(Species ~ ., data=iris, model=ipredknn, estimator="632plus", est.para=control.errorest(nboot=B), predict=bwpredict.knn, kk=3)$err [1] 0.03964991 # Note that the k=3 nearest neighbor predictor leads to lower error rates # than the k=1 NN predictor. # Random forest predictor #out of bag error estimation randomForest(Species ~ ., data=iris, mtry=2, ntree=B, keep.forest=FALSE)$err.rate[B] [1] 0.04 ## compare this to 10 fold cross-validation errorest(Species ~ ., data=iris, model=randomForest, estimator = "cv", est.para=control.errorest(k=cv.k), ntree=B, mtry=2)$err [1] 0.05333333 8 # bagging rpart trees # Use function "bagging" in package "ipred" which calls "rpart" for classification. ## The error returned is out-of-bag estimation. bag1=bagging(Species ~ ., data=iris, nbagg=B, control=rpart.control(minsplit=2, cp=0, xval=0), comb=NULL, coob=TRUE, ns=dim(iris)[1], keepX=TRUE) > bag1 Bagging classification trees with 100 bootstrap replications Call: bagging.data.frame(formula = Species ~ ., data = iris, nbagg = B, control = rpart.control(minsplit = 2, cp = 0, xval = 0), comb = NULL, coob = TRUE, ns = dim(iris)[1], keepX = TRUE) Out-of-bag estimate of misclassification error: 0.06 # The following tables lists the out-of bag estimates versus observed species table(predict(bag1),iris$Species) setosa versicolor virginica setosa 50 0 0 versicolor 0 46 5 virginica 0 4 45 # Note that the OOB error rate is 0.06=9/150 #support vector machine (SVM) ## 10 fold cross-validation, note the misclassification cost errorest(Species ~ ., data=iris, model=svm, estimator="cv", est.para=control.errorest(k = cv.k), cost=C.svm)$error [1] 0.03333333 ## .632+ errorest(Species ~ ., data=iris, model=svm, estimator="632plus", est.para=control.errorest(nboot = B), cost=C.svm)$error [1] 0.03428103 9 Chapter 3: How to filter genes and use filtered genes in kNN predictors Let’s first install bioconductor as follows. 1) Copy and paste the following functions into your R session. getBio C <- function (libName = "default", relLevel = "release", destd ir, versForce=TRUE, verbose = TRUE, bundle = T RUE, force=TRUE, getAllDeps=T RUE, method="auto") { ## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ## !!! Alway s change version number when updating th is file getBioCVersio n <- "1.2.52 " writeLines(paste("Running getBio C version ",getBio CVersion,"....\n ", "If y ou encounter problems, first make s ure that\n", "y ou are running the latest version of getBioC()\n ", "wh ich can be found at: www.b ioconductor. org/getBio C.R", "\n\n ", "Please direct any concerns or questions to ", " b ioconductor@stat.math.ethz.ch. \n",sep="")) ## !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! MINIMUM Rel <- "1.8.1 " MINIMUMDev <- "1.9 .0" rInfo <- R.Version() rVers <- paste(rInfo$major,rInfo$minor,sep=".") ## !!! Again, using commpareVersion() here as we ## !!! have not boo tstrapped w/ reposT ools for versionNumber ## !!! class MINIMUM <- switch(relLevel, "devel"=MINIMUMDev, MINIMU MRel) if (compareVersion(rVers,MINIMUM) < 0) { sto p(paste("\nYou are currently running R version ",rVers, ", ho wever R version ",MINIMUM , " is required.",sep="")) } ## Chec k the specified libName. If it is 'all' we want to warn ## that they 're about ready to get a metric ton of packages. if (libName == "all") { ## Ma ke s ure they want to get all pac kages msg <- paste("\nYou are down loading all of the Bioconductor", " pac kages and any dependencies.\n", "Depending on y our sy stem this will be about ", "60-6 5 packages and be quite large.\n ", "\nAre y ou sure that y ou want to do this ?", sep="") out <- G BCu serQuery (msg,c("y ","n")) if (out == "n ") { cat("\nN ot down loading . if y ou wish to see other down load options, \n", "please go to the URL:", " http://www.bioconductor.org/faq.h tml#getBioC\n", sep="") return(inv isib le(NULL)) } } curLibPaths <- .libPaths() on.exit(.libPaths(curLibPaths), add=TRUE) ## make sure to ex pand out the destd ir param if (!missing(destd ir)) destd ir <- path.expand(destdir) ## Chec k the specified relLevel validLevels <- c("release", "devel") if (!(relLevel %in% validLevels)) sto p(paste("Invalid relLevel parameter: ",relLevel, ". Must be one of: ", paste(validLevels, collapse=", "), ".", sep="")) ## Stifle the "connected to www.... garbage output curNetOpt <- getOption( "internet. info") on.exit(op tion s(internet.info=curNetOpt), add=T RUE) optio ns(internet. info=3) ## First chec k to ma ke sure they have HTTP capability . If they do ## not, there is no point to this exercise. http <- as.lo gical(capabilities(what="http /ftp")) if (http == FA LSE) { sto p(paste("Your R is no t currently configured to allow HT TP", "\nco nnections, w hich is required for getBio C to", "wor k properly .")) } ## find o ut where we thin k that b ioC is bio Coption <- getOption("BIOC") if (is.nu ll(bio Cop tion)) bio Cop tion <- "h ttp://www.b ioconductor.org " ## Now check to see if we can connect to the Bio C website biocURL <- url(paste(bioCop tion ,"/m ain.html",sep="")) optio ns(show.error.messages=FALSE) test <- try (readLines(biocURL)[1]) optio ns(show.error.messages=TRUE) if (inherits(tes t,"try -error")) sto p(paste("Your R can no t connect to the Bioconductor", "website, which is required for getBioc to", "wor k properly . The most likely cause of this", "is the in ternet configuration of R")) else close(biocURL) ## Get the des tination directory if (missing(destdir)) { lP <- .libPaths() if (length( lP) == 1) destdir <- lP else { dDval <- menu(lP, title="Please select an ins tallatio n directory :") if (dDval == 0) stop("N o ins tallation directory selected") else destdir <- lP[dDval] } } else .libPaths(destdir) if (length(destd ir) > 1) sto p("Invalid destdir parameter, must be of length 1") PLATFORM <- .Platform$OS.ty pe if (file.access(destdir,mode=0) < 0) sto p(paste("Directory ",destdir,"d oes not seem to exist.\n ", "Please check y our 'destdir' parameter and try again.")) if (file.access(destdir,mode=2) < 0) sto p(paste("You do not have write access to",destdir, "\nPlease check y our permission s or provide ", "a different 'destdir' parameter")) messages <- paste("Your packages are up to date.", "No downloading/installation w ill be performed.", sep="\n") packs <- NULL ## Get the names of packages s pecified by the user if(bundle){ for(i in libName){ pac ks <- c(packs, getPac kNames(i)) } }else{ packs <- libName } ## Download and ins tall reposTo ols and Bio base first ## Get the pac kage descriptio n file from Biocond uctor getReposT ools <- getReposToo ls(relLevel, PLATFO RM, des tdir, method=method, bioCop tion=b ioCoptio n) require(reposTools) || stop("Needs reposToo ls to continue") ## Get Repo sitory entries from Bioconductor urlPath <- switch(PLATFO RM, "u nix"= "/Source", "/Win32 ") bio CRepU RL <- getReposU RL(relLevel,urlPath, bio Cop tion) bio CEntries <- getReposEntry (bioCRepURL) curOps <- getOption( "repositories 2") on.exit(op tion s(repositories2=curOps), add=TRUE) repNames <- names(curOps) curOps <- gsub("http://www.b ioconductor.org ",bio Cop tion,curOps) names(curOps) <- repNames if (relLevel == "devel") optReps <- curOps[c("BIOCDevel","BIOCData", "BIO CCo urses","BIOCcdf", "BIO Cprobes", "CRAN", "BIO COmegahat")] else optReps <- curOps[c("BIOCRel1.3 ","BIOCData", "BIO CCo urses","BIOCcdf", "BIO Cprobes", "CRAN", "BIO COmegahat")] optio ns(repositories2=optReps) ## Sy nc lib lis t sy ncLocalLibList(destd ir) reposToolsVersion <- package.description("reposToo ls", lib.loc=destdir, fields="Version") if (compareVersion(reposToolsVersio n, "1. 3.12") < 0) { ## Th is is the old sty le reposTools , need to do old s ty le ## getBio C out <- in stall. packages2(pac ks, b ioCE ntries, lib=destdir, ty pe = ifelse(PLATFORM == "u nix", "Source", "Win32"), vers Force=versForce, recurse=FALSE, getAllDeps=getAllDeps, method=method, force=force, searchOptions=TRUE) } else { ## 'packs' might be NULL, imply ing every thing in the ## main repository (release/devel) if (is.n ull(pac ks)) pac ks <- repPkgs(b ioCEntries) ## Need to determine which 'packs' are alreaedy ## installed and which are not. Call install on the latter ## and u pdate on the former. load.locLib(destd ir) locP kgs <- unlist(lapp ly (locLibList, Pac kage)) havePkg s <- packs % in% locP kgs ins tallP kgs <- pac ks[! havePkgs] updateP kgs <- packs[havePkgs] out <- new("p kgStatusList", status List=list()) if (length(u pdatePkg s) > 0) { up dateList <- update.pac kages2(updateP kgs, b ioCE ntries, lib s=destdir, ty pe = ifelse(PLATFORM == "un ix", "Source", "Win 32"), vers Force=versForce, recurse=FALSE, getAllDep s=getAllDeps, method=method, force=force, searchOptions=TRUE) s tatusList(out) <- updateList[[destdir]] } sy ncLocalLibList(des tdir, qu iet=TRUE) if (length( installPkgs) > 0) s tatusList(out) <- in stall.pac kages2(ins tallP kgs, bio CEntries, lib=destdir, ty pe = ifelse(PLATFORM == "unix ", "Source", "W in32 "), versForce=versForce, recurse=FALSE, getA llDeps=getAllDeps, method=method, force=force, searchOptions=TRUE) } if (length(updated(ou t)) == 0) print( "All requested packages are up to date") else print(o ut) ## Window s doesn't currently have Rgraphviz or rhdf5 if (PLATFORM != "wind ows") { if (libName %in% c("all","prog ","graph")) { otherPkgsO ut <- paste("Packages Rgraphviz and rhdf5 require", " special libraries to be insta lled.\n ", "Please see the URL ", "h ttp ://www.b ioconductor. org/faq.html#Other No tes", " for\n", "more details o n ins talling these pac kages", " if they fail\nto install properly \n\n", sep="" ) cat(otherPkgsOut) } } ## If they are using 'default', alert the user that they have not ## gotten all pac kages if (libName == "default") { out <- paste("Yo u have downloaded a default set of packages. \n", "If y ou wish to see other down load options, please", " g o to the URL:\n", "h ttp://www.b ioconductor.org /faq.html#getBio C\n ", sep="") cat(out) } } getReposToo ls <- function(relLevel, p latform, destdir=NULL, method="auto", b ioCoption) { ## This funciton will check to see if reposTools needs to be ## updated, and if so will dow nload/install it PACKAGES <- getPACKAG ES(relLevel, bio Cop tion) ### check repos Tools ala checkLib s if (checkRepo sTools(PACKAGES)) { sourceUrl <- getDLURL("reposT ools ", PACKAGE S, p latform) ## Get the package file name for reposTools fileName <- getFileName(sourceUrl, destdir) ## Try the connection firs t before downloadin g options(sh ow.error.messages = FALSE) try Me <- try (url(sourceUrl, "r")) options(sh ow.error.messages = TRUE) if(inherits(try Me, "try -error")) s top("Could not get the required package reposToo ls") else { ## Clo se the connection for chec king close(try Me) ## D ownload and install prin t("Installing repo sTools ...") dow nload.file(sourceUrl, fileName, mode = getMode(platform), quiet = T RUE, method=method) in stallPac k(platform, fileName, destdir) if (!("reposToo ls" % in% in stalled.pac kages(lib. loc=destdir)[,"Package"])) stop("Failed to in stall package reposToo ls") un lin k(fileName) } return(invisible(NULL)) } } packNameOutput <- function() { out <- paste("\n default:\ttargets affy , cdna and exprs.\n", "exprs:\t\tpackages Biobase, anno tate, genefilter, ", "geneplo ter, edd, \n\t\tROC, multtest, pamr and limma.\n", "affy :\t\tpac kages affy , affy data, ", "annaffy , affyPLM, makecdfenv,\n\t\t", "matchprobes and vsn p lus 'exprs'.\n", "cdna:\t\tpac kages marray Input, marray Classes, ", "marray Norm, marrayPlots,\n\t\tmarray Tools, vsn,", " plus 'exprs'.\nprog:\t\tpackages graph, hexb in, ", "externalVector.\n", "graph:\t\tpackages graph, Rgraphviz, RBGL ", "\nw idgets :\tpac kages tkWidgets, w idgetToo ls,", " Dy nDoc.\ndesign :\t\tpac kages daMA and factDesign \n", "externalData:\tpackages externalVector and rhdf5.\n ", "database:\tA nnBuilder, SAGEly zer, Rdb i and ", "RdbiPgSQ L.\n ", "analy ses:\tpac kages Biobase, ctc, daMA, edd, ", "factDesign,\n \t\tgenefilter, geneplo tter, globaltest, ", "gpls, limma,\n\t\tMAG EML, multtest, pamr, RO C, ", "siggenes and splicegear.\n", "annotation:\tpac kages annotate, Ann Builder, ", "humanLLMappings\n\t\tKEGG, GO, SNPtoo ls, ", "makecdfenv and ontoTools.", "\nall:\t\tA ll of the Biocond uctor packages. \n", sep="") out } ## This function p ut to gether a vector containing Biocond uctor's ## packages based on a defined libName getPackNames <- function(libName) { error <- paste("The library ", libName, " is not valid.\n ", "Usage:\n", pac kNameOutput()) AFFY <- c("affy ", "vsn", "affy data", "annaffy ", "affy PLM", "matchprobes", "gcrma", "makecdfenv") CDNA <- c("marray Input", "marray Classes", "marray Norm", "marray Plots", "marray Tools", "vsn") EXPRS <-c("Biobase", "annotate", "genefilter", "geneplotter", "edd", "ROC", "multtest", "pamr", "limma", "MAGE ML", "siggenes ", "g lobaltes t") PROG <- c("graph", "hexb in", "externalVector", "Dy nDoc", "Ruu id") GRAPH <- c("graph", "Rgraphviz", "RBGL") WIDGETS <- c("tkWidgets", "widgetT ools ", "Dy nDoc") DATABA SE <- c("AnnBu ilder", "SAGE ly zer", "Rdbi", "RdbiPg SQL") DESIGN <- c("daMA", "factDesign") ANNOTATION <- c("annotate", "Ann Bu ilder", "humanLLMapping s", "KEGG ", "GO ", "SNPtools", "makecdfenv", "on toToo ls") ANALYSE S <- c("Bio base", "ctc", "daMA ", "edd ", "factDesign ", "genefilter", "geneplo tter", "g lobaltes t", "gp ls", "limma", "MAG EML", "multtest", "pamr", "RO C", "s iggenes", "splicegear") EXTERNALDATA <- c("externalVector", "rhdf5 ") packs <- sw itch(tolower(libName), "all"=NULL, "default" = c(EXPRS, A FFY, CDNA), "exprs " = EXPRS, "affy " = c(EXPRS, AFFY), "cdna" = c(EXPRS, CDNA), "prog " = PROG, "graph" = GRAPH, "widgets" = WIDGET S, "design" = DE SIGN, "anno tation " = ANNOTATION, "database" = DATA BASE , "analy ses" = ANALYSE S, "externaldata" = EXTE RNALDATA, sto p(error)) packs <- un ique(packs) packs } ## Returns the mode that is going to be u sed to call dow nload.file ## depending o n the platform getMode <- function(platform){ switch(platform, "u nix " = return("w"), "w indows " = return("wb"), stop(paste(platform,"is n ot currently supported"))) } ## Installs a given pac kage installPac k <- function(platform, fileName, destdir=NULL){ if(platform == "unix"){ cmd <- paste(file.path(R.home(), "bin", "R"), "CMD IN STALL") if (!is.nu ll(destd ir)) cmd <- paste(cmd, "-l", destdir) cmd <- paste(cmd, fileName) sy stem(cmd) }else{ if(platform == "windows "){ zip.u npack(fileName, .libPaths()[1]) }else{ s top(paste(platform,"is not currently supported ")) } } } ## Returns the surce url for a given pac kage getDLURL <- function(pa kName, rep, platform){ temp <- rep[rep[, "Package"] == pakName] names(temp) <- colnames(rep) switch(platform, "u nix " = return(temp[names(temp) == "SourceURL"]), "w indows " = return(temp[names(temp) == "WIN32URL"]), stop(paste(platform,"is n ot currently supported"))) } ## Returns the description file (PACKAGE) that contains the name, ## version number, url, ... of Bioco nductor pac kages. getPACKAGES <- functio n (relLevel, bio Coption){ URL <- getRepo sURL(relLevel,"/PACKAGE S", b ioCoption) con <- url(URL) optio ns(show.error.messages = FALSE) try Me <- try(read.dcf(con)) optio ns(show.error.messages = TRUE) if(inherits(try Me, "try -error")) stop(pas te("The url:",URL, "does n ot seem to have a valid PACKAGE S file.")) close(con) return(try Me) } ## Returns the url for some files that are needed to perform the ## functions . name is added to teh end of the URL getReposU RL <- function(relLevel, name="", bio Coption){ URL <- switch(relLevel, "devel"= pas te(bioCo ptio n, "repository /devel/package", name, sep ="/"), "release"=paste(bioCo ptio n, "repos itory /release1.3", "/package", name,sep="/"), character()) URL } ## Returns the file name with the des tination path (if any ) attached getFileName <- function(url, destd ir){ temp <- unlist(str split(url, "/")) if(is.null(destdir)) return(temp[length(temp)]) else return(file.path(destd ir, temp[length(temp)])) } ## getBioC has to check to see if "repos Tools " has ## already been loaded and generates a message if any has. checkReposT ools <- functio n(PACKAGES){ p kgVers <- PACKAGES[,"Vers ion"] ## First get package version ## !!! Not y et using Versio nNumber classes here ## !!! boots trapping issue as th is comes from reposTools ## !!! use compareVersion for now if ("reposTools" %in% ins talled.pac kages()[,"Package"]) { curVers <- package.description("reposT ools ",fields="Version") if (compareVersion(curVers,pkgVers) < 0) { if ("package:reposToo ls" % in% search()) { error <- paste("reposTools is o ut of date bu t", " currently loaded in y our R ses sion. ", "\nIf y ou would like to continue,", " p lease either detach this pac kage", " or restart\ny our R seesion before", " runn ing getBio C.", sep="") stop(error) } } else return(FAL SE) } return(TRUE) } ## From reposTools GBCu serQuery <- function(msg, allowed=c("y es","y ","no","n ")) { ## Prompts the user with a strin g and for an answer ## repeats until it gets allowable inpu t repeat { allow Msg <- paste("[",pas te(allowed,collapse="/"), "] ", sep="") outMsg <- paste(msg,allow Msg) cat(outMs g) ans <- readLines(n=1) if (ans %in% allowed) brea k else cat(paste(ans,"is not a valid response, try again.\n")) } ans } 10 2) Now activate the installation by typing getBioC() 3)In the following we will use the following bioconductor libraries library(Biobase) library(genefilter) # Let’s read in the data #change working directory to where the data are, the type dat1=read.csv(“MicroarrayExample.csv”,header=T,row.names=1) # Now we will use the filter functions that are described in vignette 1, # Vignettes are pdf files that can be accessed by typing openVignette("genefilter") TASK1: Let’s select genes that are expressed above 500 in at least 10 samples. # create a filter function f1=kOverA(10,500) # assemble the filter functions into a filtering function ffun=filterfun(f1) # apply the filtering function to the expression matrix which=genefilter(dat1,ffun) table(which) > table(which) which FALSE TRUE 880 120 #To arrive at the gene names of the corresponding genes type which[which] TASK 2: Let us now filter genes by a multigroup comparison test (ANOVA) #Recall that the following tissues are in the data set names(dat1) [1] "E1" "E2" "E3" "E4" "E5" "E6" "E7" "E8" [13] "E13" "E14" "E15" "E16" "E17" "E18" "E19" "B1" [25] "N1" "N2" "N3" "N4" "N5" "N6" "N7" "Q1" > "E9" "B2" "Q2" "E10" "E11" "E12" "B3" "B4" "B5" "Q3" # Therefore we define the following 4 tissue types tissue1=factor(rep(c(“E”,”B”,”N”,”Q”),c(19,5,7,3))) 11 # Now we define the Anova filter. # which filters out genes that are significantly different across the 4 tissues (p<0.01) Afilter=Anova(tissue1,0.01) aff=filterfun(Afilter) which2=genefilter(dat1,aff) table(which2) # TASK 3: Let us now filter out genes that are significantly different across # the tissues AND are expressed above 100 in at least 10 samples Afilter=Anova(tissue1,0.01) f1=kOverA(10,100) aff=filterfun(Afilter,f1) which2=genefilter(dat1,aff) table(which2) which2 FALSE TRUE 483 517 library(class) # Here we use a new definition of a cross validation function for k-nearest neighbor # compare it to knn.cv in the class library rm(knnCV) knnCV = function(EXPR, selectfun, cov, Agg, pselect = 0.01,Scale = FALSE) { nc <- ncol(EXPR) outvals <- rep(NA, nc) for (i in 1:nc) { v1 <- EXPR[, i] expr <- EXPR[, -i] glist <- selectfun(expr, cov[-i], p = pselect) expr <- expr[glist, ] if (Scale) { expr <- scale(expr) v1 <- as.vector(scale(v1[glist])) } else v1 <- v1[glist] out <- paste("iter ", i, " num genes= ", sum(glist),sep ="") print(out) Aggregate(row.names(expr), Agg) if (length(v1) == 1) # the red number selects k=5 nearest neighbors outvals[i] <- knn(expr, v1, cov[-i], k = 5) else outvals[i] <- knn(t(expr), v1, cov[-i], k = 5) } return(outvals) } 12 rm(gfun) gfun <- function(expr, cov, p = 0.05,k1=5,level1=100) { f2 <- Anova(cov, p = p) f3= kOverA(k1,level1) ffun <- filterfun(f2,f3) which <- genefilter(expr, ffun) } Agg <- new("aggregator") # Now we do leave one out cross-validation where # genes are selected on each training set! testcase <- knnCV(dat1[1:200,], gfun, tissue1, Agg, pselect = 0.05) testcase [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 3 1 1 4 3 3 3 3 3 3 3 3 1 4 > tab1=table(testcase,tissue1) tissue1 testcase B E N Q 1 3 0 0 1 2 0 19 0 0 3 1 0 7 1 4 1 0 0 1 misclassification.rate(tab1) [1] 0.118 genes.used=multiget(ls(env=aggenv(Agg)),env=aggenv(Agg)) genes.counts=as.numeric(genes.used) names(genes.counts)=names(genes.used) sort(genes.counts) ..... 51 AFFX.HSAC07/X00351.5.at AFFX.HUMGAPDH/M33197.5.at 51 51 AFFX.HUMGAPDH/M33197.M.at 51 51 AFFX.HSAC07/X00351.M.at 51 51 The variable genes.counts contains for each gene the number of times it was selected in the cross validation. 13 Homework problems 2 Microarray Data and Supervised Learning Biostats 278, Steve Horvath To understand this homework, read the corresponding discussion notes carefully. Use the data set MicroarrayExample.csv for the following analyses. 0) Fit an rpart tree using genes as covariates and tissue1 as outcome. Hint: rp1=rpart(factor(tissueE)~., data=t(dat2)) plot(rp1);text(rp1) 1) Filter out genes that are have an expression value of 200 in at least 5 samples How many genes satisfy the condition? Hint: Use the following functions in the library genefilter f1=kOverA(3,300) ffun=filterfun(f1) which1=genefilter(dat1,ffun) table(which1) 2) Create a new data set that contains the genes found in 1). Note that this data set contains a subset of genes that was found without looking at tissue type. Hint: dat2=dat1[which1,] 3) Let’s assume that we want to form a predictor for comparing E tissues versus all other tissues (B,N,Q), i.e. this is a 2 class outcome. Hint: tissueE=as.numeric(tissue1==”E”) A) Use data set dat2 to estimate the 10 fold and .632 bootstrap error rate for predicting E with i) LDA ii) QDA iii) rpart iv) support vector machines B) Compute the out of bag estimates when using random forest predictors. Hint: RF1=randomForest(factor(tissueE)~., data=t(dat2),ntree=1000) RF1 C) Compute the out of bag estimate when using bagged rpart trees Hint: bag1=bagging(tissueE ~ ., data=data.frame(t(dat2)), nbagg=B, control=rpart.control(minsplit=2, cp=0, xval=0), comb=NULL, coob=TRUE, ns=dim(t(dat2))[1], keepX=TRUE) tab1=table(predict(bag1),tissueE); misclassification.rate(tab1) 4) Again use tissueE as outcome in the following. Use the function knnCV to record the leave one out cross-validation error of k-nearest neighbor predictors that use k=1,3,5,11 neighbors. As done in the discussion notes, use an Anova filter and the kOverA function to select genes in the training data. To speed up the analysis you may want to restrict the to the first 200 genes. 14