Package ‘mRMRe’ February 15, 2013 Type Package Title R package for parallelized mRMR ensemble feature selection Version 1.0.2 Date 2012-08-20 Author Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains Maintainer Benjamin Haibe-Kains <bhaibeka@ircm.qc.ca> Description This package contains a set of function to compute mutual information matrices from continuous, categorical and survival variables. It also contains function to perform feature selection with mRMR and a new ensemble mRMR technique. License Artistic-1.0 Imports Rcpp (>= 0.9.10), survival Depends LinkingTo Rcpp Repository CRAN Date/Publication 2012-08-24 06:40:27 NeedsCompilation yes R topics documented: build.mim . . . . cgps . . . . . . . compute.causality correlate . . . . . mRMR.classic . . mRMR.ensemble set.thread.count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 3 4 5 6 8 9 10 1 2 build.mim build.mim Constructs a mutual information matrix (MIM) Description This function constructs a mutual information matrix (MIM). Usage build.mim(data, strata, weights, uses_ranks, outX, bootstrap_count) Arguments data A data frame with rows and columns respectively corresponding to samples and features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv". strata A vector of factors specifying the stratum associated with each sample. Defaults to all samples being in the same stratum. weights A vector containing the weight associated with each sample. Defaults to all samples having the same weight. uses_ranks If set to TRUE, the Spearman correlation is used between numeric features. If set to FALSE, the Pearson correlation is used. Defaults to TRUE. outX Boolean value on whether ties should be counted. Defaults to TRUE. bootstrap_count If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0. Value A symmetric matrix whose rows and columns correspond to features. Each (i, j) in the matrix is the mutual information between the features i and j. Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains See Also mRMR.classic, mRMR.ensemble cgps 3 Examples ## Test on a dummy dataset library(survival) dd <- data.frame("surv1"=Surv(runif(100), sample(0:1, 100, replace=TRUE)), "cont1"=runif(100), "cat1"=factor(sample(1:5, 100, replace=TRUE), ordered=TRUE), "surv2"=Surv(runif(100), sample(0:1, 100, replace=TRUE)), "cont2"=runif(100), "cont3"=runif(100), "surv3"=Surv(runif(100), sample(0:1, 100, replace=TRUE)), "cat2"=factor(sample(1:5, 100, replace=TRUE), ordered=TRUE)) print(head(dd)) build.mim(data=dd) ## Test on the ’cgps’ dataset data(cgps) ## The variables are all of continuous type mim <- build.mim(data.frame(cgps_ic50, cgps_ge)) # Uses Spearman as correlation estimator print(head(mim)) mim <- build.mim(data.frame(cgps_ic50, cgps_ge), uses_ranks=FALSE) # Uses Pearson as correlation estimator print(head(mim)) cgps Part of the large pharmacogenomic dataset published by Carnett et al. in the Cancer Genome Project (CGP) Description This dataset contains gene expression of 200 cancer cell lines for which sensitivity (IC50) to Irinotecan was measured. Usage data(cgps) Format The cgps dataset is composed of three objects cgps_annot Dataframe containing gene annotations cgps_ge Matrix containing expressions of 1000 genes; cell lines in rows, genes in columns cgps_ic50 Drug sensitivity measurements (IC50) for Irinotecan Details Irinotecan is a drug mainly used in colorectal cancer. 4 compute.causality Source http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-783 http://www.nature.com/nature/journal/v483/n7391/extref/natu s2.zip References Garnett MJ et al. "Systematic identification of genomic markers of drug sensitivity in cancer cells", Nature, 483:570-575, 2012. Examples data(cgps) message("Gene expression data:") print(cgps_ge[1:3, 1:3]) message("Gene annotations:") print(head(cgps_annot)) message("Drug sensitivity (IC50) values:") print(head(cgps_ic50)) compute.causality Computes the causality Description This function computes causality scores based on mutual information theory. Usage compute.causality(data, target_index, mim, solutions, estimator) Arguments data An mRMReObject OR A data frame with rows and columns respectively corresponding to samples and features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv". If mRMReObject is used, the other arguments need not be used. target_index Index of the target feature (column) from the data matrix. mim A mutual information matrix that will be used in computing causality. solutions A matrix where each row represents an mRMR solution. estimator String containing the estimator to be used for computing the correlation between elements. Must be one of the following: "pearson", "spearman" or "kendall". correlate 5 Value Returns a matrix of causality coefficients for every pair of features considered in solutions. Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains References Bell AJ: The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA 2003. edited by Amari S et al.2003. See Also mRMR.classic, mRMR.ensemble Examples data(cgps) data <- data.frame(target=cgps_ic50, cgps_ge) ensemble <- mRMR.ensemble(data, 1, c(10, 5, rep(1, 5))) compute.causality(ensemble) compute.causality(data=data, target_index=1, mim=NULL, solutions=ensemble$paths, estimator="spearman") correlate Computes the degree of association between two variables. Description This function computes the degree of association between two variables (correlation). Usage correlate(x, y, strata, weights, method, outX, bootstrap_count) Arguments x, y Two vector containing sample data. strata A vector of factors specifying the stratum associated with each sample. Defaults to all samples being in the same stratum. weights A vector containing the weight associated with each sample. Defaults to all samples having the same weight. method A string containing the estimator to be used for computing the correlation between x and y. Must be one of the following methods: "cramer", "pearson", "spearman", "cindex", "kendall". 6 mRMR.classic outX Boolean value on whether ties should be counted. Defaults to TRUE. bootstrap_count If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0. Value A value depending on the estimator. "cramer" returns Cramer’s V, "pearson" returns Pearson’s rho, "spearman" returns Spearman’s rho, "cindex" returns the concordance index, "kendall" returns Kendall’s tau. Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains See Also build.mim Examples data(cgps) # Compute c-index between feature 1 and 2 correlate(cgps_ge[,1],cgps_ge[,2], method="cindex") # Compute Cramer’s V x <- sample(c(0, 1, 2), 100, replace=TRUE) y <- sample(c(0, 1), 100, replace=TRUE) correlate(x, y, method="cramer") # Compute Pearson coefficient with random strata and sample weights # between feature 1 and 2 strata <- sample(as.factor(c("STRATUM_1","STRATUM_2","STRATUM_3")), nrow(cgps_ge), replace=TRUE) weights <- runif(nrow(cgps_ge)) correlate(cgps_ge[, 1],cgps_ge[, 2], strata=strata, weights=weights, method="pearson", bootstrap_count=1000) mRMR.classic Performs an mRMR feature selection Description This function performs an mRMR feature selection. Usage mRMR.classic(data, target_index, feature_count, strata, weights, uses_ranks, outX, bootstrap_count) mRMR.classic 7 Arguments data A data frame with rows and columns respectively corresponding to samples and features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv". target_index Index of the target feature (column) from the data matrix. feature_count The number of features to be selected by mRMR feature selection. strata A vector containing the stratum associated with each sample. Defaults to all samples being in the same stratum. weights A vector containing the weight associated with each sample. Defaults to all samples having the same weight. uses_ranks If set to TRUE, the Spearman correlation is used between numeric features. If set to FALSE, the Pearson correlation is used. Defaults to TRUE. outX Boolean value on whether ties should be counted. Defaults to TRUE. bootstrap_count If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0. Details The resulting mutual information matrix is computed using Pearson or Spearman’s rho between numeric features (depending on the uses_ranks argument); using Cramer’s V between ordered factors; and Concordance index for any other combination. Value Returns a mRMReObject object with the following attributes: paths Index vector in which elements correspond to the features selected. scores Score vector associated with mRMR feature selection. mim Mutual information matrix used for mRMR feature selection. Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains See Also mRMR.classic, mRMR.ensemble Examples data(cgps) data <- data.frame(target=cgps_ic50, cgps_ge) mRMR.classic(data, 1, 30) 8 mRMR.ensemble mRMR.ensemble Performs an ensemble mRMR feature selection Description This function performs an ensemble mRMR feature selection, that is multiple mRMR selections in parallel. Usage mRMR.ensemble(data, target_index, levels, strata, weights, uses_ranks, outX, bootstrap_count) Arguments data A data frame with rows and columns respectively corresponding to samples and features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv". target_index Index of the target feature (column) from the data matrix. levels A vector containing the number of features to be selected at each turn of ensemble mRMR feature selection. strata A vector of factors specifying the stratum associated with each sample. Defaults to all samples being in the same stratum. weights A vector containing the weight associated with each sample. Defaults to all samples having the same weight. uses_ranks If set to TRUE, the Spearman correlation is used between numeric features. If set to FALSE, the Pearson correlation is used. Defaults to TRUE. outX Boolean value on whether ties should be counted. Defaults to TRUE. bootstrap_count If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0. Details The resulting mutual information matrix is computed using Pearson or Spearman’s rho between numeric features (depending on the uses_ranks argument); using Cramer’s V between ordered factors; and Concordance index for any other combination. Value Returns a mRMReObject object with the following attributes: paths Index matrix in which rows correspond to solutions and columns to the features selected for each solution. scores Score matrix associated with ensemble mRMR feature selection. mim Mutual information matrix used for ensemble mRMR feature selection. set.thread.count 9 Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains See Also build.mim, mRMR.classic Examples data(cgps) data <- data.frame(target=cgps_ic50, cgps_ge) mRMR.ensemble(data, 1, rep.int(1, 30)) # For mRMR.classic-like results mRMR.ensemble(data, 1, c(5:1, 4)) set.thread.count Sets the maximum number of threads that can be launched Description This function sets the maximum number of threads that can be launched. Usage set.thread.count(thread_count) Arguments thread_count Maximum number of threads. Value None Author(s) Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains Examples data(cgps) set.thread.count(4) mim <- build.mim(data.frame(target=cgps_ic50, cgps_ge)) head(mim) Index ∗Topic datasets cgps, 3 build.mim, 2 cgps, 3, 3 cgps_annot (cgps), 3 cgps_ge (cgps), 3 cgps_ic50 (cgps), 3 compute.causality, 4 correlate, 5 mRMR.classic, 6 mRMR.ensemble, 8 set.thread.count, 9 10