Package `mRMRe`

advertisement
Package ‘mRMRe’
February 15, 2013
Type Package
Title R package for parallelized mRMR ensemble feature selection
Version 1.0.2
Date 2012-08-20
Author Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
Maintainer Benjamin Haibe-Kains <bhaibeka@ircm.qc.ca>
Description This package contains a set of function to compute mutual
information matrices from continuous, categorical and survival
variables. It also contains function to perform feature
selection with mRMR and a new ensemble mRMR technique.
License Artistic-1.0
Imports Rcpp (>= 0.9.10), survival
Depends
LinkingTo Rcpp
Repository CRAN
Date/Publication 2012-08-24 06:40:27
NeedsCompilation yes
R topics documented:
build.mim . . . .
cgps . . . . . . .
compute.causality
correlate . . . . .
mRMR.classic . .
mRMR.ensemble
set.thread.count .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Index
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
3
4
5
6
8
9
10
1
2
build.mim
build.mim
Constructs a mutual information matrix (MIM)
Description
This function constructs a mutual information matrix (MIM).
Usage
build.mim(data, strata, weights, uses_ranks, outX, bootstrap_count)
Arguments
data
A data frame with rows and columns respectively corresponding to samples and
features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv".
strata
A vector of factors specifying the stratum associated with each sample. Defaults
to all samples being in the same stratum.
weights
A vector containing the weight associated with each sample. Defaults to all
samples having the same weight.
uses_ranks
If set to TRUE, the Spearman correlation is used between numeric features. If
set to FALSE, the Pearson correlation is used. Defaults to TRUE.
outX
Boolean value on whether ties should be counted. Defaults to TRUE.
bootstrap_count
If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0.
Value
A symmetric matrix whose rows and columns correspond to features. Each (i, j) in the matrix is the
mutual information between the features i and j.
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
See Also
mRMR.classic, mRMR.ensemble
cgps
3
Examples
## Test on a dummy dataset
library(survival)
dd <- data.frame("surv1"=Surv(runif(100), sample(0:1, 100, replace=TRUE)),
"cont1"=runif(100),
"cat1"=factor(sample(1:5, 100, replace=TRUE), ordered=TRUE),
"surv2"=Surv(runif(100), sample(0:1, 100, replace=TRUE)),
"cont2"=runif(100),
"cont3"=runif(100),
"surv3"=Surv(runif(100),
sample(0:1, 100, replace=TRUE)),
"cat2"=factor(sample(1:5, 100, replace=TRUE), ordered=TRUE))
print(head(dd))
build.mim(data=dd)
## Test on the ’cgps’ dataset
data(cgps)
## The variables are all of continuous type
mim <- build.mim(data.frame(cgps_ic50, cgps_ge)) # Uses Spearman as correlation estimator
print(head(mim))
mim <- build.mim(data.frame(cgps_ic50, cgps_ge), uses_ranks=FALSE) # Uses Pearson as correlation estimator
print(head(mim))
cgps
Part of the large pharmacogenomic dataset published by Carnett et al.
in the Cancer Genome Project (CGP)
Description
This dataset contains gene expression of 200 cancer cell lines for which sensitivity (IC50) to Irinotecan was measured.
Usage
data(cgps)
Format
The cgps dataset is composed of three objects
cgps_annot Dataframe containing gene annotations
cgps_ge Matrix containing expressions of 1000 genes; cell lines in rows, genes in columns
cgps_ic50 Drug sensitivity measurements (IC50) for Irinotecan
Details
Irinotecan is a drug mainly used in colorectal cancer.
4
compute.causality
Source
http://www.ebi.ac.uk/arrayexpress/experiments/E-MTAB-783 http://www.nature.com/nature/journal/v483/n7391/extref/natu
s2.zip
References
Garnett MJ et al. "Systematic identification of genomic markers of drug sensitivity in cancer cells",
Nature, 483:570-575, 2012.
Examples
data(cgps)
message("Gene expression data:")
print(cgps_ge[1:3, 1:3])
message("Gene annotations:")
print(head(cgps_annot))
message("Drug sensitivity (IC50) values:")
print(head(cgps_ic50))
compute.causality
Computes the causality
Description
This function computes causality scores based on mutual information theory.
Usage
compute.causality(data, target_index, mim, solutions, estimator)
Arguments
data
An mRMReObject OR A data frame with rows and columns respectively corresponding to samples and features. Only columns of the following types are
supported: "numeric", "ordered_factor" and "Surv".
If mRMReObject is used, the other arguments need not be used.
target_index
Index of the target feature (column) from the data matrix.
mim
A mutual information matrix that will be used in computing causality.
solutions
A matrix where each row represents an mRMR solution.
estimator
String containing the estimator to be used for computing the correlation between
elements. Must be one of the following: "pearson", "spearman" or "kendall".
correlate
5
Value
Returns a matrix of causality coefficients for every pair of features considered in solutions.
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
References
Bell AJ: The co-information lattice. In Proceedings of the Fifth International Workshop on Independent Component Analysis and Blind Signal Separation: ICA 2003. edited by Amari S et
al.2003.
See Also
mRMR.classic, mRMR.ensemble
Examples
data(cgps)
data <- data.frame(target=cgps_ic50, cgps_ge)
ensemble <- mRMR.ensemble(data, 1, c(10, 5, rep(1, 5)))
compute.causality(ensemble)
compute.causality(data=data, target_index=1, mim=NULL, solutions=ensemble$paths, estimator="spearman")
correlate
Computes the degree of association between two variables.
Description
This function computes the degree of association between two variables (correlation).
Usage
correlate(x, y, strata, weights, method, outX, bootstrap_count)
Arguments
x, y
Two vector containing sample data.
strata
A vector of factors specifying the stratum associated with each sample. Defaults
to all samples being in the same stratum.
weights
A vector containing the weight associated with each sample. Defaults to all
samples having the same weight.
method
A string containing the estimator to be used for computing the correlation between x and y. Must be one of the following methods: "cramer", "pearson",
"spearman", "cindex", "kendall".
6
mRMR.classic
outX
Boolean value on whether ties should be counted. Defaults to TRUE.
bootstrap_count
If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0.
Value
A value depending on the estimator. "cramer" returns Cramer’s V, "pearson" returns Pearson’s
rho, "spearman" returns Spearman’s rho, "cindex" returns the concordance index, "kendall" returns
Kendall’s tau.
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
See Also
build.mim
Examples
data(cgps)
# Compute c-index between feature 1 and 2
correlate(cgps_ge[,1],cgps_ge[,2], method="cindex")
# Compute Cramer’s V
x <- sample(c(0, 1, 2), 100, replace=TRUE)
y <- sample(c(0, 1), 100, replace=TRUE)
correlate(x, y, method="cramer")
# Compute Pearson coefficient with random strata and sample weights
# between feature 1 and 2
strata <- sample(as.factor(c("STRATUM_1","STRATUM_2","STRATUM_3")), nrow(cgps_ge), replace=TRUE)
weights <- runif(nrow(cgps_ge))
correlate(cgps_ge[, 1],cgps_ge[, 2], strata=strata, weights=weights, method="pearson", bootstrap_count=1000)
mRMR.classic
Performs an mRMR feature selection
Description
This function performs an mRMR feature selection.
Usage
mRMR.classic(data, target_index, feature_count, strata, weights, uses_ranks, outX, bootstrap_count)
mRMR.classic
7
Arguments
data
A data frame with rows and columns respectively corresponding to samples and
features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv".
target_index
Index of the target feature (column) from the data matrix.
feature_count
The number of features to be selected by mRMR feature selection.
strata
A vector containing the stratum associated with each sample. Defaults to all
samples being in the same stratum.
weights
A vector containing the weight associated with each sample. Defaults to all
samples having the same weight.
uses_ranks
If set to TRUE, the Spearman correlation is used between numeric features. If
set to FALSE, the Pearson correlation is used. Defaults to TRUE.
outX
Boolean value on whether ties should be counted. Defaults to TRUE.
bootstrap_count
If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0.
Details
The resulting mutual information matrix is computed using Pearson or Spearman’s rho between numeric features (depending on the uses_ranks argument); using Cramer’s V between ordered factors;
and Concordance index for any other combination.
Value
Returns a mRMReObject object with the following attributes:
paths
Index vector in which elements correspond to the features selected.
scores
Score vector associated with mRMR feature selection.
mim
Mutual information matrix used for mRMR feature selection.
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
See Also
mRMR.classic, mRMR.ensemble
Examples
data(cgps)
data <- data.frame(target=cgps_ic50, cgps_ge)
mRMR.classic(data, 1, 30)
8
mRMR.ensemble
mRMR.ensemble
Performs an ensemble mRMR feature selection
Description
This function performs an ensemble mRMR feature selection, that is multiple mRMR selections in
parallel.
Usage
mRMR.ensemble(data, target_index, levels, strata, weights, uses_ranks, outX, bootstrap_count)
Arguments
data
A data frame with rows and columns respectively corresponding to samples and
features. Only columns of the following types are supported: "numeric", "ordered_factor" and "Surv".
target_index
Index of the target feature (column) from the data matrix.
levels
A vector containing the number of features to be selected at each turn of ensemble mRMR feature selection.
strata
A vector of factors specifying the stratum associated with each sample. Defaults
to all samples being in the same stratum.
weights
A vector containing the weight associated with each sample. Defaults to all
samples having the same weight.
uses_ranks
If set to TRUE, the Spearman correlation is used between numeric features. If
set to FALSE, the Pearson correlation is used. Defaults to TRUE.
outX
Boolean value on whether ties should be counted. Defaults to TRUE.
bootstrap_count
If a positive non-zero integer is given, samples of each stratum will be recombined by the generic inverse-variance weighted average based on sample bootstrapping. Defaults to 0.
Details
The resulting mutual information matrix is computed using Pearson or Spearman’s rho between numeric features (depending on the uses_ranks argument); using Cramer’s V between ordered factors;
and Concordance index for any other combination.
Value
Returns a mRMReObject object with the following attributes:
paths
Index matrix in which rows correspond to solutions and columns to the features
selected for each solution.
scores
Score matrix associated with ensemble mRMR feature selection.
mim
Mutual information matrix used for ensemble mRMR feature selection.
set.thread.count
9
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
See Also
build.mim, mRMR.classic
Examples
data(cgps)
data <- data.frame(target=cgps_ic50, cgps_ge)
mRMR.ensemble(data, 1, rep.int(1, 30)) # For mRMR.classic-like results
mRMR.ensemble(data, 1, c(5:1, 4))
set.thread.count
Sets the maximum number of threads that can be launched
Description
This function sets the maximum number of threads that can be launched.
Usage
set.thread.count(thread_count)
Arguments
thread_count
Maximum number of threads.
Value
None
Author(s)
Nicolas De Jay, Simon Papillon-Cavanagh, Benjamin Haibe-Kains
Examples
data(cgps)
set.thread.count(4)
mim <- build.mim(data.frame(target=cgps_ic50, cgps_ge))
head(mim)
Index
∗Topic datasets
cgps, 3
build.mim, 2
cgps, 3, 3
cgps_annot (cgps), 3
cgps_ge (cgps), 3
cgps_ic50 (cgps), 3
compute.causality, 4
correlate, 5
mRMR.classic, 6
mRMR.ensemble, 8
set.thread.count, 9
10
Download