package edgeR

advertisement
Packages R proposés pour l’analyse différentielle de données de séquençage
haut – débit faisant intervenir des réplicats
Remarque :
1) Certains packages peuvent intégrer une normalisation des données qui précède l’analyse
différentielle et qu’il est parfois difficile de s’affranchir.
2) Détails des principales techniques de normalisation sur le document .ppt
________________________________ package edgeR _____________________________________
Principe :
edgeR suppose que le nombre X de lectures (reads) associé à un gène i et à un échantillon (d’ADNc
ou d’ARN…) j sous la condition k suit une loi négative binômiale pouvant s’écrire :
Xijk ~ NB(μijk ;φ)
tel que :
où les Xijk sont supposés i.i.d
E(Xijk) = μijk
Var(Xij)= μijk(1+ φ)
où, φ est un paramètre de surdispersion à estimer
… Si φ = 1, on se ramène donc à une distribution de Poisson
On suppose que μijk peut s’écrire sous la forme : μijk= λik.Nj
L’estimation de φ se fait en décomposant :
-
Log-vraisemblance de φ pour le gène i : li(φ)
Log-vraisemblance commune de φ : lc(φ)= ∑i li(φ)
Une version pondérée de la vraisemblance de φ est alors estimée par :
WL(φi)= li(φ)+ α.lc(φ)
2 options sont implémentées (et à choisir par l’utilisateur) pour effectuer cette estimation :
-
« tag-wise dispersion » :
« tag-wise + common dispersion » :
α= 0
α> 0
Références :
Robinson MD and Smyth, GK.
Moderated statistical tests for assessing differences in tag abundance
Bioinformatics (2007) 23(21) ; 2881-2887
Robinson MD and Smyth, GK.
Small-sample estimation of negative binomial dispersion, with applications to SAGE data
Biostatistics (2008), 9, 2 ; 321-332
Robinson MD, McCarthy DJ, Smyth, GK.
edgeR : a Bioconductor package for differential expression analysis of digital gene expression data
Bioinformatics (2009)
Robinson MD and Oshlack, A.
A scaling normalization method for differential expression analysis of RNA-seq data
Genome Biology (2010) 11: R25
Remarque(s) :
1) « Le pendant du test limma adapté aux données de séquençage (cf moderated t-test du
même auteur) »
2) « Normalisation TMM implémentée dans le package »
______________________________ Package DESeq _______________________________________
Principe : Algorithme d’Anders & Huber
Soit Xij, le nombre de lectures (reads) correspondant au gène i et à l’échantillon j. On suppose :
Xij ~ NB(μij ; σ2ij)
Idée : La variance biologique associée au gène est une fonction de lissage de son niveau d’expression
dans cette condition.
Trois hypothèses :
1) La valeur attendue du nombre de comptages est fonction d’un terme dépendant du gène et
de la condition associés et peut s’écrire comme suit :
E[Xij] = μij = qi,ρ(j).sj
Où sj : représente la couverture ou profondeur de la librairie j
2) Variance globale du gène = shot noise (var. technique) + variance brute (réplicats
biologiques) :
σ2ij= μij + s2j.νi,ρ(j)
3) Le paramètre associé νi,ρ(j) à la variance du gène est une fonction lissée de la forme :
νi,ρ(j) = νp(qi,ρ(j))
Il est estimé en considérant les données des gènes ayant des niveaux d’expression similaires
(shrinkage)
En pratique …
sˆ j  mediani
X ij
m
( X i )1 / m
 1
… avec m : le nombre d’échantillons
Expression globale (baseMean) :
Fold Change :
FC 
baseMean 
1 n X ij

n j 1 sˆ j
baseMeanB
baseMeanA
Estimation de la variance
1) Tous les calculs sont réalisés après exclusion des gènes ayant au moins une valeur nulle
d’expression
2) Le modèle estime une fonction νp pour chaque condition ρ
3) On vérifie à l’aide de la fonction varianceFitDiagnostics que la variance estimée n’est pas
trop éloignée de la variance empirique des qij(ρ).
Références:
Anders, S and Huber, W.
Differential expression analysis for sequence count data
Nature Precedings (2010)
Dans bioconductor : Analysing RNA – seq data with the « DESeq » package (Anders S.)
___________________________________________________________________________________________
Les autres packages en bref …
 GPseq
“Using the generalized Poisson distribution to model sequence read counts from high throughput
sequencing experiments”
Abstract
Deep sequencing of RNAs (RNA-seq) has been a useful tool to characterize and quantify transcriptomes.
However, there are significant challenges in the analysis of RNA-seq data, such as how to separate signals
from sequencing bias and how to perform reasonable normalization. Here, we focus on a fundamental
question in RNA-seq analysis: the distribution of the position-level read counts. Specifically, we propose a
two-parameter generalized Poisson (GP) model to the position-level read counts. We show that the GP
model fits the data much better than the traditional Poisson model. Based on the GP model, we can better
estimate gene or exon expression, perform a more reasonable normalization across different samples, and
improve the identification of differentially expressed genes and the identification of differentially spliced
exons. The usefulness of the GP model is demonstrated by applications to multiple RNA-seq data sets.
References
Consul, P. C. (1989) Generalized Poisson Distributions: Properties and Applications. New York:
Marcel Dekker.
Sudeep Srivastava, Liang Chen. A two-parameter generalized Poisson model to improve the
analysis of RNA-Seq data Nucleic Acids Research Advance Access published July 29,2010 doi :
10.1093/nar/gkq670
Remarque
… Propose une normalisation particulière basée sur un comptage de lectures par position
 BaySeq (inclus dans BioConductor)
“Empirical Bayesian analysis of patterns of differential expression in count data”
Introduction
We assume that we have discrete data from a set of sequencing or other high-throughput experiments,
arranged in a matrix such that each column describes a sample and each row describes some entity for
which counts exist. For example, the rows may correspond to the different sequences observed in a
sequencing experiment. The data then consists of the number of times each sequence is observed in each
sample. We wish to determine which, if any, rows of the data correspond to some patterns of differential
expression across the samples. This problem has been addressed for pairwise differential expression by the
edgeR [2]package.
However, baySeq takes an alternative approach to analysis that allows more complicated patterns of
differential expression than simple pairwise comparison, and thus is able to cope with more complex
experimental designs. We also observe that the methods implemented in baySeq perform at least as well,
and in some circumstances considerably better than those implemented in edgeR [1].
baySeq uses empirical Bayesian methods to estimate the posterior likelihoods of each of a set of models that
define patterns of differential expression for each row. This approach begins by considering a distribution for
the row defined
by a set of underlying parameters for which some prior distribution exists. By estimating this prior distribution
from the data, we are able to assess, for a given model about the relatedness of our underlying parameters
for multiple libraries, the posterior likelihood of the model.
In forming a set of models upon the data, we consider which patterns are biologically likely to occur in the
data. For example, suppose we have count data from some organism in condition A and condition B.
Suppose further that we have two biological replicates for each condition, and hence four libraries
A1;A2;B1;B2, where A1, A2 and B1, B2 are the replicates. It is reasonable to suppose that at least some of
the rows may be unaffected by our experimental conditions A and B, and the count data for each sample in
these rows will be equivalent. These data need not in general be identical across each sample1 due to
random effects and different library sizes, but they will share the same underlying parameters. However,
some of the rows may be influenced by the different experimental conditions A and B. The count data for the
samples
A1 and A2 will then be equivalent, as will the count data for the samples B1 and B2. However, the count data
between samples A1; A2; B1; B2 will not be equivalent. For such a row, the data from samples A1 and A2
will then share the same set of underlying parameters, the data from samples B1 and B2 will share the same
set of underlying parameters, but, crucially, the two sets will not be identical.
Our task is thus to determine the posterior likelihood of each model for each row of the data. We can do this
by considering either a Poisson or negative-binomial distribution upon the sequencing count data. The
Poisson method is considerably faster as a closed form conjugate prior exists for this distribution.
The negative-binomial solution is slower as it requires a numerical solution for the prior, but is probably a
better fit for most data. In experimental data, we have found that the Poisson method is likely to give poor
results if true biological replicates are not available; in most human studies, for example. In general,
therefore, the use of the negative-binomial methods is recommended.
Reference
Thomas J. Hardcastle and Krystyna A. Kelly. baySeq: Empirical Bayesian Methods For Identifying
Differential Expression In Sequence Count Data.BMC Bioinformatics (2010)
Remarque
Pas de normalisation intégrée

NBPseq (en cours d’ouverture)
Details
Overview: For assessing evidence for differential gene expression from RNA-Seq read counts,it is critical to
adequately model the count variability between independent biological replicates.
Negative binomial (NB) distribution offers a more realistic model for RNA-Seq count variability than Poisson
distribution and still permits an exact (non-asymptotic) test for comparing two groups.
For each individual gene, a NB distribution uses a dispersion parameter _i to model the extra-Poisson
variation between biological replicates. Across all genes, the NBP parameterization of the NB distribution (the
NBP model) uses two parameters (_; _) to model extra-Poisson variation over the entire range of expression
levels. The NBP model allows the NB dispersion parameter to be an arbitrary power function of the mean.
The NBP model includes the Poisson model as a limiting case (as _ tends to 0) and the NB2 model as a
special case (when _ = 2).
Under the NB2 model, the dispersion parameter is a constant and does not vary with the mean expression
levels. NBP model is more flexible and is the recommended default option.
Count Normalization: We take gene expression to be indicated by relative frequency of RNASeq reads
mapped to a gene, relative to library sizes (column sums of the count matrix). Since the relative frequencies
sum to 1 in each library (one column of the count matrix), the increased relative frequencies of truly over
expressed genes in each column must be accompanied by decreased relative frequencies of other genes,
even when those others do not truly differently express. Robinson and Oshlack (2010) presented examples
where this problem is noticeable.
A simple fix is to compute the relative frequencies relative to effective library sizes—library sizes multiplied by
normalization factors. By default, nbp.test assumes the normalization factors are 1 (i.e. no normalization is
needed). Users can specify normalization factors through the argument norm.factors. Many authors
(Robinson and Oshlack (2010), Anders and Huber (2010)) propose to estimate the normalization factors
based on the assumption that most genes are NOT differentially expressed.
Library Size Adjustment: The exact test requires that the effective library sizes (column sumsof the count
matrix multiplied by normalization factors) are approximately equal. By default, nbp.test will thin
(downsample) the counts to make the effective library sizes equal. Thinning may lose statistical efficiency,
but is unlikely to introduce bias.
Reference :
Di, Y, D. W. Schafer, J. S. Cumbie, and J. H. Chang: "The NBP Negative Binomial Model for
Assessing Differential Gene Expression from RNA-Seq", SAGMB, accepted.

Samseqr (en cours d’ouverture: le pendant de SAM microarrays par les mêmes auteurs …)
Description
This package implements a method for normalization, testing, and false discovery rate estimation for RNAsequencing data. We estimate the sequencing depths of the experiments using a new method based on
Poisson goodness-of-fit statistic, calculate a score statistic on the basis of a Poisson log-linear model, and
then estimate the false discovery rate using a modified version of permutation plug-in method.
Reference :
Li J, Witten DM, Johnstone I, Tibshirani R (2011). Normalization, testing, and false discovery rate
estimation for RNA-sequencing data. Submitted.
Download