Packages R proposés pour l’analyse différentielle de données de séquençage haut – débit faisant intervenir des réplicats Remarque : 1) Certains packages peuvent intégrer une normalisation des données qui précède l’analyse différentielle et qu’il est parfois difficile de s’affranchir. 2) Détails des principales techniques de normalisation sur le document .ppt ________________________________ package edgeR _____________________________________ Principe : edgeR suppose que le nombre X de lectures (reads) associé à un gène i et à un échantillon (d’ADNc ou d’ARN…) j sous la condition k suit une loi négative binômiale pouvant s’écrire : Xijk ~ NB(μijk ;φ) tel que : où les Xijk sont supposés i.i.d E(Xijk) = μijk Var(Xij)= μijk(1+ φ) où, φ est un paramètre de surdispersion à estimer … Si φ = 1, on se ramène donc à une distribution de Poisson On suppose que μijk peut s’écrire sous la forme : μijk= λik.Nj L’estimation de φ se fait en décomposant : - Log-vraisemblance de φ pour le gène i : li(φ) Log-vraisemblance commune de φ : lc(φ)= ∑i li(φ) Une version pondérée de la vraisemblance de φ est alors estimée par : WL(φi)= li(φ)+ α.lc(φ) 2 options sont implémentées (et à choisir par l’utilisateur) pour effectuer cette estimation : - « tag-wise dispersion » : « tag-wise + common dispersion » : α= 0 α> 0 Références : Robinson MD and Smyth, GK. Moderated statistical tests for assessing differences in tag abundance Bioinformatics (2007) 23(21) ; 2881-2887 Robinson MD and Smyth, GK. Small-sample estimation of negative binomial dispersion, with applications to SAGE data Biostatistics (2008), 9, 2 ; 321-332 Robinson MD, McCarthy DJ, Smyth, GK. edgeR : a Bioconductor package for differential expression analysis of digital gene expression data Bioinformatics (2009) Robinson MD and Oshlack, A. A scaling normalization method for differential expression analysis of RNA-seq data Genome Biology (2010) 11: R25 Remarque(s) : 1) « Le pendant du test limma adapté aux données de séquençage (cf moderated t-test du même auteur) » 2) « Normalisation TMM implémentée dans le package » ______________________________ Package DESeq _______________________________________ Principe : Algorithme d’Anders & Huber Soit Xij, le nombre de lectures (reads) correspondant au gène i et à l’échantillon j. On suppose : Xij ~ NB(μij ; σ2ij) Idée : La variance biologique associée au gène est une fonction de lissage de son niveau d’expression dans cette condition. Trois hypothèses : 1) La valeur attendue du nombre de comptages est fonction d’un terme dépendant du gène et de la condition associés et peut s’écrire comme suit : E[Xij] = μij = qi,ρ(j).sj Où sj : représente la couverture ou profondeur de la librairie j 2) Variance globale du gène = shot noise (var. technique) + variance brute (réplicats biologiques) : σ2ij= μij + s2j.νi,ρ(j) 3) Le paramètre associé νi,ρ(j) à la variance du gène est une fonction lissée de la forme : νi,ρ(j) = νp(qi,ρ(j)) Il est estimé en considérant les données des gènes ayant des niveaux d’expression similaires (shrinkage) En pratique … sˆ j mediani X ij m ( X i )1 / m 1 … avec m : le nombre d’échantillons Expression globale (baseMean) : Fold Change : FC baseMean 1 n X ij n j 1 sˆ j baseMeanB baseMeanA Estimation de la variance 1) Tous les calculs sont réalisés après exclusion des gènes ayant au moins une valeur nulle d’expression 2) Le modèle estime une fonction νp pour chaque condition ρ 3) On vérifie à l’aide de la fonction varianceFitDiagnostics que la variance estimée n’est pas trop éloignée de la variance empirique des qij(ρ). Références: Anders, S and Huber, W. Differential expression analysis for sequence count data Nature Precedings (2010) Dans bioconductor : Analysing RNA – seq data with the « DESeq » package (Anders S.) ___________________________________________________________________________________________ Les autres packages en bref … GPseq “Using the generalized Poisson distribution to model sequence read counts from high throughput sequencing experiments” Abstract Deep sequencing of RNAs (RNA-seq) has been a useful tool to characterize and quantify transcriptomes. However, there are significant challenges in the analysis of RNA-seq data, such as how to separate signals from sequencing bias and how to perform reasonable normalization. Here, we focus on a fundamental question in RNA-seq analysis: the distribution of the position-level read counts. Specifically, we propose a two-parameter generalized Poisson (GP) model to the position-level read counts. We show that the GP model fits the data much better than the traditional Poisson model. Based on the GP model, we can better estimate gene or exon expression, perform a more reasonable normalization across different samples, and improve the identification of differentially expressed genes and the identification of differentially spliced exons. The usefulness of the GP model is demonstrated by applications to multiple RNA-seq data sets. References Consul, P. C. (1989) Generalized Poisson Distributions: Properties and Applications. New York: Marcel Dekker. Sudeep Srivastava, Liang Chen. A two-parameter generalized Poisson model to improve the analysis of RNA-Seq data Nucleic Acids Research Advance Access published July 29,2010 doi : 10.1093/nar/gkq670 Remarque … Propose une normalisation particulière basée sur un comptage de lectures par position BaySeq (inclus dans BioConductor) “Empirical Bayesian analysis of patterns of differential expression in count data” Introduction We assume that we have discrete data from a set of sequencing or other high-throughput experiments, arranged in a matrix such that each column describes a sample and each row describes some entity for which counts exist. For example, the rows may correspond to the different sequences observed in a sequencing experiment. The data then consists of the number of times each sequence is observed in each sample. We wish to determine which, if any, rows of the data correspond to some patterns of differential expression across the samples. This problem has been addressed for pairwise differential expression by the edgeR [2]package. However, baySeq takes an alternative approach to analysis that allows more complicated patterns of differential expression than simple pairwise comparison, and thus is able to cope with more complex experimental designs. We also observe that the methods implemented in baySeq perform at least as well, and in some circumstances considerably better than those implemented in edgeR [1]. baySeq uses empirical Bayesian methods to estimate the posterior likelihoods of each of a set of models that define patterns of differential expression for each row. This approach begins by considering a distribution for the row defined by a set of underlying parameters for which some prior distribution exists. By estimating this prior distribution from the data, we are able to assess, for a given model about the relatedness of our underlying parameters for multiple libraries, the posterior likelihood of the model. In forming a set of models upon the data, we consider which patterns are biologically likely to occur in the data. For example, suppose we have count data from some organism in condition A and condition B. Suppose further that we have two biological replicates for each condition, and hence four libraries A1;A2;B1;B2, where A1, A2 and B1, B2 are the replicates. It is reasonable to suppose that at least some of the rows may be unaffected by our experimental conditions A and B, and the count data for each sample in these rows will be equivalent. These data need not in general be identical across each sample1 due to random effects and different library sizes, but they will share the same underlying parameters. However, some of the rows may be influenced by the different experimental conditions A and B. The count data for the samples A1 and A2 will then be equivalent, as will the count data for the samples B1 and B2. However, the count data between samples A1; A2; B1; B2 will not be equivalent. For such a row, the data from samples A1 and A2 will then share the same set of underlying parameters, the data from samples B1 and B2 will share the same set of underlying parameters, but, crucially, the two sets will not be identical. Our task is thus to determine the posterior likelihood of each model for each row of the data. We can do this by considering either a Poisson or negative-binomial distribution upon the sequencing count data. The Poisson method is considerably faster as a closed form conjugate prior exists for this distribution. The negative-binomial solution is slower as it requires a numerical solution for the prior, but is probably a better fit for most data. In experimental data, we have found that the Poisson method is likely to give poor results if true biological replicates are not available; in most human studies, for example. In general, therefore, the use of the negative-binomial methods is recommended. Reference Thomas J. Hardcastle and Krystyna A. Kelly. baySeq: Empirical Bayesian Methods For Identifying Differential Expression In Sequence Count Data.BMC Bioinformatics (2010) Remarque Pas de normalisation intégrée NBPseq (en cours d’ouverture) Details Overview: For assessing evidence for differential gene expression from RNA-Seq read counts,it is critical to adequately model the count variability between independent biological replicates. Negative binomial (NB) distribution offers a more realistic model for RNA-Seq count variability than Poisson distribution and still permits an exact (non-asymptotic) test for comparing two groups. For each individual gene, a NB distribution uses a dispersion parameter _i to model the extra-Poisson variation between biological replicates. Across all genes, the NBP parameterization of the NB distribution (the NBP model) uses two parameters (_; _) to model extra-Poisson variation over the entire range of expression levels. The NBP model allows the NB dispersion parameter to be an arbitrary power function of the mean. The NBP model includes the Poisson model as a limiting case (as _ tends to 0) and the NB2 model as a special case (when _ = 2). Under the NB2 model, the dispersion parameter is a constant and does not vary with the mean expression levels. NBP model is more flexible and is the recommended default option. Count Normalization: We take gene expression to be indicated by relative frequency of RNASeq reads mapped to a gene, relative to library sizes (column sums of the count matrix). Since the relative frequencies sum to 1 in each library (one column of the count matrix), the increased relative frequencies of truly over expressed genes in each column must be accompanied by decreased relative frequencies of other genes, even when those others do not truly differently express. Robinson and Oshlack (2010) presented examples where this problem is noticeable. A simple fix is to compute the relative frequencies relative to effective library sizes—library sizes multiplied by normalization factors. By default, nbp.test assumes the normalization factors are 1 (i.e. no normalization is needed). Users can specify normalization factors through the argument norm.factors. Many authors (Robinson and Oshlack (2010), Anders and Huber (2010)) propose to estimate the normalization factors based on the assumption that most genes are NOT differentially expressed. Library Size Adjustment: The exact test requires that the effective library sizes (column sumsof the count matrix multiplied by normalization factors) are approximately equal. By default, nbp.test will thin (downsample) the counts to make the effective library sizes equal. Thinning may lose statistical efficiency, but is unlikely to introduce bias. Reference : Di, Y, D. W. Schafer, J. S. Cumbie, and J. H. Chang: "The NBP Negative Binomial Model for Assessing Differential Gene Expression from RNA-Seq", SAGMB, accepted. Samseqr (en cours d’ouverture: le pendant de SAM microarrays par les mêmes auteurs …) Description This package implements a method for normalization, testing, and false discovery rate estimation for RNAsequencing data. We estimate the sequencing depths of the experiments using a new method based on Poisson goodness-of-fit statistic, calculate a score statistic on the basis of a Poisson log-linear model, and then estimate the false discovery rate using a modified version of permutation plug-in method. Reference : Li J, Witten DM, Johnstone I, Tibshirani R (2011). Normalization, testing, and false discovery rate estimation for RNA-sequencing data. Submitted.