Some examples of statistical inference in genomics Peter J. Bickel Department of Statistics University of California at Berkeley, USA Joint work with Ben Brown, Haiyan Huang, Nancy Zhang, Nathan Boley, Jessica Li, and the ENCODE Consortium Outline The ENCODE Project The first question: Testing the hypothesis of lack of association between two features of the genome a) Modeling issues b) A minimal nonparametric model c) Theory and practical applications of our nonparametric view The second question: Determining the reliability of genomic features derived by different algorithms from ChIP-seq and other assays a) The method is based on consistency of biological replicates since ground truth is rarely, if ever, available b) A curve, a copula model, and an analogue of the False Discovery Rate The ENCODE Project The ENCODE Project Consortium. "The ENCODE (ENCyclopedia Of DNA Elements) Project". 2004. Science 22: 306 (5696). The Genome Structural Correction References for Part I Peter J. Bickel, Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R. Zhang. “Subsampling methods for genomic inference”. Annals of Applied Statistics, Volume 4, Number 4 (2010), 1660-1697. The ENCODE Project Consortium. “Initial Analysis of the Encyclopedia of DNA Elements in the Human Genome”. 2012. Nature, in press. Gerstein et al. Integrative Analysis of the Caenorhabditis elegans Genome by the modENCODE Project”. Science. (2010): Vol. 330 no. 6012 pp. 1775-1787 Birney E et al (2007). “Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project”. Nature. 447: 799-816. Margulies EM, et al. (2007). “Analysis of deep mammalian sequence alignments and constraint predictions for 1% of the human genome”. Genome Research. 17: 760-774. Association of functional annotations in the Human Genome → Transcription Start Sites (TSSs) → GENCODE Exons 5' 3' 3' 5' The ENCODE Consortium found that many Transcription Start Sites are anti-sense to GENCODE exons They also found vastly more TSSs than previously supposed Is the association between TSSs and exons in the anti-sense direction real, or experimental noise in TSS identification? Association of experimental annotations across whole chromosomes Do two factors tend to bind together more closely or more often than other pairs of factors? Does a factor’s binding site relative to TSSs tend to change across genomic regions? Feature Overlap: the question A mathematical question arises: →Transcription Fragments → Conserved sequence 5' Do these features overlap more, or less than “expected at random”? 3' Our formulation Defining “expectation” and “at random”: The genome is highly structured Analysis of feature inter-dependence must account for superficial structure “Expected at random” becomes: Overlap between two feature sets bearing structure, under no biological constraints Naïve Method Treating bases as being independent with same distribution (ordinary bootstrap) Hypothesis: Feature markings are independent Specific Object Test based on % Feature Overlap – (% Feature1)(% Feature2) and standard statistics Why naïve ? Bases are NOT independent Better method: keeping one type of feature fixed and simulating moving start site of another feature uniformly (feature bootstrap) Why still a problem? Even if feature occurrences are independent functionally, there can be clumping caused by the complex underlying genome sequence structure (i.e. inhomogeneity, local sequence dependence) A non parametric model Requirements: a) b) c) It should roughly reflect known statistics of the genome It should encompass methods listed It should be possible to do inference, tests, set confidence bounds meaningfully Segmented Stationary Model Let Xi = base at position i, i=1,…,n such that for each k=1,…,r, is: Stationary (homogeneity within blocks) Mixing (bases at distant positions are nearly independent) r << n ( X 1,..., X n ) = ( X11 ,..., X 1n ,..., X r1 ,..., X rnr ), n = n1 + ... + nr 1 … n1 n2 nr -1 nr Empirical Interpretations Within a segment: For k small compared to minimum segment length, statistics of random kmers do not differ between large subsegments of segment Knowledge of the first kmer does not help in predicting a distant kmer Remark: If this model holds it also applies to derived local features, e.g. {I1,…,In} where Ik = 1 if position k belongs to binding site for given factor Using our model for inference Many genomic statistics are function of one or more sums of the form: e.g. is 1 or 0 depending on the presence or absence of a feature or features When the summands are small compared to S: Gaussian case Example: Region overlap for common features, or rare features over large regions Under segmented stationarity, these distributions can be estimated from the data Some theory Theorem 1: Segmented stationarity, exponential mixing and fraction of short segments → 0 implies asymptotic normality of linear statistics Theorem 2: If the ordinary stationary bootstrap is used (Politis/Romano) under suitable conditions on L, and different stationary segments are present, then the asymptotic bootstrap distribution is heavier tailed than a Gaussian of the same variance Theorem 3: If the true segmentation is estimated in an approximately consistent way, then, for approximately linear statistics, the resulting segmented bootstrap is consistent By the delta method, Gaussianity holds for smooth functions of vectors of linear statistics, and so does segmented bootstrap and previous theorems Distributions of feature overlaps The Block Bootstrap Can’t observe independent occurrences of ENCODE regions, but if our hypothesis of segmented stationarity holds then the distribution of sum statistics and their functions can be approximated as follows Block Bootstrap for r = 1 Algorithm 4.1: 1) Given L << n choose a number N uniformly at random from {1,...,n-L} 2) Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xn is stationary, compute 3) Repeat B times to obtain * * 4) Estimate the distribution of T L1 , .. . , T LB by the empirical distribution: By Theorem 4.2.1 of Politis, Romano and Wolf (1999) this is asymptotically okay Block Bootstrap Animation r=1 Statistic: Observed Sequence (X): S=f(X) … B … … X * Draw a block of length from original sequence, Calculate Repeat statistic this procedure on theL identically block bootstrapped B times. sequence. this is the block-bootstrapped sequence. Observing the distributions Block bootstrap distribution of the Region Overlap Statistic Shown here with the PDF of the normal distribution with the same mean and variance The histogram of Is approximately the same as density of QQplot of BB distribution vs. standard normal What if r > 1 The estimated distribution is always heavier tailed leading to conservative p values But it can be enormously so if the segment means of the statistic differ substantially Less so but still meaningful if the means agree but variances differ Solutions Segment using biological knowledge 1) 2) Essentially done in ENCODE: poor segmentation occasionally led to non-Gaussian distributions (excessively conservative) Segment using a particular linear statistic which we expect to identify homogeneous segments Block Bootstrap given Segmentation f 1L f 2L f 3L 1. Draw Subsample of length L: 2. Compute statistic on subsample: T(X*) 3. Do this B times: T(X1*),…T(XB*) True distribution Uniform Start Site Shuffling Block Bootstrap without Segmentation Block Bootstrap with True Segmentation Block Bootstrap with Estimated Segmentation Testing Association Question: How do we estimate null distribution given only data for which we believe the null is false? Testing Association (bp overlap) Observed Sequence (Feature 1 = , Feature 2 = ): Statistic is: (XAlign )+(X1)(Y21),ofproperly normalized and set tosecond mean block, 0. 2)(Y1Feature first block with swapping Feature 2 =of(X Calculate overlap in the blocks after )(Y )+(X 2 1 1)(Y2) Sample two blocks of equal length. Under the null hypothesis of independence, this should be Gaussian. And vice versa. Correlating DNA copy number variation with genomic content Redon et al (2007) claimed that copy number variant regions (CNVs) are significantly (< 0.05 with multiple testing correction) negatively correlated with coding regions, i.e. have less than randomly expected overlap. Their analysis is based on random shufflings of start positions. Our analysis is that the effect is probably an artifact. Issues Choosing a method of segmentation, e.g. dyadic, and its tuning parameters Block size for bootstrap using: stability segmentation Test Statistic H : Features not associated in each segment (so-called “dummy overlap”) Then has a Gaussian distribution. We form the test statistic: where: Length of segment i/n % of basepairs in segment i identified as Feature 1 % of basepairs in segment i identified as Feature 2 Measuring reproducibility of highthroughput experiments Qunhua Li, James B. Brown, Haiyan Huang, and Peter J. Bickel Annals of Applied Statistics, Volume 5, Number 3 (2011), 1752-1779. A consistency measure Our method of fitting Nancy Zhang Jessica Li Ben Brown Qunhua Li Anshul Kundaje Haiyan Huang Nathan Boley Peter Bickel