Lecture Slides - Department of Statistics and Probability

advertisement
Some examples of statistical
inference in genomics
Peter J. Bickel
Department of Statistics
University of California at Berkeley, USA
Joint work with Ben Brown, Haiyan Huang, Nancy Zhang,
Nathan Boley, Jessica Li, and the ENCODE Consortium
Outline

The ENCODE Project

The first question: Testing the hypothesis of lack of association between two
features of the genome


a) Modeling issues

b) A minimal nonparametric model

c) Theory and practical applications of our nonparametric view
The second question: Determining the reliability of genomic features derived by
different algorithms from ChIP-seq and other assays

a) The method is based on consistency of biological replicates since ground
truth is rarely, if ever, available

b) A curve, a copula model, and an analogue of the False Discovery Rate
The ENCODE Project
The ENCODE Project Consortium. "The ENCODE (ENCyclopedia Of DNA
Elements) Project". 2004. Science 22: 306 (5696).
The Genome Structural Correction
References for Part I

Peter J. Bickel, Nathan Boley, James B. Brown, Haiyan Huang, and Nancy R.
Zhang. “Subsampling methods for genomic inference”. Annals of Applied
Statistics, Volume 4, Number 4 (2010), 1660-1697.

The ENCODE Project Consortium. “Initial Analysis of the Encyclopedia of
DNA Elements in the Human Genome”. 2012. Nature, in press.

Gerstein et al. Integrative Analysis of the Caenorhabditis elegans Genome by
the modENCODE Project”. Science. (2010): Vol. 330 no. 6012 pp. 1775-1787

Birney E et al (2007). “Identification and analysis of functional elements in 1%
of the human genome by the ENCODE pilot project”. Nature. 447: 799-816.

Margulies EM, et al. (2007). “Analysis of deep mammalian sequence alignments
and constraint predictions for 1% of the human genome”. Genome Research.
17: 760-774.
Association of functional annotations in the Human
Genome
→ Transcription Start Sites (TSSs)
→ GENCODE Exons
5'
3'



3'
5'
The ENCODE Consortium found that many Transcription Start Sites are
anti-sense to GENCODE exons
They also found vastly more TSSs than previously supposed
Is the association between TSSs and exons in the anti-sense direction real, or
experimental noise in TSS identification?
Association of experimental annotations
across whole chromosomes
Do two factors tend to bind together more closely or more often than other
pairs of factors? Does a factor’s binding site relative to TSSs tend to
change across genomic regions?
Feature Overlap: the question

A mathematical question arises:
→Transcription Fragments
→ Conserved sequence
5'

Do these features overlap more, or less than
“expected at random”?
3'
Our formulation

Defining “expectation” and “at random”:
The genome is highly structured
 Analysis of feature inter-dependence must account
for superficial structure


“Expected at random” becomes:

Overlap between two feature sets bearing structure,
under no biological constraints
Naïve Method

Treating bases as being independent with same distribution
(ordinary bootstrap)





Hypothesis: Feature markings are independent
Specific Object Test based on
% Feature Overlap – (% Feature1)(% Feature2)
and standard statistics
Why naïve ? Bases are NOT independent
Better method: keeping one type of feature fixed and simulating
moving start site of another feature uniformly (feature bootstrap)
Why still a problem?

Even if feature occurrences are independent functionally, there can be
clumping caused by the complex underlying genome sequence structure
(i.e. inhomogeneity, local sequence dependence)
A non parametric model
Requirements:
a)
b)
c)
It should roughly reflect known statistics of the
genome
It should encompass methods listed
It should be possible to do inference, tests, set
confidence bounds meaningfully
Segmented Stationary Model
Let Xi = base at position i, i=1,…,n
such that for each k=1,…,r,
is:
 Stationary (homogeneity within blocks)
 Mixing (bases at distant positions are nearly independent)
 r << n
( X 1,..., X n ) = ( X11 ,..., X 1n ,..., X r1 ,..., X rnr ),
n = n1 + ... + nr
1
…
n1
n2
nr -1
nr
Empirical Interpretations

Within a segment:
For k small compared to minimum segment length,
statistics of random kmers do not differ between
large subsegments of segment
 Knowledge of the first kmer does not help in
predicting a distant kmer


Remark:

If this model holds it also applies to derived local
features, e.g. {I1,…,In} where Ik = 1 if position k
belongs to binding site for given factor
Using our model for inference
Many genomic statistics are function of one or more sums of the form:
e.g.
is 1 or 0 depending on the presence or absence of a feature or features
When the summands are small compared to S:
Gaussian case
Example: Region overlap for common features, or rare features
over large regions
Under segmented stationarity, these distributions can be
estimated from the data
Some theory

Theorem 1: Segmented stationarity, exponential mixing and
fraction of short segments → 0 implies asymptotic normality of
linear statistics

Theorem 2: If the ordinary stationary bootstrap is used
(Politis/Romano) under suitable conditions on L, and different
stationary segments are present, then the asymptotic bootstrap
distribution is heavier tailed than a Gaussian of the same variance

Theorem 3: If the true segmentation is estimated in an
approximately consistent way, then, for approximately linear
statistics, the resulting segmented bootstrap is consistent

By the delta method, Gaussianity holds for smooth functions of
vectors of linear statistics, and so does segmented bootstrap and
previous theorems
Distributions of feature overlaps

The Block Bootstrap
 Can’t
observe independent occurrences of
ENCODE regions, but if our hypothesis of
segmented stationarity holds then the
distribution of sum statistics and their
functions can be approximated as follows
Block Bootstrap for r = 1
Algorithm 4.1:




1) Given L << n choose a number N uniformly at random from {1,...,n-L}
2) Given the statistics Tn(X1,…,Xn) , under the assumption that X1,…,Xn is
stationary, compute
3) Repeat B times to obtain
*
*
4) Estimate the distribution of T L1 , .. . , T LB by the empirical distribution:
By Theorem 4.2.1 of Politis, Romano and Wolf (1999)
this is asymptotically okay
Block Bootstrap Animation
r=1
Statistic:
Observed Sequence (X):
S=f(X)
…
B
…
…
X
*
Draw
a block
of length
from original
sequence,
Calculate
Repeat
statistic
this procedure
on theL identically
block
bootstrapped
B times.
sequence.
this is the block-bootstrapped sequence.
Observing the distributions
Block bootstrap
distribution of the
Region Overlap
Statistic
Shown here with
the PDF of the
normal distribution
with the same mean
and variance
The histogram of
Is approximately the
same as density of
QQplot of BB distribution vs.
standard normal
What if r > 1



The estimated distribution is always heavier
tailed leading to conservative p values
But it can be enormously so if the segment
means of the statistic differ substantially
Less so but still meaningful if the means agree
but variances differ
Solutions
Segment using biological knowledge
1)

2)
Essentially done in ENCODE: poor segmentation
occasionally led to non-Gaussian distributions
(excessively conservative)
Segment using a particular linear statistic which
we expect to identify homogeneous segments
Block Bootstrap given Segmentation
f 1L
f 2L
f 3L
1. Draw Subsample of length L:
2. Compute statistic on
subsample:
T(X*)
3. Do this B times:
T(X1*),…T(XB*)
True distribution
Uniform Start Site Shuffling
Block Bootstrap without Segmentation
Block Bootstrap with True Segmentation
Block Bootstrap with Estimated Segmentation
Testing Association
Question: How do we estimate null
distribution given only data for which
we believe the null is false?
Testing Association (bp overlap)
Observed Sequence (Feature 1 =
, Feature 2 =
):
Statistic is: (XAlign
)+(X1)(Y21),ofproperly
normalized
and set
tosecond
mean block,
0.
2)(Y1Feature
first
block
with swapping
Feature
2 =of(X
Calculate
overlap
in
the
blocks
after
)(Y
)+(X
2
1
1)(Y2)
Sample
two
blocks
of
equal
length.
Under the null
hypothesis
of
independence,
this
should
be
Gaussian.
And vice versa.
Correlating DNA copy number variation with genomic
content
Redon et al (2007) claimed that copy number variant regions
(CNVs) are significantly (< 0.05 with multiple testing correction)
negatively correlated with coding regions, i.e. have less than
randomly expected overlap. Their analysis is based on random
shufflings of start positions. Our analysis is that the effect is
probably an artifact.
Issues


Choosing a method of segmentation, e.g. dyadic,
and its tuning parameters
Block size for bootstrap using:
stability
 segmentation

Test Statistic
H : Features not associated in each segment (so-called “dummy overlap”)
Then
has a Gaussian distribution.
We form the test statistic:
where:
Length of segment i/n
% of basepairs in segment i identified as Feature 1
% of basepairs in segment i identified as Feature 2
Measuring reproducibility of highthroughput experiments
Qunhua Li, James B. Brown, Haiyan Huang,
and Peter J. Bickel
Annals of Applied Statistics, Volume 5,
Number 3 (2011), 1752-1779.
A consistency measure
Our method of fitting
Nancy Zhang
Jessica Li
Ben Brown
Qunhua Li
Anshul Kundaje
Haiyan Huang
Nathan Boley
Peter Bickel
Download