L8_DE_BAYES.pptx

advertisement
Differential Expressions
Bayesian Techniques
Lecture Topic 8
Why Bayes?
A friend of mine who is Bayesian said the following when
asked this question:
• Some problems very hard to solve by classical techniques
• e.g. Behrens-Fisher problem
• Every new problem requires a new solution
• Bayes provides a coherent path
The Frequentist Paradigm
• Probability refers to a limiting relative frequency.
Probability are OBJECTIVE properties in the real world.
• Parameters are fixed unknown constants, NO probability
statement is possible about a parameter.
• Statistical procedures should be designed to have welldefined LONG-RUN frequency properties. For example a
95% confidence interval should trap the true value of the
parameter with a limiting frequency of 95%.
Bayesian Philosophy
• Probability describes a DEGREE OF BELIEF not a
relative frequency. As such you can make probability
statements about anything, not just data
• We CAN make probability statement about parameters
even if they are fixed constants.
• We make inferences about a parameter by producing its
probability distributions. Inferences such as point or
interval estimation maybe extracted from the probability
distribution of the parameter.
The Contrasts
• According to Larry Wasserman: “Bayesian inference is a
controversial approach as it embraces a subjective notion
of probability”.
• In general Bayesian methods have NO guarantees for long
run performance.
Advantages of Bayesian Methods
•
•
•
•
•
•
Provide ability to formally incorporate prior information
Inference conditional on actual data (not what might have been)
More easily interpretable by non-specialists (e.g. confidence intervals)
All analyses follow directly from posterior distribution
Stopping Rule does not affect Inference
Any question can be directly answered ex. bioequivalence
– H0: θ0 ≠ θ0
– H1: θ0 = θ1
• ■ Reverse role of null and alternative
• ■ Hard to use traditional testing methods in Bayes easy
Disadvantages
• Initial Bayesians were subjectivist
• Results not “objective,” could be manipulated to yield any
desired result
• How to set the prior in general?
• Computationally difficult
• Need to evaluate complex integrals even for simple
problems
• Need inexpensive high speed computing
How Bayesian Method Works
• Choose a probability density f(q) – called the PRIOR distribution that expresses our beliefs about a parameter BEFORE we see any data.
• We choose a statistical model f(x| q) that reflects our beliefs our x
given q. Here we write it as f(x | q) NOT f(x;q) in the frequentist
world.
• After OBSERVING the data X1, …, Xn, we update our belief in the
parameter and calculate the posterior distribution f(q | x).
• It essentially uses the Bayes theorem to calculate the posterior
distribution.
Bayes Theorem: Discrete Version
A Simple Probability Result
• Let B1,B2 . . . Bn disjoint sets P(Bk) > 0, all k,
• P(B1U B2 . . . U Bn) = 1
• (Mutually exclusive and exhaustive)
• For any event A
• P(Bj|A) = P(Bj)P(A|Bj)/ SP(Bk)P(A|Bk)
EXAMPLE:
• Disease incidence in population – P(D)=0.001
• Diagnostic test
– false positive rate 0.05 , P(+|not D) = 0.05
– false negative rate 0.01, P(-|D) = 0.01
• If Person drawn at random tests +, What is probability
he has disease, D?
P( D) P(  | D)
P( D | ) 
P ( D) P (  | D)  P ( D C ) P (  | D C )
(.001)(.99)

 .0194
(.001)(.99)  (.999)(.05)
Comment
• Hence, probability that you HAVE the disease given that
you have TESTED positive is still pretty LOW, even with
very small FALSE POSITIVES and FALSE
NEGATIVES.
• This rule is very useful in numerous other situations.
Bayes Theorem: The Continuous Version
• Let f(q) be our prior distribution (density) for our
parameter q.
• Suppose we have the data X1, …, Xn, with density f(X1,
…, Xn | q) also written as Ln(X, q)
Ln (q , x1...xn ) f (q )
f (q | x1...xn ) 
 Ln (q , x1...xn ) f (q )dq
f ( x1...xn | q ) f (q )

 f ( x1...xn | q ) f (q )dq
Some Simplifications
• The denominator is sometimes very hard to deal with,
since the integration over the parameters is not trivial.
• We call that the normalizing constant. And in most cases
don’t explicitly evaluate it. And we use the idea that:
f (q | x1... xn )  L( x1... xn | q ) f (q )
Bayes’ Idea
• Think of a model for data y1, . . . , yn
f(y1, . . . , yn|θ) e.g. Normal, Binomial, etc.
• θ random with prior density g(.)
• Bayes Rule says that:
p(θ| y1, . . . , yn) =
 g(θ) f(y1, . . . , yn |θ)
• Hence, the posterior is proportional to probability of
prior multiplied by probability of data given the
parameter.
Hypothesis Testing: Classical vs.
Bayesian
Classical: Set up null, alternative hypotheses, perform a
test, calculate a p-value, reject or fail to reject
the null
Bayesian: Inference based on posterior distribution,
p(θ|y1, . . . , yn)
• Consider evidence in favor of certain parameter values
• Data as well as prior beliefs influence inference
Major Challenge 1: Setting Priors
Approaches
• Subjective - based on beliefs of individual, expert, etc.
issues:
– how to do in practice?
– -people inconsistent
– elicitation can help
• Non-informative - based on “prior ignorance” about
parameter
• issues:
– often hard to define
– may lead to improper posteriors
– sensitive to parameterization
Setting Priors: Conjugate Priors
• Conjugate priors are priors so that combined with the model the
posterior will have a KNOWN distribution.
• issues:
– choice of convenience
– avoids computational problems
– exists only for limited families
• Example:
• y ~ Bin(n,θ), θ ~ Beta(α,β) then p(θ|y)  Beta(α+y,β+n-y)
•
•
•
•
Normal conjugate is Normal for location
Poisson conjugate is Gamma
Inverse Gamma is often used as a prior for Normal s2.
Generally all members of the Exponential Families have conjugate
priors.
Setting Priors: Non-informative
• Assuming we have no REAL information about the
parameter, we can model it with a “non-informative” prior.
• For example if qi is discrete we can think of
– P(qi) =1/n for i= 1…n
• If we know an interval (a,b) in which q lies, we can define
– Prior as P(q) = 1/(b-a) a < q < b.
• We can also define
– P(q) = c, c > 0. (improper Prior, since its not a pdf).
Setting Priors: Jeffery’s Prior
• Uniform non-informative priors are criticized since they do
not lend themselves to transformation.
• Jeffery’s Prior is often used, that IS invariant under
transformation.
• P(q) = [I(q)]1/2 , I: information matrix
2
I (q )   E X |q ( 2 log f ( X | q )
q
Major Challenge II: Computation
• Need to evaluate complicated high dimensional integrals
• Lots of technology developed in last 20-25 years
Approaches
• Earliest solutions: approximations and numerical integration
• Noniterative Monte Carlo: direct sampling, indirect sampling
(importance, rejection)
• Markov Chain Monte Carlo (MCMC): Gibbs sampling, MetropolisHastings algorithm, hybrid methods . . .
• MCMC most popular and can be implemented in high dimensional
situations.
Simple Example
Simple Example contd…
• Posterior mean is weighted average of prior mean and data
mean
■ Sample average is shrunk toward prior mean
■ Weight depends on relative variability of prior and data
• Posterior precision is sum of prior precision and data
precision
• Samples from posterior are easy to get given data, σ², μ, τ²
Lessons from Example
General principle: posterior is compromise between prior
and data
• μ and τ² not known
■ Empirical Bayes: estimate μ and τ²
■ Hierarchical Bayes: put prior on μ and τ² as well
Bayesian Hypothesis Testing
• The idea is due to Jefferys (1961).
• Idea: Based on the data that each hypothesis is supposed to
predict, one applies Bayes’ Theorem and computes the
posterior probability that the first hypothesis is correct.
• UNLIKE Classical methods the hypothesis DO NOT have
to be nested within each other.
Mechanics of Bayesian Hypothesis Testing
• Lets consider we have two hypothesis H0 and H1 (the Bayesians prefer
to use the word “models” as opposed to hypothesis, but we will keep
“hypothesis” to be consistent with the classical ideas).
• Let H0 and H1 be two hypotheses concerning the data Y, and let q0 and
q1 be the associated parameters.
• We define pi (qi) as the corresponding priors.
• Let fi(y | qi) be the corresponding marginal distributions.
• We can use Bayes’ Theorem to calculate, P(qi |y) the posteriors.
• Bayes’ hypothesis testing consists of finding the following and using
pre-specified cut-offs for decisions:
– B=[P(q0|y)/P(q1|y)]/[P(q0)/P(q1)] (Bayes’ Factor)
– P( q0 | Y=y), P(q0 | Y>=y) (Bayesian p-values)
Bayesian Hypothesis Tests in Microarrays
• Let
Hg1: gene is differentially expressed
Hg0: gene is not differentially expressed
• Traditional Bayesians would write this as
1 if the gene is differenti ally expressed
vg  
0 otherwise
Method 1
• Differential Expression Score
• Use t-statistic or Wilcoxon Rank sum statistic, zg
• Then Calculate P(H0 | zg=z) or P(H0 | zg z) or
• P(vg=0 | zg=z) or P(vg=0 | zgz)
• McClure and Wit (2004) show that the second term is
identical to using the FDR method for controlling error.
Fully Bayesian Analysis
• In general we are interested in:
• The term given below where p0 is the fraction of inactive
genes in the array, F0 is the distribution under the null
hypothesis, v=0, F is the distribution of the test statistic
1  F0 ( z )
ˆ
P (v g  0 | z g  z )  pˆ 0
1  Fˆ ( z )
Bayesian t test
• The t statistic is given by:
t
( xg1  xg 2 )
se
• Assume: zg|{vg=0} ~ N(0,s02)
•
zg|{vg=1} ~ N(0,s12)
• Hence, zg ~ (1-p1) N(0,s02)+ p1 N(0,s12)
Bayesian t test: Priors
• p1 ~ Uniform(0,1)
• vg ~ Bernoulli (p1)
1/s02 ~ Gamma(a,b), 1/s12 ~ Gamma(g,d),
b ~ Gamma(l1,t1), d ~ Gamma(l2,t2),
q = (v, p1, s02,a, b, s12 ,g,d, l1,t1,l2,t2)
These are all conjugate priors to make the calculations
easier.
One uses the Gibbs sampler to simulate from P(q| z) to
estimate p1, s02 ,s12 to calculate the required probability.
Gibbs Sampler
• It is used to calculate the poster mean.
• It does not calculate P(q|y) explicitly. It simulates draws from this
distribution. Using sample summaries we get a good idea of the joint
posterior as well as the marginal distribution of interest P(v| y).
• It samples from the distribution of P(qi| q-i,y), until it converges to a
stationery distribution. This is called “burn-in”.
• After burn-in each draw of q is a draw from a posterior distribution.
• Bayes Theorem states that the conditional distribution of P(qi|q-i,y) is
proportional to the likelihood of the prior, P(y|q)P(q) as a function of
qi.
• If the marginal distributions without the specific component is defined
(generally using conjugate priors) this procedure can be applied easily.
Empirical Bayes Idea
• The prior distributions depend upon unknown parameters which in
turn may need a second or higher stage prior in some hierarchical
setting.
• But at some point we HAVE to specify all remaining parameters of the
hyper-prior.
• In other words we HAVE to use our knowledge to specify our prior.
• The Empirical Bayes method uses sample data to estimate the
parameters for the final stage prior.
• The idea is if we are interested in q|y, let q ~ P(h1), h1~P(h2)…
 h L-1~P(hL).
• In the empirical Bayes idea we use the data to estimate the parameter
hL obtained as the value that maximizes the marginal likelihood
P(Y| hL).
• We replace the estimate of hL in the priors, and the posterior
distribution is now P(q|y,est- hL).
Empirical Bayes’ Idea in Differential Expression
• Average log fold change.
• Problem: non DE genes with
large variances have too much
chance of being selected.
• t-statistics
• Problem: apparently DE genes
with very small sample
variances are suspect.
• Moderated t-statistics A happy
compromise between the two
above, an empirical Bayes
estimate, using data to estimate
the new se, sg. Generally ~
sg  s g  c
The moderated t statistic
• Smoothed standard deviations: shrink towards
~
sg 
d 0 s12g  d g s22g
(d 0  d g )
,
• Eliminates large t-statistics due merely to very small s
values,and reduces the impact of very large s values.
EB Idea
• Posterior odds (for DE)
• Posterior probability of differential expression for any
gene is
• A monotonic function of t˜ 2 for constant d.
Estimating hyper-parameters
Closed form estimators with good properties are available:
for s0 and d0 in terms of the first two moments of log s2.
for c0 in terms of quantiles of the | t˜g | .
Nowadays the EB estimate is used most often for differential expressions
and the genes are ranked by the EB estimates.
Instead of doing strict Error Control, the top g genes are looked at using
EB estimates for ranking purposes. Sometimes | t˜g | >4 is used as an
empirical cut-off.
Limma in R uses empirical Bayes estimates for looking at which genes are
differentially expressed.
Download