Binomial distribution

advertisement
Binomial distribution
The binomial distribution is a discrete distribution that describes the probability distribution of the number of successes in a sequence of n independant trials, each of them occuring with a probability p.
P(success) = p
P(failure) = q = p-­1
P(above sequence) = P(success)*P(failure)*P(success)*P(success)*P(failure)*...
= pk qn-­k
= pk (1-­p)n-­k
P(to have k successes) = ?
where k is the number of successes
Binomial distribution
The binomial distribution is a discrete distribution that describes the probability distribution of the number of successes in a sequence of n independant trials, each of them occuring with a probability p.
The probability P(k;;n,p) to observe k successes is given by:
This formula can easily be understood intuitively:
We want k successes (this will occur with a probability pk) and (n-­k) failures (this will occur with a probability (1-­p)n-­k ).
Since the k successes can occur anywhere among the n trials, we have to count all the possibilities. The number of ways to distribute the k successes in a sequence of n trials is
given by the binomial coefficient . Binomial distribution
Example of a binomial distribution: consider a genetic sequence of length n
(here n=100) in which each nucleotide has a probability p to mutate (here p=0.2). Then generate 10000 sequences from the initial one. Let's call a mutation a "success". Then the distribution of occurrence of the number k of mutations (successes) may look like this:
The blue bars are the data obtained when 1000 tests have been performed.
The red curve is the corresponding binomial distribution.
Bernoulli experiment
Such a success/failure (or, if you prefer, 0/1) experiment is called a Bernoulli experiment. When n=1, the binomial distribution is in fact a Bernoulli distribution.
Example 1: Tossing Coins
success = head
Example 2: Probabiliy to take a
blue ball from an urn containing
8 red balls and 2 blue balls.
success = red
Poisson distribution
If the number of trial n is very large and the probability of success p
is very small, the binomial distribution can fairly be approximated by a Poisson distribution. The Poisson distribution, sometimes referred to as the law of rare events, was first described by Simeon Denis Poisson (1781-­1840) and has nothing to do with fishes. Let's define λ = n.p (with n very large and p very small) , then the probability P(k;;λ) to observe k successes is given by: Parameter λ is the expected number of successes. It corresponds to the mean of the Poisson distribution.
Note that, more generally, the Poisson distribution can be used to describe the probability distribution of the number of events occurring in a fixed period of time if these events occur with a known average rate and independently of the time since the last event. Example: the number of mRNA synthetised in a given period of time follows a Poisson distribution.
Poisson distribution
Let's show that the Poisson distribution can be obtained as the limit of the binomial distribution when the number of trial n is very large and the probability of success p is very small, such as the product λ = n.p remains constant: note that:
recall that:
1
1
1
1
const
e-λ
1
Poisson distribution
Binomial vs Poisson distribution
Comparison of the binomial and the Poisson distribution: Let's take again the "mutations" example and compare the distributions obtained for a genetic sequence of length n =100 and a mutation probability p=0.5 (left panel) and the case n=500 and p=0.1 (right panel). For each case, we have generated 10000 sequences from the initial one. As expected, the Poisson distribution is closer to the binomial distribution and fit well the data when n is large and p small.
n = 100, p = 0.5
n = 500, p = 0.1
λ=50
λ=50
Poisson distribution & BLAST
Poisson distribution and database searches (BLAST)
When searching sequences in databases (e.g. with BLAST), we perform a lot of trials (alignments) and we expect small number of successes (high score pairs or HSP). We are therefore in the conditions of the Poisson distribution. The probability to find exactly k HSP by chance can thus be described by: where λ is the expected number of hits (E-­value).
The probability to observe at least k HSP by chance is described by:
where N is the total number of trials (N=n.m where m is the length of the query sequence times and n is the length of the database).
Poisson distribution & BLAST
Poisson d istribution
Inverse-­cumulative Poisson d istribution
Poisson distribution & BLAST
Poisson distribution and database searches (BLAST)
The probability to observe at least k HSP by chance is described by:
A particular case is the probability to observe at least one HSP by chance (k=1):
This probability is called the database-­wise P-­value (or family-­wise error rate).
This P-­value represents the probability to find at least one spurious match in the whole database search, with a score greater or equal to S.
Download