Pr{X=k}

advertisement
Bernoulli Trials
A Bernoulli trial is an experiment with only two outcomes:
• Success: This event occurs with probability p
• Failure: This event occurs with probability q = 1 - p
A coin flip is the classic example of a Bernoulli trial
In the context of finding a thymine or ‘CT’ in DNA, it could mean:
• ‘T’ is success, any other nucleotide is failure
• A pyrimidine is success, a purine is failure
When we refer to “trials” in the plural, it’s assumed they
are independent and that they have the same probabilities
Geometric Distribution
The geometric distribution follows immediately from the
idea of conducting multiple Bernoulli trials
Q: what’s the probability that it takes k trials to get a success??
• Before we can succeed at trial k, we must first have had k-1 failures!!
• Each failure occurred with probability q, so there is a term with:
qk-1
• Finally, a single success occurs with probability p, so there is a term:
p1
But each trial is mutually independent, so we can write:
Pr{X=k} = qk-1p
Geometric Distribution
Geometric distribution for p = 1/5
Image from encyclopedia of math
Pr{X=k} = qk-1p
What is the probability that the random variable X takes on the value k?
X represents the number of trials required to get a success
Binomial Distribution
The binomial distribution also arises naturally from the
idea of conducting multiple Bernoulli trials
Q: what’s the probability that we’ll get k successes in n trials??
• We’re looking for k successes, so there must be a term with:
pk
(each success has probability p)
• If we’ve had k successes there must be a term reflecting n-k failures:
qn-k
(each failure has probability q)
• Again trials are mutually independent, so we can write:
qn-kpk
BUT WAIT!!!
We’re not done!
Binomial Distribution
We also need to consider how many different ways we can generate
those k successes from n trials. Here we show all the ways you can
get 3 T’s in a total of 5 nucleotides (symbol V here is the “non-T”
nucleotide ambiguity code – a failure in our Bernoulli trials):
['TTTVV', 'TTVTV', 'TTVVT', 'TVTTV', 'TVTVT',
‘TVVTT', ‘VTTTV', ‘VTTVT', ‘VTVTT', ‘VVTTT']
Does this seem like a familiar problem? It should!
This is just “n choose k”! Therefore we must also have a term:
( )
n
k
Binomial Distribution
Putting it all together:
( )
n
Pr{X=k} =
k
n-k
k
q p
What is the probability that the random variable X takes on the
value k?
X represents the number of trials k out of a total of n that were
successes
small fly in the ointment.. DNA has four bases, not just two… really we
want a multinomial distribution -- a generalization of the binomial distribution.
*One
But close enough for government work, eh?
Binomial Distribution
Image from zoonek2.free.fr
n = 100, p = 0.5
What is the “expected value” of these distributions?
Normal curve with same mean and SD drawn over top
Poisson Distribution
Another common limiting case of binomial is when we have large N
and small p such that the expected (mean) value is a moderate
number (between 0 and 5-10). Then the distribution is close to a
Poisson distribution
Binomial(10,.1)
Poisson(1)
Characteristics of Poisson
Single parameter (mean): l = Np
P( k | l ) = exp(- l) lk/k!
Variance = Mean = l ; SD = √l ;
For l > 10, Normal approximation N(l, l) is fine
l=3
l=7
Scientific Computing in Python
SciPy
..and more
NumPy
Matplotlib
http://scipy.org
Scientific Computing in Python
numPy implements very efficient low-level
n-dimensional array processing and other basic
numerical routines
Our interest in numPy is mostly restricted to the fact that
both the sciPy library and matplotlib depend on numPy
http://docs.scipy.org/doc/scipy-0.13.0/reference/
Scientific Computing in Python
SciPy is the name for the whole ecosystem
and a specific scientific computing library!
The sciPy library has many numerical algorithms, but
also domain specific toolboxes
Our interest is primarily in the statistics toolbox
http://docs.scipy.org/doc/scipy-0.13.0/reference/
Distributions in scipy.stats
SciPy supports both continuous and
discrete random variables and associated distributions
each distribution in turn supports a number of
methods , e.g.:
• rvs: Random variates
• pdf: Probability Density Function
• cdf: Cumulative Distribution Function
• sf: Survival Function (1-CDF)
• ppf: Percent Point Function (Inverse of CDF)
• isf: Inverse Survival Function (Inverse of SF)
• stats: Return mean, variance, (Fisher’s) skew, or
(Fisher’s) kurtosis
• moment: non-central moments of the distribution
http://docs.scipy.org/doc/scipy/reference/tutorial/stats.html
Distributions in Python
The scipy.stats library supports a rich collection of
distributions and their methods. One example:
from scipy.stats import binom
[ n, p ] = [100,0.5]; myList = []
random_var = binom(n, p)
for k in xrange(100):
myList.append(random_var.pmf(k))
# .pmf is the “probability mass function”
myList here will contain the probabilities associated with
the first 100 values of k, and if plotted would recapitulate
the earlier binomial distribution histogram
http://docs.scipy.org/doc/scipy/reference/stats.html
The matplotlib python library
An very powerful tool for professional-quality plots
Many usage examples are given in the documentation
http://matplotlib.org/
Download