Statistics and Entropy

advertisement
Basic Statistics and Shannon Entropy
Ka-Lok Ng
Asia University
Mean and Standard Deviation (SD)
• Compare distributions having the same
mean value
– a small SD value  a narrow distribution
– a large SD value  a wide distribution
Pearson correlation coefficient or Covariance
Statistics – standard deviation and variance, var(X)=s2, for 1-dimension data
Var ( x)  s
2


n
2
(
x

x
)
i
i 1
n 1
How about higher dimension data ?
- It is useful to have a similar measure to find out how much the
dimensions vary from the mean with respect to each other.
- Covariance is measured between 2 dimensions,
- suppose one have a 3-dimension data set (X,Y,Z), then one can calculate
Cov(X,Y), Cov(X,Z) and Cov(Y,Z)

Cov( X , Y ) 
n
i 1
( xi  x )( yi  y )
n 1
- to compare heterogenous pairs of variables, define the correlation
coefficient or Pearson correlation coefficient, -1≦ rXY ≦1
r XY
Cov( X , Y )

(var X )(var Y )
-1  perfect anticorrelation
0  independent
+1 perfect correlation
The squared Pearson correlation coefficient
• Pearson correlation coefficient is useful for examining correlations in
the data
• One may imagine an instance, for example, in which an event can
cause both enhancement and repression.
• A better alternative is the squared Pearson correlation coefficient
(pcc),
[Cov( X , Y )]2
2
r sq  r XY 
var( X ) var(Y )
The square pcc takes the values in the range 0 ≦ rsq ≦ 1.
0  uncorrelate vector
1  perfectly correlated or anti-correlated
pcc are measures of similarity
Similarity and distance have a reciprocal relationship
similarity↑  distance↓
 d = 1 – r is typically used as a measure of distance
Pearson correlation coefficient or Covariance
- The resulting rXY value will be larger than 0 if a and b tend to increase
together, below 0 if they tend to decrease together, and 0 if they are
independent.
Remark: rXY only test whether there is a linear dependence, Y=aX+b
- if two variables independent  low rXY,
- a low rXY may or may not  independent, it may be a non-linear relation, for
example, y=sin x)
- a high rXY is a sufficient but not necessary condition for variable dependence
Pearson correlation coefficient
• To test for a non-linear relation among the data,
one could make a transformation by variables
substitution
• Suppose one wants to test the relation u(v) = avn
• Take logarithm on both sides
• log u = log a + n log v
• Set Y = log u, b = log a, and X = log v
•  a linear relation, Y = b + nX
•  log u correlates (n>0) or anti-correlates (n<0)
with log v
Pearson correlation coefficient or Covariance
matrix
A covariance matrix is merely collection of many covariances in the
form of a d x d matrix:
Spearman’s rank correlation
• One of the problems with using the PCC is that it is susceptible to
being skewed by outliers: a single data point can result in
situation appearing to be correlated, even when all the other data
points suggest that they are not.
• Spearman’s rank correlation (SRC) is a non-parametric measure
of correlation that is robust to outliers.
• SRC is a measure that ignores the magnitude of the changes.
The idea of the rank correlation is to transform the original values
into ranks, and then to compute the correlation between the
series of ranks.
• First we order the values of gene A and B in ascending order, and
assign the lowest value with rank 1. The SRC between A and B is
defined as the PCC between ranked A and B.
• In case of ties assign midranks  both are ranked 5, then assign
a rank of 5.5
Spearman’s rank correlation
The SRC can be calculated by the following formula, where xi and yi
denote the rank of the x and y respectively.
n
( xi  x )( yi  y )

i 1
r SRC ( X , Y ) 
n
n
[i 1 ( xi  x ) 2 ][ i 1 ( yi  y ) 2 ]
An approximate formula in case of tiesn is given by2
6i 1 ( xi  yi )
r SRC ( X , Y )  1 
n(n 2  1)
Distances in discretized space
• Sometimes one has to due with
discreteized values
• The similarity between two discretized
vectors can be measured by the notion of
Shannon entropy.
Entropy and the Second Law of Thermodynamics:
Disorder and the Unavailability of energy
Entropy always
increase
Ice melts, it becomes more disordered and less structured.
Statistical Interpretation of Entropy and the
Second Law
S = k ln W
S = entropy, k = Boltzmann constant,
ln W = natural logarithm of the number of microstates W
corresponding to the given macrostate.
L. Boltzmann (1844-1906)
http://automatix.physik.uos.de/~jgemmer/hintergrund_en.html
Entropy and the Second Law of Thermodynamics:
Disorder and the Unavailability of energy
Concept of entropy
Toss 5 coins, outcome
5H0T
4H1T
3H2T
2H3T
1H4T
0H5T
1
5
10
10
5
1
A total of 32 microstates.
Propose entropy, S ~ no. of
microstates, W, i.e. S ~ W
Generate coin toss with Excel
The most probable microstates
Shannon entropy
•
•
•
Shannon entropy is related to physical entropy
Shannon ask the question “What is information ?”
Energy is defined as the capacity to do work, not the work itself. Work is a
form of energy.
• Define information as the capacity to store and transmit meaning or
knowledge, not the meaning or knowledge itself.
• For example, a lot of information from WWW, but it does not mean
knowledge
• Shannon suggest entropy is the measure of this capacity
Summary
Define information  capacity to store knowlege  entropy is the measure
 Shannon entropy
Entropy ~ randomness
~ measure of capacity to store and transmit knowledge
Reference: Gatlin L.L, Information theory and the living system, Columbia
University Press, New York, 1972.
Shannon entropy
• How to relate randomness and measure of this capacity ?
Microstates
5H0T
4H1T
3H2T
2H3T
1H4T
0H5T
Shannon entropy
1
5
10
10
5
1
Assuming equal probability of each individual
microstate, pi
pi = 1/W
S = - k ln pi
Physical entropy
S = k ln W
Information ~ 1/pi = W
If pi = 1  there is no information, because it
means certainty
If pi << 1  there are more information, that is
information is a decrease in uncertainty
Distances in discretized space
Sometimes it is advantageous to use a discretized expression
matrix as the starting point, e.g. to assign values 0
(expression unchanged, 1 (expression increased) and -1
(expression decreased).
The similarity between two discretized vectors can be
measured by the notion of Shannon entropy.
Shannon entropy, H1
-Probability of observing a particular symbol or event, pi, with
in a given sequence
Consider a binary system, an element X has two states, 0 or 1
H1  i 1 pi log pi
n

n
and
i 1
pi  1
Claude Shannon
- father of information theory
- H1 measure the “uncertainty” of a probability
distribution
- Expectation (average) value of information
References:
1. http://www.cs.unm.edu/~patrik/networks/PSB99/genetutorial.pdf
2. http://www.smi.stanford.edu/projects/helix/psb98/liang.pdf
3. plus.maths.org/issue23/ features/data/
base 2
Shannon Entropy
pi
  pi log 2 pi
i
certain
Uniform probability
1,0
½, ½
-1*1[log2(1)] = 0
No information
Maximal value
-2*1/2[log2 (1/2)]=1
22-states 1/4 -4*1/4[log2(1/4)]=2
2N-states
1/2N
-2N*1/2N[log2(2N)]=N
DNA seq.  n = 4 states, maximum H1 = - 4*(1/4)* log(1/4) = 2 bits
Protein seq.  n = 20 states,  maximum = - 20*(1/20)*log(1/20) = 4.322 bits,
which is between 4 and 5 bits.
The Divergence from equi-probability
Hmax1
•
•
•
•
•
•
D1
When all letter are equi-probable, pi = 1/n
H1 = log2 (n)
H1 = log2 (n)  the maximum value H1 can take
Define Hmax1 = log2 (n)
Define the divergence from this equi-probable state, D1
D1 = Hmax1 - H1
D1 tells us how much of the total divergence from the maximum
entropy state is due to the divergence of the base composition from
a uniform distribution
For example, E. coli genome has no divergence from equi-probability
because H1Ec = 2 bits, but, for M. lysodeikticus genome, H1Ml = 1.87,
then D1 = 2.00 – 1.87 = 0.13 bit
Divergence from independence
Single-letter events  which contains no information about
how these letters are arranged in a linear sequence
Divergence from independence – Conditional
Entropy
Question
Does the occurrence of any one base along the DNA seq. alter the
probability of occurrence of the base next to it ?
 What are the numerical values of the conditional probabilities ?
 p(X|Y) = prob. of event X condition on event Y
 p(A|A), p(T|A), p(C|A), p(T|A) … etc.
 If they were independent, p(A|A) = p(A), p(T|A) = p(T) ….
 Extreme ordering case, equi-probable seq.,
AAAA…TTTT…CCCC…GGGG…
 p(A|A) is very high, p(T|A) is very low, p(C|A) = 0, p(G|A) = 0
 Extreme case, ATCGATCGATCG….
 Here p(T|A) = p(C|T) = p(G|C) = p(A|G) = 1, and all others are 0
 Equi-probable state ≠ independent events
Divergence from independence – Conditional
Entropy
• Consider the space of DNA dimers (nearest neighbor)
• S2 = {AA, AT, …. TT}
• Entropy of S2, H2 = -[p(AA)log p(AA) + p(AT) log p(AT) + …. + p(TT)
log(TT)]
• If the single letter events are independent, p(X|Y) = p(X),
• If the dimer event is independent, p(A|A)=p(A)p(A),
p(A|T)=p(A)p(T), ….
• If the dimer is not independent, p(XY) = p(X)p(Y|X), such as p(AA) =
p(A)p(A|A), p(AT) = p(A) p(T|A) … etc.
• HInp2 = entropy of completely independent
• Divergence from independence, D2 = HInp2 – H2
• D1 + D2 = the total divergence from the maximum entropy state
Divergence from independence – Conditional
Entropy
• Calculate D1 and D2 for M. phlei DNA, where p(A)=0.164, p(T)=0.162,
p(C)=0.337, p(G)=0.337.
• H1= -(0.164 log 0.164 + 0.162 log 0.162 + ..) = 1.910 bits
• D1 = 2.000 – 1.910 = 0.090 bit
• See the Excel file
•
D2 = HInp2 – H2
• = 3.8216 – 3.7943 = 0.0273 bit
• Total divergence, D1 + D2 = 0.090 + 0.0273 = 0.1173 bit
Divergence from independence – Conditional Entropy
- Compare different sequences using H
to establish relationships

n
i , j 1
pij  1
where
- Given the knowledge of one sequence, say X, can we
estimate the uncertainty of Y relative to X ?
- Relation between X, Y, and the conditional entropy,
H(X|Y) and H(Y|X)
- conditional entropy is the uncertainty relative to
known information
H(X,Y) = H(Y|X) + H(X)
= uncertainty of Y given
knowledge of X, H(Y|X) + uncertainty of X,
sum to the entropy of X and Y
= H(X|Y) + H(Y)
H(Y|X) = H(X,Y) – H(X) = 1.85 – 0.97 = 0.88 bit
Y=2x
log10Y=xlog102
X=log10Y/log102
Shannon Entropy – Mutual Information
Joint entropy H(X,Y)
H ( X , Y )   pij log pij
and
i, j

n
i , j 1
pij  1
where pij is the joint probability of finding xi and yj
• probability of finding (X,Y)
• p00 = 0.1, p01 = 0.3, p10 = 0.4, p11 = 0.2
Mutual information entropy, M(X,Y)
•
•
•
•
•
•
•
Information shared by X and Y, or it can be
used as a similarity measure between X and Y
H(X,Y)= H(X) + H(Y) – M(X,Y)  like in
set theory, A∪B = A + B – (A∩B)
M(X,Y)= H(X) + H(Y) - H(X,Y)
= H(X) – H(X|Y)
= H(Y) – H(Y|X)
= 1.00 – 0.88
M(X,Y)= H(X) + H(Y) – H(X,Y)
= 0.97 + 1.00 – 1.85
= 0.12 bit
Shannon Entropy – Conditional Entropy
Conditional entropy
a particular x
H (Y | X  x)   f (Y | X  x) log( f (Y | X  x))
y
H (Y | X )   f x ( X ) H (Y | X  x)
x
All x’s
4 1
1 3
3
6 4
4 2
2
H (Y | X )  [ ( log  log )  ( log  log )]
10 4
4 4
4 10 6
6 6
6
 0.87549
f x0 ( X )
H(Y|X)
00
10
01
11
f x1 ( X )
f (Y  0 | X  0) log( f (Y  0 | X  0)  f (Y  1 | X  0) log( f (Y  1 | X  0)
f(Y|X=x)
1/4
3/4
4/6
2/6
Download