Tutorial1

advertisement
Answer Hints to Bioinformatics Tutorial:
Differential Gene Expression Analysis
1.
Book Question, examples provided in lecture notes!
2.
N
A
B
C
D
D
128
256
1024
1000
256
128
512
1000
FC = (log(D) - log(N))/log(2)
1
-1
-1
0
3.
a.
T-test
When would you use a t-test as opposed to a z-test?
Use t-test for studying small data sets, and the z-test for studying larger data sets.
The z-test assumes that the data set (sample) follows a normal distribution.
The t-test assumes that the data set (sample) was drawn from a normal distribution but because we
only have a small sample, the sample itself follows a t-distribution.
b.
What is meant by paired and unpaired experiments? How do they affect the calculation of a
t-test?
Paired experiments involve measurements from the same individuals (or very similar individuals
e.g. twins) under different conditions. In such a case you can get away by comparing the
measurements of each individual directly.
In unpaired experiments, the assumption does not hold (so the individuals are different and you
cannot relate individual measurements), so you have to compare the averages across the two data
sets.
The formulae for paired/unpaired are given in the lecture notes.
c.
What is meant by a two tail t-test? Right tail / Left tail t-tests?
In two tail tests you are only interested if the two populations are different (you don’t care if the
change was positive or negative so long as the measurements in one group are far enough in either
way from the mean value of the other group).
Right tail and Left tail tests are stronger, they insist that the measurements in one group are
higher/lower than those in the other group.
d.
What is meant when we say that the t-value is a Signal-to-Noise ratio
The Signal and Noise represent the two components of the t-value (Signal represents the
numerator, Noise represents the denominator).
Signal is the average difference between both groups (High signal means the difference is high),
and Noise is the fluctuation in that difference (Low noise means small fluctuations).
A large SNR means the differences are high and the fluctuation (noise) is low.
e.
What is the number of degrees of freedom for a paired t-test when each of the samples has 10
data elements?
10-1 = 9
f.
What is the number of degrees of freedom for two unpaired data sets, the first having 4
elements and the second having 6 elements?
4+6-2=8
4.
a.
P-value
What does a p-value of 1 mean?
Hard Luck! You have just proved the null hypothesis!
In case you were trying to check whether a particular value does not belong to a given population,
you just discovered that this value coincides exactly with the mean for the population. In case of
testing for differences between means of two samples, you have just proved that there is no
difference between their means. The area under the curve between this value and +/- infinity is 1.
b.
What does a p-value of 0.05 mean? Explain your answer graphically using a normal
distribution.
Congratulations you just disproved the null hypothesis. P-value of 0.05 means the probability of
rejecting the null hypothesis is very high.
In case you were trying to check whether a particular value does not belong to a given population,
you just discovered that this value is very far from the mean of the population, in fact the
probability that this value belongs to the population is less than 5%.
In case of testing for differences between means of two samples, you have just proved that there is
high probability that their means are different.
c.
What is the difference between a normal-distribution and t-distribution
t-distribution is lower at the mean, and flatter, i.e. it takes longer to reach zero on both sides. For
any value on the x-axis, the area under the curve to the left (or right) of that value is bigger for the
t-distribution than it is for the normal distribution. Note that the t-distribution approximates to a
normal distribution when the number of degrees of freedom is high (>30)
d.
What is meant by a critical t-value for a p=0.05, how does this value depend on the number
of samples in an experiment?
This is the value on the x axis, where the area under the curve to its right is 0.05
For your experiment to have a p value of 0.05, the t-value you calculate must be greater than the
critical t-value.
Both the t-value you calculate and the critical t-value change as the number of degrees of freedom
changes.
e.
Using the p-value table at the bottom of the next page, find the critical t-value for a paired ttest (2 samples each having 4 elements) such that provides a 95% confidence that the two
samples are different.
V=3, and it is a two-tailed distribution. t value represents the value for which the area under the
curve should be 0.025. (since the curve is symmetric and it is a two-tailed test).
In the new table below this is 3.182 (Please note this value was missing in the original tutorial
sheet).
T
v
p
10.95
6 0.000034
10.95
3 0.001631
12.05
6 0.000020
12.05
3 0.001230
8.4
6 0.000155
8.4
3 0.003539
2.353
6 0.056825
3.182
3
0.025
2.353
3 0.100033
5.
a.
Volcano Plot
Explain the volcano plot method for assessing the effect and significance over a large number
of genes. Why is it useful?
You are trying to compare a very large number of fold changes to quickly assess which genes have
an effect that is both high and significant. You use a scatter plot, each point represents a gene. The
co-ordinates represent the magnitude of the effect for that gene and its significance.
b.
How are effect and significance calculated in the volcano plot?
Effect is calculated as the difference between the two population means, Significance is calculated
by calculating the p-value from a t-test.
c.
What is the numerical interpretation of the X-axis in a volcano plot?
This represents the average fold change. A value of 0 means no change, a value of +1 means the
effect in the gene is doubled (-1 effect is halved), a value of +2 means the effect is 4 times, etc …
d.
What is the numerical interpretation of the y-axis in the volcano plot?
This represents the number of decimal points in the p-value calculated, the higher you are on the xaxis, the lower the p-value (and hence the higher your confidence that the effect is true and not just
by chance).
6.
The table below shows gene expression values for a number genes. Each gene is measured for the
same type of tissue cell .in normal state in four samples (N1..N4) and in diseased state in another
set of four samples (D1..D4)
Gene
A
B
C
D
E
F
G
H
I
J
K
L
M
N
O
P
Q
R
S
T
U
V
Gene
A
E
N
Q
S
V
N1
N2
N3
10
11
9
10
10
10
100
50
14
1
19
110
10
10
11
100
120
120
10
11
20
100
20
18
17
20
20
20
120
60
26
11
8
120
20
20
19
120
130
130
10
19
30
100
N4
30
27
32
30
30
48
130
70
33
21
42
130
30
30
26
130
140
140
35
32
20
100
D1
D2
40
44
43
40
40
40
140
80
37
31
46
70
40
40
36
70
150
150
40
39
30
100
110
50
15
1
20
100
10
10
10
10
10
10
10
120
110
10
10
10
100
110
25
100
D3
120
60
25
11
10
120
20
20
20
20
20
20
20
130
120
20
20
20
120
120
40
100
D4
130
70
35
21
40
130
30
30
30
30
30
30
30
140
130
30
30
30
130
130
35
100
a.
Consider Gene V. Without going through lengthy calculations, is there any
change between both states. What do you expect the p-value to be?
For Gene V, all measurements are 100 in both groups. So there is no change between
the expression values for the normal and diseased states. The p-value is going to be 1.
b.
Calculate t-value for Genes A, E, N,Q, S, V
I used Excel
WT1
WT2
10
10
10
120
10
101
WT3
20
20
20
130
10
101
WT4
30
30
30
140
35
101
KO1
40
40
40
150
40
101
KO2
110
20
120
10
100
101
KO3
120
10
130
20
120
101
KO4
130
40
140
30
130
101
140
30
150
40
140
101
Mean WT
25
25
25
135
23.75
101
140
80
45
31
30
70
40
40
40
40
40
40
40
150
70
40
40
40
140
140
37
100
SD WT
Mean KO SD KO
12.90994
125 12.90994
12.90994
25 12.90994
12.90994
135 12.90994
12.90994
25 12.90994
16.00781
122.5 17.07825
1E-14
101 1.00E-14
Gene
A
E
N
Q
S
U
Effect
100
0
110
-110
98.75
0
Significance
Unpaired T
P_Unpaired
10.95445115 0.000034
0
1
12.04989627
0.00002
-12.04989627 0.000020
8.437423426 0.000151
0
1
7.
a.
Using the table of p-values below:
Calculate the effect and significance for genes A, E, N, Q, S, V and plot them on a scatter plot
(Volcano plot)
Calculations from above table plotted in figure below, each gene is represented by a square on the
plot and labelled.
b.
Compare the effect and significance between genes A and S
Directly from Plot S has higher effect but lower significance than A.
N
Q
A
S
E,V
Download