Answer Hints to Bioinformatics Tutorial: Differential Gene Expression Analysis 1. Book Question, examples provided in lecture notes! 2. N A B C D D 128 256 1024 1000 256 128 512 1000 FC = (log(D) - log(N))/log(2) 1 -1 -1 0 3. a. T-test When would you use a t-test as opposed to a z-test? Use t-test for studying small data sets, and the z-test for studying larger data sets. The z-test assumes that the data set (sample) follows a normal distribution. The t-test assumes that the data set (sample) was drawn from a normal distribution but because we only have a small sample, the sample itself follows a t-distribution. b. What is meant by paired and unpaired experiments? How do they affect the calculation of a t-test? Paired experiments involve measurements from the same individuals (or very similar individuals e.g. twins) under different conditions. In such a case you can get away by comparing the measurements of each individual directly. In unpaired experiments, the assumption does not hold (so the individuals are different and you cannot relate individual measurements), so you have to compare the averages across the two data sets. The formulae for paired/unpaired are given in the lecture notes. c. What is meant by a two tail t-test? Right tail / Left tail t-tests? In two tail tests you are only interested if the two populations are different (you don’t care if the change was positive or negative so long as the measurements in one group are far enough in either way from the mean value of the other group). Right tail and Left tail tests are stronger, they insist that the measurements in one group are higher/lower than those in the other group. d. What is meant when we say that the t-value is a Signal-to-Noise ratio The Signal and Noise represent the two components of the t-value (Signal represents the numerator, Noise represents the denominator). Signal is the average difference between both groups (High signal means the difference is high), and Noise is the fluctuation in that difference (Low noise means small fluctuations). A large SNR means the differences are high and the fluctuation (noise) is low. e. What is the number of degrees of freedom for a paired t-test when each of the samples has 10 data elements? 10-1 = 9 f. What is the number of degrees of freedom for two unpaired data sets, the first having 4 elements and the second having 6 elements? 4+6-2=8 4. a. P-value What does a p-value of 1 mean? Hard Luck! You have just proved the null hypothesis! In case you were trying to check whether a particular value does not belong to a given population, you just discovered that this value coincides exactly with the mean for the population. In case of testing for differences between means of two samples, you have just proved that there is no difference between their means. The area under the curve between this value and +/- infinity is 1. b. What does a p-value of 0.05 mean? Explain your answer graphically using a normal distribution. Congratulations you just disproved the null hypothesis. P-value of 0.05 means the probability of rejecting the null hypothesis is very high. In case you were trying to check whether a particular value does not belong to a given population, you just discovered that this value is very far from the mean of the population, in fact the probability that this value belongs to the population is less than 5%. In case of testing for differences between means of two samples, you have just proved that there is high probability that their means are different. c. What is the difference between a normal-distribution and t-distribution t-distribution is lower at the mean, and flatter, i.e. it takes longer to reach zero on both sides. For any value on the x-axis, the area under the curve to the left (or right) of that value is bigger for the t-distribution than it is for the normal distribution. Note that the t-distribution approximates to a normal distribution when the number of degrees of freedom is high (>30) d. What is meant by a critical t-value for a p=0.05, how does this value depend on the number of samples in an experiment? This is the value on the x axis, where the area under the curve to its right is 0.05 For your experiment to have a p value of 0.05, the t-value you calculate must be greater than the critical t-value. Both the t-value you calculate and the critical t-value change as the number of degrees of freedom changes. e. Using the p-value table at the bottom of the next page, find the critical t-value for a paired ttest (2 samples each having 4 elements) such that provides a 95% confidence that the two samples are different. V=3, and it is a two-tailed distribution. t value represents the value for which the area under the curve should be 0.025. (since the curve is symmetric and it is a two-tailed test). In the new table below this is 3.182 (Please note this value was missing in the original tutorial sheet). T v p 10.95 6 0.000034 10.95 3 0.001631 12.05 6 0.000020 12.05 3 0.001230 8.4 6 0.000155 8.4 3 0.003539 2.353 6 0.056825 3.182 3 0.025 2.353 3 0.100033 5. a. Volcano Plot Explain the volcano plot method for assessing the effect and significance over a large number of genes. Why is it useful? You are trying to compare a very large number of fold changes to quickly assess which genes have an effect that is both high and significant. You use a scatter plot, each point represents a gene. The co-ordinates represent the magnitude of the effect for that gene and its significance. b. How are effect and significance calculated in the volcano plot? Effect is calculated as the difference between the two population means, Significance is calculated by calculating the p-value from a t-test. c. What is the numerical interpretation of the X-axis in a volcano plot? This represents the average fold change. A value of 0 means no change, a value of +1 means the effect in the gene is doubled (-1 effect is halved), a value of +2 means the effect is 4 times, etc … d. What is the numerical interpretation of the y-axis in the volcano plot? This represents the number of decimal points in the p-value calculated, the higher you are on the xaxis, the lower the p-value (and hence the higher your confidence that the effect is true and not just by chance). 6. The table below shows gene expression values for a number genes. Each gene is measured for the same type of tissue cell .in normal state in four samples (N1..N4) and in diseased state in another set of four samples (D1..D4) Gene A B C D E F G H I J K L M N O P Q R S T U V Gene A E N Q S V N1 N2 N3 10 11 9 10 10 10 100 50 14 1 19 110 10 10 11 100 120 120 10 11 20 100 20 18 17 20 20 20 120 60 26 11 8 120 20 20 19 120 130 130 10 19 30 100 N4 30 27 32 30 30 48 130 70 33 21 42 130 30 30 26 130 140 140 35 32 20 100 D1 D2 40 44 43 40 40 40 140 80 37 31 46 70 40 40 36 70 150 150 40 39 30 100 110 50 15 1 20 100 10 10 10 10 10 10 10 120 110 10 10 10 100 110 25 100 D3 120 60 25 11 10 120 20 20 20 20 20 20 20 130 120 20 20 20 120 120 40 100 D4 130 70 35 21 40 130 30 30 30 30 30 30 30 140 130 30 30 30 130 130 35 100 a. Consider Gene V. Without going through lengthy calculations, is there any change between both states. What do you expect the p-value to be? For Gene V, all measurements are 100 in both groups. So there is no change between the expression values for the normal and diseased states. The p-value is going to be 1. b. Calculate t-value for Genes A, E, N,Q, S, V I used Excel WT1 WT2 10 10 10 120 10 101 WT3 20 20 20 130 10 101 WT4 30 30 30 140 35 101 KO1 40 40 40 150 40 101 KO2 110 20 120 10 100 101 KO3 120 10 130 20 120 101 KO4 130 40 140 30 130 101 140 30 150 40 140 101 Mean WT 25 25 25 135 23.75 101 140 80 45 31 30 70 40 40 40 40 40 40 40 150 70 40 40 40 140 140 37 100 SD WT Mean KO SD KO 12.90994 125 12.90994 12.90994 25 12.90994 12.90994 135 12.90994 12.90994 25 12.90994 16.00781 122.5 17.07825 1E-14 101 1.00E-14 Gene A E N Q S U Effect 100 0 110 -110 98.75 0 Significance Unpaired T P_Unpaired 10.95445115 0.000034 0 1 12.04989627 0.00002 -12.04989627 0.000020 8.437423426 0.000151 0 1 7. a. Using the table of p-values below: Calculate the effect and significance for genes A, E, N, Q, S, V and plot them on a scatter plot (Volcano plot) Calculations from above table plotted in figure below, each gene is represented by a square on the plot and labelled. b. Compare the effect and significance between genes A and S Directly from Plot S has higher effect but lower significance than A. N Q A S E,V