Practical: Statistical Principles of Experimental Design Exercise 1: Estimating the variability of a microarray experiment In this exercise we will look at estimating the coefficient of variability between hybridisations by looking at the variability between an identical reference sample hybridised to 20 different arrays as a function of hybridisation intensity. 1.1 Load Excel 1.2 Open up the file “perou.txt”. This file contains data for the 20 breast cancer patients treated with doxorubicin. Go to the worksheet called “Reference Sample”. The first column contains the spot number. The next 20 columns contain the background subtracted raw data for the green channel for the 20 patients. They have been labelled “g7”, “g10” etc, where the number is a patient ID. The following 20 columns contain the background subtracted raw data for the red channel for the 20 patients. The green channel contains data for the reference sample, which is identical on every array. The red channel contains the data from the patient samples. 1.3 We start by normalising the data by taking logs of the green data and then centering the data. The columns AP to BI have been set up for the logged data. Into cell AP2, type the formula: = LN(B2) / LN(2) This calculates the log of the intensity to base 2. Copy this formula into all of the cells columns AP to BI and rows 2 to 6351. 1.4 To centre the data, we will need to compute the average and standard deviation of each array. Into cell AP6352 type the formula: = AVERAGE(AP2:AP6351) Into cell AP6353 type the formula: = STDEV(AP2:AP6351) Copy these formulae into the other columns. 1.5 The normalised (centered) data will be put into columns BJ to CC. To center the data, we must subtract the average of the array and divide by the standard deviation. Into cell BJ2, type the formula: = (AP2 – AP$6352)/AP$6353 The $ signs are very important. Copy this formula into columns BJ through CC and rows 2 through 6351. 1.6 We are now in a position to calculate the mean expression of the reference sample for each gene. Into cell CD2, type the formula: =AVERAGE(BJ2:CC2) Copy this formula into the remainder of the column. 1.7 We calculate the deviates from the average and place them in columns CE through CX. Into cell CE2, type the formula: = BJ2-$CD2 Copy this formula into columns CE through CX and rows 2 to 6351. 1.8 We check that the deviates have standard deviation independent of intensity using MVA plots. Select columns CD and CE and produce a scatter plot. Is the variability independent of intensity? Do the same using column CD and different columns for the y-axis. 1.9 We would want to plot histograms of this distribution, which is not so easy (for me) in Excel. If you know how, load your data into R and plot a histogram in R. 1.10 Calculate the standard deviation for all the deviates by typing into cell CY2: =STDEV(CE2:CX6351) If we assume that the data is log-normally distributed, this can be converted to a coefficient of variability by typing into cell CY3: =SQRT(EXP((CY2*LN(2))^2)-1) What is the coefficient of variability? Exercise 2: Estimating the Population Variability in a Microarray Experiment In this exercise we will go through a similar process to estimate the coefficient of variability of gene expression in the patient population itself. We use exactly the same method as Exercise 1. 2.1 Using the same spreadsheet “perou.xls”, go to the worksheet called “Population Variability”. The first column contains the spot number. The next 20 columns contain normalised (centered) log ratios of red/green for the 20 patients – we have already performed the normalisation. 2.2 Use column V to calculate the mean log ratio for each gene (as point 1.6 in Exercise 1). 2.3 Use columns W through AP to calculate the deviates from the mean (as point 1.7). 2.4 Produce MVA plots of the deviates as a function of the mean. What do you notice? 2.5 Calculate the standard deviation and coefficient of variation. What is the coefficient of variation? How does this compare with the coefficient of variation for the reference sample? What do you conclude? Exercise 3: Estimating the Power of Paired and Unpaired Microarray Experiments In this exercise, we shall show how to use the power.t.test() function in R to calculate the power or number of replicates for paired and unpaired microarray experiments. We use the R package because it contains this function and is able to handle very small confidence levels. 3.1 Open “R” from Start -> Programs -> R -> R 1.5.0 3.2 Look at the help for the power.t.test function by typing: help(power.t.test) A new window should open with information about this function. 3.3 First, we transform the coefficient of variability into log to base 2. Suppose an experiment has coefficient of variability 50%. To get the standard deviation in log to base 2, type: sqrt(log(0.5^2+1))/log(2) What is the standard deviation for an experiment with 30% coefficient of variability? The breast cancer experiment has a coefficient of variability of 45%: what is the standard deviation? 3.4 The fold-ratio we are trying to detect converts to a different of means simply by taking logs. The difference means in log to base 2 corresponding to a 2-fold ratio difference is 1. The difference for a 3-fold difference is given by: log(3)/log(2) What is the difference of means corresponding to a 1.5-fold difference? 3.5 We start by calculating the power of the breast cancer experiment. We calculated the standard deviation in 3.3. Suppose we want to detect 2-fold regulated genes; the difference in mean is 1. The number of replicates is 20. There are 6350 genes in the analysis. Suppose we want one false positive. The significance level we need to use is 1/6350. Calculate the power by typing in: power.t.test(sd=0.62,delta=1,n=20,sig.level=1/6350, type=”one.sample”,alternative=”two.sided”) The type specifies that this is a one sample test rather than a two sample test. The alternative of “two.sided” means that we are testing for both up-regulated and down-regulated genes. What is the power? Do you think this is good? 3.6 What is the power for detecting 1.5-fold and 3-fold regulated genes? What do you notice? 3.7 What happens to the power if you change the coefficient of variability to 30% or 50%? 3.8 What happens to the power if you change the type of test to “two.sample”? Exercise 4: Estimating the Number of Replicates for a Microarray Experiment In this exercise, we use the same R formula as the previous exercise to estimate the number of replicates needed in microarray experiments. 4.1 Now we shall use the formula to estimate the number of replicates for an experiment to investigate the toxic effects of Benzo(a)pyrene on rats. There are two groups of rats, treated either with BP, or with a control substance. Is this a one-sample or two sample experiment? 4.2 The coefficient of variability is 40%. What is the standard deviation? 4.3 Suppose there are 10,000 genes in the experiment and we are happy with approximately 1 false positive. What is the significance level? 4.4 We are interested in finding 2-fold differentially expressed genes with 90% power. Type the formula: power.t.test(sd=0.56,delta=1,sig.level=0.0001, power=0.9,type=”two.sample”,alternative=”two.sided”) How many rats do you need in each group? 4.5 How many rats would you need in each group for 80% power or 95% power? 4.6 What is the power if there are only 5000 genes on the array and you are happy with 1 false positive?