design

advertisement
Practical: Statistical Principles of Experimental Design
Exercise 1: Estimating the variability of a microarray
experiment
In this exercise we will look at estimating the coefficient of variability between
hybridisations by looking at the variability between an identical reference sample
hybridised to 20 different arrays as a function of hybridisation intensity.
1.1
Load Excel
1.2
Open up the file “perou.txt”. This file contains data for the 20 breast cancer
patients treated with doxorubicin. Go to the worksheet called “Reference
Sample”. The first column contains the spot number. The next 20 columns
contain the background subtracted raw data for the green channel for the 20
patients. They have been labelled “g7”, “g10” etc, where the number is a
patient ID. The following 20 columns contain the background subtracted raw
data for the red channel for the 20 patients. The green channel contains data
for the reference sample, which is identical on every array. The red channel
contains the data from the patient samples.
1.3
We start by normalising the data by taking logs of the green data and then
centering the data. The columns AP to BI have been set up for the logged data.
Into cell AP2, type the formula:
= LN(B2) / LN(2)
This calculates the log of the intensity to base 2. Copy this formula into all of
the cells columns AP to BI and rows 2 to 6351.
1.4
To centre the data, we will need to compute the average and standard
deviation of each array. Into cell AP6352 type the formula:
= AVERAGE(AP2:AP6351)
Into cell AP6353 type the formula:
= STDEV(AP2:AP6351)
Copy these formulae into the other columns.
1.5
The normalised (centered) data will be put into columns BJ to CC. To center
the data, we must subtract the average of the array and divide by the standard
deviation. Into cell BJ2, type the formula:
= (AP2 – AP$6352)/AP$6353
The $ signs are very important. Copy this formula into columns BJ through
CC and rows 2 through 6351.
1.6
We are now in a position to calculate the mean expression of the reference
sample for each gene. Into cell CD2, type the formula:
=AVERAGE(BJ2:CC2)
Copy this formula into the remainder of the column.
1.7
We calculate the deviates from the average and place them in columns CE
through CX. Into cell CE2, type the formula:
= BJ2-$CD2
Copy this formula into columns CE through CX and rows 2 to 6351.
1.8
We check that the deviates have standard deviation independent of intensity
using MVA plots. Select columns CD and CE and produce a scatter plot. Is the
variability independent of intensity? Do the same using column CD and
different columns for the y-axis.
1.9
We would want to plot histograms of this distribution, which is not so easy
(for me) in Excel. If you know how, load your data into R and plot a histogram
in R.
1.10
Calculate the standard deviation for all the deviates by typing into cell CY2:
=STDEV(CE2:CX6351)
If we assume that the data is log-normally distributed, this can be converted to
a coefficient of variability by typing into cell CY3:
=SQRT(EXP((CY2*LN(2))^2)-1)
What is the coefficient of variability?
Exercise 2: Estimating the Population Variability in a
Microarray Experiment
In this exercise we will go through a similar process to estimate the coefficient of
variability of gene expression in the patient population itself. We use exactly the same
method as Exercise 1.
2.1
Using the same spreadsheet “perou.xls”, go to the worksheet called
“Population Variability”. The first column contains the spot number. The next
20 columns contain normalised (centered) log ratios of red/green for the 20
patients – we have already performed the normalisation.
2.2
Use column V to calculate the mean log ratio for each gene (as point 1.6 in
Exercise 1).
2.3
Use columns W through AP to calculate the deviates from the mean (as point
1.7).
2.4
Produce MVA plots of the deviates as a function of the mean. What do you
notice?
2.5
Calculate the standard deviation and coefficient of variation. What is the
coefficient of variation? How does this compare with the coefficient of
variation for the reference sample? What do you conclude?
Exercise 3: Estimating the Power of Paired and Unpaired
Microarray Experiments
In this exercise, we shall show how to use the power.t.test() function in R to calculate
the power or number of replicates for paired and unpaired microarray experiments.
We use the R package because it contains this function and is able to handle very
small confidence levels.
3.1
Open “R” from Start -> Programs -> R -> R 1.5.0
3.2
Look at the help for the power.t.test function by typing:
help(power.t.test)
A new window should open with information about this function.
3.3
First, we transform the coefficient of variability into log to base 2. Suppose an
experiment has coefficient of variability 50%. To get the standard deviation in
log to base 2, type:
sqrt(log(0.5^2+1))/log(2)
What is the standard deviation for an experiment with 30% coefficient of
variability? The breast cancer experiment has a coefficient of variability of
45%: what is the standard deviation?
3.4
The fold-ratio we are trying to detect converts to a different of means simply
by taking logs. The difference means in log to base 2 corresponding to a 2-fold
ratio difference is 1. The difference for a 3-fold difference is given by:
log(3)/log(2)
What is the difference of means corresponding to a 1.5-fold difference?
3.5
We start by calculating the power of the breast cancer experiment. We
calculated the standard deviation in 3.3. Suppose we want to detect 2-fold
regulated genes; the difference in mean is 1. The number of replicates is 20.
There are 6350 genes in the analysis. Suppose we want one false positive. The
significance level we need to use is 1/6350. Calculate the power by typing in:
power.t.test(sd=0.62,delta=1,n=20,sig.level=1/6350,
type=”one.sample”,alternative=”two.sided”)
The type specifies that this is a one sample test rather than a two sample test.
The alternative of “two.sided” means that we are testing for both up-regulated
and down-regulated genes. What is the power? Do you think this is good?
3.6
What is the power for detecting 1.5-fold and 3-fold regulated genes? What do
you notice?
3.7
What happens to the power if you change the coefficient of variability to 30%
or 50%?
3.8
What happens to the power if you change the type of test to “two.sample”?
Exercise 4: Estimating the Number of Replicates for a
Microarray Experiment
In this exercise, we use the same R formula as the previous exercise to estimate the
number of replicates needed in microarray experiments.
4.1
Now we shall use the formula to estimate the number of replicates for an
experiment to investigate the toxic effects of Benzo(a)pyrene on rats. There
are two groups of rats, treated either with BP, or with a control substance. Is
this a one-sample or two sample experiment?
4.2
The coefficient of variability is 40%. What is the standard deviation?
4.3
Suppose there are 10,000 genes in the experiment and we are happy with
approximately 1 false positive. What is the significance level?
4.4
We are interested in finding 2-fold differentially expressed genes with 90%
power. Type the formula:
power.t.test(sd=0.56,delta=1,sig.level=0.0001,
power=0.9,type=”two.sample”,alternative=”two.sided”)
How many rats do you need in each group?
4.5
How many rats would you need in each group for 80% power or 95% power?
4.6
What is the power if there are only 5000 genes on the array and you are happy
with 1 false positive?
Download