Microarray data analysis Pre-processing of Affymetrix data using R Bioconductor 1 Introduction Producing gene expression data using Affymetrix Gene Chip probe arrays is, although to a large extent automated and standardized, an elaborate procedure and is hence susceptible to various factors influencing the quality of such data. A number of steps and measures are described in the Expression Analysis Technical Manual, available from the Affymetrix web site at http://www.affymetrix.com, to ensure satisfactory quality throughout the process of sample preparation. To monitor the quality of sample hybridisation to the arrays and their subsequent scanning and generation of signal intensities, Affymetrix further provides a number of metrics describing the quality of external and internal controls, as well as a variety of parameters related to the performance of the experiment instrumentation. This exercise will make you familiar with the use of various Bioconductor packages that can be used to analyse and compare these parameters across the samples included in an experiment, detailing how to identify sample, array and hybridisation anomalies and outlier samples. The latter part of the exercise will familiarize you with the most commonly used techniques to remove systematic bias from arrays, allowing for bias-free comparisons across arrays. 1.1 Example data The data set consists of a total of 6 Affymetrix HGU-133A probe-arrays, containing 22 283 probe sets. Three of the arrays have been hybridized with RNA from cases with a certain defect in the central nervous system, while the remaining three samples were hybridized with RNA from age- and gender-matched healthy controls. After isolating the RNA for the disease samples it was noted that for one of the samples the amount of RNA was not sufficient for hybridizing according to the standard Affymetrix protocol. Hence, it was decided to run the sample through a round of amplification using the Affymetrix small sample protocol. To control for possible 3'/5' signal bias one of the control samples was also subjected to the same procedure. The hybridized probe-arrays were scanned using default settings and the resulting image was segmented and quantified into probe-level data with Affymetrix MAS 5 algorithm. The values were output into cell intensity files (CEL), for which the relevant information is given in the table below: Output file sick_1.cel sick_2.cel sick_3_amp.cel control_1.cel control_1.cel control_1_amp.cel Disease status sick sick sick healthy healthy healthy Protocol standard standard small sample standard standard small sample ___________________________________________________________________________________ Microarray data analysis 1.2 Bioconductor packages There are numerous packages collected in the Bioconductor project that allows manipulation and analysis of data collected using Affymetrix probe-arrays (have a look at www.bioconductor.org for an extensive list). For pre-processing purposes the most intuitive and commonly used ones are the affy, affyQCReport and simpleaffy packages (note that R is case sensitive so double-check your spelling!). ___________________________________________________________________________________ Microarray data analysis 2 Exercise - importing and handling data 2.1 Loading data In order to load the data contained in the CEL files into the working memory of R one first need to load the affyQCReport, simpleaffy and affy packages into memory using the library() command or by using the [Load package(s)...] pulldown menu under the [PACKAGES] menu.. If the packages are not stored locally (on the hard drive of the computer where the R software is installed) they first need to be installed from one of the remote servers that host the Bioconductor repositories. This is done following these steps: 1. Open the [Packages->Set CRAN mirror...] menu and select a download site located close to Finland. 2. Open the [Packages->Select repositories...] menu and select Bioconductor. 3. Open the [Packages->Install package(s)...] menu and select the affyQCReport package from the pull-down menu. You will notice how the affy, simpleaffay, as well as a number of other packages, will be automatically loaded since some of their functions are used by the functions of the affyQCReport package. Now that the packages are loaded into memory it's time to create a variable called dataset that will collect data from all the CEL files and add sample annotation information using the ReadAffy() function. The dataset variable will be of tthe class AffyBatch for which many functions have special features, as we will see later. ReadAffy() without parameters reads all CEL files in the working directory. The experimental design can be put in a csv file, and loaded with read.AnnotatedDataFrame(). Use the phenoData function to add it to dataset. 2.2 Obtaining summary information about a data set Once a data set has been created a summary of the data set itself and the associated sample information can conveniently be extracted by simply writing their names. To get an output similar to the summary table given in the introduction above one can apply the pData() function to the data set. 2.3 Modifying sample names The names of the samples can be accessed, or changed, using the sampleNames() function. For easier display in plots it can be convenient to remove the .CEL from the names: As you might have guessed the nchar() function finds out the length of a character string, whereas substr() extracts a substring from it. Check that the names were changed correctly: ___________________________________________________________________________________ Microarray data analysis 3 Exercise - quality control 3.1 Assessing hybridisation artefacts by visual inspection The first and perhaps most obvious, but quite often overlooked, thing to check when assessing the quality of the acquired data is to look at the images of the scanned arrays. With simple visual inspection it is easy to pick up hybridisation artefacts arising from scratches, air bubbles, straws of hair or problems with staining, mixing or washing. However, doing this can of course be problematic if the actual image files, or .DAT files as Affymetrix calls them, are not available. Luckily, the affy package provides functionality allowing the reconstruction of the image of the scanned array from the CEL files. Using the image()function on an AffyBatch object will display a pseudo-image of the intensity from all features on each array arranged according to how they are actually physically arranged on the array. Since the distribution of intensities is usually highly skewed with a very long right tail the pseudo-images tend to become rather dim, making them harder to assess for artefacts. To alleviate the problem it is possible to give the function an argument setting up the intensity scale in logarithmic mode. Displaying pseudo-images of all arrays in one single image (figure 3): Figure 3: Pseudo-images of the 6 arrays of the example data. Can you spot any significant anomalies for any of the samples? Can you spot any systematic differences between sample types? 3.2 Assessing the quality of arrays using Affymetrix quality controls ___________________________________________________________________________________ Microarray data analysis When pre-processing the data using Affymetrix MAS 5.0 or GCOS software packages a number of quantities that can be used for assessment of quality of arrays are calculated and reported. In the publication GeneChip Expression Arrays: Data Analysis Fundamentals, available from http://www.affymetrix.com, Affymetrix gives guidelines on how to interpret these quantities and what cut-off values to use when determining whether an array is of acceptable quality or not. When importing raw data from CEL files into R no control quantities are calculated, but they can be obtained using functions implemented in the simpleaffy package. Specifically, the functions will calculate average background, scale factor, percent present, 3'/5' ratios of housekeeping controls and also report the expression values for the spike-in control probes and cross-species control probes. The quantities can be calculated and stored in a QCStats object using the qc function. You can find the functions to calculate the statistics in the examples in the help of qc. average background According to Affymetrix, and assuming that the arrays were scanned with PMT setting at 10%, the average background values should normally fall between 20 and 100. So, no apparent problems for any of the samples, not even the amplified ones. Scale factors Since the scaling factors do not only depend on sample quality but also on what target intensity is selected for scaling (note that the qc() function applies a target intensity of 100 by default), Affymetrix do not recommend any absolute threshold for determining if an array is of poor quality or not. Rather, they suggest that the factors should be similar among samples and not vary more than about 2 to 3-fold from each other. According to this criterion the samples handled with the amplification protocol would not pass the criteria. In practice, what this means is that it would not be a good idea to include the samples handled with the standard protocol and the ones with the amplification protocol in the same analysis due to the apparent systematic bias introduced during the amplification process. percent present This bias often shows up also in the percent present values. In this case it appears that the overall lower signal intensities observed for the arrays with amplified samples have not resulted in a noticeably lower proportion of probes scored as present. Note that the percent present scores vary considerably with tissue type and type of experiment condition under study and consequently no absolute quality cut-offs can be recommended. As mentioned before, prior to loading samples onto the arrays a number of control oligonucleotides are added. Four of these are spiked into the sample mixture at different concentrations, thus allowing for later monitoring of the hybridisation performance. The concentrations range from low down at the detection limit of approximately 3 copies (1.25pM) of mRNA per cell for bioB to roughly 200 copies (100 pM) per cell for cre. With the exception of the earliest generation of Affymetrix arrays, even the spike-in with the lowest concentration should always be called present by the MAS 5.0 algorithm. The bioBCalls slot of a QCStats object, containing information regarding bioB detection calls, can be viewed by: ___________________________________________________________________________________ Microarray data analysis qc_data@bioBCalls Evidently, all arrays were hybridised and scanned without any mishaps. Would you have expected the amplified samples to perform lesser in this regard? It is also useful to closely examine the actual expression values for the spike-in controls, something that can be done using the spikeInProbes() function: Notice that in addition to being scaled to achieve a target intensity of 100 the values have also been transformed using the base 2 logarithm. What is important to examine about the values is that, for each array, they increase roughly linearly with the logarithm of the concentration. As neither the affy, affyQCReport or simpleaffy packages include functions for checking this some longer code is required to do this. The following lines of code will create a scatterplot (figure 4) of the values and superimpose a line of best fit for each array: # setup and calculate x and y-values concentration <- log(c(1.5, 5, 25, 100)) x_values <- array(concentration, c(4, length(dataset))) x_values <- t(x_values) y_values <- spikeInProbes(qc_data) # plot the values in a scatterplot of y as function of x plot(x_values, y_values, col=1:6, main="Spike-in performance", xlab="log ( concentration in pM)", ylab="log2 (expression)") # add legend box with sample names legend(legend=sampleNames(dataset), x=3.7, y=7, lty=1, col=1:6, cex=0.75) # add lines of best fit for (loop_count in 1:length(dataset)) { y_values <- spikeInProbes(qc_data) [loop_count,] lm_spike <- lm(y_values~concentration) slope <- coef(lm_spike) [2] intercept <- coef(lm_spike) [1] abline(intercept, slope, col=loop_count) } Interpret the results. Do the results show what you expected? ___________________________________________________________________________________ Microarray data analysis Figure 4: Plot of the expression levels observed for the spike-in control probes as a function of the spike-in concentration. Lines of best fit are superimposed. 3.3 Assessing 3' to 5' bias using data for control probes or whole array The final quality control metric to examine monitors possible problems with RNA degradation or inefficient in vitro transcription steps during sample preparation and is calculated by taking the ratio between the 3' and the 5' located probe sets targeting the housekeeping genes GAPDH and -Actin. Since the values stored in the QCStats object have been transformed using the base 2 logarithm ,they have to be transferred back to linear scale before calculation of the ratio. The 3'/5' ratios for the GAPDH gene can be displayed with the following code: (2^qc_data@qc.probes[,4]) / (2^qc_data@qc.probes[,6]) The corresponding values for the -Actin gene: (2^qc_data@qc.probes[,1]) / (2^qc_data@qc.probes[,3]) Affymetrix suggests different cut-off thresholds to be used for GAPDH and -Actin, since the latter is considerably longer than the former and hence with greater distance between probes located at the 3' and 5' ends. Ratios for GAPDH are usually very close to 1 and even such a small deviation as 1.25 should be considered suspect. The corresponding value for -Actin is 3. What do the results show and what could be the reason for this? ___________________________________________________________________________________ Microarray data analysis How would you proceed in the downstream analysis? Assessing RNA quality based on the performance of only a couple of probe sets may seem precarious, especially considering what a small proportion of the total RNA isolated that is monitored through this somewhat blunt tool. A more robust way to evaluate RNA degradation problems is offered through a couple of functions in the affy package. The idea is to order the probes within each probe set according to the physical position in which they recognize its target gene and then average the values for each position across all probe sets. Averaging over this many probe sets will efficiently cancel out any probe-specific effects and should bring out even slight trends with decreasing intensities towards the more 5' located probes. Use AffyRNAdeg and summaryAffyRNAdeg to evaluate the extent of such trends mathematically through linear regression. Are these results consistent with the previous findings? It is also possible to display the data graphically using plotAffyRNAdeg, giving a more intuitive way of interpreting the results (figure 5): Figure 5: Averaged and log2-transformed probe intensities as a function of the 5' to 3' probe position. Note that the individual plots are slightly shifted by vertical staggering for a clearer view. It is interesting to note that the slope values (and graph) indicate some degree of 3' to 5' bias also for the arrays hybridised non-amplified samples prepared with the standard Affymetrix protocol. This is not unique to this specific dataset, indeed even arrays of the highest quality will display this behaviour. Depending on what array type has been used the 3'/5' trend will be more or less pronounced due to differences in physical spacing between the individual probes in a probe-set. Early Affymetrix designs with 20 or 16 probe-pairs per probe-set have shorter inter-probe distance and ___________________________________________________________________________________ Microarray data analysis hence less 3'/5' bias while later designs with 11 probe-pairs per probe-set and larger inter-probe spacing are associated with increased bias levels. As stated before, the absolute values do not per se constitute a meaningful threshold for accepting or discarding arrays from further analysis. Rather, what is important is to have agreement between arrays. ___________________________________________________________________________________ Microarray data analysis 4 Exercise - Identifying and removing systematic bias (normalization) 4.1 Checking the distribution of data Systematic bias, as well as other anomalies in the distribution of expression values, are most easily identified by plotting the data in histograms or boxplots. In the Basic statistics using R exercise the hist() and boxplot() functions were used to accomplish this for simple data sets. The same commands can also be applied to more complex data sets, containing data for many different samples, without modification. There is hence no need to extract the data for the individual samples before applying the function. Try the following, which should yield the plot in figure 6: Figure 6: The distribution of the log2-transformed intensities viewed as boxplots and smoothed histograms. What conclusions can you make from the resulting plots? Non-linear bias is most effectively identified in an MA-plot, which depicts the logtransformed ratio between two samples as a function of the log-transformed average of the samples. This generates a scatter plot where genes with equal expression levels in the compared samples will fall along a horizontal line with a y-intercept at 0. Systematic deviations are easily picked up as vertical shifts in the whole distribution of data points, while non-linear or intensity-dependent effects are clearly spotted as curvatures in the data cloud. The affy package implementation of the MA-plot uses the base 2 for log-transformation and compares all samples to an artificial reference sample, which is created from the average of all samples. The function is called ___________________________________________________________________________________ Microarray data analysis MAplot(), and creates as may scatter plots as there are arrays by default. Try the following code (and be patient because it is very computing intensive and may take a while), which should result in the plots shown in figure 7: Figure 7: MA-plots for each of the six arrays compared to the calculated median of all arrays. Lines of perfect correlation (blue) and LOESS (red) are overlaid. Can you see any systematic and/or non-linear effects for any of the samples? -4.2 Normalizing data from Affymetrix arrays The most common ways to normalize Affymetrix is through simple linear scaling, like is implemented in Affymetrix own MAS 5 algorithm, or through the more complex RMA or GC-RMA methods developed by Terrence Speed et al. The affy package offers an extensive array (no joke intended) of methods to perform the different steps (background correction, quantification and summarisation) that make up these normalization procedures. There are also wrapper functions that combine a specific sequence of normalization steps into one single, and often more computationally efficient, step. For a more detailed description of what these steps are and what alternatives are associated to each of them have a look at the Affy and BuiltinMethods vignettes. Let's first have a look at the options provided for the individual normalization steps, which should be stated as arguments to the expresso() function. To list the available options simply type the following: bgcorrect.methods() [1] "mas" "none" "rma" "rma2" ___________________________________________________________________________________ Microarray data analysis pmcorrect.methods() [1] "mas" "pmonly" "subtractmm" normalize.methods(dataset) [1] "constant" "contrasts" [5] "qspline" "quantiles" express.summary.stat.methods() [1] "avgdiff" "liwong" "mas" "invariantset" "loess" "quantiles.robust" "medianpolish" "playerout" Note that not all combinations of background correction, pm-probe correction, normalization and summary methods are possible. More specifically, rma background correction can only be used in conjunction with pmonly pm-probe correction. Moreover, since mas and medianpolish summarization will apply base 2 logarithm transformation to the data a pm-probe correction method like subtractum, which may result in negative values, is not advised. Calling rma_data <- expresso(dataset, bgcorrect.method="rma", normalize.method="quantiles", pmcorrect.method="pmonly", summary.method="medianpolish") is in fact equivalent to running the wrapper function rma() like this: rma_data <- rma(dataset) The latter version is preferable because it runs considerably faster, as it is implemented in C language rather than R code. Similarly, normalization according to Affymetrix MAS5 algorithm can either be specified as: mas5_data <- expresso(dataset, bgcorrect.method="mas", normalize.method="constant", pmcorrect.method="mas", summary.method="mas") or using the quicker wrapper function mas() like this: mas5_data <- mas5(dataset) Let's have a closer look at the normalization steps of MAS5 and RMA to see if we can notice any differences in how they treat the data and how efficient they are, Type the following to do the normalization, and make the plots shown in figures 8-9: mas5_norm <- normalize(dataset, method="constant") rma_norm <- normalize(dataset, method="quantiles") ___________________________________________________________________________________ Microarray data analysis Figure 8: Boxplots of data normalized using the constant and quantile normalizations employed in the MAS5 and RMA algorithms, respectively. Recently, a more sophisticated method to correct the pm-probes to account for nonspecific binding, but in a sequence-specific manner, was introduced in the GC-RMA procedure. This is supposed to result in decreased signal variance on a par with RMA, but with accuracy close to what MAS5 gives. The normalization and summarization steps are identical to the ones implemented in RMA, so there is no point making a comparison of those. Instead, let's apply all steps of the normalization procedures employed in MAS5, RMA and GC-RMA to the whole dataset and compare the MAplots for the worst of the samples, i.e. the array with amplified RNA from one of the sick individuals to one of the other sick samples. Of course, first we need to install and load the gcrma package, but you should know how to do that by now. Make data.frames from the expression data (function exprs) for the differently normalised data sets, and us mva.pairs to draw MA graphs for sick2 vs sick3 (figure 10) So, which normalization procedure would you choose for the data and why? What would you do with the data from the arrays with amplified RNA? ___________________________________________________________________________________ Microarray data analysis Figure 10: MA-plots for data normalized with MAS5, RMA and GC-RMA. ___________________________________________________________________________________