Computation of discriminant values for a new sample The new microarray data must be normalised using the RMA method, which is available in the affy package from the R software (R Development Core Team, 2011). To install the R software, you must download the R installation of your choice (Windows, Mac or Linux). We consider herein the installation for a Windows computer. Go to the website http://cran.r-project.org/index.html, click on Windows and then on base. There you will find a link to download the latest R stable version. Once downloaded, click and follow the instructions to install it. If you run R from the starting menu, you will obtain this console: You can install the affy package by copying this code into the console: source("http://bioconductor.org/biocLite.R") biocLite(“affy”) 1 Set the working directory to the folder where the .cel files are. This can be done by using the “Change dir...” option from the “File” menu. Load data and normalise it as follows: Data<-ReadAffy() DataRMA<-rma(Data) This may take a long time in a cluster like Fornax (SGI Altix 350 remote cluster composed of 30 64 bits-processors Intel Itanium running at 1.5 GHz with 64 Gb of shared RAM memory -Suse SLES9/SGI ProPack 4-, for more details visit the page of the cluster http://cibercluster.upf.edu/EN/Pages/que_es.aspx), depending on the number of microarrays to be normalised (218 cases around 30 minutes). In a common desktop computer, it is likely to get blocked. The object “DataRMA” contains information about the normalisation. The matrix with expression values is extracted as follows: DataRMAmat<-exprs(DataRMA) This matrix contains cases in columns and probesets in rows. The matrix with the RMA-normalised values of the probesets of interest can be created as follows: Names<-c(“probeset1”,”probeset2”,”probesets3”,”probeset4”) DataRMAmatProb<- DataRMAmat[c(na.omit(match(Names,rownames(DataRMAmat)))),] The values of the probesets of interest are used for the computation of a discriminant formula. We consider the probesets from our local signature. First, we standardize the expression values of our local dataset. For this we need the mean (M(COL1A,POSTN,NNMT,DCN)Lee) and standard deviation (SD(COL1A,POSTN,NNMT,DCN)Lee) from the Lee’s dataset: M(COL1A,POSTN,NNMT,DCN)Lee = (8.58, 8.17, 8.53, 10.08) SD(COL1A,POSTN,NNMT,DCN)Lee = (2.06, 2.60, 2.04, 1.29) 2 If we select the first case of our local dataset: Local1(COL1A,POSTN,NNMT,DCN) = (9.95, 7.64, 11.45, 8.90) the standardized values are: Local1(COL1A,POSTN,NNMT,DCN)Std = (9.95-8.58, 7.64-8.17, 11.45-8.53, 8.90-10.08) = 2.06 2.60 2.04 1.29 = (0.67, -0.20, 1.43, -0.91) We repeat this for all cases in our local dataset and perform a hierarchical cluster to assign the group of each case. We compute the discriminant function as described in section “Materials and Methods”, which provides the discriminant coefficients as depicted in section “Results”. To obtain the discriminant score for one case from the Lee’s dataset, we must standardise their expression values as above and multiply each standardised value by the corresponding discriminant coefficient: DSC(COL1A,POSTN,NNMT,DCN)LeeStd1 = -0.181*-1.55+1.421*-1.01+-0.146*-1.13+0.600*-0.89 = -1.52 As cases are classified as GLE when displaying negative discriminant scores, this example case would be classified as GLE. The same procedure should be repeated for all cases from the Lee’s dataset. In doing so, we can assign a group to each case using the formula above. As we set the a priori group of each case by a hierarchical cluster using the standardized cases, the error of the formula above would be computed by comparison with the classification provided by the hierarchical cluster. To use the discriminant coefficients of the other signatures, both datasets must be standardised using the respective mean and standard deviation values: M(TIMP3, Hs.301281,FAM64A,ECT2)Lee = (8.52, 9.05, 7.29, 5.39) 3 SD(TIMP3, Hs.301281,FAM64A,ECT2)Lee = (0.92, 1.79, 0.93, 0.93) M(CHI3L1,LDHA,LGALS1,IGFBP3)Lee = (11.17, 11.69, 11.37, 8.66) SD(CHI3L1,LDHA,LGALS1,IGFBP3)Lee = (2.13, 0.79, 1.00, 1.55) For a sample from a new dataset, the mean and standard values must be computed for its standardization. This was performed for Gravendeel’s datasets. 4