Computation of discriminant values for a new sample

advertisement
Computation of discriminant values for a new sample
The new microarray data must be normalised using the RMA method, which is
available in the affy package from the R software (R Development Core Team, 2011).
To install the R software, you must download the R installation of your choice
(Windows, Mac or Linux). We consider herein the installation for a Windows computer.
Go to the website http://cran.r-project.org/index.html, click on Windows and then on
base. There you will find a link to download the latest R stable version. Once
downloaded, click and follow the instructions to install it. If you run R from the starting
menu, you will obtain this console:
You can install the affy package by copying this code into the console:
source("http://bioconductor.org/biocLite.R")
biocLite(“affy”)
1
Set the working directory to the folder where the .cel files are. This can be done by
using the “Change dir...” option from the “File” menu. Load data and normalise it as
follows:
Data<-ReadAffy()
DataRMA<-rma(Data)
This may take a long time in a cluster like Fornax (SGI Altix 350 remote cluster
composed of 30 64 bits-processors Intel Itanium running at 1.5 GHz with 64 Gb of
shared RAM memory -Suse SLES9/SGI ProPack 4-, for more details visit the page of
the cluster http://cibercluster.upf.edu/EN/Pages/que_es.aspx), depending on the number
of microarrays to be normalised (218 cases around 30 minutes). In a common desktop
computer, it is likely to get blocked. The object “DataRMA” contains information about
the normalisation. The matrix with expression values is extracted as follows:
DataRMAmat<-exprs(DataRMA)
This matrix contains cases in columns and probesets in rows. The matrix with the
RMA-normalised values of the probesets of interest can be created as follows:
Names<-c(“probeset1”,”probeset2”,”probesets3”,”probeset4”)
DataRMAmatProb<- DataRMAmat[c(na.omit(match(Names,rownames(DataRMAmat)))),]
The values of the probesets of interest are used for the computation of a discriminant
formula. We consider the probesets from our local signature. First, we standardize the
expression
values
of
our
local
dataset.
For
this
we
need
the
mean
(M(COL1A,POSTN,NNMT,DCN)Lee) and standard deviation (SD(COL1A,POSTN,NNMT,DCN)Lee) from the
Lee’s dataset:
M(COL1A,POSTN,NNMT,DCN)Lee = (8.58, 8.17, 8.53, 10.08)
SD(COL1A,POSTN,NNMT,DCN)Lee = (2.06, 2.60, 2.04, 1.29)
2
If we select the first case of our local dataset:
Local1(COL1A,POSTN,NNMT,DCN) = (9.95, 7.64, 11.45, 8.90)
the standardized values are:
Local1(COL1A,POSTN,NNMT,DCN)Std = (9.95-8.58, 7.64-8.17, 11.45-8.53, 8.90-10.08) =
2.06
2.60
2.04
1.29
= (0.67, -0.20, 1.43, -0.91)
We repeat this for all cases in our local dataset and perform a hierarchical cluster to
assign the group of each case. We compute the discriminant function as described in
section “Materials and Methods”, which provides the discriminant coefficients as
depicted in section “Results”. To obtain the discriminant score for one case from the
Lee’s dataset, we must standardise their expression values as above and multiply each
standardised value by the corresponding discriminant coefficient:
DSC(COL1A,POSTN,NNMT,DCN)LeeStd1 = -0.181*-1.55+1.421*-1.01+-0.146*-1.13+0.600*-0.89
= -1.52
As cases are classified as GLE when displaying negative discriminant scores, this
example case would be classified as GLE. The same procedure should be repeated for
all cases from the Lee’s dataset. In doing so, we can assign a group to each case using
the formula above. As we set the a priori group of each case by a hierarchical cluster
using the standardized cases, the error of the formula above would be computed by
comparison with the classification provided by the hierarchical cluster. To use the
discriminant coefficients of the other signatures, both datasets must be standardised
using the respective mean and standard deviation values:
M(TIMP3, Hs.301281,FAM64A,ECT2)Lee = (8.52, 9.05, 7.29, 5.39)
3
SD(TIMP3, Hs.301281,FAM64A,ECT2)Lee = (0.92, 1.79, 0.93, 0.93)
M(CHI3L1,LDHA,LGALS1,IGFBP3)Lee = (11.17, 11.69, 11.37, 8.66)
SD(CHI3L1,LDHA,LGALS1,IGFBP3)Lee = (2.13, 0.79, 1.00, 1.55)
For a sample from a new dataset, the mean and standard values must be computed for
its standardization. This was performed for Gravendeel’s datasets.
4
Download