A Tutorial on BSS-1

advertisement
A Tutorial on Tissue Heterogeneity Correction by Computational Source Separation
1. Introduction
The computational sourse separation (BSS) algorithm has been implemented in MATLAB for
computational dissection of tissue heterogeneity (CDTH). Gene expression profiling by microarrays often
represents a composite of more than one cell source due to the tissue heterogeneity. Thus, such “artifacts”
would be potentially misleading, in the case of the presence of signatures from other cells in the sample
(Wang, 2005). BSS is a novel and effective computational method that can blindly decompose gene
expression profiles. Figure BSS shows a snapshot of the BSS software for CDTH.
A Brief Description of the BSS Method
In the algorithm, we select the final ISGs that result in maxima value of the sum of correlation
coefficient (S1source, S1recovered) and correlation coefficient (S2source, S2recovered). There are several
ways for estimating the model. Most estimation principles and objective functions are equivalent, at least in
theory. To evaluate the performance of the algorithm, we will use the following two criteria: (1) correlation
coefficients and (2) performance index E1. The performance index is defined as (Aapo Hyvärinen., 2001):
2
2
2
2
pij
pij
E1   ( 
 1)   (
 1)
(1)
i 1 j 1 Maxk pik
j 1 i 1 Maxk pkj
Where pij is the ijth element of the 2  2 matrix P=WA. If the source profiles have been separated perfectly,
P becomes simply a permutation matrix of actual mixing matrix. The index E1 attains its minimum value
zero for an ideal permutation matrix.
2. A procedural Description of the BSS Method
In this section, we will describe a procedure to perform BSS algorithm. The following steps are
described and illustrated in details: (1) preparing data, (2) reading in data, (3) CTDH using BSS, and (4)
exporting recovered data to files and plotting result figures.
3.1.
Prepare Data
The input data set includes two parts, the actual expression values of samples and the corresponding
mixture label, and each data sample’s format should be tab-delimited. The data set and label are saved in
one text file. And all the data files’ name should be lowercase. And the name of the probe ID file or image
ID file is composed of data file’s name and imageid. It should also be a text file.
The data variable should be arranged to have samples with different mixing ratios as columns, and
dimensions as rows. The label variable should be a row vector with each element corresponding to each
sample, and one number is assigned to one sample. It is suggested to use consecutive numbers starting
from 1 as labels. For example, we have LCC1 vs. MCF7 data set including four samples. So, a label
variable vector [1 2 3 4] corresponds to mixture ratios as 1:0, 0:1, 3:1 and 1:3 respectively. Note that every
biological mixing sample shoule be arranged in tha same sequence as that in our example.
3.2.
Read in Data
1. Start MATLAB.
2. Choose a preferred data set to perform the PICA by entering a number in the command window:
Run BSS using simulation data or biological data? (0: exit; 1:
simulation; 2: biological):
0: Exit the program;
1: Simulation data;
2: Biological data;
Other number: wrong option.
3. When choosing the simulation data, the following prompt will be presented in the command
window:
Run BSS using simulation data or biological data? (0: exit; 1:
simulation; 2: biological):1
numerical mix...
load the pure data set...
Please input the file name of the pure data set (*.txt)[Enter:
default data set]:
There are two options here. One is to input your own data in the format: ***.txt, and the other is
simply pressing the RETURN key. If the RETURN key is pressed, our program will
automatically choose the default simulation data file: nci_63.txt (2308 genes from neuroblastoma
(NB) and the Ewing family of tumors (EWS) profiles)
4. When choosing the biological data, the following prompt will be presented in the command
window:
Run BSS using simulation data or biological data? (0: exit; 1:
simulation; 2: biological):2
NOT numerical mix...
load the pure data set...
Please input the file name of the pure data set (*.txt)[Enter:
default data set]:
There are two options here. One is for inputting your own data in the format: ***.txt, the other is
simply pressing the RETURN key. If the RETURN key is pressed, our program will
automatically choose the default biological data file: lcc9_m1.txt.txt (LCC1 vs. MCF-7 datasets)
5. For the simulation data, the computer generates the mixture ratio automatically. And for the
biological data, the current mixture ratios are 1:3/3:1.
2.3.
Run ISG-PICA algorithm
With the selected data set, the software automatically selects appropriate methods to perform the PICA
algorithm. The command window will give some useful information as the following:
-------------perform BSS...
-------------Calculating covariance...
Dimension not reduced.
Selected [ 2 ] dimensions.
Smallest remaining (non-zero) eigenvalue [ 2.36061e+006 ]
Largest remaining (non-zero) eigenvalue [ 2.50717e+007 ]
Sum of removed eigenvalues [ 0 ]
[ 100 ] % of (non-zero) eigenvalues retained.
Whitening...
Check: covariance differs from identity by [ 1.11022e-014 ].
Convergence after 8 steps
normalization/registration of sources and mixing matrix for
comparison...
the E1 of demixing-mixing matrix:
0.79997
the correlation coefficient of sig and PICA is:
1
0.94882
the correlation coefficient of sig and PICA is:
2
0.99209
2.4.
Export recovered data to files, and plot result figures
After the software of BSS finishes, the scatter plots of pure cases are presented. Also shown in the
Figures, the profiles of the ground truth, mixture observations and the recovered sources are presented for
comparison. A collection of recovered sources in the gene space is displayed. All these figures are also
saved in the format of emf file.
3. Summary
In this tutorial, we have presented the BSS method with a procedural description of the method. The
BSS method has been implemented in MATLAB. We give a brief summary of the algorithm in Section 2.
The procedural steps for performing CDTH using BSS have been detailed and illustrated in Section 3. The
material in this document is limited to the implementation of the BSS algorithms. It does not discuss the
theory of this algorithm. The reader should refer to the cited references for more detail of the underling
algorithm, as well as other related books and publications (such as Hyvärinen, 2001).
References
1. Aapo Hyvärinen., (2001) Independent Component Analysis, John Wiley & Sons, Inc.
2. Y. Wang, J. Zhang, J. Khan, R. Clarke, and Z. Gu, “Partially-independent component analysis for tissue
heterogeneity correction in microarray gene expression analysis,” Proc. IEEE Workshop on Neural
Networks for Signal Processing, pp. 24-32, Toulouse, France, September 2003.
3. RunICA software package, (2005): http://sccn.ucsd.edu/eeglab/maintut/ICA_decomposition.html
Download