Figure 1 plots the expression profiles of these genes onto the (a) 1st

advertisement
SUPPLEMENT 3
Figure 1 shows the initial plots of the 2330 differentially expressed genes onto the
(a) 1st and 2nd, (b) 2nd and 3rd, and (c) 1st and 3rd principal components. The black
genes have different expression levels in the three tissues (i.e., RC>QC>PM), blue genes
are RC up-regulated genes (RC > QC ≈ PM), red genes are RC down- regulated genes
(RC < QC ≈ PM), and the other four colors (green, orange, light blue, pink) correspond to
the up- or down- regulated expression patterns in QC or PM. This gene categorization is
determined by Fisher’s LSD test.
From Figure 1, we see that the direct application of PCA is only able to separate
the two dominating expression patterns [RC up- (1361/2330) and RC down- (477/2330)
regulated genes] and fails to recognize other interesting patterns due to their weak signals
(in more detail, red genes are well separated from blue genes in Figure 1(a), but both are
mixed together with genes of other colors). The failure of PCA to classify the
differentially expressed genes, even when exhausting all principal components, may be
attributed to the unbalance of the gene set, where the majority genes have been either RC
up- or down-regulated genes.
Therefore, a more “balanced” simulation dataset was generated, which has the
same number of genes for each pattern, to amplify the weak signals and to obtain a more
“appropriate” covariance matrix for capturing correct class information. Using this
simulation dataset, a covariance matrix COV(X)=[5.94 2.14 2.15; 2.14 5.94 2.13; 2.15
2.13 5.94] was obtained. The three eigenvalues ( 1 , 2 , 3 ) and their corresponding
eigenvectors ( e1 , e2 , e3 ) were also calculated from the above covariance matrix.
1  10.22 , 2  3  3.8 ,
1
e1  [0.58 0.58 0.58], e2  [0  0.7 0.7], e3  [0.8  0.4  0.4] .
When the tissue specific genes (only colored genes in Figure 1) were plotted onto
this new component space (see Figure 2), it was shown that the second and third
components, PC2 and PC3, beautifully recognize the 6 classes of tissue specific genes.
The first component, which is associated with the eigenvector e1  [0.58, 0.58, 0.58] ,
however only captures the gene’s average expression level across the three tissues, and
thus does not provide useful signals for discriminating between different gene classes
even it captures the largest amount of variance in data (57%). This agrees with existing
arguments (Yeung and Ruzzo 2001, Chang 1983) that the component with largest
variance is not necessary to be the most informative component for classification. The
precise description of the six class separating regions for tissue specific genes, whose
centers are the dotted lines in Figure 2, is provided in Table 1. For example, the genes
around the line PC2 = 0 with PC3 < 0 are RC down-regulated genes.
Furthermore, we applied the above analysis to all the 2330 differentially genes to
adjust the list of tissue specific genes as some genes may be mis-categorized by Fisher’s
LSD test [Figure 3]. In more detail, we projected the genes onto a circle of x2 + y2 = 2 in
the new 2nd and 3rd component space, and used the silhouette method (Rousseeuw 1987)
to adjust the partition of the genes into 12 classes. Figure 3(b) showed the clustering
results from the above method, where colored genes correspond to tissue specific genes
(i.e. RC > PM ≈QC) and black genes correspond to the genes that have different
expression levels in the three tissues (i.e., RC>QC>PM). The additional tissue specific
genes identified by the above method have been included in supplement 2.
2
Figure 1. All the differentially expressed genes are plotted on to the
(a). 1st and 2nd (b). 2nd and 3rd and (c) 1st and 3rd principal components
Figure 2. The tissue specific genes are plotted on to the 2nd and 3rd
components in the new component space
Figure 3. All the 2330 differentially expressed genes are plotted on to the new component space
(a). 1st and 2nd components (b). 2nd and 3rd components and (c) 1st and 3rd components
3
Red and blue genes: RC down-/up-regulated genes (PM≈QC); Pink and light blue genes: PM
down-/up-regulated genes (RC≈QC); Orange and green genes: QC down-/up-regulated genes
(RC≈PM); Black genes corresponding to patterns PM > QC > RC, PM > RC > QC, RC > PM > QC,
RC > QC > PM, QC > RC > PM and QC > PM > RC
Table 1. Six expression patterns and their separating regions
described by PC2 and PC3
Center of separating regions
Class index
Expression patterns
described by PC2 and PC3
1
PM > (QC ≈ RC)
PC2 = 1.73 PC3 < 0
2
PM < (QC ≈ RC)
PC2 = 1.73 PC3 > 0
3
QC > (PM ≈ RC)
PC2 = 1.73 PC3 > 0
4
QC < (PM ≈ RC)
PC2 = 1.73 PC3 < 0
5
RC > (PM ≈ QC)
PC2 = 0; PC3 > 0
6
RC < (PM ≈ QC)
PC2 = 0; PC3 < 0
References
Chang, W.C. 1983. On using principal components before separating a mixture of two
multivariate normal distributions. Appl. Statist. 32: 267-275.
Yeung, K.Y. and Ruzzo, W.L. 2001. Principal component analysis for clustering gene
expression data. Bioinformatics 17: 763-774.
Rousseeuw, P.J. 1987. Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis. J. Comput. Appl. Math., 20, 53-65.
4
Download