SUPPLEMENT 3 Figure 1 shows the initial plots of the 2330 differentially expressed genes onto the (a) 1st and 2nd, (b) 2nd and 3rd, and (c) 1st and 3rd principal components. The black genes have different expression levels in the three tissues (i.e., RC>QC>PM), blue genes are RC up-regulated genes (RC > QC ≈ PM), red genes are RC down- regulated genes (RC < QC ≈ PM), and the other four colors (green, orange, light blue, pink) correspond to the up- or down- regulated expression patterns in QC or PM. This gene categorization is determined by Fisher’s LSD test. From Figure 1, we see that the direct application of PCA is only able to separate the two dominating expression patterns [RC up- (1361/2330) and RC down- (477/2330) regulated genes] and fails to recognize other interesting patterns due to their weak signals (in more detail, red genes are well separated from blue genes in Figure 1(a), but both are mixed together with genes of other colors). The failure of PCA to classify the differentially expressed genes, even when exhausting all principal components, may be attributed to the unbalance of the gene set, where the majority genes have been either RC up- or down-regulated genes. Therefore, a more “balanced” simulation dataset was generated, which has the same number of genes for each pattern, to amplify the weak signals and to obtain a more “appropriate” covariance matrix for capturing correct class information. Using this simulation dataset, a covariance matrix COV(X)=[5.94 2.14 2.15; 2.14 5.94 2.13; 2.15 2.13 5.94] was obtained. The three eigenvalues ( 1 , 2 , 3 ) and their corresponding eigenvectors ( e1 , e2 , e3 ) were also calculated from the above covariance matrix. 1 10.22 , 2 3 3.8 , 1 e1 [0.58 0.58 0.58], e2 [0 0.7 0.7], e3 [0.8 0.4 0.4] . When the tissue specific genes (only colored genes in Figure 1) were plotted onto this new component space (see Figure 2), it was shown that the second and third components, PC2 and PC3, beautifully recognize the 6 classes of tissue specific genes. The first component, which is associated with the eigenvector e1 [0.58, 0.58, 0.58] , however only captures the gene’s average expression level across the three tissues, and thus does not provide useful signals for discriminating between different gene classes even it captures the largest amount of variance in data (57%). This agrees with existing arguments (Yeung and Ruzzo 2001, Chang 1983) that the component with largest variance is not necessary to be the most informative component for classification. The precise description of the six class separating regions for tissue specific genes, whose centers are the dotted lines in Figure 2, is provided in Table 1. For example, the genes around the line PC2 = 0 with PC3 < 0 are RC down-regulated genes. Furthermore, we applied the above analysis to all the 2330 differentially genes to adjust the list of tissue specific genes as some genes may be mis-categorized by Fisher’s LSD test [Figure 3]. In more detail, we projected the genes onto a circle of x2 + y2 = 2 in the new 2nd and 3rd component space, and used the silhouette method (Rousseeuw 1987) to adjust the partition of the genes into 12 classes. Figure 3(b) showed the clustering results from the above method, where colored genes correspond to tissue specific genes (i.e. RC > PM ≈QC) and black genes correspond to the genes that have different expression levels in the three tissues (i.e., RC>QC>PM). The additional tissue specific genes identified by the above method have been included in supplement 2. 2 Figure 1. All the differentially expressed genes are plotted on to the (a). 1st and 2nd (b). 2nd and 3rd and (c) 1st and 3rd principal components Figure 2. The tissue specific genes are plotted on to the 2nd and 3rd components in the new component space Figure 3. All the 2330 differentially expressed genes are plotted on to the new component space (a). 1st and 2nd components (b). 2nd and 3rd components and (c) 1st and 3rd components 3 Red and blue genes: RC down-/up-regulated genes (PM≈QC); Pink and light blue genes: PM down-/up-regulated genes (RC≈QC); Orange and green genes: QC down-/up-regulated genes (RC≈PM); Black genes corresponding to patterns PM > QC > RC, PM > RC > QC, RC > PM > QC, RC > QC > PM, QC > RC > PM and QC > PM > RC Table 1. Six expression patterns and their separating regions described by PC2 and PC3 Center of separating regions Class index Expression patterns described by PC2 and PC3 1 PM > (QC ≈ RC) PC2 = 1.73 PC3 < 0 2 PM < (QC ≈ RC) PC2 = 1.73 PC3 > 0 3 QC > (PM ≈ RC) PC2 = 1.73 PC3 > 0 4 QC < (PM ≈ RC) PC2 = 1.73 PC3 < 0 5 RC > (PM ≈ QC) PC2 = 0; PC3 > 0 6 RC < (PM ≈ QC) PC2 = 0; PC3 < 0 References Chang, W.C. 1983. On using principal components before separating a mixture of two multivariate normal distributions. Appl. Statist. 32: 267-275. Yeung, K.Y. and Ruzzo, W.L. 2001. Principal component analysis for clustering gene expression data. Bioinformatics 17: 763-774. Rousseeuw, P.J. 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math., 20, 53-65. 4