The Report of HIV Data Analysis Part I Patients Clustering 1. Data Set Introduction In our HIV data, there are 36 distinct patients with 41 genes for each patient. 1.1 About Genes There are 41 genes which come from 6 families, which are MMP, Apotosis family, Chemokine Family, Cytokine family, Signal transduction molecules family and Basement membrane proteins family. The detailed info about genes is shown in table 1-1 below. Family name MMP family(Amount:10) Apotosis family(Amount:9) Chemokine Family(Amount:10) Cytokine family(Amount:7) Signal transduction molecules(Amount:3) Basement membrane proteins(Amount:2) Contain gene name MMP-1, MMP-2 ,MMP-10,MMP-11,MMP-13,MMP14,TIMP-1,TIMP-2,TIMP-3,TIMP-4 BCL2,CASP2,CASP3,CASP5,CASP6,CASP7,CASP8,CD40L ,FASL CCL3,CCL4,CCL5,CXCL1,CXCL2,CXCL3,CXCL5,CCR7,CCL2, CXCR4 TNF,VEGF-A,TGFb1,IL1B,IL-17,TGFb1,TNFa MAP2K4,MAP2K5,MAP2K7 CLDN7,COL1A1 Table 1-1 The Info of 6 families 1.2 About Patients We have 36 patients from 3 groups, which are NP No Meds (Normal progressors with meds), LTNP(Long term Normal progressors ) and NP with Meds(Normal progressors with No meds). The detailed info about patient are shown in table 1-2 below. Patients groups NP No Meds (Normal progressors with meds) LTNP(Long term Normal progressors ) Patients number MR000582565,MR000829742,MR000680263,MR000681587,MR00 0944008,MR001024668,MR000153425,MR000738082,MR0008903 09,MR000750185,MR000278554,MR000793267 MR001054915,MR000393119,MR000698654,MR000650839,MR00 0650855,MR000484265,MR000704971,MR000384372,MR0000670 52,MR001056420,MR001048432,M001047735 1 NP with Meds(Normal progressors with No meds) MR000784719,MR000829742,MR000834420,MR000835659,MR00 0944328,MR000864519,MR000324700,MR000972175,MR0007030 81,MR000667018,MR000731854,MR00060496 Table 1-2 The Info of patients’ groups 2. GPX Introduction We have designed and implemented a clustering tool named GPX (Gene Pattern eXplorer), which is implemented in java. 2.1 The Contribution of GPX Proposes a framework of interactive exploration for the analysis of multidimensional data. This approach supports exploration by users as guided by their domain knowledge and accommodates disparate user requirements for varying degrees of coherence in different parts of the data set. Develops a novel strategy for handling “intermediate” data items. A user first identifies coherent patterns. He/she can then determine the borders of groups of coexpressed genes on the basis of the distance between a data item and the coherent patterns. In particular, an “intermediate” data item is allowed to participate in more than one cluster. Designs a coherent pattern index to give users ranked indications of the existence of coherent patterns. To derive a coherent pattern index, we adopt a density-based model to describe the coherence relationship between data items and devise an attraction tree structure to summarize the coherence information for the interactive exploration. 2.2 Features of GPX 2.2.1 Interactive Exploration Operations A user can explore the data set and their coherent patterns by unfolding a hierarchy of the data items and patterns. The exploration starts from the root. To help a user to make decision to split the data items and detect the patterns, a coherent pattern index graph [7] is used at each node of the tree to illustrate the cluster structure in the corresponding subset of data at the node. Each pulse in the coherent pattern index graph indicates the potential existence of a coherent pattern, and a higher pulse represents a stronger indication. Based on the index graph, GPX supports several exploration operations [8]. Two essential ones are drill-down and roll-up. Drill-down. A user can select the pulse(s) in the index graph, and the system will split the data items accordingly. Each split subset of data items becomes a child node of the current node. If the user 2 does not specify any pulse, the system will choose the highest pulse by default and split the data items accordingly. Roll-up. A user can revoke any drill-down operation. The user can select a node and undo the drill-down operation from this node. All descendants of this node will be deleted. The user can also roll up one node A to its parent P. This equals to skipping the selection of the pulse in P's index graph that corresponds to A, and undoing the drill down operation for P. 2.2.2 A Robust Model for Clusters and Patterns Most existing clustering methods try to find the clusters based on some global criteria, and then derive the coherent patterns as the centroids of the clusters. Such strategies may be sensitive to a large amount of intermediate data items in data sets. Contrast to those methods, GPX adopts a novel strategy: it first explores the hierarchy of coherent patterns in the data set and then finds the groups of data items according to the coherent patterns. In GPX, a cluster of data items is modeled as a dense area in the multidimensional data space. Data items at the center of the dense area have relatively high density and present the coherent pattern of the whole cluster. Data items at the periphery of the dense area have relatively low density and will be attracted toward the center area level by level. Through this density-based model, GPX can distinguish the coherent data items from intermediate data items by their relative density. The coherent pattern in a dense area is represented by the profile of the data item that has the highest local density in the dense area. Other data items in the same dense area can be sorted in a list according to the similarity (from high to low) between their profiles and the coherent pattern. Since the intermediate data items have low similarity to the coherent pattern, they are at the rear part of the sorted list. Users can set up a similarity threshold and thus cut the intermediate data items from the cluster. 2.2.3 Graphical Interface A graphical interface provides users a direct impression of the trends of data coherent levels. Users may interpret the underlying biological process and decide whether the coherent data items should be further split into finer subgroups. In GPX, we use the parallel coordinate to illustrate the profiles of data items. We also visualize the whole hierarchical structure of the coherent patterns. Users can browse the hierarchical tree, select a node and apply the exploration operations. 3. Clustering by GPX 3.1 The meaning of Clustering on our data In this part we will cluster the 36 patients samples based on 41 gene feature, which can tell us which patients belong to the same group. For any given unknown patients, we can apply 3 clustering on the patients based on gene features which could tell us which patients may have the same situation (for example, taking the same medicine/treat ). 3.2 The Clustering on our data Since our 36 patients come from 3 groups, then what we hope to do is to separate the data into three clusters and compare the result with what we have known. The expression profiles shows in Figure 3-1 and when clustering number k=3, the coherent pattern curve shows in Figure 3-2. Figure 3-1 Figure 3-2 Since we hope to get three clusters from the data, we need to pick up 2 points in the coherent graph to separate the data into three groups. According to the meaning of our algorithm , we pick up the two point with the most drop height as shown in Figure 3-3. 4 Figure 3-3 Then we can get the clustering result tree as shown in Figure 3-4, which includes each cluster number and its data index. Figure 3-4 The result distribution matrix shows as follows: A(12) B(12) C(12) a 5 8 9 b 5 3 2 5 c 2 1 1 3.3 Clustering result validation Validation is a very important step of any such clustering algorithm. It helps rank clustering results, so that we will know the quality and reliability of the clustering results. To judge the performance of our algorithm, we choose some index to estimate it which includes External Index: Purity: One of the ways of measuring the quality of a clustering solution is cluster purity. Let k be the number of clusters of the dataset D and size of cluster Cj be |Cj |. Let |Cj |class=i denote the number of items in class i assigned to cluster j. Purity of this cluster is given as: The overall purity of a clustering solution could be expressed as a weighted sum of individual cluster: In general, larger the value of purity better the solution. It should be noted that cluster entropy is a better measure than purity however for the purpose of this assignment, we will stick to purity. Rand Rand Index measures between pair decisions. Given a set of n elements and and two partitions of S to compare, , we define the following: , the number of pairs of elements in S that are in the same set in X and in the same set in Y , the number of pairs of elements in S that are in different sets in X and in different sets in Y , the number of pairs of elements in S that are in the same set in X and in different sets in Y , the number of pairs of elements in S that are in different sets in X and in the same set in Y Rand | Agree | | SS | | DD | | Agree | | Disagree | | SS | | SD | | DS | | DD | We can construct a distribution matrix from our GPX algorithm as follows. 6 A, B, C are the three clusters we have gotten from the algorithm, the number following the cluster name is the member number of this cluster. In each row is the distribution of a certain cluster which we have already know, such as the second row means cluster NP No Meds include 12 members, 2 member is clustered into group A, 5 are clustered into group B, and 5 into group C. NP No Meds(12) LTNP(12) NP with Meds(12) A(4) 2 1 1 B(10) 5 3 2 C(22) 5 8 9 Based on this matrix and the definition of the purity, we can get Purity=0.417 Rand: Rand =0.5339 To Sum up, When the cluster number is 3 and using all the 41 genes as features the performance of GPX is as follows: GPX Purity 0.417 Rand 0.5339 4. Compare with other Clustering method In order to judge the performance of GPX, we choose 3 algorithms to compare with it. These are PAM (Partitional clustering) , C4.5 (Hierarchical clustering) and Fuzzy algorithm (Fuzzy clustering). 4.1 The performance of the other algorithms PAM (Partitioning Around Medoids) PAM has the following features: It operates on the dissimilarity matrix of the given data set or when it is presented with an n ´ p data matrix, the algorithm first computes a dissimilarity matrix. It is robust, because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances. It provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters. 7 The algorithm proceeds in two steps: * BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids * SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased. The algorithm proceeds in two steps The distribution matrix: A(1) 0 1 1 NP No Meds(12) LTNP(12) NP with Meds(12) B(33) 1 2 3 Purity 0.417 PAM C(2) 11 9 8 Rand 0.4274 C4.5: C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan [5]. C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller sublists. The distribution matrix: A(1) 0 1 0 NP No Meds(12) LTNP(12) NP with Meds(12) C4.5 B(33) 12 10 11 Purity 0.389 C(2) 0 1 1 Rand 0.3904 Fuzzy: In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and associated with each element is a set of membership levels. These indicate the strength of the 8 association between that data element and a particular cluster. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters. The distribution matrix: A(7) 1 2 4 NP No Meds(12) LTNP(12) NP with Meds(12) Fuzzy B(27) 11 9 7 Purity 0.444 C(22) 0 9 1 Rand 0.4861 4.2 Analysis The followings table shows a summary of the performance from all listed algorithms. GPX PAM C4.5 Fuzzy Purity 0.417 0.417 0.389 0.444 Rand 0.5339 0.4274 0.3904 0.4861 We can see from the table that GPX has a highest Rand value among this four algorithm, which means GPX got relative compact and well-separated clusters. Also the large of each groups of GPX is more similar to the large of the truth. But the purity of it is not so good, which is lower than Fuzzy. In order to get a better performance, we decide to make a feature selection first, which may help to optimize the GPX performance. The strengths and advantages of each algorithms are shown at Table 1-2 below. Algorithm Strength Weakness Fuzzy Kmeans It converges to a local minimum or a saddle point It generally more static and flexible It is relatively efficient for large datasets since its running time is O(t*k*n) where n is the number of objects, k is the number of 9 Unable to handle noisy data and outliers. Need to specify k, the number of clusters in advance. Applicable only when mean is defined. Unable to handle noisy data and outliers. Need to specify k, the number of Hierarchical GPX clustes, and t is the number of iterations; It oftern terminates at a local optimum Suitable for data sets with inherent hierarchical structure, e.g. microarray data. clusters in advance. Clusters can have arbitrary shape and size Guided by users’ domain knowledge Tree structure shows the clusters more clear Extensive performance study on both synthetic data sets and some real-world gene expression data Not scale well since it is O(N^3). It cannot handle non-convex shape data. Too sensitive to the outliers if you choose single distance to measure distance between different clusters. May be not as efficient as some clustering algorithms In some situations it is sensitive to boundaries you choose. Table 1-2 The strengths and shortcomings of algorithms 5. Feature Selection In this part, we use two methods to do feature selection, one is based SVM , the other is the Relief Attribution Evaluates. 5.1 SVM High generalization ability of SVM is based on the idea of maximizing the margin. Inverse-square of margin is given by: Feature selection methods that use the above quantity as a criterion have been proposed by several authors [1,2]. For linear SVMs, it is possible to decompose the above quantity into sum of terms corresponding to original features: 10 where xki is the kth component of xi. Therefore, contribution of kth feature to the inverse-squaremargin can be given by: Importance of a features were evaluated according to their values of , and features having zero can be discarded without deteriorating classification performance. close to 5.2 Relief Attribution Evaluates Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the same and different class. Can operate on both discrete and continuous class data [2,3,4]. The original algorithm of Relief is: 5.3 The feature selection result 5.3.1 SVM The features after ranked by SVM are as following: Feature # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Gene Name TIMP_2 TIMP_4 CXCL3 MAP2K4 MAP2K5 MMP_10 CXCL5 CASP8 MMP_14 MMP_13 TGFb_1 CCL4 CXCL1 TNF TNF_a CD40L Family MMP MMP Chemokine S.t.molecules S.t.molecules MMP Chemokine Apotosis MMP MMP Cytokine Chemokine Chemokine Cytokine Cytokine Apotosis Feature # 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 11 Gene Name MMP_1 CCL5 CASP3 TIMP_1 VEGF_A MAP2K7 COLIAI CCR7 IL1B MMP_2 CASP5 BCL_2 CCL3 CXCR4 CXCL2 CASP2 Family MMP Chemokine Apotosis MMP Cytokine S.t.molecules B.membrane Chemokine Cytokine Apotosis Apotosis Apotosis Chemokine Chemokine Chemokine Apotosis 17 18 19 20 21 MMP_11 CASP7 IL_17 TGFb2 CCL2 MMP Apotosis Cytokine Cytokine Chemokine 38 39 40 41 MMP-1 FASL TIMP_3 CASP6 MMP Apotosis MMP Apotosis Gene Name IL_17 CLDN7 IL1B MMP_11 TGFb2 MAP2K4 TIMP_1 CD40L TIMP_3 CXCR4 CASP6 BCL_2 CCL3 CASP2 CASP3 CCR7 CASP7 VEGF_A COLIAI MMP_2 Family Cytokine B.membrane Cytokine MMP B.membrane S.t.molecules MMP Apotosis MMP Chemokine Apotosis Apotosis Chemokine Apotosis S.t. Apotosis Chemokine Apotosis Cytokine B.membrane MMP 5.3.2 Relief Attribution Evaluates The features after ranked by Relief Attribution are as following: Feature # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Gene Name CXCL3 TIMP_2 TIMP_4 MMP_13 CXCL1 MMP_10 CCL4 MAP2K5 TGFb_1 CXCL2 MMP_1 CCL2 TNF_a CCL5 MMP_14 CXCL5 CASP8 MAP2K7 TNF FASL CASP5 Family Chemokine MMP MMP MMP Chemokine MMP Chemokine S.t.molecules Cytokine Chemokine MMP Chemokine Cytokine Chemokine MMP Chemokine Apotosis S.t.molecules Cytokine Apotosis Apotosis Feature # 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 5.3.3 Analysis based on the results After multiple times experiments on using different combination of features based on the above experiment result, we decide to use four genes : TIMP_2, TIMP_4, MMP_13 and MMP_10 instead of the whole 41 genes. TIMP_2 TIMP_4 MMP_13 MMP_10 The reasons are as followings: (1) The rank of the feature Both these four genes rank in the top10 at both two feature selection method. What’s more, TIMP_2 and TIMP_4 are the top 1 and top 2 in SVM and top2 and top3 in Relief Attribution Evaluates from which we 12 can find they are playing an important role in clustering our HIV data. The higher the gene rank, the more important it has. (2) They come from the same family Both these four genes come from the gene family “MMP”, therefore these 4 genes may have more strong connection. (3) Based on the experiment As mentioned before, in order to find the best combination of the gene features, we have conducted many experiments on different combination of genes. The combination of these four genes gives us the best performance. 6 . Clustering based on the feature selection After feature selection, we conducted the clustering again and made a comparison of the performance. Therefore we can decide whether the features we have selected are well-performed. 6.1 GPX Clustering result The coherent pattern curve based on the genes selection shows in figure 6-1. Figure 6-1 Since we hope to get three clusters from the data, we need to pick up 2 points in the coherent graph to separate the data into three groups. According to the meaning of our algorithm , we can pick up the two point with the most drop height as in figure 6-2. 13 Figure 6-2 Then we can get the clustering result tree as in Figure 6-3, which included each clusters number and the data index it includes. Figure 6-3 The result distribution matrix shows as following: NP No Meds(12) LTNP(12) A(22) 7 4 B(7) 2 5 14 C(7) 3 3 NP with Meds(12) 11 0 1 Result: Purity 0.537 GPX Rand 0.579 6.2 The other Clustering algorithms result PAM: The distribution matrix: A(3) 1 0 2 NP No Meds(12) LTNP(12) NP with Meds(12) B(31) 11 12 8 Purity 0.417 PAM C(2) 0 0 2 Rand 0.4367 C4.5: The distribution matrix: A(1) 11 12 10 NP No Meds(12) LTNP(12) NP with Meds(12) B(33) 0 0 1 Purity 0.389 C4.5 C(2) 1 0 1 Rand 0.375 Fuzzy: The distribution matrix: NP No Meds(12) LTNP(12) NP with Meds(12) A(7) 0 0 3 B(27) 11 12 8 15 C(22) 1 0 1 Fuzzy Purity 0.444 Rand 0.4861 6.3 The analysis of feature selection The following table shows the comparison between algorithms before feature selection and after features selection. Before gene selection(gene num:41) Algorithm Purity Rand PAM 0.417 0.4274 C4.5 0.389 0.3904 Fuzzy 0.444 0.4861 GPX 0.417 0.5339 After gene selection(gene num:4) Algorithm Purity Rand PAM 0.417 0.4367 C4.5 0.389 0.375 Fuzzy 0.444 0.4861 GPX 0.537 0.579 6.4 Conclusion From the above table we can find that: 1. After gene selection , the Purity of GPX have increased from 0.417 to 0.537, and the value of Rand have increased from 0.5339 to 0.579. 2. Since the dimension have deduced from 41 dimensions to 4 dimensions, therefore the runtime will dramatically decrease. 3. Not only the performance of GPX have been improved, the other algorithms PAM, C4.5 and Fuzzy have also been improved. 16 Part 2 Genes Clustering 1. Clustering based on NP with Meds(Normal progressors with meds) In this part, we cluster the genes based on the patients who take normal progressors with medicines and this group has 12 patients (MR000784719, MR000829742, MR000834420, MR000835659, MR000944328, MR000864519, MR000324700, MR000972175, MR000703081, MR000667018, MR000731854, MR00060496). 1.1 The clustering result of GPX 1.1.1 Clustering result The expression profiles are shown in figure 1-1 and the coherent pattern curve is shown in figure 1-2. Figure 1-1 Figure 1-2 Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in figure 1-3. The relative clustering result is shown in figure 1-4. 17 Figure 1-3 1 2,3,9,10,19,21,22,26, 27,29,30,31,32,35,38 2 11,39,41 3 1,4,5 4 7,12,14,18,20,23, 24,25,28,33,34,40 5 8,15,17, 36,37 6 6,13,16 Figure 1-4 A(7) B(10) C(9) D(3) E(10) F(2) a 2 2 4 2 5 0 b 0 1 0 0 1 1 c 3 0 0 0 0 0 d 1 2 5 1 2 1 e 0 3 0 0 2 0 f 1 2 0 0 0 0 1.1.2 Clustering result validation GPX Purity 0.415 Rand 0.6871 1.2 The clustering result of other algorithms In order to judge the performance with GPX, we choose 3 algorithms to compare with it. These are PAM (Partitional clustering),C4.5 (Hierarchical clustering and Fuzzy algorithm (Fuzzy clustering). 18 PAM C4.5 Fuzzy Purity 0.317 0.2683 0.3415 Rand 0.4789 0.3587 0.5574 1.3 The comparison between the algorithms GPX PAM C4.5 Fuzzy Purity 0.415 0.317 0.2683 0.3415 Rand 0.6871 0.4789 0.3587 0.5574 Comparing the performance of other algorithms from above table with our GPX algorithm based on the NP with meds group, we can easily see that our GPX algorithm performs much better, from which we can trust that our GPX has a better quality of a clustering result. And the clusters gotten from GPX are compact and well-separated clusters than the other four algorithms. 2. Clustering based on NP without Meds (Normal progressors with No meds) In this part, we cluster the genes based on the patients who take normal progressors without medicines and this group has 12 patients(MR000784719, MR000829742, MR000834420, MR000835659, MR000944328, MR000864519, MR000324700, MR000972175, MR000703081, MR000667018, MR000731854, MR00060496). 2.1 The clustering result of GPX The expression profiles are shown in figure 2-1 and the coherent pattern curve is shown in figure 2-2. Figure 2-1 19 Figure 2-2 Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in figure 2-3. The relative clustering result is shown in figure 2-4. Figure 2-3 1 19,30,31,36,38,40 2 17,18,20,21, 22,24,32,35 3 9,11,12,13, 14,34,37,39 4 1,3,6,7,25,29,33,41 5 2,5,8 6 4,10,15,16,23,26,2 7,28 Figure 2-4 A(7) B(10) C(9) a 0 0 1 b 0 1 5 c 0 5 0 d 4 0 1 20 e 2 1 0 f 1 3 2 D(3) E(10) F(2) 0 4 1 0 2 0 0 3 0 1 1 1 0 0 0 2 0 0 2.1.2 Clustering result validation Purity 0.5121 GPX Rand 0.765 2.2 The clustering result of other algorithms Purity PAM C4.5 Fuzzy 0.317 0.317 0.317 Rand 0.3659 0.3659 0.4261 2.3 The comparison between the algorithms GPX PAM C4.5 Fuzzy Purity 0.5121 0.317 0.317 0.317 Rand 0.765 0.3659 0.3659 0.4261 When comparing the performance of other algorithms from the above table with our GPX algorithm based on the NP without meds group, we can easily see that our GPX algorithm performs much better, from which we can trust that our GPX has a better quality of a clustering result. And the clusters gotten from GPX are compact and well-separated clusters than the other four algorithms. 3. Clustering based on LTNP (Long term Normal progressors) In this part, we cluster the genes based on the patients who take Long term Normal progressors and this group has 12 patients(MR001054915, MR000393119, MR000698654, MR000650839, MR000650855, MR000484265, MR000704971, MR000384372, MR000067052, MR001056420, MR001048432, M001047735). 3.1 The clustering result of GPX 3.1.1 Clustering result The expression profiles are shown in figure 3-1 and the coherent pattern curve is shown in figure 3-2. 21 Figure 3-1 Figure 3-2 Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in figure 3-3.And the relative clustering result is shown in figure 3-4. Figure 3-3 22 1 3,6,9,10,11,12,13,14,16,19,21 ,22,25,26,30,31,35,36,38,40 2 18,28,29,32,3 3,34 3 1,4,5,7,15, 17,24,27 4 2,41 5 20,23,37 6 8,39 Figure 3-4 A(7) B(10) C(9) D(3) E(10) F(2) a 2 7 5 0 5 1 b 0 0 1 2 3 0 c 4 2 1 1 0 0 d 1 0 0 0 0 1 e 0 0 2 1 0 0 f 0 1 0 0 1 0 3.1.2 Clustering result validation GPX Purity 0.439 Rand 0.6657 3.2 The clustering result of other algorithms PAM C4.5 Fuzzy Purity 0.317 0.2683 0.3415 Rand 0.3837 0.3587 0.5145 3.3 The comparison between the algorithms GPX PAM C4.5 Fuzzy Purity 0.439 0.317 0.2683 0.3415 Rand 0.6657 0.3837 0.3587 0.5145 When compared the performance of other algorithm from the above table with our GPX algorithm based on the LNTP group, we can easily see that our GPX algorithm performs much better, from which we can trust that our GPX has a better quality of a clustering result. And the clusters gotten from GPX are compact and well-separated clusters than the other four algorithms. 23 Reference 1. Guyon, J. Weston, S. Barnhill, V. Vapnik (2002). Gene selection for cancer classification using support vector machines. Machine Learning. 46:389-422. 2. Kenji Kira, Larry A. Rendell: A Practical Approach to Feature Selection. In: Ninth International Workshop on Machine Learning, 249-256, 1992. 3.Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European Conference on Machine Learning, 171-182, 1994. 4.Marko Robnik-Sikonja, Igor Kononenko: An adaptation of Relief for attribute estimation in regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997. 5.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993. 6. S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques, Informatica 31(2007) 249-268, 2007 7. Jiang D., Pei J. and Zhang A. Interactive exploration of coherent patterns in time-series gene expression data. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03),Washington, DC, USA, August 24-27 2003. 8. Pei, J. A general model for online analytical processing of complex data. In Proceedings of the 22nd International Conference on Conceptual Modeling (ER'03), Chicago, IL,October 13-26 2003 24