Clustering Report

advertisement
The Report of HIV Data Analysis
Part I Patients Clustering
1. Data Set Introduction
In our HIV data, there are 36 distinct patients with 41 genes for each patient.
1.1 About Genes
There are 41 genes which come from 6 families, which are MMP, Apotosis family, Chemokine
Family, Cytokine family, Signal transduction molecules family and Basement membrane
proteins family. The detailed info about genes is shown in table 1-1 below.
Family name
MMP
family(Amount:10)
Apotosis
family(Amount:9)
Chemokine
Family(Amount:10)
Cytokine
family(Amount:7)
Signal transduction
molecules(Amount:3)
Basement membrane
proteins(Amount:2)
Contain gene name
MMP-1, MMP-2 ,MMP-10,MMP-11,MMP-13,MMP14,TIMP-1,TIMP-2,TIMP-3,TIMP-4
BCL2,CASP2,CASP3,CASP5,CASP6,CASP7,CASP8,CD40L
,FASL
CCL3,CCL4,CCL5,CXCL1,CXCL2,CXCL3,CXCL5,CCR7,CCL2,
CXCR4
TNF,VEGF-A,TGFb1,IL1B,IL-17,TGFb1,TNFa
MAP2K4,MAP2K5,MAP2K7
CLDN7,COL1A1
Table 1-1 The Info of 6 families
1.2 About Patients
We have 36 patients from 3 groups, which are NP No Meds (Normal progressors with meds),
LTNP(Long term Normal progressors ) and NP with Meds(Normal progressors with No meds).
The detailed info about patient are shown in table 1-2 below.
Patients groups
NP No Meds (Normal
progressors with
meds)
LTNP(Long term
Normal progressors )
Patients number
MR000582565,MR000829742,MR000680263,MR000681587,MR00
0944008,MR001024668,MR000153425,MR000738082,MR0008903
09,MR000750185,MR000278554,MR000793267
MR001054915,MR000393119,MR000698654,MR000650839,MR00
0650855,MR000484265,MR000704971,MR000384372,MR0000670
52,MR001056420,MR001048432,M001047735
1
NP with Meds(Normal
progressors with No
meds)
MR000784719,MR000829742,MR000834420,MR000835659,MR00
0944328,MR000864519,MR000324700,MR000972175,MR0007030
81,MR000667018,MR000731854,MR00060496
Table 1-2 The Info of patients’ groups
2. GPX Introduction
We have designed and implemented a clustering tool named GPX (Gene Pattern eXplorer),
which is implemented in java.
2.1 The Contribution of GPX
 Proposes a framework of interactive exploration for the analysis of multidimensional data.
This approach supports exploration by users as guided by their domain knowledge and
accommodates disparate user requirements for varying degrees of coherence in different
parts of the data set.
 Develops a novel strategy for handling “intermediate” data items. A user first identifies
coherent patterns. He/she can then determine the borders of groups of coexpressed genes on
the basis of the distance between a data item and the coherent patterns. In particular, an
“intermediate” data item is allowed to participate in more than one cluster.
 Designs a coherent pattern index to give users ranked indications of the existence of coherent
patterns. To derive a coherent pattern index, we adopt a density-based model to describe the
coherence relationship between data items and devise an attraction tree structure to
summarize the coherence information for the interactive exploration.
2.2 Features of GPX
2.2.1 Interactive Exploration Operations
A user can explore the data set and their coherent patterns by unfolding a hierarchy of the data
items and patterns. The exploration starts from the root. To help a user to make decision to split
the data items and detect the patterns, a coherent pattern index graph [7] is used at each node of
the tree to illustrate the cluster structure in the corresponding subset of data at the node. Each
pulse in the coherent pattern index graph indicates the potential existence of a coherent pattern,
and a higher pulse represents a stronger indication. Based on the index graph, GPX supports
several exploration operations [8]. Two essential ones are drill-down and roll-up.
Drill-down.
A user can select the pulse(s) in the index graph, and the system will split the data items
accordingly. Each split subset of data items becomes a child node of the current node. If the user
2
does not specify any pulse, the system will choose the highest pulse by default and split the data
items accordingly.
Roll-up.
A user can revoke any drill-down operation. The user can select a node and undo the drill-down
operation from this node. All descendants of this node will be deleted. The user can also roll up
one node A to its parent P. This equals to skipping the selection of the pulse in P's index graph
that corresponds to A, and undoing the drill down operation for P.
2.2.2 A Robust Model for Clusters and Patterns
Most existing clustering methods try to find the clusters based on some global criteria, and then
derive the coherent patterns as the centroids of the clusters. Such strategies may be sensitive to a
large amount of intermediate data items in data sets. Contrast to those methods, GPX adopts a
novel strategy: it first explores the hierarchy of coherent patterns in the data set and then finds
the groups of data items according to the coherent patterns.
In GPX, a cluster of data items is modeled as a dense area in the multidimensional data space.
Data items at the center of the dense area have relatively high density and present the coherent
pattern of the whole cluster. Data items at the periphery of the dense area have relatively low
density and will be attracted toward the center area level by level. Through this density-based
model, GPX can distinguish the coherent data items from intermediate data items by their
relative density. The coherent pattern in a dense area is represented by the profile of the data item
that has the highest local density in the dense area. Other data items in the same dense area can
be sorted in a list according to the similarity (from high to low) between their profiles and the
coherent pattern. Since the intermediate data items have low similarity to the coherent pattern,
they are at the rear part of the sorted list. Users can set up a similarity threshold and thus cut the
intermediate data items from the cluster.
2.2.3 Graphical Interface
A graphical interface provides users a direct impression of the trends of data coherent levels.
Users may interpret the underlying biological process and decide whether the coherent data items
should be further split into finer subgroups. In GPX, we use the parallel coordinate to illustrate
the profiles of data items. We also visualize the whole hierarchical structure of the coherent
patterns. Users can browse the hierarchical tree, select a node and apply the exploration
operations.
3. Clustering by GPX
3.1 The meaning of Clustering on our data
In this part we will cluster the 36 patients samples based on 41 gene feature, which can tell us
which patients belong to the same group. For any given unknown patients, we can apply
3
clustering on the patients based on gene features which could tell us which patients may have the
same situation (for example, taking the same medicine/treat ).
3.2 The Clustering on our data
Since our 36 patients come from 3 groups, then what we hope to do is to separate the data into
three clusters and compare the result with what we have known. The expression profiles shows
in Figure 3-1 and when clustering number k=3, the coherent pattern curve shows in Figure 3-2.
Figure 3-1
Figure 3-2
Since we hope to get three clusters from the data, we need to pick up 2 points in the coherent graph to
separate the data into three groups. According to the meaning of our algorithm , we pick up the two point
with the most drop height as shown in Figure 3-3.
4
Figure 3-3
Then we can get the clustering result tree as shown in Figure 3-4, which includes each cluster number and
its data index.
Figure 3-4
The result distribution matrix shows as follows:
A(12)
B(12)
C(12)
a
5
8
9
b
5
3
2
5
c
2
1
1
3.3 Clustering result validation
Validation is a very important step of any such clustering algorithm. It helps rank clustering
results, so that we will know the quality and reliability of the clustering results. To judge the
performance of our algorithm, we choose some index to estimate it which includes
External Index:
Purity:
One of the ways of measuring the quality of a clustering solution is cluster purity. Let k be the number of
clusters of the dataset D and size of cluster Cj be |Cj |. Let |Cj |class=i denote the number of items in class
i assigned to cluster j. Purity of this cluster is given as:
The overall purity of a clustering solution could be expressed as a weighted sum of individual cluster:
In general, larger the value of purity better the solution. It should be noted that cluster entropy is a better
measure than purity however for the purpose of this assignment, we will stick to purity.
Rand
Rand Index measures between pair decisions.
Given a set of n elements
and

and two partitions of S to compare,
, we define the following:
, the number of pairs of elements in S that are in the same set in X and in the same set in Y

, the number of pairs of elements in S that are in different sets in X and in different sets in Y

, the number of pairs of elements in S that are in the same set in X and in different sets in Y

, the number of pairs of elements in S that are in different sets in X and in the same set in Y
Rand 
| Agree |
| SS |  | DD |

| Agree |  | Disagree | | SS |  | SD |  | DS |  | DD |
We can construct a distribution matrix from our GPX algorithm as follows.
6
A, B, C are the three clusters we have gotten from the algorithm, the number following the
cluster name is the member number of this cluster. In each row is the distribution of a certain
cluster which we have already know, such as the second row means cluster NP No Meds include
12 members, 2 member is clustered into group A, 5 are clustered into group B, and 5 into group
C.
NP No Meds(12)
LTNP(12)
NP with Meds(12)
A(4)
2
1
1
B(10)
5
3
2
C(22)
5
8
9
Based on this matrix and the definition of the purity, we can get
Purity=0.417
Rand:
Rand =0.5339
To Sum up, When the cluster number is 3 and using all the 41 genes as features the performance
of GPX is as follows:
GPX
Purity
0.417
Rand
0.5339
4. Compare with other Clustering method
In order to judge the performance of GPX, we choose 3 algorithms to compare with it. These are
PAM (Partitional clustering) , C4.5 (Hierarchical clustering) and Fuzzy algorithm (Fuzzy
clustering).
4.1 The performance of the other algorithms
PAM (Partitioning Around Medoids)
PAM has the following features:



It operates on the dissimilarity matrix of the given data set or when it is presented with an n ´ p
data matrix, the algorithm first computes a dissimilarity matrix.
It is robust, because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean
distances.
It provides a novel graphical display, the silhouette plot, which allows the user to select the
optimal number of clusters.
7
The algorithm proceeds in two steps:
* BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids
* SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object
with an unselected object, then the swap is carried out. This is continued till the objective function can no
longer be decreased. The algorithm proceeds in two steps
The distribution matrix:
A(1)
0
1
1
NP No Meds(12)
LTNP(12)
NP with Meds(12)
B(33)
1
2
3
Purity
0.417
PAM
C(2)
11
9
8
Rand
0.4274
C4.5:
C4.5 is an algorithm used to generate a decision tree developed by Ross Quinlan [5]. C4.5 builds
decision trees from a set of training data in the same way as ID3, using the concept of
information entropy. At each node of the tree, C4.5 chooses one attribute of the data that most
effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is
the normalized information gain (difference in entropy) that results from choosing an attribute
for splitting the data. The attribute with the highest normalized information gain is chosen to
make the decision. The C4.5 algorithm then recurs on the smaller sublists.
The distribution matrix:
A(1)
0
1
0
NP No Meds(12)
LTNP(12)
NP with Meds(12)
C4.5
B(33)
12
10
11
Purity
0.389
C(2)
0
1
1
Rand
0.3904
Fuzzy:
In hard clustering, data is divided into distinct clusters, where each data element belongs to
exactly one cluster. In fuzzy clustering, data elements can belong to more than one cluster, and
associated with each element is a set of membership levels. These indicate the strength of the
8
association between that data element and a particular cluster. Fuzzy clustering is a process of
assigning these membership levels, and then using them to assign data elements to one or more
clusters.
The distribution matrix:
A(7)
1
2
4
NP No Meds(12)
LTNP(12)
NP with Meds(12)
Fuzzy
B(27)
11
9
7
Purity
0.444
C(22)
0
9
1
Rand
0.4861
4.2 Analysis
The followings table shows a summary of the performance from all listed algorithms.
GPX
PAM
C4.5
Fuzzy
Purity
0.417
0.417
0.389
0.444
Rand
0.5339
0.4274
0.3904
0.4861
We can see from the table that GPX has a highest Rand value among this four algorithm, which
means GPX got relative compact and well-separated clusters. Also the large of each groups of
GPX is more similar to the large of the truth. But the purity of it is not so good, which is lower
than Fuzzy. In order to get a better performance, we decide to make a feature selection first,
which may help to optimize the GPX performance. The strengths and advantages of each
algorithms are shown at Table 1-2 below.
Algorithm
Strength
Weakness
Fuzzy



Kmeans

It converges to a local
minimum or a saddle point
It generally more static and
flexible
It is relatively efficient for
large datasets since its
running time is O(t*k*n)
where n is the number of
objects, k is the number of




9
Unable to handle noisy data and
outliers.
Need to specify k, the number of
clusters in advance.
Applicable only when mean is
defined.
Unable to handle noisy data and
outliers.
Need to specify k, the number of

Hierarchical
GPX





clustes, and t is the number of
iterations;
It oftern terminates at a local
optimum
Suitable for data sets with
inherent hierarchical
structure, e.g. microarray
data.
clusters in advance.




Clusters can have arbitrary
shape and size
Guided by users’ domain
knowledge
Tree structure shows the
clusters more clear
Extensive performance study
on both synthetic data sets
and some real-world gene
expression data

Not scale well since it is O(N^3).
It cannot handle non-convex
shape data.
Too sensitive to the outliers if you
choose single distance to
measure distance between
different clusters.
May be not as efficient as some
clustering algorithms
In some situations it is sensitive to
boundaries you choose.
Table 1-2 The strengths and shortcomings of algorithms
5. Feature Selection
In this part, we use two methods to do feature selection, one is based SVM , the other is the Relief
Attribution Evaluates.
5.1 SVM
High generalization ability of SVM is based on the idea of maximizing the margin. Inverse-square of
margin is given by:
Feature selection methods that use the above quantity as a criterion have been proposed by several authors
[1,2]. For linear SVMs, it is possible to decompose the above quantity into sum of terms corresponding to
original features:
10
where xki is the kth component of xi. Therefore, contribution of kth feature to the inverse-squaremargin
can be given by:
Importance of a features were evaluated according to their values of
, and features having
zero can be discarded without deteriorating classification performance.
close to
5.2 Relief Attribution Evaluates
Evaluates the worth of an attribute by repeatedly sampling an instance and considering the value of the
given attribute for the nearest instance of the same and different class. Can operate on both discrete and
continuous class data [2,3,4].
The original algorithm of Relief is:
5.3 The feature selection result
5.3.1 SVM
The features after ranked by SVM are as following:
Feature #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Gene Name
TIMP_2
TIMP_4
CXCL3
MAP2K4
MAP2K5
MMP_10
CXCL5
CASP8
MMP_14
MMP_13
TGFb_1
CCL4
CXCL1
TNF
TNF_a
CD40L
Family
MMP
MMP
Chemokine
S.t.molecules
S.t.molecules
MMP
Chemokine
Apotosis
MMP
MMP
Cytokine
Chemokine
Chemokine
Cytokine
Cytokine
Apotosis
Feature #
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
11
Gene Name
MMP_1
CCL5
CASP3
TIMP_1
VEGF_A
MAP2K7
COLIAI
CCR7
IL1B
MMP_2
CASP5
BCL_2
CCL3
CXCR4
CXCL2
CASP2
Family
MMP
Chemokine
Apotosis
MMP
Cytokine
S.t.molecules
B.membrane
Chemokine
Cytokine
Apotosis
Apotosis
Apotosis
Chemokine
Chemokine
Chemokine
Apotosis
17
18
19
20
21
MMP_11
CASP7
IL_17
TGFb2
CCL2
MMP
Apotosis
Cytokine
Cytokine
Chemokine
38
39
40
41
MMP-1
FASL
TIMP_3
CASP6
MMP
Apotosis
MMP
Apotosis
Gene Name
IL_17
CLDN7
IL1B
MMP_11
TGFb2
MAP2K4
TIMP_1
CD40L
TIMP_3
CXCR4
CASP6
BCL_2
CCL3
CASP2
CASP3
CCR7
CASP7
VEGF_A
COLIAI
MMP_2
Family
Cytokine
B.membrane
Cytokine
MMP
B.membrane
S.t.molecules
MMP
Apotosis
MMP
Chemokine
Apotosis
Apotosis
Chemokine
Apotosis
S.t. Apotosis
Chemokine
Apotosis
Cytokine
B.membrane
MMP
5.3.2 Relief Attribution Evaluates
The features after ranked by Relief Attribution are as following:
Feature #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Gene Name
CXCL3
TIMP_2
TIMP_4
MMP_13
CXCL1
MMP_10
CCL4
MAP2K5
TGFb_1
CXCL2
MMP_1
CCL2
TNF_a
CCL5
MMP_14
CXCL5
CASP8
MAP2K7
TNF
FASL
CASP5
Family
Chemokine
MMP
MMP
MMP
Chemokine
MMP
Chemokine
S.t.molecules
Cytokine
Chemokine
MMP
Chemokine
Cytokine
Chemokine
MMP
Chemokine
Apotosis
S.t.molecules
Cytokine
Apotosis
Apotosis
Feature #
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
5.3.3 Analysis based on the results
After multiple times experiments on using different combination of features based on the above
experiment result, we decide to use four genes : TIMP_2, TIMP_4, MMP_13 and MMP_10 instead of the
whole 41 genes.
TIMP_2
TIMP_4
MMP_13
MMP_10
The reasons are as followings:
(1) The rank of the feature
Both these four genes rank in the top10 at both two feature selection method. What’s more, TIMP_2 and
TIMP_4 are the top 1 and top 2 in SVM and top2 and top3 in Relief Attribution Evaluates from which we
12
can find they are playing an important role in clustering our HIV data. The higher the gene rank, the more
important it has.
(2) They come from the same family
Both these four genes come from the gene family “MMP”, therefore these 4 genes may have more strong
connection.
(3) Based on the experiment
As mentioned before, in order to find the best combination of the gene features, we have conducted many
experiments on different combination of genes. The combination of these four genes gives us the best
performance.
6 . Clustering based on the feature selection
After feature selection, we conducted the clustering again and made a comparison of the performance.
Therefore we can decide whether the features we have selected are well-performed.
6.1 GPX Clustering result
The coherent pattern curve based on the genes selection shows in figure 6-1.
Figure 6-1
Since we hope to get three clusters from the data, we need to pick up 2 points in the coherent graph to
separate the data into three groups. According to the meaning of our algorithm , we can pick up the two
point with the most drop height as in figure 6-2.
13
Figure 6-2
Then we can get the clustering result tree as in Figure 6-3, which included each clusters number and the
data index it includes.
Figure 6-3
The result distribution matrix shows as following:
NP No Meds(12)
LTNP(12)
A(22)
7
4
B(7)
2
5
14
C(7)
3
3
NP with Meds(12)
11
0
1
Result:
Purity
0.537
GPX
Rand
0.579
6.2 The other Clustering algorithms result
PAM:
The distribution matrix:
A(3)
1
0
2
NP No Meds(12)
LTNP(12)
NP with Meds(12)
B(31)
11
12
8
Purity
0.417
PAM
C(2)
0
0
2
Rand
0.4367
C4.5:
The distribution matrix:
A(1)
11
12
10
NP No Meds(12)
LTNP(12)
NP with Meds(12)
B(33)
0
0
1
Purity
0.389
C4.5
C(2)
1
0
1
Rand
0.375
Fuzzy:
The distribution matrix:
NP No Meds(12)
LTNP(12)
NP with Meds(12)
A(7)
0
0
3
B(27)
11
12
8
15
C(22)
1
0
1
Fuzzy
Purity
0.444
Rand
0.4861
6.3 The analysis of feature selection
The following table shows the comparison between algorithms before feature selection and after features
selection.
Before gene selection(gene num:41)
Algorithm
Purity
Rand
PAM
0.417
0.4274
C4.5
0.389
0.3904
Fuzzy
0.444
0.4861
GPX
0.417
0.5339
After gene selection(gene num:4)
Algorithm
Purity
Rand
PAM
0.417
0.4367
C4.5
0.389
0.375
Fuzzy
0.444
0.4861
GPX
0.537
0.579
6.4 Conclusion
From the above table we can find that:
1. After gene selection , the Purity of GPX have increased from 0.417 to 0.537, and the value of
Rand have increased from 0.5339 to 0.579.
2. Since the dimension have deduced from 41 dimensions to 4 dimensions, therefore the runtime will
dramatically decrease.
3. Not only the performance of GPX have been improved, the other algorithms PAM, C4.5 and Fuzzy
have also been improved.
16
Part 2 Genes Clustering
1. Clustering based on NP with Meds(Normal progressors with meds)
In this part, we cluster the genes based on the patients who take normal progressors with medicines and
this group has 12 patients (MR000784719, MR000829742, MR000834420, MR000835659,
MR000944328, MR000864519, MR000324700, MR000972175, MR000703081, MR000667018,
MR000731854, MR00060496).
1.1 The clustering result of GPX
1.1.1 Clustering result
The expression profiles are shown in figure 1-1 and the coherent pattern curve is shown in figure 1-2.
Figure 1-1
Figure 1-2
Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in
figure 1-3. The relative clustering result is shown in figure 1-4.
17
Figure 1-3
1
2,3,9,10,19,21,22,26,
27,29,30,31,32,35,38
2
11,39,41
3
1,4,5
4
7,12,14,18,20,23,
24,25,28,33,34,40
5
8,15,17,
36,37
6
6,13,16
Figure 1-4
A(7)
B(10)
C(9)
D(3)
E(10)
F(2)
a
2
2
4
2
5
0
b
0
1
0
0
1
1
c
3
0
0
0
0
0
d
1
2
5
1
2
1
e
0
3
0
0
2
0
f
1
2
0
0
0
0
1.1.2 Clustering result validation
GPX
Purity
0.415
Rand
0.6871
1.2 The clustering result of other algorithms
In order to judge the performance with GPX, we choose 3 algorithms to compare with it. These
are PAM (Partitional clustering),C4.5 (Hierarchical clustering and Fuzzy algorithm (Fuzzy
clustering).
18
PAM
C4.5
Fuzzy
Purity
0.317
0.2683
0.3415
Rand
0.4789
0.3587
0.5574
1.3 The comparison between the algorithms
GPX
PAM
C4.5
Fuzzy
Purity
0.415
0.317
0.2683
0.3415
Rand
0.6871
0.4789
0.3587
0.5574
Comparing the performance of other algorithms from above table with our GPX algorithm based on the
NP with meds group, we can easily see that our GPX algorithm performs much better, from which we
can trust that our GPX has a better quality of a clustering result. And the clusters gotten from GPX are
compact and well-separated clusters than the other four algorithms.
2. Clustering based on NP without Meds (Normal progressors with No meds)
In this part, we cluster the genes based on the patients who take normal progressors without medicines
and this group has 12 patients(MR000784719, MR000829742, MR000834420, MR000835659,
MR000944328, MR000864519, MR000324700, MR000972175, MR000703081, MR000667018,
MR000731854, MR00060496).
2.1 The clustering result of GPX
The expression profiles are shown in figure 2-1 and the coherent pattern curve is shown in figure 2-2.
Figure 2-1
19
Figure 2-2
Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in
figure 2-3. The relative clustering result is shown in figure 2-4.
Figure 2-3
1
19,30,31,36,38,40
2
17,18,20,21,
22,24,32,35
3
9,11,12,13,
14,34,37,39
4
1,3,6,7,25,29,33,41
5
2,5,8
6
4,10,15,16,23,26,2
7,28
Figure 2-4
A(7)
B(10)
C(9)
a
0
0
1
b
0
1
5
c
0
5
0
d
4
0
1
20
e
2
1
0
f
1
3
2
D(3)
E(10)
F(2)
0
4
1
0
2
0
0
3
0
1
1
1
0
0
0
2
0
0
2.1.2 Clustering result validation
Purity
0.5121
GPX
Rand
0.765
2.2 The clustering result of other algorithms
Purity
PAM
C4.5
Fuzzy
0.317
0.317
0.317
Rand
0.3659
0.3659
0.4261
2.3 The comparison between the algorithms
GPX
PAM
C4.5
Fuzzy
Purity
0.5121
0.317
0.317
0.317
Rand
0.765
0.3659
0.3659
0.4261
When comparing the performance of other algorithms from the above table with our GPX
algorithm based on the NP without meds group, we can easily see that our GPX algorithm performs
much better, from which we can trust that our GPX has a better quality of a clustering result.
And the clusters gotten from GPX are compact and well-separated clusters than the other four
algorithms.
3. Clustering based on LTNP (Long term Normal progressors)
In this part, we cluster the genes based on the patients who take Long term Normal progressors and this
group has 12 patients(MR001054915, MR000393119, MR000698654, MR000650839, MR000650855,
MR000484265, MR000704971, MR000384372, MR000067052, MR001056420, MR001048432,
M001047735).
3.1 The clustering result of GPX
3.1.1 Clustering result
The expression profiles are shown in figure 3-1 and the coherent pattern curve is shown in figure 3-2.
21
Figure 3-1
Figure 3-2
Since there are 6 families of the genes, therefore we need to pick up 5 pulses in the Index Graph as in
figure 3-3.And the relative clustering result is shown in figure 3-4.
Figure 3-3
22
1
3,6,9,10,11,12,13,14,16,19,21
,22,25,26,30,31,35,36,38,40
2
18,28,29,32,3
3,34
3
1,4,5,7,15,
17,24,27
4
2,41
5
20,23,37
6
8,39
Figure 3-4
A(7)
B(10)
C(9)
D(3)
E(10)
F(2)
a
2
7
5
0
5
1
b
0
0
1
2
3
0
c
4
2
1
1
0
0
d
1
0
0
0
0
1
e
0
0
2
1
0
0
f
0
1
0
0
1
0
3.1.2 Clustering result validation
GPX
Purity
0.439
Rand
0.6657
3.2 The clustering result of other algorithms
PAM
C4.5
Fuzzy
Purity
0.317
0.2683
0.3415
Rand
0.3837
0.3587
0.5145
3.3 The comparison between the algorithms
GPX
PAM
C4.5
Fuzzy
Purity
0.439
0.317
0.2683
0.3415
Rand
0.6657
0.3837
0.3587
0.5145
When compared the performance of other algorithm from the above table with our GPX
algorithm based on the LNTP group, we can easily see that our GPX algorithm performs much
better, from which we can trust that our GPX has a better quality of a clustering result. And the
clusters gotten from GPX are compact and well-separated clusters than the other four algorithms.
23
Reference
1. Guyon, J. Weston, S. Barnhill, V. Vapnik (2002). Gene selection for cancer classification
using support vector machines. Machine Learning. 46:389-422.
2. Kenji Kira, Larry A. Rendell: A Practical Approach to Feature Selection. In: Ninth
International Workshop on Machine Learning, 249-256, 1992.
3.Igor Kononenko: Estimating Attributes: Analysis and Extensions of RELIEF. In: European
Conference on Machine Learning, 171-182, 1994.
4.Marko Robnik-Sikonja, Igor Kononenko: An adaptation of Relief for attribute estimation in
regression. In: Fourteenth International Conference on Machine Learning, 296-304, 1997.
5.Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.
6. S.B. Kotsiantis, Supervised Machine Learning: A Review of Classification Techniques,
Informatica 31(2007) 249-268, 2007
7. Jiang D., Pei J. and Zhang A. Interactive exploration of coherent patterns in time-series gene
expression data. In Proceedings of the Ninth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD'03),Washington, DC, USA, August 24-27 2003.
8. Pei, J. A general model for online analytical processing of complex data. In Proceedings of the
22nd International Conference on Conceptual Modeling (ER'03), Chicago, IL,October 13-26
2003
24
Download