Introduction

advertisement
1
Clustering with Fitch-Margoliash and UPGMA algorithm
Introduction
Genotype and environment together determine the phenotype of the organism. Sometimes the
environment is most important, but other times the genes are dominant (Cantor et al., 1999). To
investigate the importance of genes one must first know the function of the gene. Furthermore, the
gene’s regulation gives information about when genes become important and give clues what the
function of the gene is. On the other hand, if a gene’s function is not known, knowledge of the
regulation might point in the direction of a certain function.
The traditional way of investigating the function of a gene is to mutate the gene (or regulatory
component) of interest and to investigate the changing functioning of the organism. With this method
one can get very precise information about a single gene (Misener et al, 2000). In view of the large
number of identified genes this method is no longer feasible. The traditional way is very slow and
costs much effort. So another approach is needed.
A new approach to investigate regulation and function of genes is the simultaneous monitoring of the
activity of many (1000+) genes. This gives limited information; it is only a snapshot of the genes
activity. To get a dynamic view of the genes activity, and with it a better understanding of the
regulation and perhaps function, the simultaneous monitoring is done in a time series. With this
approach one gets a dynamic portrait of the genes activity in time. These data are called Gene Activity
Profile (GAP, Figure 1).
Figure 1. A normalized (Maximum = 1) GAP with on the vertical axis the normalized relative activity of a
hypothetical gene and on the horizontal axis the time.
A gene’s activity is basically the rate at which it is translated. As this rate is difficult to determine,
derived quantities are used to estimate the gene’s activity. In principle both RNA and the resulting
protein can be used as approximation. Simply put, the concentration of mRNA is a good measure for
the concentration of the protein the gene codes for. It is also possible to measure the protein
concentration directly, but this is unpractical because for each protein a specific marker (protein)
would be necessary. Measuring mRNA levels is more practical; this can be done with Reverse
Transcriptase Polymerase Chain Reaction (RT-PCR). This reaction exponentially increases specific
RNA sequences by doubling them repeatedly. As a result it is possible to measure them much more
accurately (Carr et al., 1997; Eisen et al., 1998; Micheals et al., 1998; Herwig et al., 1999).
2
Once the GAPs are measured, they have to be analysed. The number of GAPs equals the number of
genes investigated (could be 1000+), giving a huge amount of data to be analysed. With clustering the
GAPs are sorted into similarity groups. The idea behind clustering is that two similar GAPs might be
subject to the same regulatory mechanism or have other similarities (Carr et al. 1998; Wen et al.,
1998).
There are many algorithms to cluster datasets: Sequential, Hierarchically, Optimal-based and other
algorithms. Sequential algorithms are the most simple and fastest; each object is submitted only once
and the number of clusters is not restricted. Hierarchical algorithms produce a hierarchy of nested
clusters. Either grouping or splicing the datasets from each other is done. The UPGMA method is an
example of a grouping method. At the start each gene forms one cluster on its own. Then, in each
iteration step two genes are merged. Optimum-based algorithms rely on the optimisation of a cost
function. The algorithm calculates the cost of every given dataset for a given number of iterations,
provided the number of clusters are prescribed (Houte, van, 2000). The Fitch-Margoliash clustering
algorithm is an optimum based algorithm.
Clustering might give a clue to the regulation or function of groups of genes. Genes with a similar
activity profile can be regulated the same way (Carr et al. 1998; Wen et al., 1998). With prior
knowledge of genes investigated in the traditional way, further studies could be made on genes that
are not yet investigated, but seem interesting. With the clustering information a dendrogram can be
constructed in order to visualize the clustering data text file in the Newick format (which has this
form: (C,(A,B)), (Felsenstein, 1993)). A simple dendrogram is shown in Figure 2.
Figure 2. A simple dendrogram with the ‘species’, A B and C. C was the first to split off and then A and B split of
from their branch. In the Newick format this is represented as (C,(A,B)).
In view of the relatively great amount of data and the need for flexibility of calculating the distance
matrices, a computer program is written which can easily modify the data as one wishes: DAMAP
(DAta MAnipulation Program). The program will have to normalize the data, calculate different
distance measures and, if needed, can add noise (uniform or Gaussian distributed) to the data. The
output (distance matrix) will be used in the PHYLIP phylogenic program package (Felsenstein, 1993),
or more specifically the FITCH program used in many articles regarding the clustering subject.
(Dis-)Similarity
In order to cluster the data with some algorithms the (dis)similarity must be calculated. The
hierarchical Fitch-Margoliash and the UPGMA (Unweighted Pair Group Method with Averages)
algorithms require a distance matrix as input. In the matrix each distance is given from one gene to
every other gene. The distance between objects a and b can be calculated by Euclidian distance
3
N
Dab   (ai bi)
2
i 0
It is a distance measurement and it is calculated by determining the difference or dissimilarity of the
two data points (see figure 3), squared it, take the sum of the squares of all time points, and then take
the square root of this sum. The order of the time points is thus not taken into account. The vector a
and b are two different profiles and N is the number of data-points within the profiles (the features),
so only profiles with the same number of data-points can be used. The index i is the time-point of the
value of a and b.
Figure 3. Two GAPs with on the y-axis the normalized relative activity of the gene and on the horizontal axis the
time. The distance per time point or data point is indicated with the vertical lines
If slopes are neglected, no information on the shape of the GAP is used. Slopes contain some
information on the shape. The combination of differences between absolute values, and between
slopes, makes a comparison between two GAPs more sensitive.
With the UPGMA algorithm generally a similarity measure is used. The Pearson's correlation
coefficient r is a similarity measure. The Pearson's correlation coefficient rab of two samples a and b
(Heath, 1995) can be calculated with the following formula


rab 
  ai 







 ai b   bi 
 i
N 
N 
2

 ai     bi 
    bi 

  ai 
N
N



  

2
In Eisen (1998) the mean terms are set to zero, but they can also be taken equal to some reference
state and the primary data were log-transformed first.
Algorithms
The distances between all GAPs are collected in a matrix denoted by Dij, so the size of this square
matrix the number of GAPs denoted by M. When the Euclidean distance is used there are zeros on the
diagonal since the distance from a GAP to itself is zero. When on the other hand the correlation
coefficient being a similarity measure is used there are ones in the diagonal, since all GAPs are
perfect similar to each other.
4
With the Fitch-Margoliash clustering algorithm the aim is to find a matrix dij such that the following
least square sum is minimized
SS   i  j



Dij  d ij 
2
P
Dij
where both indices i and j range from 1 to M, the number of data points, and P is a power which
weights the individual squares of the differences between Dij and dij often referred to as observed and
expected.
The UPGMA is an agglomerative method. At the start each gene forms a cluster. The two clusters
most similar (or closest) to each other are merged forming a hypothetical cluster. Let q denote the
merged cluster and s any other cluster, then the distance matrix is updated with
Dqs 
nj
ni
Dis 
D js
ni  n j
ni  n j
where ni is the number of genes in the cluster. In Eisen (1998) the GAP of this newly formed cluster is
computed by averaging the GAPs of the joined clusters weighted by the number of genes they
contain, and the distance matrix is computed again. This procedure is repeated until all genes are
merged into one big cluster that forms the root of the binary tree.
Both algorithms produce a hierarchy of clusters presented as dendograms or binary trees. Therefore
the requested clusters have to be extracted from the dendogram. Generally a similarity levels is
chosen and those clusters more similar to each orther than this threshold form each one cluster.
In the PHYLIP-package (Phylogeny Inference Package) by Felsenstein (1993) the FITCH program
implements the Fitch-Margoliash algorithm, and in the NEIGHBOR program the UPGMA algorithm.
Case-study
The data was derived from neural cells of a rat cervical spinal. On nine developmental time points the
mRNA levels of 112 genes were determined (Wen et al., 1998). These data have already been
clustered and studied (Somogyi et al.,; Carr et al., 1997; Eisen et al., 1998; Wen et al., 1998). In the
clustering analysis Somogyi and co-workers, used the Euclidean distance combined with slopes of the
GAPs. Subsequently the matrices were clustered with FITCH which utilizes a Fitch-Margoliash
algorithm, which is a phylogenetic cluster-program.
The FITCH and NEIGHBOR program clusters the 112 genes of Micheals et al. with the FitchMargoliash clustering algorithm. After clustering the output file of the program FITCH or
NEIGHBOR, in the cladogram in Newick format, is entered in the program DRAWGRAM that
produces a graphical cladogram.
The Fitch-Margoliash clustering algorithm uses a combining hierarchical method. Therefore a
dendrogram can be constructed from the output. The FITCH program from the PHYLIP package
automatically saves it into an international accepted format, which can be read by several graphical
dendrogram programs, including its own DRAWGRAM/DRAWTREE. An example of a clustering
of the data derived from Wen et al., done with FITCH and vizualized with DRAWGRAM, is depicted
in Figure 5 in the cladogram style. In Figure 6 the presentation of the clustering in the phenogram
style. Finally, the FITCH clustering results in table form are given in Table 1.
5
Figure 5. Figure from Somogyi et al. (1997). They clustered expression values, combined with slopes, via a
Euclidian distance matrix with FITCH. The 112 gene expression patterns were grouped into six different groups
with the mean ‘wave’ depicted next to it, except for the ‘other’ group.
Figure 6. Depicting FM clustering of euclidian distances with slopes in phenogram style. This clustering has been
derived exactly the same as by Somogyi et al. (1997) (see Figure 5). No differences from that phenogram were
detected. Except for the swap of nAChRa3 and nAChRa4, but this is due to a mix-up between the two labels.
6
Group
1: (27)
Gene Name
Brm, cellubrevin, cyclin B, G67180/86, G67186, GDNF, IGF_II, IGFR1, IGFR2, Ins1, IP3R3,
keratin, MK2, nAChRa3, nAChRa5, nAChRa6, nAChRd, nACHRe, nestin, NMDA2D, NT3,
PDGFb, PDGFR, SC6, SC7, TH, trk
2: 20
ACHE, GAD65, GAT1, Gra2, Gra3, Gra5, GRb1, GRb3, GRg2, MAP2, mGluR3, mGluR8,
nAChRa4, neno, NMDA2B, NOS, pre-GAD67, statin, synaptophysin, trkC
3: 21
5HT1b, 5HT1c, 5HT2, 5HT3, GAD67, Gra4, GRb2, GRg3, L1, mAChR2, mAChR3, mGluR2,
mGluR4, mGluR5, mGluR6, mGLuR7, nAChRa2, nAChRa7, NFL, NMDA1, NMDA2C
4: 17
aFGF, bFGF, cfos, ChAT, cjun, CNTF, GFAP, GRg1, IP3R2, mAChR4, mGluR1, MOG, NFH,
NFM, NGF, NMDA2A, S100
constant actin, BDNF, CCO1, CCO2, CNTFR, CRAF, cyclin A, DD63.2, FGFR, H2AZ, IGF_I, Ins2, InsR,
21
IP3R1, ODC, PDGFa, PTN, SC1, SOD, TCP, TGFR
other: 6 EFGR, EGF, GAP43, Gra1, SC2, trkB
Table 1. Results of the FITCH clustering in table form. This is from the same clustering performed with the result
of Figure 7. The numbers in the group column are the original group designation from Somogyi et al. 1997.
nAChRa4 and nAChRa3 are already swapped to their correct group for convenience and the numbers between the
brackets are the number of genes in the cluster.
We analyse now a subset of the 121 genes where the nine genes which are underlined in Table 1.
Besides the Fitch algorithm we also use the UPGMA algorithm.
The distance matrix Dij reads
9
keratin
0.000 1.506 1.537 2.348 2.241 1.961 2.083
cellubrevin1.506 0.000 1.866 2.050 1.667 1.964 1.645
nestin
1.537 1.866 0.000 1.843 1.902 1.756 1.812
MAP2
2.348 2.050 1.843 0.000 0.770 1.057 1.322
GAP43
2.241 1.667 1.902 0.770 0.000 1.125 1.256
L1
1.961 1.964 1.756 1.057 1.125 0.000 1.119
NFL
2.083 1.645 1.812 1.322 1.256 1.119 0.000
NFM
1.896 1.791 1.743 1.606 1.685 1.794 1.829
NFH
2.017 1.629 2.008 1.389 1.522 1.634 1.678
^^^^^^^^^^^
^
^
^
^
^
|
|
|_____|_____|_____|_____|
|
|
|
10 chars Single Space
Single Space
1.896
1.791
1.743
1.606
1.685
1.794
1.829
0.000
0.932
2.017
1.629
2.008
1.389
1.522
1.634
1.678
0.932
0.000
This distance matrix is used to generate trees with the Fitch-Margoliash algorithm, and the UPGMA
algorithm. The tree calculated with the Fitch-Margoliash algorithm is given in Figure 7.
7
Figure 7. The dendogram of the Fitch-Margoliash algorithm for the reduced set of GAPs. This tree can be
compared with the tree presented in Figure 6 for the complete set of 121 genes.
With the UPGMA algorithm the tree is similar
Figure 8. The dendogram of the UPGMA algorithm for the reduced set of GAPs. This tree can be compared with
the tree presented in Figure 6 for the complete set of 121 genes and with the tree in Figure 8 for the FitchMargoliash algorithm results.
From the two trees presented in Figures 7 and 8 we conclude that they are similar and would lead to
the same clusters. However, gene GAP43 belongs together with MAP2 to the same cluster while in the
tree for the 121 genes they did not. This is not surprising for GAP43 belonged to the ‘garbage’ cluster
‘other’ in Figures 5 and 6. Furthermore L1 belonged to the same cluster as NFL in the full set of
genes but in the tree of the reduced set of genes it is closer to MAP2 than to NFL. We conclude that
these results indicate that also the number of genes taken into account with the calculation of the tree
and subsequently the final clusters, is important.
Remarks
A problem of the dendograms is that defining the final clusters is subjective, for the user has to decide
were to draw the line to create clusters from the tree. The choice of the best number of clusters, that is
the threshold in the tree, should be done based on an algorithm, but in practice it is often done by
visual inspection.
Literature
Attwood TK, Parry-Smith DJ; Introduction to bioinformatics; Addison Wesley Longman; 1999.
8
Eisen MB, Spellman PT, Brown PO, Botstein D; Cluster analysis and display of genome wide expression
patterns, Proc. Natl. Acad. Sci. USA; Vol 95; pp. 14863-14868, 1998.
Cantor CR, Cassandra LS; Genomics; Wiley-Interscience; 1999.
Carr DB, Somogyi R, Micheals G; Templates for Looking at Gene Expression Clustering; Stat. Comput. &
Stat. Graph. Newsltr.; April, pp. 20-29, 1997.
Felsenstein J; PHYLIP (Phylogeny Inference Package) version 3.5c; 1993.
Hamming RW; Coding and Information Theory 2nd Ed.; Prentice-Hall; 1986.
Herwig RH, Poutska AJ, Müller C, Bull C, Lehrach H, O’Brien J; Large-sale clustering of cDNAFingerprinting Data; Genome Research; Vol. 9; pp.1093-1105, 1999.
Houte, van BPP, Cluster analyse: Theorie en toepassingen in de Biologie (in Dutch, Cluster analysis: Theory
and Application in Biology), Masters Thesis, 2000.
Jongman, R. Braak ter C en Tongeren, van O, Data analysis in community and landscape ecology, Pudoc
1987.
Micheals GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R; Cluster analysis and data vizualization
of large scale gene expression data, Pac. Symp. Biocomputing 3; pp. 42-53, 1998.
Misener S, Krawetz SA; Methods in Molecular Biology Vol. 132, Bioinformatics Methods and Protocols;
Humana Press, 2000.
Somogyi R, Fuhrman S, Askenazi M, Wuensche A; The Gene Expression Matrix: Towards The Extraction of
Genetic Network Architectures; Proc. of second World. Cong. of Nonlin. Analysts; Vol. 30(3); pp 1815-1824,
1997.
S. Theodoridis en K. Koutroumbas, Pattern recognition, Academic press, 1999.
Wen X, Fuhrman S, Micheals GS, Carr DB, Smith S, Barker JL, Somogyi R; Large-scale temporal gene
expression mapping of central nervous system development; Proc. Natl. Acad. Sci. USA; Vol. 95; pp. 334339, 1998.
Zhang MQ; Large-Scale Gene Expression Data Analysis: A New Challenge to Computational Biologists;
Genome Research; Vol. 9; pp. 681-688, 1999.
Download