Introduction

1 Clustering with Fitch-Margoliash and UPGMA algorithm Introduction Genotype and environment together determine the phenotype of the organism. Sometimes the environment is most important, but other times the genes are dominant (Cantor et al., 1999). To investigate the importance of genes one must first know the function of the gene. Furthermore, the gene’s regulation gives information about when genes become important and give clues what the function of the gene is. On the other hand, if a gene’s function is not known, knowledge of the regulation might point in the direction of a certain function. The traditional way of investigating the function of a gene is to mutate the gene (or regulatory component) of interest and to investigate the changing functioning of the organism. With this method one can get very precise information about a single gene (Misener et al, 2000). In view of the large number of identified genes this method is no longer feasible. The traditional way is very slow and costs much effort. So another approach is needed. A new approach to investigate regulation and function of genes is the simultaneous monitoring of the activity of many (1000+) genes. This gives limited information; it is only a snapshot of the genes activity. To get a dynamic view of the genes activity, and with it a better understanding of the regulation and perhaps function, the simultaneous monitoring is done in a time series. With this approach one gets a dynamic portrait of the genes activity in time. These data are called Gene Activity Profile (GAP, Figure 1). Figure 1. A normalized (Maximum = 1) GAP with on the vertical axis the normalized relative activity of a hypothetical gene and on the horizontal axis the time. A gene’s activity is basically the rate at which it is translated. As this rate is difficult to determine, derived quantities are used to estimate the gene’s activity. In principle both RNA and the resulting protein can be used as approximation. Simply put, the concentration of mRNA is a good measure for the concentration of the protein the gene codes for. It is also possible to measure the protein concentration directly, but this is unpractical because for each protein a specific marker (protein) would be necessary. Measuring mRNA levels is more practical; this can be done with Reverse Transcriptase Polymerase Chain Reaction (RT-PCR). This reaction exponentially increases specific RNA sequences by doubling them repeatedly. As a result it is possible to measure them much more accurately (Carr et al., 1997; Eisen et al., 1998; Micheals et al., 1998; Herwig et al., 1999). 2 Once the GAPs are measured, they have to be analysed. The number of GAPs equals the number of genes investigated (could be 1000+), giving a huge amount of data to be analysed. With clustering the GAPs are sorted into similarity groups. The idea behind clustering is that two similar GAPs might be subject to the same regulatory mechanism or have other similarities (Carr et al. 1998; Wen et al., 1998). There are many algorithms to cluster datasets: Sequential, Hierarchically, Optimal-based and other algorithms. Sequential algorithms are the most simple and fastest; each object is submitted only once and the number of clusters is not restricted. Hierarchical algorithms produce a hierarchy of nested clusters. Either grouping or splicing the datasets from each other is done. The UPGMA method is an example of a grouping method. At the start each gene forms one cluster on its own. Then, in each iteration step two genes are merged. Optimum-based algorithms rely on the optimisation of a cost function. The algorithm calculates the cost of every given dataset for a given number of iterations, provided the number of clusters are prescribed (Houte, van, 2000). The Fitch-Margoliash clustering algorithm is an optimum based algorithm. Clustering might give a clue to the regulation or function of groups of genes. Genes with a similar activity profile can be regulated the same way (Carr et al. 1998; Wen et al., 1998). With prior knowledge of genes investigated in the traditional way, further studies could be made on genes that are not yet investigated, but seem interesting. With the clustering information a dendrogram can be constructed in order to visualize the clustering data text file in the Newick format (which has this form: (C,(A,B)), (Felsenstein, 1993)). A simple dendrogram is shown in Figure 2. Figure 2. A simple dendrogram with the ‘species’, A B and C. C was the first to split off and then A and B split of from their branch. In the Newick format this is represented as (C,(A,B)). In view of the relatively great amount of data and the need for flexibility of calculating the distance matrices, a computer program is written which can easily modify the data as one wishes: DAMAP (DAta MAnipulation Program). The program will have to normalize the data, calculate different distance measures and, if needed, can add noise (uniform or Gaussian distributed) to the data. The output (distance matrix) will be used in the PHYLIP phylogenic program package (Felsenstein, 1993), or more specifically the FITCH program used in many articles regarding the clustering subject. (Dis-)Similarity In order to cluster the data with some algorithms the (dis)similarity must be calculated. The hierarchical Fitch-Margoliash and the UPGMA (Unweighted Pair Group Method with Averages) algorithms require a distance matrix as input. In the matrix each distance is given from one gene to every other gene. The distance between objects a and b can be calculated by Euclidian distance 3 N Dab   (ai bi) 2 i 0 It is a distance measurement and it is calculated by determining the difference or dissimilarity of the two data points (see figure 3), squared it, take the sum of the squares of all time points, and then take the square root of this sum. The order of the time points is thus not taken into account. The vector a and b are two different profiles and N is the number of data-points within the profiles (the features), so only profiles with the same number of data-points can be used. The index i is the time-point of the value of a and b. Figure 3. Two GAPs with on the y-axis the normalized relative activity of the gene and on the horizontal axis the time. The distance per time point or data point is indicated with the vertical lines If slopes are neglected, no information on the shape of the GAP is used. Slopes contain some information on the shape. The combination of differences between absolute values, and between slopes, makes a comparison between two GAPs more sensitive. With the UPGMA algorithm generally a similarity measure is used. The Pearson's correlation coefficient r is a similarity measure. The Pearson's correlation coefficient rab of two samples a and b (Heath, 1995) can be calculated with the following formula   rab    ai          ai b   bi   i N  N  2   ai     bi      bi     ai  N N        2 In Eisen (1998) the mean terms are set to zero, but they can also be taken equal to some reference state and the primary data were log-transformed first. Algorithms The distances between all GAPs are collected in a matrix denoted by Dij, so the size of this square matrix the number of GAPs denoted by M. When the Euclidean distance is used there are zeros on the diagonal since the distance from a GAP to itself is zero. When on the other hand the correlation coefficient being a similarity measure is used there are ones in the diagonal, since all GAPs are perfect similar to each other. 4 With the Fitch-Margoliash clustering algorithm the aim is to find a matrix dij such that the following least square sum is minimized SS   i  j    Dij  d ij  2 P Dij where both indices i and j range from 1 to M, the number of data points, and P is a power which weights the individual squares of the differences between Dij and dij often referred to as observed and expected. The UPGMA is an agglomerative method. At the start each gene forms a cluster. The two clusters most similar (or closest) to each other are merged forming a hypothetical cluster. Let q denote the merged cluster and s any other cluster, then the distance matrix is updated with Dqs  nj ni Dis  D js ni  n j ni  n j where ni is the number of genes in the cluster. In Eisen (1998) the GAP of this newly formed cluster is computed by averaging the GAPs of the joined clusters weighted by the number of genes they contain, and the distance matrix is computed again. This procedure is repeated until all genes are merged into one big cluster that forms the root of the binary tree. Both algorithms produce a hierarchy of clusters presented as dendograms or binary trees. Therefore the requested clusters have to be extracted from the dendogram. Generally a similarity levels is chosen and those clusters more similar to each orther than this threshold form each one cluster. In the PHYLIP-package (Phylogeny Inference Package) by Felsenstein (1993) the FITCH program implements the Fitch-Margoliash algorithm, and in the NEIGHBOR program the UPGMA algorithm. Case-study The data was derived from neural cells of a rat cervical spinal. On nine developmental time points the mRNA levels of 112 genes were determined (Wen et al., 1998). These data have already been clustered and studied (Somogyi et al.,; Carr et al., 1997; Eisen et al., 1998; Wen et al., 1998). In the clustering analysis Somogyi and co-workers, used the Euclidean distance combined with slopes of the GAPs. Subsequently the matrices were clustered with FITCH which utilizes a Fitch-Margoliash algorithm, which is a phylogenetic cluster-program. The FITCH and NEIGHBOR program clusters the 112 genes of Micheals et al. with the FitchMargoliash clustering algorithm. After clustering the output file of the program FITCH or NEIGHBOR, in the cladogram in Newick format, is entered in the program DRAWGRAM that produces a graphical cladogram. The Fitch-Margoliash clustering algorithm uses a combining hierarchical method. Therefore a dendrogram can be constructed from the output. The FITCH program from the PHYLIP package automatically saves it into an international accepted format, which can be read by several graphical dendrogram programs, including its own DRAWGRAM/DRAWTREE. An example of a clustering of the data derived from Wen et al., done with FITCH and vizualized with DRAWGRAM, is depicted in Figure 5 in the cladogram style. In Figure 6 the presentation of the clustering in the phenogram style. Finally, the FITCH clustering results in table form are given in Table 1. 5 Figure 5. Figure from Somogyi et al. (1997). They clustered expression values, combined with slopes, via a Euclidian distance matrix with FITCH. The 112 gene expression patterns were grouped into six different groups with the mean ‘wave’ depicted next to it, except for the ‘other’ group. Figure 6. Depicting FM clustering of euclidian distances with slopes in phenogram style. This clustering has been derived exactly the same as by Somogyi et al. (1997) (see Figure 5). No differences from that phenogram were detected. Except for the swap of nAChRa3 and nAChRa4, but this is due to a mix-up between the two labels. 6 Group 1: (27) Gene Name Brm, cellubrevin, cyclin B, G67180/86, G67186, GDNF, IGF_II, IGFR1, IGFR2, Ins1, IP3R3, keratin, MK2, nAChRa3, nAChRa5, nAChRa6, nAChRd, nACHRe, nestin, NMDA2D, NT3, PDGFb, PDGFR, SC6, SC7, TH, trk 2: 20 ACHE, GAD65, GAT1, Gra2, Gra3, Gra5, GRb1, GRb3, GRg2, MAP2, mGluR3, mGluR8, nAChRa4, neno, NMDA2B, NOS, pre-GAD67, statin, synaptophysin, trkC 3: 21 5HT1b, 5HT1c, 5HT2, 5HT3, GAD67, Gra4, GRb2, GRg3, L1, mAChR2, mAChR3, mGluR2, mGluR4, mGluR5, mGluR6, mGLuR7, nAChRa2, nAChRa7, NFL, NMDA1, NMDA2C 4: 17 aFGF, bFGF, cfos, ChAT, cjun, CNTF, GFAP, GRg1, IP3R2, mAChR4, mGluR1, MOG, NFH, NFM, NGF, NMDA2A, S100 constant actin, BDNF, CCO1, CCO2, CNTFR, CRAF, cyclin A, DD63.2, FGFR, H2AZ, IGF_I, Ins2, InsR, 21 IP3R1, ODC, PDGFa, PTN, SC1, SOD, TCP, TGFR other: 6 EFGR, EGF, GAP43, Gra1, SC2, trkB Table 1. Results of the FITCH clustering in table form. This is from the same clustering performed with the result of Figure 7. The numbers in the group column are the original group designation from Somogyi et al. 1997. nAChRa4 and nAChRa3 are already swapped to their correct group for convenience and the numbers between the brackets are the number of genes in the cluster. We analyse now a subset of the 121 genes where the nine genes which are underlined in Table 1. Besides the Fitch algorithm we also use the UPGMA algorithm. The distance matrix Dij reads 9 keratin 0.000 1.506 1.537 2.348 2.241 1.961 2.083 cellubrevin1.506 0.000 1.866 2.050 1.667 1.964 1.645 nestin 1.537 1.866 0.000 1.843 1.902 1.756 1.812 MAP2 2.348 2.050 1.843 0.000 0.770 1.057 1.322 GAP43 2.241 1.667 1.902 0.770 0.000 1.125 1.256 L1 1.961 1.964 1.756 1.057 1.125 0.000 1.119 NFL 2.083 1.645 1.812 1.322 1.256 1.119 0.000 NFM 1.896 1.791 1.743 1.606 1.685 1.794 1.829 NFH 2.017 1.629 2.008 1.389 1.522 1.634 1.678 ^^^^^^^^^^^ ^ ^ ^ ^ ^ | | |_____|_____|_____|_____| | | | 10 chars Single Space Single Space 1.896 1.791 1.743 1.606 1.685 1.794 1.829 0.000 0.932 2.017 1.629 2.008 1.389 1.522 1.634 1.678 0.932 0.000 This distance matrix is used to generate trees with the Fitch-Margoliash algorithm, and the UPGMA algorithm. The tree calculated with the Fitch-Margoliash algorithm is given in Figure 7. 7 Figure 7. The dendogram of the Fitch-Margoliash algorithm for the reduced set of GAPs. This tree can be compared with the tree presented in Figure 6 for the complete set of 121 genes. With the UPGMA algorithm the tree is similar Figure 8. The dendogram of the UPGMA algorithm for the reduced set of GAPs. This tree can be compared with the tree presented in Figure 6 for the complete set of 121 genes and with the tree in Figure 8 for the FitchMargoliash algorithm results. From the two trees presented in Figures 7 and 8 we conclude that they are similar and would lead to the same clusters. However, gene GAP43 belongs together with MAP2 to the same cluster while in the tree for the 121 genes they did not. This is not surprising for GAP43 belonged to the ‘garbage’ cluster ‘other’ in Figures 5 and 6. Furthermore L1 belonged to the same cluster as NFL in the full set of genes but in the tree of the reduced set of genes it is closer to MAP2 than to NFL. We conclude that these results indicate that also the number of genes taken into account with the calculation of the tree and subsequently the final clusters, is important. Remarks A problem of the dendograms is that defining the final clusters is subjective, for the user has to decide were to draw the line to create clusters from the tree. The choice of the best number of clusters, that is the threshold in the tree, should be done based on an algorithm, but in practice it is often done by visual inspection. Literature Attwood TK, Parry-Smith DJ; Introduction to bioinformatics; Addison Wesley Longman; 1999. 8 Eisen MB, Spellman PT, Brown PO, Botstein D; Cluster analysis and display of genome wide expression patterns, Proc. Natl. Acad. Sci. USA; Vol 95; pp. 14863-14868, 1998. Cantor CR, Cassandra LS; Genomics; Wiley-Interscience; 1999. Carr DB, Somogyi R, Micheals G; Templates for Looking at Gene Expression Clustering; Stat. Comput. & Stat. Graph. Newsltr.; April, pp. 20-29, 1997. Felsenstein J; PHYLIP (Phylogeny Inference Package) version 3.5c; 1993. Hamming RW; Coding and Information Theory 2nd Ed.; Prentice-Hall; 1986. Herwig RH, Poutska AJ, Müller C, Bull C, Lehrach H, O’Brien J; Large-sale clustering of cDNAFingerprinting Data; Genome Research; Vol. 9; pp.1093-1105, 1999. Houte, van BPP, Cluster analyse: Theorie en toepassingen in de Biologie (in Dutch, Cluster analysis: Theory and Application in Biology), Masters Thesis, 2000. Jongman, R. Braak ter C en Tongeren, van O, Data analysis in community and landscape ecology, Pudoc 1987. Micheals GS, Carr DB, Askenazi M, Fuhrman S, Wen X, Somogyi R; Cluster analysis and data vizualization of large scale gene expression data, Pac. Symp. Biocomputing 3; pp. 42-53, 1998. Misener S, Krawetz SA; Methods in Molecular Biology Vol. 132, Bioinformatics Methods and Protocols; Humana Press, 2000. Somogyi R, Fuhrman S, Askenazi M, Wuensche A; The Gene Expression Matrix: Towards The Extraction of Genetic Network Architectures; Proc. of second World. Cong. of Nonlin. Analysts; Vol. 30(3); pp 1815-1824, 1997. S. Theodoridis en K. Koutroumbas, Pattern recognition, Academic press, 1999. Wen X, Fuhrman S, Micheals GS, Carr DB, Smith S, Barker JL, Somogyi R; Large-scale temporal gene expression mapping of central nervous system development; Proc. Natl. Acad. Sci. USA; Vol. 95; pp. 334339, 1998. Zhang MQ; Large-Scale Gene Expression Data Analysis: A New Challenge to Computational Biologists; Genome Research; Vol. 9; pp. 681-688, 1999.

Introduction

Related documents

Products

Support

Introduction

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib