Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data Li Zhang and Aidong Zhang Department of Computer Science and Engineering State University of New York at Buffalo Buffalo, NY 14260 {lizhang,azhang}@cse.buffalo.edu Abstract DNA microarray technology provides a broad snapshot of the state of the cell by measuring the expression levels of thousands of genes simultaneously. Visualization techniques can enable the exploration and detection of patterns and relationships in a complex dataset by presenting the data in a graphical format in which the key characteristics become more apparent. The purpose of this study is to present an interactive visualization technique conveying the temporal patterns of gene expression data in a form intuitive for non-specialized end-users. The first Fourier harmonic projection (FFHP) was introduced to translate the multi-dimensional time series data into a two dimensional scatter plot. The spatial relationship of the points reflect the structure of the original dataset and relationships among clusters become two dimensional. The proposed method was tested using two published, arrayderived gene expression datasets. Our results demonstrate the effectiveness of the approach. Keywords: visualization, gene expression, time series, Fourier harmonic projection 1 INTRODUCTION Knowledge of the spectrum of genes expressed at a certain time or under given conditions proves instrumental to understand the working of a living cell. DNA microarray technology allows measurements of expression levels for thousands of genes simultaneously [9]. Extensive research has been conducted on the study of temporal patterns of gene expressions [10, 16]. Clustering methods which group genes or samples with similar patterns have become mainstream analysis tool [14]. Visualization can facilitate the discovery of structures, features, patterns, and relationships in data and may provide more insightful information than Murali Ramanathan Department of Pharmaceutical Sciences State University of New York at Buffalo Buffalo, NY 14260 murali@acsu.buffalo.edu traditional numerical methods. By visualization, we hope to gain some intuition regarding the data, but more importantly, we would like to understand the relationships between data points and detect the intrinsic structure, or possible cluster tendencies. Visualization is especially important in the early stages of data analysis in which qualitative analysis is primary to quantitative. Early success will enhance the users’ performance in the remaining stages of analysis. Array-derived gene expression datasets present analysis and visualization challenges because of their dimensionality, noisy environment, and pattern varieties. The parallel coordinates approach [18] is perhaps the simplest method to display patterns of gene expression profiles. Here the data in each dimension are plotted along a separate axis. Holter et al. [8] have used parallel coordinates to visualize the temporal progression in the yeast cell cycle data. Self-organizing maps (SOM) [12] is another approach. More recently, Hautaniemi et al. [6] have presented a heat map-based strategy for visualizing the U-matrix from SOM. The most prominent visualization-enhanced analysis tool for gene expression data is TreeView [5], which provides a user-friendly computational and graphical environment for assessing the results from hierarchical clustering. The graphical presentations from TreeView include a dendrogram to reflect the distance relationships between clusters and a heat plot to visually convey gene expression changes between samples. The heat plot can be viewed as variation of the parallel coordinates plot in which color is used to convey dimension values. Here, we present an alternative mapping for multidimensional data that is based on the first harmonic of the discrete Fourier transform. The mapping has interesting properties and preserves certain key characteristics of a variety of data sets, especially time series data. Unlike parallel coordinates and heat plot which display all individual dimensional information, our approach uses a two dimensional point to represent each gene over the time at the com- putational cost of O(N log N ). It focuses on one very important aspect of the visualization: revealing the structure of the entire dataset. Tested using two published, array-derived gene expression time series datasets, our results indicated that temporal patterns were well reflected in the visualization: cluster relationship became two dimensional, clusters were apparent, and outliers were clear. An interactive visualization tool, VizStruct, was implemented to perform the visualization. The remainder of this paper is organized as follows. Section 2 presents the model of visualization. In section 3, we show our analysis results. The final section discusses other issues in this approach. Proofs for all mapping propositions are included as appendix. 2 METHODS AND SYSTEM The Mapping Mapping converts multi-dimensional data to twodimensions for visualization. An array-derived profile for M genes with N measurements results in M N dimensional data point containing real valued numerical data. Time series data in its simplest form is merely a set of data {yt , t = 0, . . . , N − 1} where the subscript t indicates the time at which the datum yt was observed [4]. On the other hand, a discrete-time real signal on N evenly distributed time points is represented as an indexed sequence of N real numbers 0, . . . , N − 1 denoted by x[n] [1]. Each term of x[n] is denoted by x[n]. The denotation similarity between time series and digital signal suggests that we can view each data point in a time series as a discrete-time real signal (It is not necessary for the signal’s time index to comply with the actual time points). In this scenario, the problem of a two dimensional visualization of the time series is transformed into the problem of finding a two-dimensional point estimation for signals (data points). The frequency domain representation of discrete-time signals is through discrete-time Fourier transform, or DFT [13]. The DFT of a N -point signal x[n] is a frequency sequence with N complex values: F(x[n]) = [Fk (x[n])], where each Fk (x[n]) = N −1 X nk x[n]WN , k = 0, . . . , N −1. (1) n=0 WN = e−i2π/N is called twiddle factor. Each harmonic Fk in the DFT is a measure of the kth sinusoidal frequency component in the signal. For example, the zero harmonic, F0 , is the mean value, the first harmonic F1 measures the base frequency component, the second harmonic F2 measures the component in the signal that is twice the base frequency, and so forth. Because Fourier harmonics, in general, are complex numbers, they provide the two-dimensional point estimate for mapping a multidimensional signal. For this reason, we refer to the mapping as the Fourier harmonic projections. In particular, we are interested in the first Fourier hamonic projection (FFHP): F1 (x[n]) = N −1 X n x[n]WN = n=0 N −1 X x[n]e−i2πn/N . (2) n=0 The relationship between the DFT and the mapping allows the fast Fourier transform algorithm (FFT), originally discovered by Cooley and Tukey [13], to be used for computation. The FFT is a computationally efficient algorithm and has a complexity of O(N log N ). The complex number of F1 (x[n]) in Equation (2) can be expressed in terms of magnitude r and phase θ to provide a useful geometric interpretation of the mapping. The data set was normalized so that the range of values of each dimension across the dataset was 0 to 1. For a data point with N dimensions, the complex exponential divides a unit circle centered at the origin of the complex plane into N equally spaced angles. The value of the first dimension is projected on the radial line corresponding to θ = 0 and, similarly, the value of the kth dimension is projected on to the radial line corresponding to the θ = 2π(1 − k)/N radians. The overall two-dimensional FFHP mapping is the complex sum of all N projections from a data point. Figure 1 illustrates the geometric interpretation for a point containing 6 dimensions. W46 W56 unit circle F1(x[n]) W36 W06 x[n] 2 W6 1 W6 Figure 1. A geometric interpretation of the first Fourier harmonic projection. A normalized 6-dimensional data point is shown on the right by the stem plot. The powers of twiddle factor W6 divide the unit circle centered at the origin into 6 equal angles and each dimension of the data point is projected onto a different radial angle (open circle). The projections are taken complex number sum to give a 2-dimensional image (indicated by a filled circle). +a xa origin origin amplitude scaling amplitude shifting time shifting (A) time shifting (B) Figure 2. Illustration of the effect of (A) amplitude shifting and multiplying, and (B) time shifting of the first Fourier harmonic projection. Properties of FFHP Theory in Practice The first Fourier harmonic projection has useful properties that preserve the correlation between dimensions in the multi-dimensional data point. We summarize them as propositions below. Detailed proofs for propositions are provided as the appendix. 1. Data points with equal values for all the dimensions are mapped to the origin. If x[n] = [a, . . . , a], then F1 (x[n]) = 0. 2. Two data points whose dimension values differ due to the amplitude shifting of a constant are mapped to the same point. If y[n] = x[n] + a, then F1 (y[n]) = F1 (x[n]). 3. Two data points whose dimension values differ due to the amplitude multiplying a constant are mapped to the two points on a line through the origin. If y[n] = a x[n], then F1 (y[n]) = a F1 (x[n]). See Figure 2A. 4. Two data points whose dimension values are transposing each other, i.e. symmetric regarding the middle time point, are mapped to the points symmetric to the real axis. If y[n] = x[N − n − 1], then F1 (y[n]) = F1 (x[n]). 5. Data points that differ only because they are “time-shifted” by d dimensions relative to each other are mapped to the circumference of the circle that is concentric with the unit circle and the angle between the points in the visualization is φ = 2πd/N . If y[n] = x[n − d], then F1 (y[n]) = d F1 (x[n])WN . This property is illustrated in Figure 2B. 6. Let w[n] = x[n] − y[n] be the difference between the two N -dimensional points, x[n] and y[n]. The distance between these two points in the visualization is: à ! N −1 X 2 kF1 (w[n])k = g0 N 1 + 2 rk cos(2πk/N ) (3) k=1 We will demonstrate in the next section that the relative locations of temporal profiles’ mapping images can be predicted by Propositions 1–5. For example, genes with relatively low levels of expression throughout the time line are mapped close to the origin. Genes that steadily increase over the time and genes that steadily decrease over the time line are likely mapped symmetric to the real axis. In Equation (3), the g0 is the variance of w[n], rk is the kth-sample autocorrelation coefficient of w[n]. It can be shown [4] that −1 ≤ rk ≤ 1 and for mutually independent random sequences or white noise, rk = 0. Equation (3) provides insight into the cluster delineation capabilities of the first Fourier harmonic projection. The variance and kth-sample autocorrelation coefficients of the difference between two points within a given cluster are likely to be small because they will share similarities across many dimensions. Thus, points within cluster are likely to map close to each other in the visualization. The first Fourier harmonic projection is well suited for visualizing temporal patterns. FFHP has a better class separation capability when the first harmonic is more dominant among all harmonics. Since the first harmonic is a measure of the base sinusoidal frequency component in the signal, a lower frequency signal tends to have a larger first harmonic. This is the case for a large number of time series data where most trend patterns show low frequency variations (not having multiple cycles). The first Fourier harmonic projection is sensitive to dimension order. For time series data, the time points (dimensions) are naturally ordered. VizStruct System The first Fourier harmonic projection approach is implemented as a visualization tool called VizStruct, which is written in Java and is available on request from the first author (lizhang@cse.buffalo.edu). The name VizStruct em- (A) (B) (C) Figure 3. Three snapshots from a dimension tour through a synthetic three-dimensional data set containing 25, 000 points. The parameter settings for (A)–(C) were h0.5, 0.5, 0.5i, h0.3, 0.5, 1i, and h0.05, 1, 1i, respectively. phasizes its capability of visualizing the structure of the dataset. 3 RESULTS Data Sets for Visualization Dimension Tour The dimension tour is a feature of VizStruct that allows the user to interact with the data via dynamic animations. It is analogous to the grand tour [2] and the user interacts with the visualization by changing the dimension parameter associated with each dimension. The default value for each dimension parameter is 0.5 and the individual parameters can be changed over the range −1 to +1 either manually or systematically using the program. Because no twodimensional mapping can capture all the interesting properties of the original multi-dimensional space, two points that are close in the visualization can theoretically be far apart in the multi-dimensional space. The dimension tour, which creates animations that explore dimension parameter space, can reveal structures in the multi-dimensional input that were hidden due to overlap with other points in the visualization. Figure 3 illustrates the capabilities of the dimension tour using a synthetic dataset containing 25, 000 points in three dimensions. At the default settings of dimensional parameters (Figure 3A), 5 clusters are apparent. However, during the course of the animations (Figures 3B and 3C), the multi-layered structures of the original 5 clusters become increasingly apparent. Our approach was tested using two published arrayderived data sets. The rat kidney array dataset of Stuart et al. [16] containes measurements of gene expressions during rat kidney organogenesis. The data were downloaded from http://organogenesis.ucsd.edu/data.html. It consists of 873 genes which vary significantly during kidney development at 7 different time points: gestational day 13, 15, 17, 19; newborn (N); 1 week (W); and nonpregnant adult (A). The fibroblasts dataset of Iyer et al. [10] is the result of a study of the response of human fibroblasts to serum. It consists of gene expressions measuring the temporal changes in mRNA levels of 517 human genes at 13 time points, ranging from 15 minutes to 24 hours after serum stimulation. The data were downloaded from http://genomewww.stanford.edu/serum. Rat Kidney Dataset In the rat kidney dataset, there are 5 discrete patterns or groups of gene expression during nephrogenesis. Figure 4 illustrates these temporal profiles characterized by the idealized gene expressions. Figure 5A shows how genes are classified by a hierarchical clustering algorithm. It copies the Figure 3 in [16]. Figure 5B shows the parallel coordinates of the dataset. Patterns of genes in each group comply to the profiles depicted in Figure 4 (with some noise). 3 1 2.5 2 1.5 1 0.5 0 13151719 N W A 2 13151719 N W A 3 5 4 13151719 N W A 13151719 N W A 13151719 N W A Figure 4. Idealized temporal gene expression profiles during kidney development. The groups were named 1 through 5 based on the timing of their peak expression during development. Seven time points were 13, 15, 17, 19 embryonic days; N , newborn; W , 1 week old; A, adult. 3 3 3 2 2 2 1 1 1 0 All Data 0 Group 1 0 3 3 3 2 2 2 1 1 1 0 Group 3 0 (A) Group 4 0 Group 2 Group 5 (B) Figure 5. Visualization of the rat kidney dataset in heat plot and parallel coordinates. (A) Dendrogram and a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset and each of the gene groups. Figures 6A-B show the visualization of the rat kidney dataset in VizStruct under the first Fourier harmonic projection for two dimension parameter settings. There are 5 sets of colored symbols for each of the 5 gene groups. Each symbol represents one gene across 7 time points. In Figure 6A, two big clusters are clearly apparent from the visualization. The top cluster consists of genes from groups 3, 4, and 5. The bottom cluster is comprised of genes from groups 1 and 2. Furthermore, genes from each group are aggregated. The formation of two large clusters can be interpreted by the temporal profiles. Groups 1 and 2 with genes which have very high relative levels of expression in early development are quite different from groups 3, 4, and 5 for genes that have a relatively steady increase in expression throughout development. The visualization also indicates that points in the upper cluster are symmetric to the points in the lower cluster. Properties of FFHP may suggest the reason. Temporal profiles of groups 1 and 4 suggest that they are somewhat symmetric to the middle time point (gestational day 19). By Proposition 4, they would be mapped to points symmetric to the real axis. On the other hand, groups 4 and group 5 are mapped closely since they have similar profiles except for the significantly up-regulated in the last time point. The same arguments can be applied in the case of group 1 vs. group 2, or group 3 vs. group 4. Due to the noise, most boundaries between groups are not very clear. However, the separation between group 4 and group 5 improves in Figure 6B compared to Figure 6A. Fibroblast Dataset Temporal patterns are slightly complicated in the fibroblast dataset. Data has been classified into 10 groups using the hierarchical clustering algorithm by the original author [10]. Figure 7A shows the result of the hierarchical clustering. It is a copy of Figure 3 from [10]. Figure 7B gives the 6 12 G2 G4 10 G5 4 G3 8 G3 2 6 Im[F1] Im[F1] G4 0 4 2 G1 0 −2 −2 G2 −4 G5 −4 G1 −6 −6 −4 −3 −2 −1 0 1 Re[F1] 2 3 4 5 −6 −4 −2 0 (A) 2 Re[F1] 4 6 8 10 (B) Figure 6. Visualization of the rat kidney dataset in VizStruct. Five gene groups were represented by blue plus symbols, red circles, green triangles, magenta stars, and black cross symbols. The dimension parameters for (A) and (B) were h0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5i and h1, 0.5, −1, −0.5, 0.5, 1, −1i, respectively. 15 10 5 0 All Group A Group B Group C Group D Group E Group F Group G Group H Group I Group J Group U 15 10 5 0 (A) (B) Figure 7. Visualization of the fibroblast dataset in heat plot and parallel coordinates. (A) Dendrogram and a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset and each of the gene groups. The gene group labelled U consists genes without label after hierarchical clustering. 8 8 8 4 4 4 8 8 8 8 10 30 9 8 8 8 6 44 8 7 4 4 444 4 8 888 87 7 44 4 8 988 8 7 4 44 44 744 8 8888 78 777 444444 8 88 2 88 8877774 87 6 8 888 888811 888884424424 9811 8 88 6 4 6 8 4 8 2 11 4 88 4 8 9 66 22 66 2 8611 10 4 11 2 11 2 811 9 6 4 11 9 2 11 2 2 4 9 2 2 6 2 6 4 2 9911 6 96 66 11 11 11 99 24 2 2 2 1 12 2 11 11 2 2 11 2 1 1 1 1 6 9969 669 111 66 2 11 3 2 12 2 1 10 1 6 1 10 1 2 210 6 1 3 6 11 1 110 2 1 11 3 6 6 66610 1 11 10 31 1333 3 333 3 6 10 10 10 3 11 6 611 11 3 10 110 0 1 03 10 5 10 3 33310 10 55 5 10 10 0 −10 3 10 10 10 5 9 −20 −100 −80 9 −60 Re[F1] −40 2 0 6 −2 6 6 6 −4 6 6 −6 5 10 6 −120 4 Im[F1] Im[F1] 8 −8 −20 0 −15 20 44 4 7 4 47 44 4 7 8 74 44 8 7 7 44 42 8 88 888888 8 88 4 7 4 24 7 4 887 8 8 8 78 8 4 4 88 8 811 2 98 88 8 4 8 6 8 11 88 868 2 444 118 86 9 8 4 6 2 11 2 10 411 2 68 2 2 8 11211 9 6 11 2442 22222 4 222222 11 222 699 9 22 2 66 11 11 611 2 4 2222222 2222 11 9 2 11 11 2222 2 11 96 99999 242 212 222 66 222 2 11 1 2 2 11 21 1 2 22222 11 6 6 111111 1 111 211 6 22 222 12 2 22 2 2 11 2 1 1 1 9 2 1 2 1 1 1 2 6 1 2 1 1 1 1 1 9 11 1113111 1 11111221 1112 2 2 10 1 96 6 1 2 910 6 1 1 6 10 11211111 2 10 6 11 1333 111111 11 2 111111 66 333 1 1 11 11 10 10 3313 33 3333313 3 3 3 333 1 6 6 10 10 1010 3 3 3 3 11 1110 10 11 10 5 10 10 3 3 3 10 3 5 10 55 10 10 6 8 8 4 4 4 7 8 88 8 20 7 8 8 8 8 8 4 7 78 8 −10 −5 0 5 5 10 Re[F1] (A) (B) 10 10 5 5 were used for each Figure 8. Visualization of the fibroblast dataset in VizStruct. 11 colored numbers of the gene groups. (A) The mapping of the entire dataset. (B) Enlarged portion of the visualization 10 in (A). 11 7 2 7 2 10 30 2 7 8 2 0 −10 4 −20 −100 −80 −60 Re[F1] −40 2 0 10 −2 10 10 10 −4 10 −6 5 1 −120 4 Im[F1] Im[F1] 10 −8 −20 0 20 (A) −15 11 11 1111 9 1111 9 9 99 9 9 11 9 99 3 99 11 99 39 9 9 9 3 9 8 22 8 99 9 2 2 2 2 11 2 11 9 9 11 11 11 11 1111 2 2 2 9 99 9 11 1111 2 222 8 9 9 11 11 99999 2 2228 9 9911 2 88 3 9999999399911 8 8 8 8 8 8 8 2 2 888 999 2 8889 8 8 8 3 8 8 83333 810 888888 3 9 88 33 10 3933 8888810 38 38 8 8 3 33333 33 10 10 3 2 10 88383 10 10 3 3 3 810 810 33 33 8 8 33 3 33 10 1010 88 10 3 833 3 3 33 10 33 10 3 88 10 3 38 33 10 10 3 3 883 101010 10 3 3 3 3 33333 10 6 336836 10 1010 10 3 3 160 63 6336 66 66 6 6 6 6 6 6 6 9 9 9 9 88 88 8888 8 88 9 8 9 8 8 8 8 98 8 8 888 8 8 88 88 8 8 8 3 39 10 8 8 88 3 88 8 8 3 8 33 33 10 9 88 8 8 3333 3 3 3 8 38 8 10 3 3333 3333 333333 333 333 888 10 10 3 3 10 3 3 10 333333 33333 33 2 3 888 333 33 3333 88 10 108 3 33 3333 3333333 8 3 8 33 3333 3 8 8 3 10 333333333 3 333333 3 333 333 3333 33 8 3 3 3 3 3 10 10 10 10 3 8 3 3 3 3333333333333 3 8810 3 10 8 33 333333 3333 33 10 3333 3 10 3 8 3 3 3 3 3 33 3 10 3 3 10 3 33 810 8 3 3333 33333333333 3 3 3 333 3 10 10 6 8 63 3 3 1010 33 3 10 10 6 6 6 33 3 6 6 3 6 6 6 66 6 2 6 2 7 9 2 2 2 20 9 9 −10 −5 0 6 5 10 Re[F1] (B) Figure 9. Visualization of the fibroblast dataset in VizStruct. Genes were grouped by the k-means clustering method. 11 colored numbers were used for each gene groups. (A) The mapping of the entire dataset. (B) Enlarged portion of the visualization in (A). parallel coordinates the entire dataset and each gene group. Figure 8A shows the visualization of the fibroblasts dataset in VizStruct under the first Fourier harmonic projection. Colored numbers 1 through 11 were used for each gene group. The visualization reveals several outliers and no distinct clusters. The majority of data are too dense to be seen. By zooming technology, the enlarged detail is shown in Figure 8B. More numbers spread out, but most blue numbers (1) were still covered by numbers 2 and 3. Two observations can be made: (1) a large number of genes are close to the center. (2) most clusters have a radial shape emitting from the center. They can be interpreted by the FFHP properties. (1) Temporal patterns of group A and group B have very flat shape and relative smaller values. By Propositions 1 and 2, they tend to be mapped closed to the origin. (2) In hierarchical clustering, Pearson’s correlation coefficient was used to measure the similarity. Genes whose time values differ due to the amplitude shifting or multiply- 4 6 4 2 2 Im[F1] y 1 0 0 −1 −2 −2 −4 −3 3 2 1 801 y 3 4 516 444 629 480451 536528 671 489 519 511 572 582 600 601 452 690 471 670 566 676 420503 574 466 687 520 672 605 795 775 669 529661 611 571 443 649 424 531 630 607 624 680 620 713 625 465 660 657 513 662 825 486 652 614 645 493 619 639 506 527 656 525 866 653 450 742 508 457 554 655 689 558 683 488 779 793 704 569 398 469 651 856 504 562 470 754 524 477 584 646 460 515 635 534 468 628 502 637 449 664 426 710 402 583 564 602 563 650 603 799 467 741 567 463 733 715 474 850 458 552 834 764 705 831 618 706 707 640 634 440 411 654 717 753 535 851 540 559 612 530 432 835 577 638 827 523 517 538 668 697 785 716 514 501 838 446 833 702 695418 596 868 410 678 857 500 553 461 748 841 8712 60 859 573 755 696 867 475 632 555 568 814 401 633 855 580 666 485 512 586 663 539 824 576 826 522 863 759 597 729 773 693 585 399 836 497 476 575 798 613 479 865 473 578 721 726 641 849 509 679 423 848 487 694 482 803 718 692 763 526 551 720 811 409 546 828 734 598 459 760 483 544 783 421 631 595 581 491 788 556 708 862 434 691 846 615 809 853 771 818 496 806 543 782 472 454 667 561 688 762 845 821 644 507 832 819 744 787 681 797 790 665 864 709 725 430 684 394 647 840 686 738 839 484 490 740 429 636 842 626 495 415 453 604 627 419 404 858 756 768 510 673 776 549 565 761 587 579 730 658 412 675 455 784 804 674 570 621 752 617 428 822 829 810 659 403 532 478 700 541 414 789 816 808 791 677 802 591 438 719 777 609 847 743 594 616 442 550 815 745 844 518 781 746 547 439 852 437 778 682 758 462 750 807 643 794 820 545 610 492 505 648 770 727 698 622 642 796 416 405 751 448 736737 456 701 589 703 711 435 800 830 422 823 780 735 431 521 724 805 427 623 817 837 557 722 739 699 749 685 854 408407 723 772 537 732 747 766 731 396 786 792 395 397 714 341 599 588 494 606 400 533 447 590 433 608 813 767 481 498 464 441 765 593 560 812 757 386 365 357 380 774 843 406499 370 436 861 385 413 391 417 387 361 346 445 362 769 358 372 592 548 247 342 49 97309 327 324373 355 728139 231 184 376 326368 425 352367 266 359 7 331 329 332 34 252 129 171 369 351 383 542 299 215 350 334 112 183 221 65 304 313 238 98 356 328 113 33 185 390 389 330 348 333 381 197 188 382 80 224 13 255 293 250 392337 288 354 220 152 375 107 364 343 11320 321 344 287 101 39 2 369 677 338379 310 349 345 377 339 256 353 23204 46 41208 210 38 147 384 8 292 92296 213 162 1245 295 173 5048 393 388 218 159 156 164227 322 106 66 118 78 219 336 335340 306 40 241 347 378 300 109 264 189 30 285 194 61 94145 72 168 140 86 125 317 91 1020 26 245 100 96 120 294 57 52 206 363 283 277 73 27 174 36 60 75 366 311 137 28 274 284 68314 232 126 32 325 298265 226 53 18132 144 212 281 35 315 170 151 222 119 192 286 44 318 127 58 172 85 117 160 279 374 289 81 43 229 248 63 95 149 240 74 205 190 1154 133 180 239 166 360 8831 54 235 143 108 4186 225 263 233 55 278 51 24 25 163 267 134 253 181 47 146 223 179 157 167 102 116 130 254 202 19 29200 191 262 104 280 136 211 141 70 216 138 243 79 105 148 150 175 214 62 249 76 128 42 83 209 242 307 269 178 319 273 237 371 9217 228 90 22 260 165 114 89169 21 282 272 308 203 99 234 207182 268 193 230 323 67 246 123 124 7137 261 201 131 259 303 176 270 158 199 316 177 8784 122 305 5121 115 111 271 275 17 103 257 15 14 64 251 244 142 195 16 110 155 56 82 135 236 198 297 93 161 153 276196 312 59290 258 302187 0 −1 −2 −3 291 301 −4 −3 −2 −1 0 x 1 2 3 −6 −4 −3 −2 −1 (A) 0 1 Re[F1] 2 3 4 857 717 707 848762 835846 742 777 839 804 715826 756 845 746 748 779 840 758 815 582 451 775795 825 754 782 770 730 725 781 716 866 713 741 760 807 705798 745 511 571 778 763 799 796 704 793 727 853 744 831 771 709 856 624 734 720 830 833 465 649 834 452 827 790 824 721 850 764 787 719 718 768 712 735 785 789 867 844 635 676 863 761 800 751 471 443 829 516 862 706 729 723 759 818 740 524 655 836 605 536 601 838 868 652 733 743 672 529 788 466 842 726 859 784 474 776 783 806 851 816 819 860 753 558463 849 480489 687619 797 737 786 708 773 660 852 821 467 671 444 710 520620 855 752 832 780 614 847 470 618 457 865 517 724 506 639 858 567 750 792 535 569 772 528 562 703 553 808 854 803 811 828 822732 486 755 476 634 629 578 664 523 661525 487 650 458 656 459 731765 767 794 502 539 814 531 563 739 736 491 508 514 810 693 572 820 597 515 864 841 526 581 738 645 577 823 640 666 423 791 654 670 689 510 401 749 632 575 453 564 747 584 552 534 659 667 714 722 625 507 837 690 658 503 662 668 817 805 541 711 555 801 477 678 700 497 475 688 468 701 702 607651 583 456 679 766757 455 449 540 646 696 576 512 538 446 697 460 580 428 469 532 695 566 802 568 462 500 545 653 809 573 680 615 504 630 513 579 546 501 521 549 600 574657450 473 394 420 843 519 585 604 698537 813 686 621 565 530692 627 544 647 570 533774 559 613 589 557608 496 442 424 554 505 419 488 493 812 663 590 683 492 611 669 561 454 637 665 441 681 550 448 399602 641 596 461 691 495642 606 769 728 479 398 628587631 643 426 402 414 447 861 411 610 494560 603612 410527 633 522595 648 542 412 440 622 547 638 431 464 472623 598 636 675 432 616 418 684 434 309 685 694 586 617 644 490 599 588 593 139 551 409437 413699416 415 421 591 556 184 485 677 509 397 435 478 482 425 518 673 345 483 405 422 427 429 439 147 543 594 430 484 404 499 49 407403682 400 438 626 97 548381 481 38 370 674 255 12 408395 386 433 396380 39 150 159 445 371 112 200 310 304171 436 129 185 362 32 592 365417 262 188 220 227 113 238 197 391357 609 164 327 297 375 406 237 78132 65 33 389 385 231 346 329368 334 45 6 81 315 13 204 253 342333 2205 252 20 336 325 285 498341 355 28 162 25 10 247224 226 210211 349 101 229 215 277 324 331390 105 183 154 240 87 51 358 283 212 55 344 59 146 254 30 18 242 125 191 21122 300 134235 364 356 328 77208 2622 151 189 6973 80374 60 165 248 340 281 15 387 330 52 260 424 44170 332 61 23 286 388 263 19 95 120219 41 124 383 294 156 282 64 100 393 270 256 299 382 109 192 46 209 264 36 222 74 54 72 99 103 169 379 127 366 243 392 68 174 7 35934 384168 343 232 126 108 303 354 27 216 373 172 40157 43 123 90284 136 163 89 86106 14 138 319 1 351 58 83 372 367 363 133 376 130 117 63 142 361 338 118 128 153 308 8 249 5 71 279 239 236 32657 339 233 131 140 33511 160 116 257 42 312 273 84 119 76 137 241 114 56 234 144 70 311 186 275 258 337 274 259 3 360 177 251 67 246 190 98 316 47 203 75 305 352350 217 182 228 35 152 272 107 158 225 261 195 317 149 16 121 265 193 377 91278 214 347 369 353 161 53 135 88 145 115 267 298 175 230 199 181 37244 104 148 85 296 82187 223 268 111 321 280 293194 290 302291 178 102 266 289 269 201 318 166 31 93 378 323 348179 306 202 92 9 29 94 295 167 176 196 207 276 180 221 322 143 62 79 250 48 110 218 314 198 155 320 17 66 271 245 287 50 292 301 173 288 313 96 307 141 213 206 −4 5 (B) −3 −2 −1 0 x 1 2 3 (C) Figure 10. Comparison of Sammon’s mapping and the first Fourier harmonic projection. (A) The result from Sammon’s mapping of the rat kidney dataset. Five gene groups were represented by 5 colored symbols the same as were used in Figure 6. (B)–(C) Comparison of relative gene locations between (B) VizStruct and (C) Sammon’s mapping. Five colors were used for each of the gene groups. The color schema is the same as in (A). Notice that the visualization layout of (A) and (C) is identical. The purpose of panel (C) is to compare the gene by gene location to the panel (B). ing a constant have very high coefficient values. By Propositions 2 and 3, they tend to be mapped closely along a line though the origin. Some relative gene group locations can be predicted. For example, by Proposition 4, groups 4 and 5 are mapped symmetric to the real axis. Close inspection suggests that temporal pattern of group 8 is somewhat a 2 or 3 right time-shift from the pattern of group 6. By Proposition 5, each time point shift correspondents to roughly 2/13π ≈ 30◦ rotation. In Figure 8B, genes in group 8 are mapped about 60◦ clockwise to genes in group 6. Lacking clear clusters in the visualization indicates that different clustering methods may yield different results. Figure 9 shows the grouping aspect of k-means clustering. Euclidean distance was used in k-means as the distance measure. The visualization reveals that genes closed to the origin are grouped as one. Comparison of FFHP to Other Visualizations Heat plot and parallel coordinates display all individual dimensional information. In parallel coordinates, a polyline represents a gene over time. This format is a concise and intuitive for displaying temporal profiles. However, the parallel coordinate has obvious drawbacks: when the data size becomes larger it becomes increasingly unreadable as indicated in the first panel of Figure 5 and Figure 7. Heat plot uses a color mosaic instead of a polyline to overcome over- lapping. Combined with the dendrogram from hierarchical clustering, it gives a global view as well as individual clusters of the dataset by grouping genes with similar patterns together. Cluster relationships in heat plot is one dimensional: clusters of genes are listed one by one as shown in Figure 5A and Figure 7A. The first Fourier harmonic projection takes a different approach. It uses a two dimensional point to represent each gene over time. By doing so, it takes advantage of the spatial relationship of the points to reflect the structure of the original dataset. Thus the cluster relationships become two dimensional and FFHP has a better capability of displaying outliers than heat plot. Unlike heat plot, which requires algorithms to group similar genes to make a meaningful visualization, FFHP directly mapped the multi-dimensional data onto a two dimensional space without any prior knowledge of the dataset. Multidimensional scaling (MDS) [3] is a competing approach for visualizing multi-dimensional data. We compared FFHP to Sammon’s mapping [15], a variant of MDS that optimizes the following stress function E: 1 E = XX i X X (d∗ij − dij )2 , d∗ij d∗ij i j<i (4) j<i where d∗ij is the distance between points i and j in the N dimensional space and dij is the distance between i and j in the visualization. Figure 10 shows Sammon’s mapping of the rat kidney dataset. A comparison of Figure 10A to Figure 6A reveals the extensive similarities between FFHP and Sammon’s mapping. The relative locations of individual samples are also remarkably similar. This is indicated in Figure 10B and Figure 10C. Sammon’s mapping has some drawbacks: (1) it provides a single final result and the user cannot intervene interactively during visualization, (2) the incremental addition of even a single point requires a complete repetition of the optimization procedure and possible extensive reorganization of all the previously mapped points to new locations, and (3) it requires time-consuming optimization procedures of time complexity O(N 2 ) or greater. Our results illustrate some of the strengths and weaknesses of FFHP. The visualization reflects the structure of the dataset: outliers, clusters and their relationships. Comprehending the structure of the dataset can facilitate the choice and understanding the results of different clustering methods. Although the FFHP uses an approach to mapping multi-dimensional data that is distinctively different from Sammon’s mapping, it yields results that are consistently comparable. However, when the dataset contains a large number of patterns, it becomes difficult to separate different patterns and the visualization may be difficult to interpret. 4 DISCUSSION Visualization of microarray data is a challenge because of its high dimensionality. In this paper, we have explored use of the first Fourier harmonic projection for visualizing multi-dimensional time course array datasets. Unlike parallel coordinate or heat plot approach which display all dimensional information, FFHP uses a two dimensional point to represent each data point (gene in this case). Our results indicated that temporal patterns were well reflected by spatial relationships: genes with similar pattern were aggregated and relative locations of gene groups can be predicted. Moreover, the first Fourier harmonic projection was shown to yield results that were similar to those from Sammon’s mapping. Achieving two dimensional mapping requires a tradeoff. The mapping is lossy for detailed dimensional information. Our approach attempts to preserve to the maximum semantics of the data points via Fourier harmonic aspect. In particular, characterizing the data using two descriptive measurements: the real and imaginary portions of the first discrete Fourier harmonic. A similar approach uses principal component analysis (PCA) [11]. This visualization deploys another two descriptive measurement: the first and second principal component. In addition to the first Fourier harmonic projection, higher Fourier harmonics can also be used as mappings. It can be shown that for any harmonic (> 1), there exits an equivalent first harmonic of the original discrete signal whose time indices are systematically rearranged. At certain conditions (such as temporal patterns with high frequency variations, i.e. multiple cycles), higher harmonic projections enhance substructure separation in the visualization. Detailed discussion is beyond the scope of this paper due to a length constraint. The FFHP mapping results in two-dimensional visualizations that are identical to those of radial coordinate visualization techniques, e.g., RadViz [7]. However, rather than vector notation and the spring paradigm of RadViz, we have used a complex number notation. This substantive reformulation of the mapping provides valuable theoretical insights and allows important properties of mapping, including its relationship to the DFT, to be easily derived. It can also create possible extensions such as higher Fourier harmonic projections. The first Fourier harmonic projection requires data without missing values. This requirement can be easily met because filling in missing values is a mature research field [17]. Gene expression data can be studied in either sample space or gene space. Here, we have reported only the visualization on gene space. In a separate report [19], we applied the first Fourier harmonic projection on the sample space and performed visualization-driven classifications. Our experiments demonstrated that FFHP offered an alterative format of visualization. We believe that using FFHP alone or in combination with heat plot or parallel coordinates would give a biologist additional powerful tools for analyzing and visualizing microarray data sets. ACKNOWLEDGEMENTS This work was supported by grants from the National Science Foundation. We also thank the anonymous reviewers for their constructive comments on the manuscript. References [1] Cadzow, J. A., Landingham, H. F. Signals, Systems, and Transforms. Prentice-Hall, Inc., Englewood Cliffs, NJ, 1985. [2] Cook, C., Buja, A., Cabrera, J., and Hurley, C. Grand Tour and Projection Pursuit. Journal of Computational and Graphical Statistics, 2(3):225–250, 1995. [3] Davision, M. L. Multidimensional Scaling. Krieger Publishing, Inc., Malabar, FL, 1992. [4] Diggle, P. J. Time Series: A Biostatistical Introduction. Oxford University Press, Oxford OX2 6DP, 1990. [5] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D. Cluster Analysis and Display of Genome-wide Expression Patterns. Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868, December 1998. [6] Hautaniemi, S., Yli-Harja, O., Astola, J., Kauraniemi, P., Kallioniemi, A., Wolf, M., Ruiz, J., Mousses, S., and Kallioniemi, O. Analysis and Visualization of Gene Expression Microarray Data in Human Cancer Using SelfOrganizing Maps, 2003. To appear in Machine Learning: Special Issue on Methods in Functional Genomics. [7] Hoffman, P. E., Grinstein, G. G., Marx, K., Grosse, I., and Stanley, E. DNA Visual and Analytic Data Mining. In IEEE Visualization ’97, pages 437–441, Phoenix, AZ, 1997. [8] Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J. R., and Fedoroff, N. V. Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity. Proc. Natl. Acad. Sci. USA, Vol. 97(15):8409–8414, July 2000. [9] Ideker, T., Galitski, T., and Hood, L. A New Approach to Decoding Life: Systems Biology. Annu. Rev. Genomics Hum. Genet., 2:343–372, July 2001. [10] Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore, T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr., Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., and Brown, P. O. The Transcriptional Program in the Response of Human Fibroblasts to Serum. Science, Vol. 283(1):83–87, January 1999. [11] K. Y. Yeung and W. L. Ruzzo. Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics, Vol. 17(9):763–774, 2001. [12] Kohonen, T. Self-Organizing Maps, Springer Series in Information Sciences, volume Vol. 30. Springer, Berlin, Heidelberg, New York, 1995. [13] Morrison, N., editor. Introduction to Fourier Analysis. John Wiley & Sons, Inc., New York, NY, 1994. [14] Parmigiani, G., Garrett, E. S., Irizarry, R. A., and Zeger, S. L., editor. The Analysis of Gene Expression Data: Methods and Software. Springer-Verlag New York, Inc, New York, NY, 2003. [15] Sammon, J. W. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5):401–409, 1969. [16] Stuart, R. O., Bush, K. T., and Nigam, S. K. Changes in Global Gene Expression Patterns During Development and Maturation of the Rat Kidney. Proc. Natl. Acad. Sci. USA, Vol. 98(10):5649–5654, May 2001. [17] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics, Vol.17(6):520–525, 2001. [18] Ward, M. O. XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data. In IEEE Visualization 1994, pages 326–336, 1994. [19] Zhang, L., Zhang, A., Ramanathan, M. et al. VizCluster and Its Application on Clustering Gene Expression Data. International Journal of Distributed and Parallel Databases, 13(1):79–97, January 2003. Appendix: Proof for Propositions Lemma 1 ÃN −1 X !2 an = N −1 X n=0 a2n + 2 n=0 −k−1 N −1 N X X at at+k . t=0 k=1 Lemma 2 For any two complex numbers z and w, (1) z + w = z + w, (2) z w = z w, (3) z = z. Lemma 3 Let j ∈ N, then = N −1 X cos(2πjn/N ) = n=0 N −1 X e−i2πjn/N n=0 N −1 X sin(2πjn/N ) = 0. n=0 Lemma 4 FFHP is homomorphic: F1 (a x[n] + b y[n]) = a F1 (x[n]) + b F1 (y[n]). Proposition 1 (Cancellation) If x[n] F1 (x[n]) = 0. = [a, . . . , a], then Proof : From the formula in Eq. (2), we get F1 (x[n]) = N −1 X a e−i2πn/N = a n=0 N −1 X e−i2πn/N . n=0 By Lemma 3, let j = 1. F1 (x[n]) = 0. ¤ Proposition 2 (Amplitude Shifting) If y[n] = x[n] + a, then F1 (y[n]) = F1 (x[n]) From the formula in Eq. (2), we get F1 (y[n]) = N −1 X (x[n] + a)e−i2πn/N = n=0 + N −1 X N −1 X x[n]e−i2πn/N n=0 a e−i2πn/N = F1 (x[n]) + 0 n=0 The second summation is 0 by Proposition 1. ¤ Proposition 3 (Amplitude Multiplying) If y[n] = a x[n], then F1 (y[n]) = a F1 (x[n]). From the formula in Eq. (2), we get F1 (y[n]) = N −1 X a x[n]e−i2πn/N n=0 = a N −1 X x[n]e−i2πn/N = a F1 (x[n]) n=0 ¤ Proposition 4 (Transposing) y[n] = x[N − n − 1], then F1 (y[n]) = F1 (x[n]). Proof : By Lemma 4, the distance between F1 (y[n]) and F1 (x[n]) is kF1 (w[n])k. From Eq. (2), we get By Lemma 2, we have F1 (x[n]) = N −1 X x[n]e−i2πn/N = n=0 = N −1 X N −1 X x[n]ei2πn/N n=0 kF1 (w[n])k x[n]ei2πn/N = k = F1 (x[−n]) N −1 X w[n]e−i2πn/N k n=0 n=0 = However, when x[n] is a real signal, x[n] = x[n]. Then we have k N −1 X w[n] cos(2πn/N ) − iw[n] sin(2πn/N )k n=0 F1 (x[n]) = F1 (x[−n]). i.e., F1 (x[n]) = F1 (x[−n]). Since x[N − n − 1] = x[−n] then F1 (y[n]) = F1 (x[n]) ¤ Let ω = 2π/N , by Lemma 3, we have N −1 X cos(nω) = n=0 Proposition 5 (Time Shifting) If y[n] d F1 (y[n]) = WN F1 (x[n]). = x[n − d], then N −1 X sin(nω) = 0. Now add a term w, b the mean of w[n], n=0 Proof : Assume 0 ≤ n < N , let l = n − d, then n = l + d. When n = 0, l = −d and when n = N − 1, l = N − 1 − d. From the formula in Eq. (2), we get NX −1−d F1 (y[n]) = F1 (x[n − d]) = kF1 (w[n])k 2 = NX −1−d d x[l]e−i2πl/N e−i2πd/N = WN l=−d x[l]e−i2π(l+d)/N = NX −1−d x[l]e−i2πl/N + = l=−d −1 X N −1 X l=−d + x[l] e−i2πl/N = l=−d N −1 X x[t]e−i2πt/N 2 N −1 N X −1−k X + = (w[n] − w) b sin(nω) kF1 (w[n])k2 x[t]e−i2πt/N [(w[t] − w)(w[t b + k] − w)Ω] b where Ω = cos(tω) cos((t + k)ω) + sin(tω) sin((t + k)ω). By trigonometry identity cos θ cos φ + sin θ sin φ = cos(φ − θ), we have Ω = cos(kω). Now = N −1 X (w[n] − w) b 2 n=0 t=0 N −1 X !2 t=0 k=1 t=N −d NX −1−d ÃN −1 X (w[n] − w) b 2 (cos2 (nω) + sin2 (nω)) + Let t = l + N for the first summation and t = l for the second summation, we get x[l]e−i2πl/N (w[n] − w) b cos(nω) n=0 l=0 NX −1−d ÃN −1 X w[n] sin(nω) n=0 !2 Expending each squaring term by Lemma 1, we get x[l + N ] e−i2π(l+N )/N NX −1−d + !2 n=0 However, ei2πn/N = ei2π(n+N )/N and x[n] = x[n + N ], x[l]e−i2πl/N w[n] cos(nω) ÃN −1 X n=0 l=−d NX −1−d !2 n=0 l=−d = ÃN −1 X + x[t]e−i2πt/N = F1 (x[n]) 2 N −1 N X −1−k X t=0 d Therefore, F1 (y[n]) = F1 (x[n])WN . ¤ Definition 1 The mean of a signal x[n] is defined as x b = PN −1 autocovariance coefficient of a n=0 x[n]/N . The k-th sample P −1−k signal x[n] is defined as gk = N (x[n] − x b)(x[n + k] − n=0 x b)/N . g0 is called the variance of x[n]. The k-th sample autocorrelation coefficient is defined as rk = gk /g0 . Proposition 6 (General Distance) Let w[n] = x[n]−y[n] be the difference between x[n] and y[n]. The distance between F1 (x[n]) and F1 (y[n]) is à ! N −1 X kF1 (w[n])k2 = g0 N 1 + 2 rk cos(2πk/N ) . k=1 = N (g0 + 2 à = g0 N [(w[t] − w)(w[t b + k] − w) b cos(kω)] t=0 k=1 N −1 X gk cos(kω)) k=1 1+2 N −1 X k=1 ! rk cos(2πk/N ) . ¤