Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene Expression Data

advertisement
Fourier Harmonic Approach for Visualizing Temporal Patterns of Gene
Expression Data
Li Zhang and Aidong Zhang
Department of Computer Science and Engineering
State University of New York at Buffalo
Buffalo, NY 14260
{lizhang,azhang}@cse.buffalo.edu
Abstract
DNA microarray technology provides a broad snapshot
of the state of the cell by measuring the expression levels
of thousands of genes simultaneously. Visualization techniques can enable the exploration and detection of patterns
and relationships in a complex dataset by presenting the
data in a graphical format in which the key characteristics become more apparent. The purpose of this study is to
present an interactive visualization technique conveying the
temporal patterns of gene expression data in a form intuitive
for non-specialized end-users.
The first Fourier harmonic projection (FFHP) was introduced to translate the multi-dimensional time series data
into a two dimensional scatter plot. The spatial relationship
of the points reflect the structure of the original dataset and
relationships among clusters become two dimensional. The
proposed method was tested using two published, arrayderived gene expression datasets. Our results demonstrate
the effectiveness of the approach.
Keywords: visualization, gene expression, time series,
Fourier harmonic projection
1 INTRODUCTION
Knowledge of the spectrum of genes expressed at a certain time or under given conditions proves instrumental to
understand the working of a living cell. DNA microarray
technology allows measurements of expression levels for
thousands of genes simultaneously [9]. Extensive research
has been conducted on the study of temporal patterns of
gene expressions [10, 16]. Clustering methods which group
genes or samples with similar patterns have become mainstream analysis tool [14]. Visualization can facilitate the
discovery of structures, features, patterns, and relationships
in data and may provide more insightful information than
Murali Ramanathan
Department of Pharmaceutical Sciences
State University of New York at Buffalo
Buffalo, NY 14260
murali@acsu.buffalo.edu
traditional numerical methods. By visualization, we hope
to gain some intuition regarding the data, but more importantly, we would like to understand the relationships between data points and detect the intrinsic structure, or possible cluster tendencies. Visualization is especially important
in the early stages of data analysis in which qualitative analysis is primary to quantitative. Early success will enhance
the users’ performance in the remaining stages of analysis. Array-derived gene expression datasets present analysis
and visualization challenges because of their dimensionality, noisy environment, and pattern varieties.
The parallel coordinates approach [18] is perhaps the
simplest method to display patterns of gene expression profiles. Here the data in each dimension are plotted along a
separate axis. Holter et al. [8] have used parallel coordinates to visualize the temporal progression in the yeast cell
cycle data. Self-organizing maps (SOM) [12] is another approach. More recently, Hautaniemi et al. [6] have presented
a heat map-based strategy for visualizing the U-matrix from
SOM. The most prominent visualization-enhanced analysis tool for gene expression data is TreeView [5], which
provides a user-friendly computational and graphical environment for assessing the results from hierarchical clustering. The graphical presentations from TreeView include
a dendrogram to reflect the distance relationships between
clusters and a heat plot to visually convey gene expression
changes between samples. The heat plot can be viewed as
variation of the parallel coordinates plot in which color is
used to convey dimension values.
Here, we present an alternative mapping for multidimensional data that is based on the first harmonic of the
discrete Fourier transform. The mapping has interesting
properties and preserves certain key characteristics of a variety of data sets, especially time series data. Unlike parallel coordinates and heat plot which display all individual
dimensional information, our approach uses a two dimensional point to represent each gene over the time at the com-
putational cost of O(N log N ). It focuses on one very important aspect of the visualization: revealing the structure of
the entire dataset. Tested using two published, array-derived
gene expression time series datasets, our results indicated
that temporal patterns were well reflected in the visualization: cluster relationship became two dimensional, clusters
were apparent, and outliers were clear. An interactive visualization tool, VizStruct, was implemented to perform the
visualization.
The remainder of this paper is organized as follows. Section 2 presents the model of visualization. In section 3, we
show our analysis results. The final section discusses other
issues in this approach. Proofs for all mapping propositions
are included as appendix.
2 METHODS AND SYSTEM
The Mapping
Mapping converts multi-dimensional data to twodimensions for visualization. An array-derived profile
for M genes with N measurements results in M N dimensional data point containing real valued numerical
data. Time series data in its simplest form is merely a set
of data {yt , t = 0, . . . , N − 1} where the subscript t indicates the time at which the datum yt was observed [4]. On
the other hand, a discrete-time real signal on N evenly distributed time points is represented as an indexed sequence
of N real numbers 0, . . . , N − 1 denoted by x[n] [1]. Each
term of x[n] is denoted by x[n]. The denotation similarity
between time series and digital signal suggests that we can
view each data point in a time series as a discrete-time real
signal (It is not necessary for the signal’s time index to comply with the actual time points). In this scenario, the problem of a two dimensional visualization of the time series is
transformed into the problem of finding a two-dimensional
point estimation for signals (data points).
The frequency domain representation of discrete-time
signals is through discrete-time Fourier transform, or DFT
[13]. The DFT of a N -point signal x[n] is a frequency sequence with N complex values: F(x[n]) = [Fk (x[n])],
where each
Fk (x[n]) =
N
−1
X
nk
x[n]WN
,
k = 0, . . . , N −1.
(1)
n=0
WN = e−i2π/N is called twiddle factor.
Each harmonic Fk in the DFT is a measure of the kth
sinusoidal frequency component in the signal. For example, the zero harmonic, F0 , is the mean value, the first harmonic F1 measures the base frequency component, the second harmonic F2 measures the component in the signal that
is twice the base frequency, and so forth. Because Fourier
harmonics, in general, are complex numbers, they provide
the two-dimensional point estimate for mapping a multidimensional signal. For this reason, we refer to the mapping
as the Fourier harmonic projections. In particular, we are
interested in the first Fourier hamonic projection (FFHP):
F1 (x[n]) =
N
−1
X
n
x[n]WN
=
n=0
N
−1
X
x[n]e−i2πn/N .
(2)
n=0
The relationship between the DFT and the mapping allows the fast Fourier transform algorithm (FFT), originally
discovered by Cooley and Tukey [13], to be used for computation. The FFT is a computationally efficient algorithm
and has a complexity of O(N log N ).
The complex number of F1 (x[n]) in Equation (2) can be
expressed in terms of magnitude r and phase θ to provide a
useful geometric interpretation of the mapping. The data set
was normalized so that the range of values of each dimension across the dataset was 0 to 1. For a data point with N
dimensions, the complex exponential divides a unit circle
centered at the origin of the complex plane into N equally
spaced angles. The value of the first dimension is projected
on the radial line corresponding to θ = 0 and, similarly, the
value of the kth dimension is projected on to the radial line
corresponding to the θ = 2π(1 − k)/N radians. The overall
two-dimensional FFHP mapping is the complex sum of all
N projections from a data point. Figure 1 illustrates the geometric interpretation for a point containing 6 dimensions.
W46
W56
unit circle
F1(x[n])
W36
W06
x[n]
2
W6
1
W6
Figure 1. A geometric interpretation of the
first Fourier harmonic projection. A normalized 6-dimensional data point is shown on the
right by the stem plot. The powers of twiddle factor W6 divide the unit circle centered
at the origin into 6 equal angles and each dimension of the data point is projected onto a
different radial angle (open circle). The projections are taken complex number sum to
give a 2-dimensional image (indicated by a
filled circle).
+a
xa
origin
origin
amplitude scaling
amplitude shifting
time shifting
(A)
time shifting
(B)
Figure 2. Illustration of the effect of (A) amplitude shifting and multiplying, and (B) time shifting of
the first Fourier harmonic projection.
Properties of FFHP
Theory in Practice
The first Fourier harmonic projection has useful properties that preserve the correlation between dimensions in
the multi-dimensional data point. We summarize them as
propositions below. Detailed proofs for propositions are
provided as the appendix.
1. Data points with equal values for all the dimensions
are mapped to the origin. If x[n] = [a, . . . , a], then
F1 (x[n]) = 0.
2. Two data points whose dimension values differ due to the
amplitude shifting of a constant are mapped to the same
point. If y[n] = x[n] + a, then F1 (y[n]) = F1 (x[n]).
3. Two data points whose dimension values differ due to the
amplitude multiplying a constant are mapped to the two
points on a line through the origin. If y[n] = a x[n], then
F1 (y[n]) = a F1 (x[n]). See Figure 2A.
4. Two data points whose dimension values are transposing
each other, i.e. symmetric regarding the middle time point,
are mapped to the points symmetric to the real axis. If
y[n] = x[N − n − 1], then F1 (y[n]) = F1 (x[n]).
5. Data points that differ only because they are “time-shifted”
by d dimensions relative to each other are mapped to the
circumference of the circle that is concentric with the unit
circle and the angle between the points in the visualization
is φ = 2πd/N . If y[n] = x[n − d], then F1 (y[n]) =
d
F1 (x[n])WN
. This property is illustrated in Figure 2B.
6. Let w[n] = x[n] − y[n] be the difference between the two
N -dimensional points, x[n] and y[n]. The distance between these two points in the visualization is:
Ã
!
N
−1
X
2
kF1 (w[n])k = g0 N 1 + 2
rk cos(2πk/N ) (3)
k=1
We will demonstrate in the next section that the relative
locations of temporal profiles’ mapping images can be predicted by Propositions 1–5. For example, genes with relatively low levels of expression throughout the time line are
mapped close to the origin. Genes that steadily increase
over the time and genes that steadily decrease over the time
line are likely mapped symmetric to the real axis.
In Equation (3), the g0 is the variance of w[n], rk is
the kth-sample autocorrelation coefficient of w[n]. It can
be shown [4] that −1 ≤ rk ≤ 1 and for mutually independent random sequences or white noise, rk = 0. Equation (3) provides insight into the cluster delineation capabilities of the first Fourier harmonic projection. The variance and kth-sample autocorrelation coefficients of the difference between two points within a given cluster are likely
to be small because they will share similarities across many
dimensions. Thus, points within cluster are likely to map
close to each other in the visualization.
The first Fourier harmonic projection is well suited for
visualizing temporal patterns. FFHP has a better class separation capability when the first harmonic is more dominant
among all harmonics. Since the first harmonic is a measure
of the base sinusoidal frequency component in the signal, a
lower frequency signal tends to have a larger first harmonic.
This is the case for a large number of time series data where
most trend patterns show low frequency variations (not having multiple cycles). The first Fourier harmonic projection
is sensitive to dimension order. For time series data, the
time points (dimensions) are naturally ordered.
VizStruct System
The first Fourier harmonic projection approach is implemented as a visualization tool called VizStruct, which
is written in Java and is available on request from the first
author (lizhang@cse.buffalo.edu). The name VizStruct em-
(A)
(B)
(C)
Figure 3. Three snapshots from a dimension tour through a synthetic three-dimensional data set containing 25, 000 points. The parameter settings for (A)–(C) were h0.5, 0.5, 0.5i, h0.3, 0.5, 1i, and h0.05, 1, 1i,
respectively.
phasizes its capability of visualizing the structure of the
dataset.
3 RESULTS
Data Sets for Visualization
Dimension Tour
The dimension tour is a feature of VizStruct that allows
the user to interact with the data via dynamic animations.
It is analogous to the grand tour [2] and the user interacts
with the visualization by changing the dimension parameter associated with each dimension. The default value for
each dimension parameter is 0.5 and the individual parameters can be changed over the range −1 to +1 either manually or systematically using the program. Because no twodimensional mapping can capture all the interesting properties of the original multi-dimensional space, two points
that are close in the visualization can theoretically be far
apart in the multi-dimensional space. The dimension tour,
which creates animations that explore dimension parameter
space, can reveal structures in the multi-dimensional input
that were hidden due to overlap with other points in the visualization.
Figure 3 illustrates the capabilities of the dimension tour
using a synthetic dataset containing 25, 000 points in three
dimensions. At the default settings of dimensional parameters (Figure 3A), 5 clusters are apparent. However, during the course of the animations (Figures 3B and 3C), the
multi-layered structures of the original 5 clusters become
increasingly apparent.
Our approach was tested using two published arrayderived data sets. The rat kidney array dataset of Stuart et
al. [16] containes measurements of gene expressions during
rat kidney organogenesis. The data were downloaded from
http://organogenesis.ucsd.edu/data.html. It consists of 873
genes which vary significantly during kidney development
at 7 different time points: gestational day 13, 15, 17, 19;
newborn (N); 1 week (W); and nonpregnant adult (A).
The fibroblasts dataset of Iyer et al. [10] is the result of a
study of the response of human fibroblasts to serum. It consists of gene expressions measuring the temporal changes
in mRNA levels of 517 human genes at 13 time points,
ranging from 15 minutes to 24 hours after serum stimulation. The data were downloaded from http://genomewww.stanford.edu/serum.
Rat Kidney Dataset
In the rat kidney dataset, there are 5 discrete patterns or
groups of gene expression during nephrogenesis. Figure 4
illustrates these temporal profiles characterized by the idealized gene expressions.
Figure 5A shows how genes are classified by a hierarchical clustering algorithm. It copies the Figure 3 in [16].
Figure 5B shows the parallel coordinates of the dataset. Patterns of genes in each group comply to the profiles depicted
in Figure 4 (with some noise).
3 1
2.5
2
1.5
1
0.5
0
13151719 N W A
2
13151719 N W A
3
5
4
13151719 N W A
13151719 N W A
13151719 N W A
Figure 4. Idealized temporal gene expression profiles during kidney development. The groups were
named 1 through 5 based on the timing of their peak expression during development. Seven time
points were 13, 15, 17, 19 embryonic days; N , newborn; W , 1 week old; A, adult.
3
3
3
2
2
2
1
1
1
0
All Data
0
Group 1
0
3
3
3
2
2
2
1
1
1
0
Group 3
0
(A)
Group 4
0
Group 2
Group 5
(B)
Figure 5. Visualization of the rat kidney dataset in heat plot and parallel coordinates. (A) Dendrogram
and a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset
and each of the gene groups.
Figures 6A-B show the visualization of the rat kidney
dataset in VizStruct under the first Fourier harmonic projection for two dimension parameter settings. There are 5
sets of colored symbols for each of the 5 gene groups. Each
symbol represents one gene across 7 time points. In Figure
6A, two big clusters are clearly apparent from the visualization. The top cluster consists of genes from groups 3, 4, and
5. The bottom cluster is comprised of genes from groups 1
and 2. Furthermore, genes from each group are aggregated.
The formation of two large clusters can be interpreted by
the temporal profiles. Groups 1 and 2 with genes which
have very high relative levels of expression in early development are quite different from groups 3, 4, and 5 for
genes that have a relatively steady increase in expression
throughout development. The visualization also indicates
that points in the upper cluster are symmetric to the points
in the lower cluster. Properties of FFHP may suggest the
reason. Temporal profiles of groups 1 and 4 suggest that
they are somewhat symmetric to the middle time point (gestational day 19). By Proposition 4, they would be mapped
to points symmetric to the real axis. On the other hand,
groups 4 and group 5 are mapped closely since they have
similar profiles except for the significantly up-regulated in
the last time point. The same arguments can be applied in
the case of group 1 vs. group 2, or group 3 vs. group 4.
Due to the noise, most boundaries between groups are
not very clear. However, the separation between group 4
and group 5 improves in Figure 6B compared to Figure 6A.
Fibroblast Dataset
Temporal patterns are slightly complicated in the fibroblast dataset. Data has been classified into 10 groups using
the hierarchical clustering algorithm by the original author
[10]. Figure 7A shows the result of the hierarchical clustering. It is a copy of Figure 3 from [10]. Figure 7B gives the
6
12
G2
G4
10
G5
4
G3
8
G3
2
6
Im[F1]
Im[F1]
G4
0
4
2
G1
0
−2
−2
G2
−4
G5
−4
G1
−6
−6
−4
−3
−2
−1
0
1
Re[F1]
2
3
4
5
−6
−4
−2
0
(A)
2
Re[F1]
4
6
8
10
(B)
Figure 6. Visualization of the rat kidney dataset in VizStruct. Five gene groups were represented
by blue plus symbols, red circles, green triangles, magenta stars, and black cross symbols. The
dimension parameters for (A) and (B) were h0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5i and h1, 0.5, −1, −0.5, 0.5, 1, −1i,
respectively.
15
10
5
0
All
Group A
Group B
Group C
Group D
Group E
Group F
Group G
Group H
Group I
Group J
Group U
15
10
5
0
(A)
(B)
Figure 7. Visualization of the fibroblast dataset in heat plot and parallel coordinates. (A) Dendrogram
and a heat plot from hierarchical clustering algorithm. (B) Parallel coordinates for the entire datset and
each of the gene groups. The gene group labelled U consists genes without label after hierarchical
clustering.
8
8
8
4
4
4
8
8
8
8
10
30
9
8
8
8
6
44
8 7 4 4
444 4
8
888 87 7 44 4
8 988 8 7 4 44
44
744
8
8888 78 777
444444
8 88
2
88
8877774
87
6 8
888
888811
888884424424
9811
8
88
6
4
6
8
4
8
2
11
4
88
4
8
9
66
22
66
2
8611
10
4
11
2
11
2
811
9
6
4
11
9
2
11
2
2
4
9
2
2
6
2
6
4
2
9911
6 96 66
11
11
11
99
24
2
2
2
1
12
2
11
11
2
2
11
2
1
1
1
1
6 9969
669
111
66
2
11
3
2
12
2
1
10
1
6
1
10
1
2
210
6
1
3
6
11
1
110
2
1
11
3
6 6 66610
1
11
10
31
1333
3
333
3
6
10
10
10
3
11
6 611
11
3
10
110
0
1
03
10
5
10
3
33310
10 55 5
10
10
0
−10
3
10
10 10
5
9
−20
−100
−80
9
−60
Re[F1]
−40
2
0
6
−2
6
6
6
−4
6
6
−6
5
10
6
−120
4
Im[F1]
Im[F1]
8
−8
−20
0
−15
20
44
4
7
4
47
44 4
7
8
74
44
8
7 7 44
42
8 88 888888 8 88
4
7
4 24
7 4
887
8
8 8 78 8
4 4
88
8 811
2
98
88
8
4
8 6 8 11 88 868
2 444
118
86 9 8
4
6
2
11
2 10
411 2
68
2
2
8 11211
9
6
11
2442
22222 4
222222
11
222
699
9
22
2
66
11
11
611
2
4
2222222
2222
11
9
2
11
11
2222
2
11
96 99999
242
212
222
66
222
2
11
1
2
2
11
21
1
2
22222
11
6 6
111111
1
111
211
6
22
222
12
2
22
2
2
11
2
1
1
1
9
2
1
2
1
1
1
2
6
1
2
1
1
1
1
1
9
11
1113111
1
11111221
1112
2
2
10
1
96 6
1
2
910
6
1
1
6
10
11211111 2 10
6
11 1333
111111
11
2
111111
66
333
1
1
11 11
10
10 3313
33
3333313 3
3 3 333 1
6
6
10
10
1010 3 3 3 3
11
1110
10
11
10 5
10
10 3
3
3
10
3
5
10
55
10 10
6
8
8
4
4 4
7
8
88
8
20
7
8
8
8
8
8
4
7
78
8
−10
−5
0
5
5
10
Re[F1]
(A)
(B)
10
10
5
5 were used for each
Figure 8. Visualization of the fibroblast dataset in VizStruct. 11 colored numbers
of the gene groups. (A) The mapping of the entire dataset. (B) Enlarged portion of the visualization
10
in (A).
11
7
2
7
2
10
30
2
7
8
2
0
−10
4
−20
−100
−80
−60
Re[F1]
−40
2
0
10
−2
10
10
10
−4
10
−6
5
1
−120
4
Im[F1]
Im[F1]
10
−8
−20
0
20
(A)
−15
11
11
1111
9
1111
9 9
99 9
9
11
9
99
3
99
11
99
39
9
9 9
3
9
8
22
8
99
9
2
2
2
2
11
2
11
9 9 11 11
11
11
1111
2
2
2 9 99 9 11
1111
2 222 8 9 9 11
11
99999
2
2228 9 9911
2 88
3
9999999399911
8
8
8
8
8
8
8
2 2
888 999
2 8889
8
8
8
3
8
8
83333
810
888888
3
9
88
33
10
3933
8888810
38
38
8
8
3
33333
33
10
10
3
2 10
88383
10 10
3
3
3
810
810
33
33
8
8
33
3
33
10
1010
88
10
3
833
3
3
33
10
33
10
3
88
10
3
38
33
10
10
3
3
883
101010
10
3
3
3
3
33333
10
6
336836
10 1010
10
3
3
160
63
6336
66 66 6
6
6
6 6
6
6
9
9
9 9
88
88 8888 8 88
9
8
9
8
8
8 8 98 8
8 888 8 8
88
88
8
8
8
3 39
10
8 8
88
3
88
8 8 3 8 33 33
10
9
88 8 8
3333 3 3
3 8 38
8
10
3
3333
3333
333333
333
333
888
10
10
3
3
10
3
3
10
333333
33333 33
2
3
888
333 33
3333
88 10
108
3
33
3333
3333333
8 3 8 33
3333
3
8 8
3
10
333333333
3
333333
3
333
333
3333
33
8
3
3
3
3
3
10
10 10 10
3
8
3
3
3
3333333333333
3
8810 3
10 8
33
333333
3333 33
10
3333
3
10
3
8
3
3
3
3
3
33
3
10
3
3
10
3
33
810 8
3 3333
33333333333
3
3 3 333 3
10
10
6
8
63
3
3
1010
33 3
10
10
6
6
6 33
3
6
6
3
6
6 6
66
6
2
6
2
7
9
2
2
2
20
9
9
−10
−5
0
6
5
10
Re[F1]
(B)
Figure 9. Visualization of the fibroblast dataset in VizStruct. Genes were grouped by the k-means
clustering method. 11 colored numbers were used for each gene groups. (A) The mapping of the
entire dataset. (B) Enlarged portion of the visualization in (A).
parallel coordinates the entire dataset and each gene group.
Figure 8A shows the visualization of the fibroblasts
dataset in VizStruct under the first Fourier harmonic projection. Colored numbers 1 through 11 were used for each
gene group. The visualization reveals several outliers and
no distinct clusters. The majority of data are too dense to be
seen. By zooming technology, the enlarged detail is shown
in Figure 8B. More numbers spread out, but most blue numbers (1) were still covered by numbers 2 and 3.
Two observations can be made: (1) a large number of
genes are close to the center. (2) most clusters have a radial
shape emitting from the center. They can be interpreted by
the FFHP properties. (1) Temporal patterns of group A and
group B have very flat shape and relative smaller values. By
Propositions 1 and 2, they tend to be mapped closed to the
origin. (2) In hierarchical clustering, Pearson’s correlation
coefficient was used to measure the similarity. Genes whose
time values differ due to the amplitude shifting or multiply-
4
6
4
2
2
Im[F1]
y
1
0
0
−1
−2
−2
−4
−3
3
2
1
801
y
3
4
516 444
629
480451
536528
671
489
519
511
572
582
600
601
452
690
471
670
566
676
420503 574
466 687
520
672
605
795 775
669 529661
611
571
443
649
424 531
630
607
624
680
620
713
625
465
660
657
513
662
825
486
652
614
645
493
619
639
506
527
656
525
866
653
450
742
508
457
554
655
689
558
683
488
779
793
704
569
398
469
651
856
504
562
470
754
524
477
584
646
460
515
635
534
468
628
502
637
449
664
426
710
402
583
564
602
563
650
603
799
467
741
567
463
733
715
474
850
458
552
834
764
705
831
618
706
707
640
634
440 411
654
717
753
535
851
540
559
612
530
432
835
577
638
827
523
517
538
668
697
785
716
514
501
838
446
833
702
695418 596
868
410
678
857
500
553
461
748
841
8712
60
859
573
755
696
867
475
632
555
568
814
401
633
855
580
666
485
512
586
663
539
824
576
826
522
863
759
597
729
773
693
585
399
836
497
476
575
798
613
479
865
473
578
721
726
641
849
509
679
423
848
487
694 482
803
718
692
763
526
551
720
811
409
546
828
734
598
459
760
483
544
783
421
631
595
581
491
788
556
708
862
434
691
846
615
809
853
771
818
496
806
543
782
472
454
667
561
688
762
845
821
644
507
832
819
744
787
681
797
790
665
864
709
725
430
684
394
647
840
686
738
839
484
490
740
429
636
842
626
495
415
453
604
627
419
404
858
756
768
510
673
776
549
565
761
587
579
730
658
412
675
455
784
804
674
570
621
752
617
428
822
829
810
659
403
532
478
700
541
414
789
816
808
791
677
802
591
438
719
777
609
847
743
594
616
442
550
815
745
844
518
781
746
547
439
852
437
778
682
758
462
750
807
643
794
820
545
610
492
505
648
770
727
698
622
642
796
416
405
751
448
736737
456
701
589
703
711
435
800
830
422
823
780
735
431
521
724
805
427 623
817
837
557
722
739
699
749
685
854
408407
723
772
537
732
747
766
731
396
786
792
395
397
714
341 599 588 494
606
400
533
447
590
433
608
813
767
481 498 464
441 765
593
560
812
757
386
365 357
380
774 843
406499
370
436
861
385
413
391 417 387 361
346
445
362
769
358
372
592 548
247
342
49
97309
327 324373 355
728139
231
184
376 326368
425
352367
266
359
7
331
329 332 34
252
129
171
369
351
383
542
299 215
350 334
112
183
221
65
304
313
238
98
356
328
113
33
185
390
389
330
348 333
381
197 188
382
80 224 13
255
293
250
392337
288
354
220
152
375
107
364
343
11320
321
344
287
101
39
2 369
677
338379
310
349
345
377
339
256
353
23204
46
41208
210
38 147
384
8
292
92296
213
162
1245
295
173
5048
393 388 218
159
156
164227
322
106
66
118
78
219
336 335340
306
40 241
347 378
300
109
264
189
30
285
194
61
94145
72
168
140
86
125
317
91
1020
26
245
100
96
120
294
57
52
206
363
283
277
73
27
174
36
60
75
366
311
137
28
274
284
68314
232
126
32
325
298265
226
53
18132
144
212
281
35
315
170
151
222
119
192
286
44
318
127
58
172
85
117
160
279
374
289
81
43
229
248
63
95
149
240
74
205
190
1154
133
180
239
166
360 8831
54
235
143
108
4186
225
263
233
55
278
51
24
25
163
267
134
253
181
47
146
223
179
157
167
102
116
130
254
202
19
29200
191 262
104
280
136
211
141
70
216
138
243
79
105
148
150
175
214
62
249
76
128
42
83
209
242
307
269
178
319
273
237
371
9217
228
90
22
260
165
114
89169 21
282
272
308
203
99
234
207182
268
193
230
323
67
246
123
124
7137
261
201
131
259
303
176
270
158
199
316
177
8784 122
305
5121
115
111
271
275
17
103
257
15
14 64
251
244
142
195
16
110
155
56
82
135
236
198
297
93
161
153
276196
312
59290 258
302187
0
−1
−2
−3
291
301
−4
−3
−2
−1
0
x
1
2
3
−6
−4
−3
−2
−1
(A)
0
1
Re[F1]
2
3
4
857
717
707
848762
835846
742
777
839
804
715826
756
845
746
748
779
840
758
815
582 451 775795
825
754
782
770
730
725
781
716
866
713
741
760
807
705798
745
511 571
778
763
799
796
704
793
727
853
744
831
771
709
856
624
734
720
830
833
465
649
834
452
827
790
824
721
850
764
787
719
718
768
712
735
785
789
867
844
635
676
863
761
800
751
471 443
829
516
862
706
729
723
759
818
740
524
655
836
605
536 601
838
868
652
733
743
672
529
788
466
842
726
859
784
474
776
783
806
851
816
819
860
753
558463
849
480489 687619
797
737 786
708
773
660
852
821
467
671
444
710
520620
855
752
832
780
614
847
470
618
457
865
517
724
506
639
858
567
750
792
535
569
772
528
562
703
553
808
854
803
811
828
822732
486
755
476
634
629
578
664
523
661525
487
650
458
656
459
731765
767
794
502
539
814
531
563
739
736
491
508
514
810
693
572
820
597
515
864
841
526
581
738
645 577
823
640
666
423
791
654
670
689
510
401
749
632
575
453
564
747
584
552
534
659
667
714 722
625
507
837
690
658
503
662
668
817
805
541
711
555
801
477
678
700
497
475
688
468
701
702
607651 583
456
679
766757
455
449
540
646
696
576
512
538
446
697
460
580
428
469
532
695 566
802
568
462
500
545
653
809
573
680
615
504
630
513
579
546
501
521
549
600 574657450 473
394 420
843
519
585
604
698537
813
686
621
565
530692
627
544
647
570
533774
559 613
589
557608
496
442
424 554
505
419
488
493
812
663
590
683
492
611 669
561
454
637 665
441
681
550 448
399602
641
596 461
691
495642 606 769 728
479
398
628587631
643
426
402
414
447 861
411
610
494560
603612
410527
633
522595
648
542
412
440
622
547
638
431
464
472623
598
636
675
432
616
418
684
434
309
685
694 586
617
644
490
599 588
593
139
551
409437 413699416
415
421
591
556
184
485
677
509
397
435
478
482
425
518
673
345
483
405
422
427
429
439
147
543
594
430 484
404
499
49
407403682
400
438
626
97
548381
481
38
370
674
255
12
408395
386
433 396380
39
150
159
445
371 112
200 310
304171
436
129 185
362
32
592
365417
262
188
220
227
113
238 197
391357
609
164
327
297
375
406
237
78132
65 33
389
385
231
346 329368 334
45
6
81 315
13 204
253
342333
2205 252
20
336 325
285 498341 355
28
162
25
10
247224
226
210211
349
101
229
215
277
324 331390
105
183
154
240
87
51
358
283
212
55
344
59
146
254
30
18
242
125
191 21122
300
134235
364 356 328
77208
2622
151
189
6973
80374
60
165
248
340
281
15
387 330
52
260
424
44170
332
61
23
286
388
263
19
95
120219
41
124
383
294
156
282 64
100
393
270
256
299
382 109
192
46
209
264
36
222
74
54
72
99 103
169
379
127
366
243
392
68 174
7 35934 384168
343
232
126
108
303
354
27
216
373
172
40157
43
123
90284
136
163
89
86106
14
138
319
1
351
58
83
372
367 363 133
376
130
117
63
142
361
338
118
128
153
308
8 249
5
71
279
239
236
32657 339
233
131
140
33511
160
116
257
42
312
273
84
119
76
137
241
114
56
234
144
70
311
186
275
258
337
274
259
3
360
177
251
67
246
190
98
316
47 203
75
305
352350
217
182
228
35
152
272
107
158
225
261
195
317
149
16
121
265
193
377
91278
214
347 369 353
161
53
135
88
145
115
267
298
175
230
199
181
37244
104
148
85
296
82187
223
268
111
321
280
293194
290
302291
178
102
266
289
269
201
318
166
31
93
378
323
348179
306
202
92
9
29
94
295
167 176 196
207
276
180
221
322
143
62
79
250
48
110
218
314
198 155
320 17
66
271
245
287
50
292
301
173
288
313
96 307
141
213
206
−4
5
(B)
−3
−2
−1
0
x
1
2
3
(C)
Figure 10. Comparison of Sammon’s mapping and the first Fourier harmonic projection. (A) The
result from Sammon’s mapping of the rat kidney dataset. Five gene groups were represented by 5
colored symbols the same as were used in Figure 6. (B)–(C) Comparison of relative gene locations
between (B) VizStruct and (C) Sammon’s mapping. Five colors were used for each of the gene
groups. The color schema is the same as in (A). Notice that the visualization layout of (A) and (C) is
identical. The purpose of panel (C) is to compare the gene by gene location to the panel (B).
ing a constant have very high coefficient values. By Propositions 2 and 3, they tend to be mapped closely along a line
though the origin.
Some relative gene group locations can be predicted. For
example, by Proposition 4, groups 4 and 5 are mapped symmetric to the real axis. Close inspection suggests that temporal pattern of group 8 is somewhat a 2 or 3 right time-shift
from the pattern of group 6. By Proposition 5, each time
point shift correspondents to roughly 2/13π ≈ 30◦ rotation. In Figure 8B, genes in group 8 are mapped about 60◦
clockwise to genes in group 6.
Lacking clear clusters in the visualization indicates that
different clustering methods may yield different results.
Figure 9 shows the grouping aspect of k-means clustering. Euclidean distance was used in k-means as the distance
measure. The visualization reveals that genes closed to the
origin are grouped as one.
Comparison of FFHP to Other Visualizations
Heat plot and parallel coordinates display all individual
dimensional information. In parallel coordinates, a polyline
represents a gene over time. This format is a concise and
intuitive for displaying temporal profiles. However, the parallel coordinate has obvious drawbacks: when the data size
becomes larger it becomes increasingly unreadable as indicated in the first panel of Figure 5 and Figure 7. Heat plot
uses a color mosaic instead of a polyline to overcome over-
lapping. Combined with the dendrogram from hierarchical
clustering, it gives a global view as well as individual clusters of the dataset by grouping genes with similar patterns
together. Cluster relationships in heat plot is one dimensional: clusters of genes are listed one by one as shown in
Figure 5A and Figure 7A.
The first Fourier harmonic projection takes a different
approach. It uses a two dimensional point to represent each
gene over time. By doing so, it takes advantage of the spatial relationship of the points to reflect the structure of the
original dataset. Thus the cluster relationships become two
dimensional and FFHP has a better capability of displaying
outliers than heat plot. Unlike heat plot, which requires algorithms to group similar genes to make a meaningful visualization, FFHP directly mapped the multi-dimensional data
onto a two dimensional space without any prior knowledge
of the dataset.
Multidimensional scaling (MDS) [3] is a competing approach for visualizing multi-dimensional data. We compared FFHP to Sammon’s mapping [15], a variant of MDS
that optimizes the following stress function E:
1
E = XX
i
X X (d∗ij − dij )2
,
d∗ij
d∗ij i j<i
(4)
j<i
where d∗ij is the distance between points i and j in the N dimensional space and dij is the distance between i and j
in the visualization.
Figure 10 shows Sammon’s mapping of the rat kidney
dataset. A comparison of Figure 10A to Figure 6A reveals the extensive similarities between FFHP and Sammon’s mapping. The relative locations of individual samples are also remarkably similar. This is indicated in Figure
10B and Figure 10C.
Sammon’s mapping has some drawbacks: (1) it provides
a single final result and the user cannot intervene interactively during visualization, (2) the incremental addition of
even a single point requires a complete repetition of the optimization procedure and possible extensive reorganization
of all the previously mapped points to new locations, and
(3) it requires time-consuming optimization procedures of
time complexity O(N 2 ) or greater.
Our results illustrate some of the strengths and weaknesses of FFHP. The visualization reflects the structure of
the dataset: outliers, clusters and their relationships. Comprehending the structure of the dataset can facilitate the
choice and understanding the results of different clustering
methods. Although the FFHP uses an approach to mapping
multi-dimensional data that is distinctively different from
Sammon’s mapping, it yields results that are consistently
comparable. However, when the dataset contains a large
number of patterns, it becomes difficult to separate different
patterns and the visualization may be difficult to interpret.
4 DISCUSSION
Visualization of microarray data is a challenge because
of its high dimensionality. In this paper, we have explored
use of the first Fourier harmonic projection for visualizing
multi-dimensional time course array datasets. Unlike parallel coordinate or heat plot approach which display all dimensional information, FFHP uses a two dimensional point
to represent each data point (gene in this case). Our results indicated that temporal patterns were well reflected by
spatial relationships: genes with similar pattern were aggregated and relative locations of gene groups can be predicted.
Moreover, the first Fourier harmonic projection was shown
to yield results that were similar to those from Sammon’s
mapping.
Achieving two dimensional mapping requires a tradeoff. The mapping is lossy for detailed dimensional information. Our approach attempts to preserve to the maximum
semantics of the data points via Fourier harmonic aspect.
In particular, characterizing the data using two descriptive
measurements: the real and imaginary portions of the first
discrete Fourier harmonic. A similar approach uses principal component analysis (PCA) [11]. This visualization
deploys another two descriptive measurement: the first and
second principal component.
In addition to the first Fourier harmonic projection,
higher Fourier harmonics can also be used as mappings.
It can be shown that for any harmonic (> 1), there exits
an equivalent first harmonic of the original discrete signal
whose time indices are systematically rearranged. At certain conditions (such as temporal patterns with high frequency variations, i.e. multiple cycles), higher harmonic
projections enhance substructure separation in the visualization. Detailed discussion is beyond the scope of this paper due to a length constraint.
The FFHP mapping results in two-dimensional visualizations that are identical to those of radial coordinate visualization techniques, e.g., RadViz [7]. However, rather than
vector notation and the spring paradigm of RadViz, we have
used a complex number notation. This substantive reformulation of the mapping provides valuable theoretical insights
and allows important properties of mapping, including its
relationship to the DFT, to be easily derived. It can also
create possible extensions such as higher Fourier harmonic
projections.
The first Fourier harmonic projection requires data without missing values. This requirement can be easily met
because filling in missing values is a mature research field
[17].
Gene expression data can be studied in either sample
space or gene space. Here, we have reported only the visualization on gene space. In a separate report [19], we applied
the first Fourier harmonic projection on the sample space
and performed visualization-driven classifications. Our experiments demonstrated that FFHP offered an alterative format of visualization. We believe that using FFHP alone or
in combination with heat plot or parallel coordinates would
give a biologist additional powerful tools for analyzing and
visualizing microarray data sets.
ACKNOWLEDGEMENTS
This work was supported by grants from the National
Science Foundation. We also thank the anonymous reviewers for their constructive comments on the manuscript.
References
[1] Cadzow, J. A., Landingham, H. F. Signals, Systems, and
Transforms. Prentice-Hall, Inc., Englewood Cliffs, NJ,
1985.
[2] Cook, C., Buja, A., Cabrera, J., and Hurley, C. Grand
Tour and Projection Pursuit. Journal of Computational and
Graphical Statistics, 2(3):225–250, 1995.
[3] Davision, M. L. Multidimensional Scaling. Krieger Publishing, Inc., Malabar, FL, 1992.
[4] Diggle, P. J. Time Series: A Biostatistical Introduction. Oxford University Press, Oxford OX2 6DP, 1990.
[5] Eisen, M. B., Spellman, P. T., Brown, P. O., and Botstein, D.
Cluster Analysis and Display of Genome-wide Expression
Patterns. Proc. Natl. Acad. Sci. USA, Vol. 95:14863–14868,
December 1998.
[6] Hautaniemi, S., Yli-Harja, O., Astola, J., Kauraniemi, P.,
Kallioniemi, A., Wolf, M., Ruiz, J., Mousses, S., and
Kallioniemi, O. Analysis and Visualization of Gene Expression Microarray Data in Human Cancer Using SelfOrganizing Maps, 2003. To appear in Machine Learning:
Special Issue on Methods in Functional Genomics.
[7] Hoffman, P. E., Grinstein, G. G., Marx, K., Grosse, I., and
Stanley, E. DNA Visual and Analytic Data Mining. In IEEE
Visualization ’97, pages 437–441, Phoenix, AZ, 1997.
[8] Holter, N. S., Mitra, M., Maritan, A., Cieplak, M., Banavar,
J. R., and Fedoroff, N. V. Fundamental Patterns Underlying Gene Expression Profiles: Simplicity from Complexity. Proc. Natl. Acad. Sci. USA, Vol. 97(15):8409–8414, July
2000.
[9] Ideker, T., Galitski, T., and Hood, L. A New Approach
to Decoding Life: Systems Biology. Annu. Rev. Genomics
Hum. Genet., 2:343–372, July 2001.
[10] Iyer, V. R., Eisen, M. B., Ross, D. T., Schuler, G., Moore,
T., Lee, J. C. F., Trent, J. M., Staudt, L. M., Hudson, J. Jr.,
Boguski, M. S., Lashkari, D., Shalon, D., Botstein, D., and
Brown, P. O. The Transcriptional Program in the Response
of Human Fibroblasts to Serum. Science, Vol. 283(1):83–87,
January 1999.
[11] K. Y. Yeung and W. L. Ruzzo. Principal Component Analysis for Clustering Gene Expression Data. Bioinformatics,
Vol. 17(9):763–774, 2001.
[12] Kohonen, T. Self-Organizing Maps, Springer Series in Information Sciences, volume Vol. 30. Springer, Berlin, Heidelberg, New York, 1995.
[13] Morrison, N., editor. Introduction to Fourier Analysis. John
Wiley & Sons, Inc., New York, NY, 1994.
[14] Parmigiani, G., Garrett, E. S., Irizarry, R. A., and Zeger, S.
L., editor. The Analysis of Gene Expression Data: Methods
and Software. Springer-Verlag New York, Inc, New York,
NY, 2003.
[15] Sammon, J. W. A nonlinear mapping for data structure analysis. IEEE Transactions on Computers, C-18(5):401–409,
1969.
[16] Stuart, R. O., Bush, K. T., and Nigam, S. K. Changes in
Global Gene Expression Patterns During Development and
Maturation of the Rat Kidney. Proc. Natl. Acad. Sci. USA,
Vol. 98(10):5649–5654, May 2001.
[17] Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,
Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. Missing Value Estimation Methods for DNA Microarrays. Bioinformatics, Vol.17(6):520–525, 2001.
[18] Ward, M. O. XmdvTool: Integrating Multiple Methods for
Visualizing Multivariate Data. In IEEE Visualization 1994,
pages 326–336, 1994.
[19] Zhang, L., Zhang, A., Ramanathan, M. et al. VizCluster and
Its Application on Clustering Gene Expression Data. International Journal of Distributed and Parallel Databases,
13(1):79–97, January 2003.
Appendix: Proof for Propositions
Lemma 1
ÃN −1
X
!2
an
=
N
−1
X
n=0
a2n + 2
n=0
−k−1
N
−1 N X
X
at at+k .
t=0
k=1
Lemma 2 For any two complex numbers z and w, (1) z + w =
z + w, (2) z w = z w, (3) z = z.
Lemma 3 Let j ∈ N, then
=
N
−1
X
cos(2πjn/N ) =
n=0
N
−1
X
e−i2πjn/N
n=0
N
−1
X
sin(2πjn/N ) = 0.
n=0
Lemma 4 FFHP is homomorphic: F1 (a x[n] + b y[n]) =
a F1 (x[n]) + b F1 (y[n]).
Proposition 1 (Cancellation) If x[n]
F1 (x[n]) = 0.
=
[a, . . . , a], then
Proof : From the formula in Eq. (2), we get
F1 (x[n]) =
N
−1
X
a e−i2πn/N = a
n=0
N
−1
X
e−i2πn/N .
n=0
By Lemma 3, let j = 1. F1 (x[n]) = 0.
¤
Proposition 2 (Amplitude Shifting) If y[n] = x[n] + a, then
F1 (y[n]) = F1 (x[n])
From the formula in Eq. (2), we get
F1 (y[n])
=
N
−1
X
(x[n] + a)e−i2πn/N =
n=0
+
N
−1
X
N
−1
X
x[n]e−i2πn/N
n=0
a e−i2πn/N = F1 (x[n]) + 0
n=0
The second summation is 0 by Proposition 1.
¤
Proposition 3 (Amplitude Multiplying) If y[n] = a x[n], then
F1 (y[n]) = a F1 (x[n]).
From the formula in Eq. (2), we get
F1 (y[n])
=
N
−1
X
a x[n]e−i2πn/N
n=0
=
a
N
−1
X
x[n]e−i2πn/N = a F1 (x[n])
n=0
¤
Proposition 4 (Transposing) y[n] = x[N − n − 1], then
F1 (y[n]) = F1 (x[n]).
Proof : By Lemma 4, the distance between F1 (y[n]) and
F1 (x[n]) is kF1 (w[n])k. From Eq. (2), we get
By Lemma 2, we have
F1 (x[n])
=
N
−1
X
x[n]e−i2πn/N =
n=0
=
N
−1
X
N
−1
X
x[n]ei2πn/N
n=0
kF1 (w[n])k
x[n]ei2πn/N
=
k
= F1 (x[−n])
N
−1
X
w[n]e−i2πn/N k
n=0
n=0
=
However, when x[n] is a real signal, x[n] = x[n]. Then we have
k
N
−1
X
w[n] cos(2πn/N ) − iw[n] sin(2πn/N )k
n=0
F1 (x[n]) = F1 (x[−n]). i.e., F1 (x[n]) = F1 (x[−n]).
Since x[N − n − 1] = x[−n] then F1 (y[n]) = F1 (x[n])
¤
Let ω = 2π/N , by Lemma 3, we have
N
−1
X
cos(nω) =
n=0
Proposition 5 (Time Shifting) If y[n]
d
F1 (y[n]) = WN
F1 (x[n]).
=
x[n − d], then
N
−1
X
sin(nω) = 0. Now add a term w,
b the mean of w[n],
n=0
Proof : Assume 0 ≤ n < N , let l = n − d, then n = l + d. When
n = 0, l = −d and when n = N − 1, l = N − 1 − d. From the
formula in Eq. (2), we get
NX
−1−d
F1 (y[n]) = F1 (x[n − d]) =
kF1 (w[n])k
2
=
NX
−1−d
d
x[l]e−i2πl/N e−i2πd/N = WN
l=−d
x[l]e−i2π(l+d)/N
=
NX
−1−d
x[l]e−i2πl/N
+
=
l=−d
−1
X
N
−1
X
l=−d
+
x[l] e−i2πl/N
=
l=−d
N
−1
X
x[t]e−i2πt/N
2
N
−1 N X
−1−k
X
+
=
(w[n] − w)
b sin(nω)
kF1 (w[n])k2
x[t]e−i2πt/N
[(w[t] − w)(w[t
b
+ k] − w)Ω]
b
where Ω = cos(tω) cos((t + k)ω) + sin(tω) sin((t + k)ω).
By trigonometry identity cos θ cos φ + sin θ sin φ = cos(φ −
θ), we have Ω = cos(kω). Now
=
N
−1
X
(w[n] − w)
b 2
n=0
t=0
N
−1
X
!2
t=0
k=1
t=N −d
NX
−1−d
ÃN −1
X
(w[n] − w)
b 2 (cos2 (nω) + sin2 (nω))
+
Let t = l + N for the first summation and t = l for the second
summation, we get
x[l]e−i2πl/N
(w[n] − w)
b cos(nω)
n=0
l=0
NX
−1−d
ÃN −1
X
w[n] sin(nω)
n=0
!2
Expending each squaring term by Lemma 1, we get
x[l + N ] e−i2π(l+N )/N
NX
−1−d
+
!2
n=0
However, ei2πn/N = ei2π(n+N )/N and x[n] = x[n + N ],
x[l]e−i2πl/N
w[n] cos(nω)
ÃN −1
X
n=0
l=−d
NX
−1−d
!2
n=0
l=−d
=
ÃN −1
X
+
x[t]e−i2πt/N = F1 (x[n])
2
N
−1 N X
−1−k
X
t=0
d
Therefore, F1 (y[n]) = F1 (x[n])WN
.
¤
Definition
1 The mean of a signal x[n] is defined as x
b =
PN −1
autocovariance coefficient of a
n=0 x[n]/N . The k-th sample
P −1−k
signal x[n] is defined as gk = N
(x[n] − x
b)(x[n + k] −
n=0
x
b)/N . g0 is called the variance of x[n]. The k-th sample autocorrelation coefficient is defined as rk = gk /g0 .
Proposition 6 (General Distance) Let w[n] = x[n]−y[n] be the
difference between x[n] and y[n]. The distance between F1 (x[n])
and F1 (y[n]) is
Ã
!
N
−1
X
kF1 (w[n])k2 = g0 N 1 + 2
rk cos(2πk/N ) .
k=1
=
N (g0 + 2
Ã
=
g0 N
[(w[t] − w)(w[t
b
+ k] − w)
b cos(kω)]
t=0
k=1
N
−1
X
gk cos(kω))
k=1
1+2
N
−1
X
k=1
!
rk cos(2πk/N ) .
¤
Download