1 Spatial Ordering and Encoding for Geographic

advertisement
Spatial Ordering and Encoding for Geographic Data Mining and Visualization
Diansheng Guo1 and Mark Gahegan2
1
Department of Geography, University of South Carolina
709 Bull Street, Columbia SC 29208, Email: guod@sc.edu
2
GeoVISTA Center, Department of Geography, Pennsylvania State University,
302 Walker, University Park, PA16802, Email: mng1@psu.edu
Abstract:
Geographic information (e.g., locations, networks, and nearest neighbors) are unique and
different from other aspatial attributes (e.g., population, sales, or income). It is a challenging
problem in spatial data mining to take into account both the geographic information and
multiple aspatial variables in the detection of patterns. To tackle this problem, we present and
evaluate a variety of spatial ordering methods that can transform spatial relations into a
one-dimensional ordering and encoding which preserves spatial locality as much possible.
The ordering can then be used to spatially sort temporal or multivariate data series and thus
help reveal patterns across different geographical spaces. The encoding, as a materialization
of spatial clusters and neighboring relations, is then amenable for processing together with
aspatial variables by any existing (non-spatial) data mining methods. We design a set of
measures to evaluate nine different ordering/encoding methods, including two space-filling
curves, six hierarchical clustering based methods, and a one-dimensional Sammon mapping
(a multidimensional scaling approach). Evaluation results with various data distributions
show that the optimal ordering/encoding with the complete-linkage clustering consistently
gives the best overall performance, surpassing well-known space-filling curves in preserving
spatial locality. Moreover, clustering-based methods can encode not only simple geographic
locations, e.g., x and y coordinates, but also a wide range of other spatial relations, e.g.,
network distances or arbitrarily weighted graphs. We briefly introduce two example
applications of the spatial ordering and encoding in spatial-temporal data mining.
Keywords:
Spatial data mining, spatio-temporal visualization, space-filling curve, hierarchical clustering,
linear ordering,
1
Introduction
To engage in spatial data mining is to explore large spatial databases for patterns that involve
geographic properties and relations (Han and Kamber 2001; Miller and Han 2001; Shekhar et
al. 2004). In addition to common aspatial variables, spatial information can add orders of
magnitude to the complexity of potential patterns. A challenging problem for spatial data
mining is to consider both spatial and aspatial aspects together, for example, detecting
multivariate spatial clusters (Murray and Shyy 2000), deriving classification rules that
involve spatial relations (Koperski et al. 1998), or discovering spatial association rules
(Koperski and Han 1995). There are three options by which to spatial information can be
1
integrated with non-spatial attributes: (1) materialize implicit spatial relations into data
variables and then apply general-purpose data mining techniques; (2) develop specialized
spatial data mining techniques (Shekhar et al. 2004); or (3) use multiple views to visually
explore patterns across different spaces (Andrienko and Andrienko 1999; Guo et al. 2003;
Guo et al. 2005). In this paper, we focus on the first option, i.e., transforming spatial relations
into a one-dimensional encoding, which can then be processed (together with other aspatial
variables) by any traditional data mining methods.
We distinguish two different types of spatial relations that can be materialized: external
spatial relations and internal spatial relations. An external relation is defined between two
different types of spatial entities while an internal relation is defined within the same type of
spatial entities. Suppose we analyze a dataset of traffic accidents on a road network. An
example of an external relation that can be extracted for each accident is its distance to the
nearest hospital or road intersection. On the other hand, the distance between each accident to
every other accident is an internal relation. The materialization of external relations is
straightforward, although it can be overwhelming if there are many such relations to extract.
Internal relations, however, require complicated strategies to materialize. Although we can
use various spatial clustering techniques to examine the internal relations of traffic accidents,
it remains a difficult research problem to materialize the spatial cluster structure inherent in
the data, to integrate it with other attributes, and then process them together with a
general-purpose data mining method, e.g., a decision tree.
Figure 1 shows a simple example that illustrates the above problem. The task of
delineating a cluster in a geographic space, or a spatio-temporal space, requires the decision
tree to learn at least two decision rules or hyperplanes for each spatio-temporal dimension
included (i.e., 4 rules to describe a cluster in x ,y; 6 for x, y, t; 8 for x, y, z, t). This complicates
the learning task even before we consider other non-spatial attributes. The more decisions
needed to describe a cluster, the harder it will be to ‘discover’; if the tools employed cannot
increase information gain or receive some other form of positive feedback at each stage in the
delineation of a cluster, then the cluster may never be uncovered. In addition, the fact that
spatial dimensions are highly interdependent often leads to a highly fragmented space,
possibly containing an assortment of shapes that can be difficult to describe with any fidelity.
Figure 1: Suppose that four clusters are contained in a spatial dataset, represented by ellipses
A-D. Note each ellipse is approximated with 4 values—decision rules in this instance. Ellipses A
and C are represented with reasonable fidelity, whereas B and D are not, since they are not
2
orthogonally oriented.
One way to address the above difficulties is to force the spatial data into a
one-dimensional ordering or encoding that, as far as possible, preserves the spatial clusters
inherent in the data. Since the early work of Morton (1966), researchers have developed a
number of methods to achieve spatial orderings that preserve ‘locality of reference’ within the
data. These orderings are usually employed to speed up data retrieval by storing similar data
on sectors of the storage medium that are physically close to each other and thus minimizing
time-consuming movements of a read-write head. We can also use such methods to simplify
the task of detecting clusters since points that are close to each other in space will ideally
appear in an approximately linear sequence. In other words, if points are numbered according
to their position in the sequence, points in a cluster will have similar values. Thus, clusters
can (ideally) be delineated using only two decision rules (i.e., the start and end points of the
sequence), simplifying their discovery and description. Space-filling curves have been used
to transform multidimensional data into a unit interval and thus reduce the complexity of
pattern recognition problem (Skubalska-Rafajlowicz 2001). Other related research transforms
images into signals using a space-filling curve and then analyzes the obtained signal using
one-dimensional wavelet orthogonal bases (Lamarque and Robert 1996).
However, space-filling curves first divide the data space into regular grid cells and then
traverse the grid cells along a predefined route, and therefore have limited ability to adapt to
the specifics of different data patterns. Moreover, space-filling curves are not able to preserve
patterns in a non-Euclidean space, e.g., arbitrarily weighted networks or graphs.
In this paper, we introduce a family of spatial ordering and encoding approaches based on
hierarchical clustering. A cluster-based method is able to adapt to various data distributions.
Moreover, cluster-based methods can encode not only simple geographic locations, e.g., x
and y coordinates, but also a wide range of non-Euclidean relations, e.g., network distances or
arbitrarily weighted graphs.
We systematically evaluate and compare traditional space-filling curves (including the
Morton curve and the Hilbert curve), a one-dimensional Sammon mapping (which is a
multidimensional scaling method), and six hierarchical clustering based ordering methods.
The remainder of the paper is organized as follows. Section 2 gives a review of related
research. In section 3, we introduce a set of spatial ordering and encoding approaches,
including space-filling curves, the Sammon mapping, and hierarchical clustering based
ordering methods. Section 4 presents a systematic evaluation of different ordering methods
using a set of synthetic datasets of various data distributions, a real data set, and a series of
measures. We then in section 5 briefly introduce two applications of the spatial ordering and
encoding in multivariate spatial clustering and spatio-temporal visualization. We conclude the
paper with a summary and discussions.
2
Related Work
Research efforts addressing spatial data mining problems have taken a variety of directions,
for example, spatial clustering (Ng and Han 1994; Openshaw 1994; Wang et al. 1997; Murray
and Shyy 2000; Han et al. 2001), spatial classification (Gahegan 2000), spatial association
rule mining (Koperski and Han 1995), operators in spatial databases (Ester et al. 1997), visual
3
approaches (Andrienko and Andrienko 1999; Andrienko et al. 2003; Keim et al. 2004), and
integrated systems (Han et al. 1997; Guo et al. 2005).
A common problem often encountered in spatial data mining research is the integration of
aspatial attributes with spatial information. As mentioned earlier, there are three options to
address this problem: materializing implicit spatial relations into data variables, developing
specialized spatial data mining techniques, or using multiple views to visually explore
patterns. In this paper, we specifically focus on the first option and evaluate approaches that
can transform spatial relations into a one-dimensional ordering and encoding. To materialize
internal spatial relations (see previous section for the definition), e.g., spatial cluster
structures or neighboring relations, there are several research areas that can provide candidate
methodologies, including space-filling curves, multidimensional scaling (MDS) methods, and
hierarchical clustering methods.
A space-filling curve (SFC) traverses a two-dimensional (or multidimensional) area with
a one-dimensional curve (Hilbert 1891; Morton 1966; Mark 1990; Mokbel and Aref 2003). A
space-filling curve can be obtained by recursively dividing the space into quadrants and then
visiting each quadrant in a predefined (and often recursive) order (Wirth 1976). Figure 2
shows two space-filling curves in a 2D space: the Morton curve (which is also known as the
Z-order, proposed by Morton (1966) as the means for tiling adjacent map sheets), and the
Hilbert curve, which was first introduced by the german mathematician David Hilbert in 1891.
Space-filling curves have been extensively studied for the purpose of image compression and
spatial or multidimensional access methods in databases (Goodchild and Grandfield 1983;
Lawder and King 2001; Moon et al. 2001). Space-filling curves have recently been used to
reduce the complexity of pattern recognition problem by converting multidimensional data
into one-dimensional signal (Lamarque and Robert 1996; Skubalska-Rafajlowicz 2001).
Methods from the area of multidimensional scaling (MDS), attempt to preserve
inter-point distances in the data in a low-dimensional projection (Young 1987). As a
representative MDS method, the Sammon mapping (Sammon 1969) is included as one of the
ordering methods in our evaluation. Given n points in an m-dimensional space, a Sammon
mapping tries to find n points in a d-dimensional space (with d < m), in such a way that the
corresponding distances approximate the original ones as closely as possible. In this paper,
the Sammon mapping always maps data to a one-dimensional space (i.e., d = 1) to achieve an
ordering and encoding of data points (see section 3.2 for details).
An ordering can also be obtained from a hierarchical clustering result. A hierarchical
clustering method builds a hierarchy of clusters by decomposing a data set with a sequence of
nested partitions. A cluster hierarchy, represented by a dendrogram (see Figure 3), is a binary
tree with each data point as a leaf node. Traditional hierarchical clustering methods include,
for example, the single-link and complete-linkage clustering methods (Gordon 1987; Jain and
Dubes 1988; Gordon 1996; Duda et al. 2000). New hierarchical clustering methods have also
been developed, for example, density-based algorithms such as OPTICS (Ankerst et al. 1999),
extended from the DBSCAN partitioning method (Ester et al. 1996). To derive a
one-dimensional ordering from a cluster hierarchy, several different methods are available
(Bar-Joseph et al. 2001; Bar-Joseph et al. 2003; Guo et al. 2003),
Ordering techniques in general are widely used in information visualization systems to
accentuate patterns. For example, in the visualization of bacterial genomes, pixel arrangement
4
is used to place adjacent nucleotides as close to each other as possible and thus to help bring
out data patterns that otherwise would be difficult to perceive (Wong et al. 2003). Friendly
and Kwan present a general framework for ordering information in visual displays (tables and
graphs) according to the effects or trends (Friendly and Kwan 2003). Their framework can be
applied to the arrangement of unordered factors for quantitative data and frequency data, and
to the arrangement of variables and observations in multivariate displays (e.g., star plots,
parallel coordinate plots). The concept of a reorderable matrix (Wilkinson 1979; Bertin 1983;
Bertin 2001), as a data table visualization method, has been the focus of several recent
research efforts from different perspectives, e.g. testing ordering heuristics for an interactive
tool (Siirtola and Makinen 2005), and visualizing time-varying data (Qeli et al. 2004).
3
Spatial Ordering and Encoding Methods
In this section, we introduce nine ordering/encoding methods, including two space-filling
curves, a one-dimensional Sammon mapping, and six hierarchical clustering based methods.
These ordering and encoding methods are not limited to 2D spaces and can be applied to
multidimensional data (Breinholt and Schierz 1998; Mokbel and Aref 2003). Moreover, both
the Sammon mapping and the hierarchical clustering based methods can process a
non-Euclidean data space defined by a dissimilarity matrix or a weighted graph. For ease of
presentation and without loss of generality, we use 2D datasets to present and evaluate
ordering/encoding methods. In the discussion that follows, spatial ordering is a one-to-one
assignment of consecutive integers to spatial entities in the input space, where entities are
equally spaced in an ordering. By contrast, spatial encoding is a mapping of spatial entities to
a continuous one-dimensional interval, where entities are often not equally distributed. An
encoding is often derived from an ordering by rearranging the distances between neighboring
entities.
3.1 Ordering and Encoding with Space-Filling Curves
Given a data set of n points of any distribution in a 2D space, a minimum bounding square
box that covers all data points is constructed first. The square box (or quadrant) is recursively
divided into four subquadrants until each subquadrant covers only one point. An ordering of
all points can then be derived by following a space-filling curve to visit each subquadrant
(and thus each point) (Figure 2). Mark (1990), working with regularly spaced grid cells,
termed such orderings as “quadrant-recursive orderings”. Here we extend some of those ideas
to arbitrary distributions of data points.
Each data point is assigned an integer value as the ordering key (or ID). For example, let
the left-most point in the ordering be 0 (zero), then its right neighbor will be 1, and so on. The
right-most point in the ordering is then n-1, where n is the total number of spatial points.
An encoding is constructed by adjusting distances between neighboring points in the
ordering so that it is proportional to their distance in the 2D space (Figure 2). To derive such
an encoding, each point in the ordering is assigned a real value (instead of an integer value)
and the numeric difference between two neighboring points in the encoding proportionally
represents the distance between the two immediate neighboring points. For example, as
shown in Figure 2, let Encoding(G) be the encoding value of point G—the left-most point in
5
the ordering, then Encoding(F) = Encoding(G) + GF, where point F is the right neighbor of
G, and GF is the distance between G and F in the original space. Similarly, the encoding
values of other points can be obtained.
Figure 2: Spatial ordering and encoding based on the Morton curve (left) and the Hilbert curve
(right). In the ordering, points are equally spaced, while in the encoding, the distance between
two neighboring points is proportional to their distance in the two-dimensional space.
Space-filling curves can vary with the orientation (or direction) of the data. For example,
if we rotate the data in Figure 2 by 90 degrees, both orderings will be changed. Space-filling
curves also have limited ability in recognizing clusters and adapting to data distribution. For
example, a cluster at the center of a quadrangle will be cut into four pieces and separated in
the ordering by both the Morton and Hilbert methods. Outliers can also dramatically change
the ordering.
3.2 Ordering and Encoding with the Sammon Mapping (SAM)
Methods from the area of multidimensional scaling (MDS) try to preserve inter-point
distances in the data in a low-dimensional projection. One such algorithm is the Sammon
mapping (Sammon 1969). Given n points in an m-dimensional space, a Sammon mapping
tries to find n points in a d-dimensional space (with d < m), in such a way that the
corresponding distances approximate the original ones as closely as possible. To evaluate the
goodness-of-fit for each configuration, the Sammon mapping uses an error (or stress)
function, which measures the difference between the present configuration of n points in the
d-dimensional space and the configuration of n points in the original m-dimensional space.
The stress function E is defined as:
E=
where
1
∑ ∑
n −1
n
i =1
j =i +1
d ij
∑ ∑
n −1
n
i =1
j =i +1
(d ij − δ ij ) 2
d ij
,
(1)
d ij is the distance between two points in an m-dimensional space and δ ij is the
distance between two points in a d-dimensional space.
The Sammon mapping starts from an initial configuration of vectors and calculates the
stress. It then adjusts the vectors in order to decrease the stress using a steepest descent
algorithm to search for the minimum of the stress function. It iterates the above step for a
6
specified number of repetitions. In this paper, we use the Sammon mapping to transform the
intput data to a one-dimensional space (i.e., d = 1). Unlike the space-filling curves, a
Sammon mapping result is already an encoding, where each point has a position in the
one-dimensional space. We directly use the Sammon mapping result as the encoding and
simply extract an ordering from this encoding.
3.3 Ordering and Encoding with Hierarchical Clusters
In this section, we introduce another family of ordering methods that are based on
hierarchical clustering approaches. Let A = {a1, a2, …, an} be a set of spatial entities. A
dissimilarity measure between two spatial entities can be defined as their Euclidean distance
or any other spatial proximity measure. All pair-wise dissimilarity values within A form a
symmetric matrix (hereafter dissimilarity matrix), which is used to derive a hierarchy of
clusters. An important advantage of hierarchical clustering-based methods is that they can
accommodate various metrics in defining similarity or dissimilarity, including network
distances, spaces with obstacles, or arbitrarily defined graphs.
3.3.1 Two Ordering Strategies with Hierarchical Clusters: Simple vs. Optimal
As seen in Figure 3, a cluster hierarchy cannot determine a unique ordering of leaf nodes (i.e.,
data points). There are 2n-1 (n is the number of data points) unique orderings that are
consistent with the same cluster hierarchy (Bar-Joseph et al. 2001). To derive a
one-dimensional ordering from a cluster hierarchy, we adopt two strategies: the simple
ordering, introduced in Guo et al. (2003), and the optimal ordering, introduced in Bar-Joseph
et al. (2003).
Figure 3: Hierarchical clustering and ordering. The two dendrograms (middle and right)
represent the same cluster hierarchy but the one to the right has a better ordering of points.
The simple ordering strategy is fast and of O(n) complexity (n is the number of data
points). It processes the hierarchy from the bottom up. At the beginning, each cluster contains
a single data point. When two clusters are merged into one, the closest (i.e., most similar)
ends of the two clusters are connected. For example, as shown in Figure 3, when B is merged
with cluster {C, D}, B will be next to D because it is closer to D than to C. When the cluster
{A, E} is merged with the cluster {B, D, C}, C and E will be next to each other since CE is
the closest among the four connection options: AB, AC, BE, and CE. Once all data items are
in the same cluster, an ordering is achieved (Figure 3, right).
The second strategy provides an optimal ordering, but at the expense of being more
computationally intensive. Bar-Joseph et al. (2001) propose a method, which is of O(n4)
7
complexity, to find the shortest ordering among the 2n-1 different orderings that are consistent
with the cluster hierarchy. The length of an ordering is the total length of the curve that
connects all points by following the ordering (in the original 2D space). For example, the
length of the ordering shown in Figure 3 (right) is the length sum of edges BD, DC, CE, and
EA in the original space. Bar-Joseph et al. (2003) improve the optimal ordering algorithm to
achieve an O(n3) complexity. Readers are referred to the above references for details. The
optimal ordering discussed here is similar to the Travel Salesman Problem (TSP) (Reinelt
1994) except two differences. First, the last point in the ordering need not connect back to the
first point to form a loop. Second, the ordering here is consistent with a cluster hierarchy and
thus reduces the search space for an optimal solution.
Each of these two ordering strategies (the simple one and the optimal one) can be
combined with any hierarchical clustering method to construct a one-dimensional ordering
and encoding. Below, we briefly introduce three hierarchical clustering methods used in this
paper: (1) the single-linkage hierarchical clustering, (2) the complete-linkage clustering, and
(3) a complete-linkage clustering that uses the number of shared nearest neighbors as the
similarity measure.
3.3.2 Single-Linkage Clustering with Minimum Spanning Tree (MST)
The single-linkage clustering defines the distance between two clusters as the distance
between the nearest pair of points, with one from each cluster (Jain and Dubes 1988). A
single-linkage clustering can be obtained by constructing a minimum spanning tree (MST).
Given a dissimilarity matrix as explained earlier, which is a complete graph G of all data
points, Kruskal's algorithm (Baase and Gelder 2000) is used to construct an MST. At the
beginning, each point itself is a graph (altogether n graphs). The algorithm first sorts all edges
in G in an increasing order. Following this order, each edge is selected in turn, starting with
the shortest. If an edge connects two points in two different connected graphs, the algorithm
adds the edge to the MST and the two graphs are merged into one. If an edge connects two
points in the same graph, the edge is discarded. When all the points are in a single graph, the
spanning tree is complete (Figure 4: left). The hierarchy of clusters is represented with a
dendrogram (Figure 4: middle).
Figure 4: Single-linkage clustering, ordering, and encoding. Each edge in the MST (left) has an
ID number, which indicates the order of merges in building the cluster hierarchy. In the encoding,
the distance between points D and G is the length of edge CE, which connects the two clusters in
the MST (left). Therefore, D and G are closer in the encoding than in the original space.
We then use both the simple ordering strategy and the optimal ordering strategy to derive
8
two different orderings (namely, MST and MST_OPT) from the cluster hierarchy. Similar to
the ordering derived with space-filling curves, each data point in the ordering is assigned an
integer value as the ordering key (or ID), ranging from 0 (left-most) to n-1 (right-most),
where n is the total number of spatial points. However, the assignment of encoding values is
different from that of a space-filling curve. Here, the distance between two neighboring
points is proportional to the distance between the two clusters that they belonged to before
they were merged together. For example, as shown in Figure 4, Encoding(G) = Encoding(D)
+ CE, because CE is the distance between cluster {E, D} and cluster {G, F, C, B, A}. Note the
difference between the encoding and the total length of an ordering, which remains as the
total length of the curve that connects all points (in the original space) following the ordering.
3.3.3 Complete-Linkage Clustering and Ordering (CLO)
By contrast, the complete-linkage clustering defines the distance between two clusters as the
distance between the farthest pair of points, one from each cluster. Figure 5 shows the
complete-linkage clustering result of the same set of data points as shown in Figure 4. At the
beginning, each point constitutes a cluster. Then the closest pair of clusters is merged into one.
After each merge, the distance between the new cluster and every other cluster is updated
with the length of the longest edge that links the two clusters. Following this, the closest pair
of clusters is chosen to merge, and so on, until all points are in the same cluster. The
complete-linkage clustering tends to build a more balanced cluster hierarchy than does the
single-link clustering method.
Figure 5: The complete-linkage clustering, ordering, and encoding. Each edge in the final
connected graph (left) has an ID, which indicates the order of merges in building the cluster
hierarchy. Note that clusters are merged with the longest edge that connects the two clusters.
Similar to the ordering derived with the single-linkage clustering hierarchy, each data
point in the ordering is assigned an integer value as the ordering key, and a real number as the
encoding value. Both the simple ordering strategy and the optimal ordering strategy are used
to derive two different orderings (namely, CLO and CLO_OPT) from the cluster hierarchy.
3.3.4 Shared Nearest Neighbors Clustering and Ordering
Instead of using Euclidean distances, alternative methods have been proposed that use the
number of shared nearest neighbors (SNN) as the similarity measure between two points
(Jarvis and Patrick 1973; Ertoz et al. 2003). Clustering methods based on an SNN similarity
measure have some unique advantages; for example, they can identify clusters of different
9
densities and avoid the single-link effect. Figure 6 shows a simple scenario to demonstrate
the SNN similarity measure. First, k nearest neighbors (kNN) are identified for each point (k =
7 in Figure 6). The SNN similarity measure between a pair of points is the number of points
that belong to the kNN neighborhood for both points. If the two points are in each other’s
kNN neighborhood, their SNN similarity will be increased by one. Let SNN(A, B) represent
the SNN value between point A and B. As shown in Figure 6, points A and B share two
neighbors and they are in each other’s circle, therefore SNN(A, B) = 3. Similarly, SNN(A, C) =
4 since points A and C share three neighbors and they also belong to each other’s seven
nearest neighbors.
Figure 6: Shared nearest neighbors as the similarity measure (here k = 7). Each circle covers
seven nearest neighbors for the point at the center of the circle.
Since SNN is only a similarity measure and is not a clustering method by itself, we
combine the complete-linkage clustering (CLO) with SNN by using 1/(SNN+1) as the
dissimilarity value between two points. We name this clustering method as SNN_CLO. In
cases where two pairs of points have the same SNN value, the Euclidean distance between
each pair of points is used to break the tie. Except for the use of a different dissimilarity
measure, the clustering, ordering, and encoding procedure of SNN_CLO is the same as the
CLO method introduced above. Similarly, we use both the simple and optimal ordering
approaches to derive two different orderings from the SNN_CLO result, namely, SNN_CLO
and SNN_CLO_OPT.
To summarize, in this section we have introduced nine different ordering/encoding
methods as listed below. The abbreviations to the left are the short names for each method. To
simplify presentation, from now on we use the short name to refer each ordering method.
Morton curve
MOR
Hilbert curve
HIL
Sammon Mapping
SAM
Single-link clustering, with the simple ordering strategy
MST
Single-link clustering, with the optimal ordering strategy
MST_OPT
Complete-linkage clustering, simple ordering
CLO
Complete-linkage clustering, optimal ordering
CLO_OPT
SNN measure and complete-linkage clustering, simple ordering
SNN_CLO
SNN measure and complete-linkage clustering, optimal ordering
SNN_CLO_OPT
10
4
Systematic Evaluation of Orderings and Encodings
Orderings and encodings should preserve “locality” as much as possible. Points that are close
in the original space should also be close in the ordering, and vice versa (Gotsman and
Lindenbaum 1996). However, most evaluation work so far has focused only on one direction
or the other. Moreover, existing evaluation analysis focus only on orderings of regularly
spaced grid cells instead of data points of an arbitrary distribution (Mark 1990; Gotsman and
Lindenbaum 1996; Mokbel and Aref 2003). We design a comprehensive set of measures to
evaluate the nine ordering/encoding approaches with synthetic data sets of various data
distributions and a real data set of 3128 US cities.
4.1 Two Groups of Measures
Specifically, we develop two groups of measures, namely, the key similarity measures
(hereafter KS measures) and the spatial similarity measures (hereafter SS measures). The KS
measures are designed to assess the similarity of ordering keys (or encoding values) assigned
to points that are close in the original space. On the other hand, the SS measures are to
evaluate the reverse mapping, i.e., the similarity of original data values for points that are
close in the ordering or encoding. Very often, an ordering or encoding method works well on
one group of measures but performs poorly against the other. Ideally, the best ordering or
encoding should excel with both groups of measures.
4.1.1 KS Measures: Measuring Ordering Similarity in Spatial Neighborhoods
As noted above, the KS measures evaluate how a set of spatial objects in a spatial
neighborhood (e.g., k nearest neighbors) are distributed in the ordering or encoding. Ideally,
we would like to see that all k nearest neighbors in the original space are also very close in
the ordering or encoding. Obviously, this is not usually possible for any realistic dataset since
the projection of two-dimensional data points to a one-dimensional ordering can only
preserve some spatial relations, but not all of them. This group of measures is similar to the
measures used by Mark (1990). However, since our data points are not regularly spaced (as
assumed by Mark’s measures), we use k nearest neighbors to represent the neighborhood for
each point. The KS measures are also similar to the stress function optimize in the Sammon
mapping, but here we emphasize the preservation of relations within a kNN neighborhood.
To calculate KS measures, we first connect all points in the original space to its k nearest
neighbors and thus create a kNN graph. Let E = {eij} be all edges in the kNN graph, eij be an
edge with point i as the origin point and point j as the destination point. If points i and j are
within each other’s kNN neighborhood, there are two edges connecting them: eij and eji. Let
Rij represent the nearest neighbor rank of point j (the destination point) in relation to point i
(the origin point) in the original data space. For example, if point i is the fourth nearest
neighbor of point j in the original data space, then Rij has a value of 4. We use rij to denote the
nearest neighbor rank of point j in relation to point i in the ordering or encoding. Note: rij can
be greater than k while Rij ranges from 1 to k. The distance between points i and j in the
original data space is denoted Lij.
A generic KS (Keys Similarity) measure is defined as:
11
KS = ∑ ( wij rij ) / ∑ wij
E
,
(2)
|E|
where wij is the weight for each edge and |E| is the total number of edges in the kNN graph.
Depending on the configuration of the weight, we have three different KS measures:
KSnd = ∑ (rij / Lij ) / ∑ (1 / Lij ) ,
(3)
where the weight is the inverse of the distance (Lij) in the original space;
KSnn = ∑ (rij / Rij ) / ∑ (1/ Rij ) ,
(4)
where the weight is the inverse of neighbor ranking (Rij) in the original space;
KSn = ∑ rij / E
(5)
,
where the weight is a constant.
4.1.2 SS Measures: Measuring Spatial Similarities in Ordering Neighborhoods
The KS measures introduced above emphasize that spatial neighbors should be close in the
ordering. However, KS measures do not penalize situations where two points far away from
each other in the original space but are next to each other in the ordering. In this subsection,
we introduce the SS measures, which evaluate how a set of points in an ordering or encoding
neighborhood are distributed in the original data space. Ideally, we would like to see that k
nearest neighbors in an ordering or encoding are also close in the original space. Orderings
that have good performance on these measures normally have fewer long jumps, i.e., the
ordering path typically moves from a point to one of its close neighbors.
Similar to the calculation of KS measures, we construct a kNN graph with the ordering or
encoding (instead of the original data space). Let E = {eij} be all kNN edges, eij be an edge
with an origin point i and a destination point j. If points i and j are within each other’s kNN
neighborhood, there are two edges connecting them: eij and eji. Rij denotes the nearest
neighbor rank of point j in relation to point i in the original data space, while rij is the nearest
neighbor rank in the ordering or encoding. Here Rij can be greater than k while rij ranges from
1 to k. The distance between points i and j in the original space is denoted Lij.
A generic spatial similarity (SS) measure is defined as:
SS = ∑ ( wij d ij ) / ∑ wij ,
E
(6)
|E|
where wij is the weight for a pair of points (i.e., a point and one of its k nearest neighbors) and
dij is the distance between the two points in the 2D geographic space. |E| is the total number
of edges in the kNN graph. Depending on the configuration of the weight and the distance,
we have four different SS measures:
SSdn = ∑ ( Lij / rij ) / ∑ (1 / rij )
,
(7)
where the weight is defined as the inverse of the k-nearest-neighbor ranking (rij) and the
distance is the distance (Lij) in the original space;
12
SSd = ∑ Lij / E
,
(8)
where the weight is a constant and the distance is the distance (Lij) is the original space;
SSnn = ∑ ( Rij / rij ) / ∑ (1 / rij ) ,
(9)
where the weight is defined as the inverse of the neighbor ranking (rij) in the ordering (or
encoding) and the distance is the nearest neighbors ranking (Rij) in the original space;
SSn = ∑ Rij / E
,
(10)
where the weight is a constant and the distance is defined as the nearest neighbor ranking (Rij)
in the original space.
4.2 Results and Evaluation
We evaluate the nine ordering and encoding methods using two types of synthetic data:
clustered data and random data. We also carry out an evaluation with a real data set that
contains the locations for 3128 US cities.
4.2.1 Evaluation with Clustered Data
Each clustered dataset contains 500 points, which form four Gaussian clusters: A, B, C and D.
Clusters A-C have 100 points each while cluster D has 200 points. The center of each cluster
is randomly chosen between (0, 100) for both dimensions. The cluster shape and size (in
terms of standard deviation on each dimension) are also randomly chosen. Specifically, the
standard deviation on each dimension ranges from 2 to 20 for clusters A and D. Cluster B is
always elongated, with a small deviation (ranging from 2 to 10) on one dimension and a
much larger deviation (ranging from 10 to 40) on the other dimension. Cluster C is always
circular, having the same standard deviation (ranging from 5 to 10) on both dimensions.
Since the centers are randomly chosen, these four clusters may overlap. Once all data points
are generated, each cluster is rotated by a random amount, ranging from zero to 90 degree.
Thus, the orientation of each cluster is also random. We generate 100 such clustered data sets.
Figure 7 shows the results of the nine ordering/encoding methods for the same clustered
data set. Table 1 shows the average measure scores for each method obtained across all 100
synthetic clustered datasets. Measures are calculated using 16 nearest neighbors for each
point (i.e., k = 16). The SNN_CLO and SNN_CLO_OPT ordering also use 16 nearest
neighbors in calculating the SNN value. For all scores, the lower the value, the better the
ordering. Figure 8 shows the average ranks of the nine ordering methods for each measure.
For example, if SAM performs the best for the KSnd measure for a specific data set, then its
value for KSnd is set as 1. Thus, we convert measure values to ranks, ranging from 1 to 9.
The ranks are averaged for each ordering method and each specific measure1.
1
One could argue that since ranks are ordinal numbers only, taking an average of a set of ranks is statistically
invalid in the strictest sense. However, we do need a metric to summarize ranking performance and so beg the
reader’s indulgence in overlooking this small faux pas.
13
Figure 7: The ordering results of a clustered data set with the nine ordering methods. Each
ordering starts with the green point and ends at the yellow point.
In Figure 8, we can see that the Sammon mapping (SAM) is the best among the nine
ordering methods for the KSnd and KSn measures and ranks second for the KSnn measure.
However, SAM is the worst against all SS measures (see Table 1 and Figure 8). This is
because the stress function that SAM minimizes is similar to the KS measures, which
emphasizes that spatial neighbors should be close in the ordering. However, the stress
function does not penalize situations when two points far away from each other in the 2D
space are next to each other in the ordering. Therefore, we notice in Figure 7 that the SAM
ordering has many more long jumps than other methods. This finding is similar to Mark’s
(1990) findings, where he reported that a simple row-based ordering performs the best on his
three measures, which are similar to the KS measures used here.
14
Table 1: Average measure scores for the nine orderings with 100 synthetic clustered data sets. The
lowest score for each measure is highlighted in a bold font.
Measure
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
MOR
35.66
35.59
50.27
8.76
11.13
28.66
39.95
HIL
43.84
43.14
60.64
6.83
8.91
16.43
24.82
SAM
23.31
26.73
37.11
15.65
16.07
72.78
74.40
MST
38.56
48.92
75.74
9.86
12.76
39.02
53.41
MST_OPT
38.34
48.96
75.94
9.40
12.43
35.92
50.89
SNN_CLO
31.10
28.17
46.79
6.69
8.88
17.95
26.91
SNN_CLO_OPT
31.28
28.54
45.49
6.83
8.84
17.52
25.82
CLO
30.83
28.05
47.30
6.75
9.01
18.37
27.70
CLO_OPT
28.20
25.35
42.84
5.93
7.98
13.10
20.76
Ordering
Ordering Ranks (with clustered data)
10
9
SAM
8
MST
MST_OPT
7
MOR
Rank
6
5
CLO
SNN_CLO
SNN_CLO_OPT
HIL
4
3
2
CLO_OPT
1
0
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
Measures
Figure 8: Average ranks of the nine ordering methods with 100 synthetic datasets.
Across all seven measures, CLO_OPT is clearly the best ordering method among the nine
evaluated. This can be observed in Figure 7, where the CLO_OPT ordering has fewer long
jumps and better preserves clusters. The CLO_OPT ordering ranks the best on all SS
measures and the KSnn measure. It ranks the second on the KSnd and KSn measures, on
which the SAM ranks the best (see Figure 8). Since the KSnn measure is the average of
nearest neighbor rankings weighted by their 1D nearest neighbor rankings, it indicates that
the SAM ordering does not preserve small neighborhoods well compared to the CLO_OPT
ordering. In another evaluation result (which is not shown) that uses four nearest neighbors
instead of 16 (i.e., k = 4) in calculating the measures, the SAM ordering is much worse than
the CLO_OPT on all KS measures.
15
The two space-filling curves, i.e., the Hilbert curve (HIL) and the Morton curve (MOR),
show an interesting difference in performance. HIL is better than MOR on all SS measures
while MOR is better than HIL on all KS measures. The Hilbert curve’s good performance on
SS measures is because it avoids long jumps as it always moves to a close neighbor instead of
jumps to another region. However, the Hilbert curve often divides a cluster of points into
several pieces and those pieces are scattered around in the ordering. Compared to the Hilbert
curve, the Morton curve is more balanced against all measures; since it is based on a
hierarchical traversal it tends to fracture clusters to a lesser degree.
Orderings based on the single-linkage clustering, i.e., the MST and the MST_OPT
orderings, perform badly on all measures. Other clustering-based methods, i.e., CLO and
SNN_CLO, perform better than the Morton ordering on all measures and better than the
Hilbert ordering on most measures (except on the SSnn and SSn measures). The optimal
ordering makes a significant improvement for the CLO ordering. In general, clustering-based
orderings preserve dense neighborhood better at the sacrifice of “outliers”. If a certain
percentage of points are first filtered out as “outliers” and excluded during the calculation of
measures, the performance of those clustering-based ordering methods will be better.
Standard Deviation of Ordering Measures (clustered data)
10
MOR
SAM
HIL
Standard Deviation
SAM
8
HIL
CLO
CLO_OPT
6
MOR
SNN_CLO
SNN_CLO_OPT
4
CLO_OPT
2
0
SSnn
SSn
KSnd
KSnn
KSn
Figure 9: The standard deviation of measure scores for each ordering method.
Figure 9 shows the standard deviation of measure scores for each ordering method. We
can see that CLO_OPT, which gives the best overall performance, is also very stable. On the
other hand, although SAM performs well on the KS measures, its standard deviation on the
KS measures is large, indicating that SAM is not stable in performance. The standard
deviation of the MST and MST_OPT orderings are much higher than that of other methods
and thus excluded in Figure 9.
Table 2 and Figure 10 show the evaluation results of the nine encodings. As defined in
section 3, points in the ordering are equally spaced while the distance between neighboring
points varies in the encoding. Therefore, although points in the encoding and the ordering are
16
in the same order, the k nearest neighbors of a point in the encoding may not be the same as
in the ordering. The CLO_OPT encoding remains the best. The most significant change in
performance occurs to the Hilbert curve, which is the second best in ordering but only ranks
the fifth in encoding on SS measures. This is because that the Hilbert curve typically cannot
recognize clusters and thus has limited ability in adjusting the distances in the encoding.
Table 2: Nine encoding results with 100 synthetic data sets of clustered data. The best average
score for each measure is highlighted with a bold font.
Measure
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
MOR
31.87
31.23
43.95
6.31
9.20
15.22
27.50
HIL
35.13
33.97
47.73
5.65
8.17
11.44
20.57
Encoding
SAM
21.76
23.54
32.88
15.39
15.90
71.02
73.57
MST
30.66
35.78
56.39
8.61
12.26
32.57
50.46
OPT_MST
30.38
35.48
56.11
8.22
11.94
29.90
48.04
SNN_CLO
25.59
22.91
38.10
5.24
7.58
10.11
18.37
OPT_SNN_CLO
25.72
23.12
37.25
5.58
7.73
11.31
19.10
CLO
25.34
22.74
38.36
5.08
7.66
9.81
18.89
OPT_CLO
23.68
20.97
35.44
4.92
7.32
8.90
16.77
Encoding Ranks (with clustered data)
10
9
SAM
8
MST
MST_OPT
7
Rank
6
MOR
5
HIL
4
SNN_CLO_OPT
CLO
3
SNN_CLO
2
CLO_OPT
1
0
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
Measures
Figure 10: Average ranks of the nine encoding methods with 100 clustered data sets.
To examine the improvement that the encoding can achieve over the ordering, we plot
both the ordering measures and the encoding measures in the same view (see Figure 11) for
the. The encodings of both the CLO and the CLO_OPT method are consistently better than
their corresponding orderings across all measures. This finding suggests that the CLO or
CLO_OPT encoding should be chosen over their orderings to encode spatial relations.
However, their orderings remain important for visualization techniques where entities are
often equally spaced.
17
Ordering vs. Encoding (with clustered data)
50
45
Measure Value
40
35
30
CLO (ORD)
25
CLO_OPT (ORD)
CLO (ENC)
CLO_OPT (ENC)
20
15
10
5
KSnd
KSnn
KSn
SSnn
SSn
Figure 11: Improvement of measure values by using encoding instead of ordering.
4.2.2 Evaluation with Random Data
To further examine the performance of the nine ordering/encoding methods with different
data distributions, we present here the evaluation result with 100 synthetic sets of random
data points. A random data set contains 500 points that are randomly distributed, with values
ranging between (0, 100) for each dimension. With this evaluation, we specifically want to
test the performance of cluster-based methods with data that have no obvious clusters.
Figure 12 shows the results of the nine ordering/encoding methods for the same random
data set. Table 3 shows the average results obtained across all 100 synthetic random datasets.
Measures are calculated using 16 nearest neighbors for each point (i.e., k = 16). The
SNN_CLO and SNN_CLO_OPT ordering also use 16 nearest neighbors in calculating the
SNN value. Figure 13 shows the average ranks of the nine ordering results for each measure.
We notice that, even with a completely random data distribution, the CLO_OPT ordering
remains the best among the nine. SAM, however, gets worse on KS measures while MOR
ordering emerges as the best on KSnd and KSn measures (Table 3 and Figure 13).
The two space-filling curves, i.e., HIL and MOR, show a dramatically difference in
performance. HIL is much better than MOR on all SS measures while MOR is much better
than HIL on all KS measures. Actually, for random data sets, the Hilbert curve is very close to
the CLO_OPT ordering on all SS measures, while the Morton curve is the best on two KS
measures. However, neither HIL nor MOR performs well on both types of measures. In this
regard, CLO_OPT still gives the best overall result.
Table 4 and Figure 14 show the evaluation results of the nine encodings with the 100
synthetic random data sets used above. Two important changes can be observed. First, the
CLO_OPT encoding emerges as the best on all measures. Second, CLO and SNN_CLO
encodings become better than the HIL and MOR encodings. The clustering-based methods
(except MST and MST_OPT) can construct better encodings since they are able to recognize
18
clusters and better adjust distances between neighbors.
Figure 12: Ordering results of the nine methods with the same random data set.
Table 3: Average measure values of the nine ordering results for 100 synthetic random data sets.
Measure
Ordering
MOR
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
31.86
29.90
42.69
10.15
12.90
24.12
33.68
HIL
36.61
34.61
49.00
7.50
9.70
11.16
16.70
SAM
32.66
31.11
45.45
31.53
31.99
132.83
134.78
MST
44.51
43.35
74.11
10.38
13.86
27.47
39.52
MST_OPT
43.64
42.20
72.28
9.71
13.16
24.04
35.81
SNN_CLO
35.43
31.02
52.37
8.05
10.74
14.99
22.64
SNN_CLO_OPT
35.60
31.50
51.64
8.19
10.66
14.14
20.99
CLO
34.02
29.87
50.71
8.03
10.71
14.99
22.52
CLO_OPT
32.12
28.08
47.77
7.04
9.48
10.43
16.43
19
Ordering Ranks (with random data)
10
9
SAM
8
MST
7
MOR
MST_OPT
Rank
6
5
SNN_CLO
CLO
SNN_CLO_OPT
4
3
2
HIL
CLO_OPT
1
0
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
Measures
Figure 13: Ranks of the nine ordering methods with 100 synthetic random data sets.
Table 4: Average measure values of the nine encoding results for 100 random data sets.
Measure
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
MOR
29.82
27.97
39.86
7.56
11.25
13.17
24.29
HIL
29.89
28.16
39.79
6.62
9.57
9.20
16.19
SAM
31.59
30.01
43.58
31.14
32.00
130.56
134.75
MST
34.46
32.57
56.08
8.93
13.40
22.34
37.24
MST_OPT
34.04
31.93
55.19
8.38
12.69
19.51
33.51
SNN_CLO
28.15
24.58
41.52
6.43
9.46
8.96
16.21
SNN_CLO_OPT
27.72
24.46
40.31
6.81
9.64
9.82
16.81
CLO
27.13
23.75
40.31
6.26
9.57
8.85
16.88
CLO_OPT
25.47
22.25
37.78
6.08
9.16
8.11
15.10
Encoding
Encoding Ranks (with random data)
Rank
10
9
SAM
8
MST
7
MST_OPT
6
MOR
5
CLO
SNN_CLO_OPT
4
3
SNN_CLO
HIL
2
CLO_OPT
1
0
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
Measures
Figure 14: Ranks of the nine encoding methods with a synthetic dataset of 500 random points.
20
4.2.3 Evaluation with a Real Data Set
We also apply the eight ordering methods (excluding SAM due to its poor performance on SS
measures) to a real data set that contains locations of 3128 US cities. Figure 15 shows the
CLO_OPT ordering of the cities. Table 5 and Figure 16 show the evaluation results of the
eight encodings of US cities. The CLO_OPT encoding gives the best result for all seven
measures. From the map (Figure 15) we can see that clusters are well preserved at different
scales. CLO and SNN_CLO are very close to the CLO_OPT in measure scores (Table 5).
Figure 15: The CLO_OPT ordering of 3128 US cities.
Table 5: Measure scores of the eight encoding results (excluding SAM) for 3128 US cities.
Measure
KSnd
KSnn
KSn
SSd
SSnn
SSn
MOR
34.83
60.58
89.49
0.65
1.00
23.93
44.14
HIL
37.91
63.31
MST
24.99
58.18
90.79
0.53
0.78
15.78
27.70
99.72
0.75
1.12
43.01
69.39
MST_OPT
23.77
53.00
90.86
0.66
0.98
34.72
56.86
SNN_CLO
20.92
30.09
51.40
0.46
0.68
12.33
21.65
SNN_CLO_OPT
25.49
34.22
54.08
0.52
0.71
14.71
23.66
CLO
21.38
30.74
52.79
0.42
0.64
9.87
18.64
CLO_OPT
19.84
29.01
50.02
0.41
0.62
9.04
16.77
Encoding
SSdn
Encoding of 3128 US Cities (Ranking)
Rank
9
8
MST
7
MST_OPT
6
MOR
5
HIL
4
SNN_CLO_OPT
3
SNN_CLO
2
CLO
1
CLO_OPT
0
KSnd
KSnn
KSn
SSdn
SSd
SSnn
SSn
Measures
Figure 16: Ranks of the eight encodings of 3128 US cities. CLO_OPT is the best for all measures.
21
5
Example Applications of Ordering and Encoding
The spatial ordering and encoding methods introduced and evaluated above are useful in
many spatial (or spatio-temporal) data mining tasks. Below we briefly introduce two
applications of the spatial ordering or encoding: (1) multivariate spatial clustering, and (2)
spatio-temporal visualization. To illustrate each application, we use the analysis of traffic
accidents on a road network as an example (Steenberghen et al. 2004; Yamada and Thill
2004). Due to space limitation, we only present the conceptual idea without referring to a
specific data set or implementation details.
Multivariate Spatial Clustering with Encoding
Suppose we want to identify clusters of accidents using both spatial similarities (e.g.,
distances between accidents on the road network) and attribute similarities (e.g., accident
variables derived from police records and local traffic characteristics). Specifically, we want
to use the well-known self-organizing map (SOM) to derive clusters. However, the SOM can
only process vector-based data with a Euclidean metric. In other words, each accident should
have a vector of values (including both spatial information and aspatial attributes) and the
dissimilarity between two accidents should be the Euclidean distance between their vectors.
Since the accidents are distributed on a road network, the spatial distance between accidents
is non-Euclidean (i.e., the network distance cannot be computed with the locations of the
accidents). Thus, the SOM cannot combine the spatial similarity and attribute similarity.
To solve this problem, we can first derive a spatial encoding for all accidents using a
dissimilarity matrix of all pair-wise network distances between accidents. Accidents that are
close on the road network will also be close in the encoding. Then a SOM can take this
encoding as an additional “variable”, together with other aspatial variables, to derive clusters
that take into account both spatial and aspatial information. We can also set different weights
for the spatial encoding and other variables, or even normalize the encoding without changing
the spatial patterns it represents.
Spatio-Temporal Visualization with Ordering
Suppose traffic accidents are grouped for each week, for a 10-year period (i.e., 522
weeks). A common approach to visualize such a spatio-temporal dataset is to make a map of
accidents for each week and then animate the maps across time. An alternative approach may
be to visualize the data using a 2D matrix, with the horizontal dimension representing time
(week and year) and rows representing accidents, which are ordered with a spatial ordering.
We can also aggregate nearby accidents in the ordering into groups to reduce the number of
matrix rows. Such an aggregation is possible because accidents that are close in the ordering
are also spatially close. Thus, we have a spatio-temporal matrix view of all accidents, with
accidents squeezed and aggregated into one dimension. From such a spatio-temporal view,
we will be able to perceive hotpots (or trouble zones/period) of accidents across space and
time.
6
Conclusion and Discussion
We have presented and evaluated a set of spatial ordering/encoding methods to transform
spatial relations into a one-dimensional ordering and encoding, which preserves spatial
22
patterns as much possible. Such ordering and encoding can then be used in a variety of spatial
or spatio-temporal data mining tasks. We designed a comprehensive set of measures to
evaluate different orderings/encodings. The results revealed a number of important
characteristics and unique behaviors for each ordering/encoding method. We showed that the
optimal ordering/encoding with the complete-linkage clustering consistently gives the best
overall performance, with various data distributions tested. We also briefly introduced two
possible applications (out of many) that make use of the spatial ordering and encoding
methods we describe.
Evaluation results with various data distributions show that the optimal ordering/encoding
based on the complete-linkage clustering gives the best overall performance, surpassing
well-known space-filling curves, in preserving spatial patterns. It can preserve spatial locality
in both directions, i.e., on one hand spatial neighbors are close in the ordering and on the
other hand neighbors in the ordering are also spatially close. The spatial ordering and
encoding can then help in a variety of geographic data mining problems.
Although the optimal ordering strategy generally produces a better ordering/encoding
than the simple ordering strategy for a given hierarchical clustering method, the primary
factor that controls the ordering/encoding quality is the clustering method. For example, the
single-linkage clustering gives very poor results in all tests, no matter which ordering strategy
is used. The two space-filling curves (i.e., the Hilbert curve and the Morton curve) have very
different characteristics. They generally work better with random data than clustered data but
neither of them gives a good overall performance—they are good on one type of measures
and perform badly on the other type of measures. Another advantage of the cluster-based
ordering methods is that they can work with non-Euclidean data spaces.
Lastly, we would like to briefly compare the computational scalability of each ordering
method. Space-filling curves are of O(nlogn) complexity and thus can process very large data
sets. For each clustering-based method, the computational complexity involves two parts: the
clustering procedure and the ordering procedure. The simple ordering strategy is of O(n)
complexity while the optimal ordering strategy is of O(n3). The single-linkage clustering is of
O(n2logn) complexity and the complete-linkage clustering is of O(n3). Therefore, the
CLO_OPT and SNN_CLO_OPT ordering methods are the most time-consuming ones among
all clustering-based methods. To derive an ordering of the 3128 US cities, the CLO_OPT
method takes about 5 minutes on a desktop computer with 2.0GB of RAM memory and a
3.60GHz Pentium 4 CPU. Where efficiency is an issue, perhaps due to dataset size or time
criticality of the application, then the simple ordering strategy might provide a more viable
alternative.
Acknowledgement
This research was partially supported by grant CA95949 from the National Cancer Institute.
23
References
Andrienko, G. and N. Andrienko (1999). Data Mining with C4.5 and Interactive Cartographic
Visualization. User Interfaces to Data Intensive Systems. G. T. Los Alamitos, CA,
IEEE Computer Society: 162-165.
Andrienko, G. and N. Andrienko (1999). "Interactive Maps for Visual Data Exploration."
International Journal of Geographical Information System 13(4): 355-374.
Andrienko, N., G. Andrienko and P. Gatalsky (2003). "Exploratory spatio-temporal
visualization: an analytical review." Journal of Visual Languages & Computing 14(6):
503-541.
Ankerst, M., M. M. Breunig, H.-P. Kriegel and J. Sander (1999). OPTICS: Ordering Points
To Identify the Clustering Structure. ACM SIGMOD International Conference on
Management of Data, Philadelphia, PA, USA, ACM Press, 49-60.
Baase, S. and A. V. Gelder (2000). Computer Algorithms, Addison-Wesley.
Bar-Joseph, Z., E. D. Demaine, D. K. Gifford, A. M. Hamel, T. S. Jaakkola and N. Srebro
(2003). "K-ary Clustering with Optimal Leaf Ordering for Gene Expression Data."
Bioinformatics 19(9): 1070-8.
Bar-Joseph, Z., D. K. Gifford and T. S. Jaakkola (2001). "Fast optimal leaf ordering for
hierarchical clustering." Bioinformatics 17(Suppl. 1): S22-S29.
Bertin, J. (1983). Semiology of Graphics. Diagrams, Networks, Maps. Madison, The
University of Wisconsin Press.
Bertin, J. (2001). "Matrix theory of graphics." Information Design Journal 10: 5-19.
Breinholt, G. and C. Schierz (1998). "Algorithm 781: Generating Hilbert's space-filling curve
by recursion." Acm Transactions on Mathematical Software 24(2): 184-189.
Duda, R. O., P. E. Hart and D. G. Stork. (2000). Pattern classification. New York, John Wiley
& Sons.
Ertoz, L., M. Steinbach and V. Kumar (2003). Finding Clusters of Different Sizes, Shapes,
and Densities in Noisy, High Dimensional Data. The Third SIAM International
Conference on Data Mining (SDM '03), San Francisco, CA, USA.
Ester, M., H. P. Kriegel and J. Sander (1997). Spatial data mining: A database approach.
Advances in Spatial Databases. Berlin 33, Springer-Verlag Berlin. 1262: 47-66.
Ester, M., H.-P. Kriegel, J. Sander and X. Xu (1996). A density-based algorithm for
discovering clusters in large spatial databases with noise. the Second International
Conference on Knowledge Discovery and Data Mining, Portland, Oregon, USA,
AAAI Press, 226-231.
Friendly, M. and E. Kwan (2003). "Effect Ordering for Data Displays." Computational
Statistics & Data Analysis 43(4): 509-539.
Gahegan, M. (2000). "The case for inductive and visual techniques in the analysis of spatial
data." Journal of Geographical Systems 2(1): 77-83.
Goodchild, M. F. and A. W. Grandfield (1983). Optimizing raster storage: an examination of
four alternatives. Proceedings, Auto-Carto 6, 400-407.
Gordon, A. D. (1987). "A Review of Hierarchical Classification." Journal of the Royal
Statistical Society. Series A (General) 150(2): 119-137.
Gordon, A. D. (1996). Hierarchical Classification. Clustering and Classification. P. Arabie, L.
24
J. Hubert and G. D. Soete. River Edge, NJ, USA, World Scientific Publisher: 65-122.
Gotsman, C. and M. Lindenbaum (1996). "On the metric properties of discrete space-filling
curves." Ieee Transactions on Image Processing 5(5): 794-797.
Guo, D., M. Gahegan, A. M. MacEachren and B. Zhou (2005). "Multivariate Analysis and
Geovisualization with an Integrated Geographic Knowledge Discovery Approach."
Cartography and Geographic Information Science 32(2): 113-132.
Guo, D., D. Peuquet and M. Gahegan (2003). "ICEAGE: Interactive Clustering and
Exploration of Large and High-dimensional Geodata." GeoInformatica 7(3): 229-253.
Han, J. and M. Kamber (2001). Data Mining: Concepts and Techniques, Morgan Kaufmann
Publishers.
Han, J., M. Kamber and A. K. H. Tung (2001). Spatial Clustering Methods in Data Mining: a
survey. Geographic Data Mining and Knowledge Discovery. H. J. Miller and J. Han.
London and New York, Taylor & Francis: 33-50.
Han, J., K. Koperski and N. Stefanovic (1997). GeoMiner: A System Prototype for Spatial
Data Mining. ACM SIGMOD International Conference on Management of Data,
Tucson, Arizona, USA, 553-556.
Hilbert, D. (1891). "Uber die stetige Abbildung einer Linie auf Flachenstuck." Math. Ann. 38:
459-460.
Jain, A. K. and R. C. Dubes (1988). Algorithms for clustering data. Englewood Cliffs, NJ,
Prentice Hall.
Jarvis, R. A. and E. A. Patrick (1973). "Clustering using a similarity measure based on shared
near neighbours." IEEE Transactions on Computers 22(11): 1025-1034.
Keim, D. A., C. Panse, M. Sips and S. C. North (2004). "Visual Data Mining in Large
Geospatial Point Sets." IEEE Computer Graphics and Applications 24(5): 36-44.
Koperski, K., J. Han and N. Stefanovic (1998). An Efficient Two-Step Method for
Classification of Spatial Data. 1998 International Symposium on Spatial Data
Handling SDH'98, Vancouver, BC, Canada, 45-54.
Koperski, K. and J. W. Han (1995). Discovery of spatial association rules in geographic
information databases. Advances in Spatial Databases. Berlin 33, Springer-Verlag
Berlin. 951: 47-66.
Lamarque, C. H. and F. Robert (1996). "Image analysis using space-filling curves and 1D
wavelet bases." Pattern Recognition 29(8): 1309-1322.
Lawder, J. K. and P. J. H. King (2001). "Querying multi-dimensional data indexed using the
Hilbert space-filling curve." SIGMOD Record 30(1): 19-24.
Mark, D. M. (1990). "Neighbor-based Properties of Some Ordering of Two-dimensional
Space." Geographical Analysis 22(2): 145-157.
Miller, H. J. and J. Han (2001). Geographic Data Mining and Knowledge Discovery: an
overview. Geographic Data Mining and Knowledge Discovery. H. J. Miller and J.
Han. London and New York, Taylor & Francis: 3-32.
Mokbel, M. F. and W. G. Aref (2003). "Analysis of Multi-Dimensional Space-Filling Curves."
GeoInformatica 7(3): 179-209.
Moon, B., H. V. Jagadish, C. Faloutsos and J. H. Saltz (2001). "Analysis of the Clustering
Properties of the Hilbert Space-Filling Curve." IEEE Transaction on Knowledge and
Data Engineering 13(1): 1-18.
25
Morton, G. (1966). A computer-oriented geodetic data base and a new technique for file
sequencing, IBM Canada: Unpublished report.
Murray, A. T. and T. K. Shyy (2000). "Integrating attribute and space characteristics in
choropleth display and spatial data mining." International Journal of Geographical
Information Science 14(7): 649-667.
Ng, R. and J. Han (1994). Efficient and Effective Clustering Methods for Spatial Data Mining.
Proc. 20th International Conference on Very Large Databases, Santiago, Chile,
144-155.
Openshaw, S. (1994). Two exploratory space-time-attribute pattern analysers relevant to GIS.
Spatial analysis and GIS. S. Fotheringham. Technical Issues in Geographic
Information Systems, Taylor & Francis: 83-104.
Qeli, E., W. Wiechert and B. Freisleben (2004). Visualizing time-varying matrices using
multidimensional scaling and reorderable matrices. Proceedings of the Eighth
International Conference on Information Visualisation, 561-567.
Reinelt, G. (1994). The Travelling Salesman. Computational Solutions for TSP Applications.
Berlin Heidelberg New York, Springer-Verlag.
Sammon, J. W. (1969). "A non-linear mapping for data structure analysis." IEEE Transactions
on Computers C-18(5): 401-409.
Shekhar, S., P. Zhang, Y. Huang and R. Vatsavai (2004). Trend in Spatail Data Mining. Data
Mining: Next Generation Challenges and Future Directions. H. Kargupta, A. Joshi, K.
Sivakumar and Y. Yesha, AAAI/MIT Press: 357-381.
Siirtola, H. and E. Makinen (2005). "Constructing and Reconstructing the Reorderable
Matrix." Information Visualization 4: 32-48.
Skubalska-Rafajlowicz, E. (2001). "Data compression for pattern recognition based on
space-filling curve pseudo-inverse mapping." Nonlinear Analysis-Theory Methods &
Applications 47(1): 315-326.
Steenberghen, T., T. Dufays, I. Thomas and B. Flahaut (2004). "Intra-urban location and
clustering of road accidents using GIS: a Belgian example." International Journal of
Geographical Information Science 18(2): 169-181.
Wang, W., J. Yang and R. Muntz (1997). STING : A Statistical Information Grid Approach to
Spatial Data Mining. 23rd Int. Conf on Very Large Data Bases, Athens, Greece,
Morgan Kaufmann, 186-195.
Wilkinson, L. (1979). Permuting a matrix to a simple pattern. Proceedings of the Statistical
and Computing section of the American Statistical Association, 409-412.
Wirth, N. (1976). Algorithms + Data Structures = Programs, Prentice Hall.
Wong, P. C., K. K. Wong, H. Foote and J. Thomas (2003). "Global Visualization and
Alignments of Whole Bacterial Genomes." IEEE Transactions on Visualization and
Computer Graphics 9(3): 361-377.
Yamada, I. and J.-C. Thill (2004). "Comparison of Planar and Network K-functions in Traffic
Accident Analysis." Journal of Transport Geography 12: 149–58.
Young, F. W. (1987). Multidimensional scaling: history, theory, and applications, Lawrence
Erlbaum Associates.
26
Download