2D, 3D and High-Dimensional Data and Information Visualization

advertisement
University of Hannover
Institut für Wirtschaftsinformatik (IWI)
2D, 3D and High-Dimensional Data
and Information Visualization
Kim Bartke
Email: kimbartke@yahoo.co.uk
Tutor: Prof. Michael H. Breitner
Seminar on Data and Information Management
SS 05
CONTENTS
1
Contents
1 Introduction
3
2 Data visualization in knowledge discovery
2.1 Data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
3 Data visualization techniques
3.1 Two-dimensional data . . .
3.2 Three-dimensional data . . .
3.3 High-dimensional data . . .
3.3.1 Icon-based methods .
3.3.2 Hierarchical methods
3.3.3 Geometrical methods
5
5
5
6
7
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Projection techniques
15
4.1 Multi-dimensional scaling (MDS) . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Self-organizing maps (SOMs) . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Interaction
5.1 Filtering . . . . . . .
5.2 Linking and brushing
5.3 Zooming . . . . . . .
5.4 Manipulation of data
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
19
19
20
20
20
6 Software and Applications
21
7 Outlook and Conclusion
22
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
LIST OF FIGURES
2
List of Figures
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Scatterplot of car data set [Hoffmann99] . . . . . . .
Total crimes . . . . . . . . . . . . . . . . . . . . . . .
3D linegraph (surface) [generated with Matlab] . . .
Chernoff faces [Ward99] . . . . . . . . . . . . . . . .
Star glyphs [Oellien03] . . . . . . . . . . . . . . . . .
Dimensional stacking . . . . . . . . . . . . . . . . . .
Fractal foam [Hoffmann99] . . . . . . . . . . . . . . .
Parallel coordinates for points A, B, C . . . . . . . .
Parallel coordinates (Iris data set) [Grinstein01] . . .
Parallel coordinates (Iris data set) [Hoffmann99] . . .
Andrew’s curves for data points A, B, C . . . . . . .
Andrews curves (Iris data set) [Hoffmann99] . . . . .
Scatter plot matrix (Car data set) [Hoffmann99] . . .
RadViz (Iris data set) [Grinstein01] . . . . . . . . . .
PolyViz (Iris data set) [Grinstein01] . . . . . . . . . .
Hyperbolic patch embedded into <3 . . . . . . . . . .
”Circle Limit III” (1958) . . . . . . . . . . . . . . . .
Self organizing map of the Iris data set [Grinstein01]
Tramnetwork F+C technique [Keahey] . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
6
7
8
9
10
10
11
12
12
13
14
15
16
17
18
19
21
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
1 INTRODUCTION
1
3
Introduction
We do it in the city, in the country side and on mountains; we even do it in space. You
are asking yourself what I am talking about. We collect data. Since the beginning of
mankind, people have gathered data. They used to do it by hand, counting mammoths
or cattle, watching the sun and the clouds. Nowadays we use electronic devices to inspect
and monitor our environment.
In police departments you can find huge databases about criminals and crime. Everywhere on the planet we see weather stations that collect meteorological data, and in space
large space shuttles examine planets, their atmosphere and the Earth.
These electronic devices enable us to collect a vast amount of data and todays computers
make it possible to store and process it. The reason we make the effort to accumulate data
in such quantities is because of the information it can yield. For example police officers
endeavour to find patterns in criminals’ behaviour to enable them to react on crimes more
quickly and more effectively, while meteorology stations are used to predict the weather
and, who knows, we may even find signs of life in outer space.
So we have two main goals in our data analysis: the generation of hypotheses and their
verification. This means that we use the gathered information to either describe the actual situation or make predictions for the behaviour of future data.
As mentioned before computers do not only allow us to store data but also to process
large amounts of it. In order to discover knowledge we often use numerical solutions such
as data mining algorithms. But with the increase of both the number of variables (or
dimensions) and the number of cases it is very demanding to recognize patterns in data.
It is also difficult to understand the data structure itself and the exploration process.
The numerical solutions are also very susceptible to noise and incorrect data. Remedial
measures can be taken by presenting the data visually.
Data visualization is the mapping of data into a Cartesian space. This method integrates
a person’s creativity and expertise into the knowledge discovery process and therefore allows a symbiosis of the computational power with our visual potentials. The visualization
does also give the user the opportunity to interact with the computer, e. g. change the
data and observe the reactions to gain a better insight into the data structure.
The greatest challenge for visualizing data is to find a good spatial representation. Most
of the time, the data that needs to be presented is high dimensional. That means in
order to communicate the information we have to scale the data so that we can display
it in two-dimensional (2D) or three-dimensional (3D) coordinate systems while losing the
least amount of information. This is fairly easy for scientific data. Problems arise if data
depends on variables such as customer satisfaction or purchase channels which are nonspatial.
In the following paper I will present the uses of data and information visualization in the
knowledge discovery process by explaining and evaluating data visualization techniques
for both 2D and 3D representation. I will then deal with human-machine interaction
methods and will also provide some examples of software packages and their application.
I will conclude with giving you a preview on the direction of further research, e. g. the
further development of the group of visualization tools.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
2 DATA VISUALIZATION IN KNOWLEDGE DISCOVERY
2
4
Data visualization in knowledge discovery
As mentioned before, knowledge discovery in databases (KDD) is the way of gaining
interpretable information out of a set of raw data. Fayyad et al. [Rhodes02] make the
distinction between KDD as the process of discovering knowledge whereas data mining
is the method of extracting patterns from the data. Bearing this in mind the following
steps in the KDD process can be identified [Oellien03, p. 2].
The first step is the pre-processing and data preparation. This includes the selection of
a particular part of the data set that seems to be suitable for the following process. It is
also very important to carry out noise reduction algorithms and exclude false data points
to guarantee a result that reflects its nature as well as possible. The data set might also
need to be adjusted. Normalization is a very useful and often performed transformation.
Another key technique is the projection of high-dimensional data into the two or three
dimensional space.
The second step is the actual use of data mining algorithms. The goal of this step is the
classification of the data as well as the clustering and summarization of it [Fayyad02]. It
concludes in the pattern recognition. Finally the information gained from the data needs
to be communicated to the user(s).
All these elements of the analytic process are often done with the help of visualization.
It is this visualization that this paper will focus on. Since the techniques always depend
on the data type, this chapter will start with an overview of data types.
2.1
Data types
Every data set has a general structure. It is always characterised by a group of variables
(also called dimensions) and the records the database contains. One way of categorizing
data is to differentiate between sets that can be described by dimensionality and sets that
cannot [Keim02].
The first group consists of one-dimensional, two-dimensional, three-dimensional and highdimensional data sets. The variable in one-dimensional data is usually time. An example is the log of interrupts in a processor. Two-dimensional data can often be found
in statistics like the number of financial transactions in a certain period of time. Threedimensional data can be positions in three-dimensional space or points on a surface
whereas time (the third dimension) varies. High-dimensional data contains all those
sets of data that have more than three considered variables. Examples are locations in
space that vary with time (here: time is the fourth dimension) or any other combination
of more than three variables, e. g. product - channel - territory - period - customer’s
income.
In the second group we distinguish between text and graphs. Especially since the birth
of the World Wide Web, analysing text (or in this case hypertext) becomes more and
more important. Text itself is not easily analysed by data visualization techniques, so
a transformation into numbers is necessary. A technique for this change from textual
to numerical data could be word counting for instance. A graph is a ”set of objects,
called nodes, and connections between these objects, called edges” [Keim02]. Any relational databases are examples for this type of data sets.
In the following chapters I will present several data visualization techniques for two-,
three- and high-dimensional data sets. I will also give an overview of two non-linear pro-
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
5
Figure 1: Scatterplot of car data set [Hoffmann99]
jection techniques to reduce the size of high dimensional data, namely multi-dimensional
scaling (MDS) and Kohonens Self-Organizing Maps (SOM) as an example for a neural
network algorithm.
3
3.1
Data visualization techniques
Two-dimensional data
Two-dimensional data can be visualized in different ways. A very common visualization
form is the scatterplot.
In a scatterplot the frame for the data presentation is a Cartesian coordinate system,
in which the axes correspond to the two dimensions. The data is usually represented by
points in the coordinate systems first quadrant (assuming the data point values are not
negative). In case of two or more data sets being displayed in the same coordinate system
different colours can be used to distinguish between the distinct plots. A problem with
this way of displaying data arises when the amount of data points gets very high as the
points become too dense. In order to avoid this Becker suggests binning of the data set
[Sahling03]. The quality of the visualization now depends on the number of bins and
their sizes. Figure 1 shows the distribution of miles per gallon (MPG) vs. horsepower for
American (red), European (blue) and Japanese (green) cars.
Another important visualization technique for two-dimensional data is the linegraph.
The difference to scatterplots is that this time the relation between the dimension on the
horizontal axis and the one on the vertical axis is definite. Figure 2 shows an example for
a linegraph displaying the number of crimes in Niedersachsen in the years 1993 to 2002.
Extensions of linegraphs are survey plots. They can be obtained by turning the plot
90 degrees clockwise and then halve the length of the rays and add this half on the other
side of the now vertical axis.
The last technique I would like to mention here is the visualization of data as barcharts.
Considering the last figure a barchart representation would be the same as above but
with the area under the graph filled in. Histograms are particular barcharts with the bar
standing for the sum of the data point class [Hoffmann02].
3.2
Three-dimensional data
The two-dimensional techniques can easily be extended to three dimensions. The third
dimension is achieved in scatterplots and barcharts by adding a further axis, orthogonal
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
6
Figure 2: Total crimes
Figure 3: 3D linegraph (surface) [generated with Matlab]
to the other two. The additional dimension in a linegraph representation has the effect
that the resulting plot is a surface. Figure 3 shows an example that has been generated
with Matlab.
A very widespread technique for visualizing the third dimension in a two-dimensional
coordinate system is the use of colour or a variation of the data point size.
Another very interesting visualization technique is animation to show the variation of
the plot with time for instance.
3.3
High-dimensional data
The visualization of high-dimensional data raises a very severe problem: the visualization
space is limited to three dimensions or even to only two since data is usually displayed
on screens or paper. One of the obstacles in the discovery of high-dimensional data
sets information Mihalisin [Mihalisin02] points out is that techniques of extracting lowdimensional information and displaying it cannot automatically be employed for highdimensional data as the data set size is too large. Next we have to study the effect on the
possible resulting data sets if we increase the number of variables or values they can hold.
In order to do this, consider the following example [Mihalisin02]: We have a data set
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
7
Figure 4: Chernoff faces [Ward99]
consisting of six columns which represent the attributes product, territory, sales channel,
method of payment, time of payment and a unique identifier. Furthermore we have
100,000 rows representing the records. Our company sells five products in five different
territories via two sales channels. We also offer the opportunity of two distinct methods
of payment, all divided into five quarters. This means there are 5 ∗ 5 ∗ 2 ∗ 2 ∗ 5 = 500
possible cell results. 100,000 records, each having one of the 500 cell results, leads to
100, 499!
= 101350
100, 000 ∗ 499
as the amount of different data sets. This is a huge number (larger than the number
quantity of atoms in the universe!) and it is only a very small database.
Coming now to the different visualization techniques, we distinguish between icon-based,
hierarchical and geometrical methods.
3.3.1
Icon-based methods
Icon-based methods are approaches that use icons (or glyphs) to represent high-dimensional
data. They map data components to graphical attributes.
The most famous technique is the use of Chernoff faces [Hoffmann02]. In this case a
data point is represented by an individual face whereas the features map the data dimensions. Five different sizes of the eyes could correspond to the five products of the example
above and the mouth might symbolize the two methods of payment. This scheme uses a
person’s ability of recognizing faces. Examples for Chernoff faces shows figure 4.
The probably most common icon-based technique is the use of star glyphs to denote
data points. A star glyph consists of a centre point with equally angled rays. These
branches correspond to the different dimensions and the length of the limbs mark the
value of this particular dimension for the studied data point. A polygon line connects the
outer ends of the spokes [Oellien03]. An illustration of the star glyphs approach is figure 5.
These icon-based techniques are very vivid but have several disadvantages. A very severe
problem is the organisation of the glyphs on the screen as no coordinate system representing two of the dimensions is provided. Even if you decided to use a Cartesian system it
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
8
Figure 5: Star glyphs [Oellien03]
would put more weight on these two dimensions and so probably distort the data pattern.
Another obstacle is the amount of variables and the size of the data set itself. If the
number of rays become too high a distinction between the different spokes and the values
they represent is not possible anymore. A similar unclear map emerges if the number of
data points exceeds a certain amount.
3.3.2
Hierarchical methods
The most important representative of the group of hierarchical visualization techniques
is dimensional stacking. It is a method of embedding coordinate systems recursively
into each other [Grinstein02a]. Consider again the example with the five products, five
territories, two sales channels, two methods of payment and five quarters [Mihalisin02].
First of all you have to select the two outermost dimensions. We choose the quarters and
the pay types. Our horizontal axis is now divided into five parts while the vertical axis
becomes halved. We now decide that we would like the sales channel to be embedded
into the method of payment, so each part of the pay type axis gets further divided into
two parts that represent the different channels. The axis corresponding to the quarters
will embed the products so these elements become subdivided as well. Finally the upright
axis lodges the five territories. The resulting coordinate axes combination system can be
obtained in figure 6. It shows that goods of product type four, sold in quarter one in
territory four, via the first sales channel and the first type of payment can be represented
by the coloured rectangle. In order to visualize the amount of data points you can use a
colour/grey scale. Considering the colour scale drawn next to the plot the filled rectangle
would represent an amount of less than 40,000 items. This value is binned since otherwise
a clear visualization would not be possible.
The common depiction of the dimensional stacking technique is a bit more compact and
not as nicely presented as the one above. Usually the rectangles, which are now spaced
to make the distinction between the different attribute combinations easier, are close to
each other, only separated by a thicker line.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
9
Figure 6: Dimensional stacking
This method is very useful for hierarchical data sets that only have a small number of
dimensions as otherwise the embedding process will make the resulting plot too crowded.
A great challenge is the question of labelling. The way chosen in the example is one
possibility of naming the different variables in the plot.
A technique that displays the correlation between dimensions (not the data itself!) recursively [Hoffmann02] is the fractal foam. The starting point is a chosen dimension that is
depicted by a coloured circle. Attached to this circle are further circles, which symbolize
the other dimensions. The size of these rings corresponds to the correlation between the
inner circle and the fastened ones. A high correlation requires a large circle. Fixed to the
second layer of circles is a third layer which describes the correlation of these dimensions
and so on. An example of fractal foam can be found in figure 7.
3.3.3
Geometrical methods
Geometrical methods are a very large group of visualization techniques.
Probably the easiest and most commonly used one is the method of parallel coordinates. Here the dimensions are represented by parallel lines, which are equally spaced.
They are linearly scaled so that the bottom of the axis stands for the lowest possible value
whereas the top corresponds to the highest value. A data point is now drawn into this
system of axes with a polygonal line, which crosses the variable lines at the locations the
data point holds for the examined dimension. A simple example with three points and four
dimensions is shown in figure 8. The points displayed are A = (1; 3; 2; 5), B = (2; 4; 1; 6)
and C = (1; 4; 3; 5).
This method is not exclusively applicable to data sets that are as simple as the last one.
One of the familiar high-dimensional data set examples used to explain data visualization
techniques is the Iris data set. It consists of three different Iris types, namely Iris Setosa,
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
10
Figure 7: Fractal foam (sepal length - centre (white), petal length - right (red), petal
width - top (yellow), sepal width - bottom (green)) [Hoffmann99]
Figure 8: Parallel coordinates for points A, B, C
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
11
Figure 9: Parallel coordinates (Iris data set) [Grinstein01]
Iris Versicolor and Iris Virginica. The variables of this data set are the sepal length, the
sepal width, the petal length and the petal width, all measured in millimetres. As you
can see in plot 9 the parallel coordinate technique is a tool which enables you to find out
attributes that allow a categorization of the different flower types. In the diagram the
petal width seems to be a good classifier for the red Iris type. It is also a fairly good
attribute to distinguish between the violet and green flower category.
A very significant feature of this visualization technique is that the dimensions are treated
equally. This characteristic permits a rearrangement of the displayed dimensions, which
gives another view on the data and therefore might lead to the recognition of certain
patterns (or classification attributes) that would otherwise be hidden in the actual visualization arrangement.
Figure 10 shows the same Iris data set but this time normalized and with the dimensions
sepal width and sepal length swapped. The resultant graph looks very different and much
clearer.
Another interesting geometrical visualization technique is the use of Andrew’s curves
[Hoffmann02]. This method plots each data point as a function of the data values using a
specific equation. The data point curves are usually sketched in the interval −π < t < π.
The function which draws these curves is shown as:
x1
f (t) = √ + x2 · sin(t) + x3 · cos(t) + x4 · sin(2 · t) + x5 · cos(2 · t) + . . . ,
2
where x = (x1 , x2 , . . . , xn ) and xn are the values of the data points for the particular
dimension. Consider the example of the three data points already used to explain the
parallel coordinates technique (A = (1; 3; 2; 5), B = (2; 4; 1; 6) and C = (1; 4; 3; 5)).
For data point A, the function fA (t) =
For data point B, the function fB (t) =
For data point C, the function fC (t) =
√1
2
√2
2
√1
2
+ 3 · sin(t) + 2 · cos(t) + 5 · sin(2 · t).
+ 4 · sin(t) + 1 · cos(t) + 6 · sin(2 · t).
+ 4 · sin(t) + 3 · cos(t) + 5 · sin(2 · t).
If you plot these three data points into one coordinate system using Matlab you obtain the result depicted in figure 11. Applying this algorithm now on the Iris data set
mentioned before results in a graph (Figure 12) that looks slightly more complex.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
12
Figure 10: Parallel coordinates (Iris data set) [Hoffmann99]
Figure 11: Andrew’s curves for data points A, B, C
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
13
Figure 12: Andrews curves (Iris data set) [Hoffmann99]
The advantage of this algorithm is that it is easily applied to data with a large amount of
dimensions. The disadvantage is the long computational time as every data point requires
the calculation of a trigonometric function [Hoffmann02].
A very basic technique to visualize high-dimensional data is the application of multiple views. They are often used with scatterplots or barcharts leading to an n × n cell
matrix, where n is the number of dimensions. Each cell of this matrix is then a scatterplot or a barchart respectively. This method is widely employed for data sets that
contain diverse attributes. It reveals correlations and disparities between variables since
the representation of the different component combinations next to each other allows a
visual comparison of the possible connections.
In the next example the method has been applied to the car data set, another widely
employed set for visualization techniques. This table contains the combinations of miles
per gallon (MPG), year of manufacture, cylinders, acceleration, horsepower and weight
for three different car types. The red spots in figure 13 symbolize American cars, the
green ones Japanese cars and the blue ones European cars. This figure clearly identifies
a positive correlation between horsepower and weight, whereas the combination of MPG
and weight reveals a negative correlation [Hoffmann02].
Even though this method is a very functional tool in the visualization of data it does have
several disadvantages. A very problematic one is the fact that the user becomes overwhelmed by the number of charts they have to evaluate and keep in mind while doing so.
The usage of space is a more practical aspect that needs consideration. The car example
produces a matrix, which is not only a manageable quantity to work with but also to
display. If the data set was extended to ten dimensions for instance the presentation of
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
3 DATA VISUALIZATION TECHNIQUES
14
Figure 13: Scatter plot matrix (Car data set) [Hoffmann99]
the corresponding graph in a clear way would no longer be possible.
The last two techniques I would like to present in this paper belong to the division of
anchor visualization methods. They are both fairly new approaches to the problem, the
second being the further development of the first one.
Radial Coordinate Visualization (RadViz) uses the spring paradigm [Hoffmann02].
From a centre point n equally spaced limbs of the same length spread out, each representing one dimension. The ends of the lines mark the dimensional anchor (DA) of the
respective variable, which are connected forming a circle. Before the data points can be
visualized by this technique they need to be normalized. After that one end of a spring
is fastened to each dimensional anchor, the other end to the data point. The spring constant of each spring is the value of the data point of the respective dimension. In order
to determine the location of the data point the sum of the spring forces needs to equal
zero. If you apply this method to the well known Iris data set you can obtain figure 14.
An advantage of RadViz is the fact that it preserves certain symmetries of the data set
[Hoffmann02]. The major disadvantage is the overlap of points.
The second dimensional anchor technique, which has been named PolyViz, takes remedial measures. The emerging plot is a combination of RadViz and the application of
the barchart technique. It illustrates the DAs not as points as in RadViz but as lines so
that the graph becomes a polygon. This technique nevertheless shows the clustering of
the data points in the middle of the polygon as it uses the same spring paradigm. But
it also makes a study of the distribution along the different dimensions possible since it
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
4 PROJECTION TECHNIQUES
15
Figure 14: RadViz (Iris data set) [Grinstein01]
plots this scattering along the axes using the barchart technique [Hoffmann02] (Figure
15).
All the techniques explained above visualize data sets without trying to change them
in order to simplify the visualization. In the following chapter I will introduce non-linear
projection methods that reduce the size of the dimension vector so that the display of the
data sets becomes facilitated.
4
Projection techniques
The general goal of projection techniques is the reduction of the dimensionality of data
to obtain a spatial mapping of the particular data set in the available space. I will
distinguish between two methods that differ in their side condition. The first method,
multi-dimensional scaling, tries to preserve distances between data points whereas neural
networks focus on the maintenance of structure.
4.1
Multi-dimensional scaling (MDS)
As stated above multi-dimensional scaling has its focus on the preservation of distance.
The distance dij is the Euclidean distance between the data points xi and xj in the
n-dimensional space, which is dij = kxi − xj k, with xi ∈ <n , i, j ∈ 1, 2, . . . , N . Multidimensional scaling attempts to reconstruct the distances between the data points in
the n-dimensional space by the determination of dissimilarity vectors δij for the subvector space. With non-linear MDS the relationships between the distance vectors and
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
4 PROJECTION TECHNIQUES
16
Figure 15: PolyViz (Iris data set) [Grinstein01]
dissimilarity vectors are not proportional. In order to determine the dissimilarity vector
δij we need to apply a monotone transformation D(.), which results in the disparity matrix
Dij = D(δij ) [Walter02].
An extensively used algorithm of multi-dimensional scaling is the Sammon’s Algorithm
[Walter02]. Sammon came up with the following equation:
E({xi }) =
N X
X
wij (dij − Dij )2
i=0 j>i
This formula describes a minimization problem as a sum over the weighted squares of the
differences of distance and disparity vectors. The values of wij are a means to normalize
the cost function, as well as weigh the different disparities; they depend on the normalization technique (local, intermediate or global normalization). In order to calculate the
cost or stress - E, Sammon recommends the employment of iterative methods such as the
Newton method to recursively calculate the minimum.
A different approach to the problem of preserving distance is the hyperbolic multidimensional scaling (H-MDS). Before explaining the basic idea of H-MDS I would like
to give an overview of the hyperbolic space.
If people think of geometry and spaces they usually assume spherical geometry. This
geometry’s property is the positive curvature which results in spherical surfaces like the
moon. As there is geometry with positive curvature, there also exists geometry with negative curvature, which is called the hyperbolic plane - H2. The hyperbolic plane can be
represented by two important equations [Walter02]. The area ’a’ and the circumference
’c’ of radius ’r’ in H2 can be defined by:
a(r) = 4πsinh2 ( 2r ) and c(r) = 2πsinh(r)
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
4 PROJECTION TECHNIQUES
17
Figure 16: Hyperbolic patch embedded into <3 . The circumference and area grow exponentially in the drawn circle. The sum of angles in the triangle is smaller than 180
[Walter].
These two equations hold an amazing feature of the hyperbolic plane. For small values
2
of r sinh2 ( 2r ) ≈ r4 and sinh(r) ≈ r, so that a(r) ≈ πr2 and c(r) ≈ 2πr. This is identical to the functional description of circles in spherical geometry. For larger values of
r though both the area and the circumference grow exponentially. In order to imagine
this exponential growth you can think of the H2 as a ball of crumpled paper. If you now
draw a circle on this wrinkled sheet of paper and unfold it afterwards the area and the
circumference of the resulting object will be much larger than it appeared when drawn.
Figure 16 shows the embedding of an extract of the H2 into the <3 .
There are several approaches towards mapping the hyperbolic plane into the Euclidean
surface. I will provide you with a rough overview of the Poincaré model as it is the most
widely used method for this task.
Basic features of this projection are the display compatibility, which means that the entire
H2 space fits into the Poincaré disk (PD), and the infinite size of the circle rim. The first
mentioned characteristic inspired M. Escher to create the picture in figure 17. Note that
in the drawing the white lines are perpendicular to the rim at any time and represent
straight lines in the H2. It also clearly illustrates the ”fish-eye” effect, which is the larger
appearance of the images in the middle in comparison to the ones in the outer parts of
the circle even though they are all the same size.
Coming back to hyperbolic multi-dimensional scaling we now have to adapt the distance
vector dij to the new assumptions. The distance according to Riemann [Walter02, cf. 8]
is now given by:
|xi − xj |
), xi , xj ∈ P D
dij = arctanh(
|1 − xi xj |
I will not go into detail how to determine the cost function resulting from this type of
distance vector but would like to mention that the non-linearity of the distance vector
dij influences the transformation matrix in H2 more than it did in the spherical geometry
[Walter02]. This suggests that the choice of D(.) needs a lot more consideration.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
4 PROJECTION TECHNIQUES
18
Figure 17: ”Circle Limit III” (1958)
4.2
Self-organizing maps (SOMs)
The second method I would like to familiarise you with, is the self-organizing maps, a
method of artificial neural networks.
Neural networks are adapted from neurobiological models [Oellien03]. In order to reduce
dimensionality they use a combination of analytic and graphical techniques to group data
while preserving the data structure [Grinstein01]. A neural network consists of several
layers that are made up of neurons. The data is presented to the input layer, processed
inside the network and then returned from the output layer. During the phase of processing various rules are applied to the given data set, constituting the learning algorithm.
Neural networks can further be divided into supervised and unsupervised nets. Selforganizing maps (SOMs) are the most famous example of unsupervised learning. Teuvo
Kohonen developed this method; for this reason they are often referred to as Kohonen’s
SOMs.
The term ”unsupervised” corresponds to the fact that only the input data set is presented to the neural network. The algorithm then determines similarities automatically
and returns a map as the output of the process. This output map is characterized by
numerous clusters, in which similar objects lie close to each other therefore showing the
relationship between the input variables. Since there is no need to know anything about
the underlying relationships of the data points, this technique is ideal for data sets for
which the structure or system is unknown. Figure 18 shows the self-organized map of the
Iris data set.
In the next chapter I will deal with the need of interacting with the visualized data
and will present the most essential techniques.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
5 INTERACTION
19
Figure 18: Self organizing map of the Iris data set [Grinstein01]
5
Interaction
Interaction plays an important role in the understanding of the knowledge discovery process and the data itself. The methods I present cover the most basic but crucial techniques
that are necessary to become an active data analyst. The distinction between an active
and a passive user is the ability to discuss problems that arise [Thearling02] and by working on the data set (in the visualization mode) find solutions for them.
The order in which I present the techniques and the summarization I chose is not the only
possible solution but it reflects the normal flow of interactive applications.
5.1
Filtering
The interaction technique that does the central procedure of selecting data points is
filtering. The use of selection is either the elimination of impossible/false data points
or the focus on a particular cluster that needs further consideration. As an example for
impossible data points consider the following case: The data set is the number of crimes
over the age of the population. If the plot indicates crimes committed by three-years-old
children or people that are 150 years old we can assume that these points have been
entered incorrectly into the system and should be eliminated.
In the category of filtering techniques we distinguish between browsing and querying.
Browsing is the direct selection of data points. This is a suitable method for identifying
clusters and, for instance, choosing them for further investigation. Querying however
offers the opportunity of directly entering specifications the data points are required to
meet. This could possibly eradicate the unrealistic data points mentioned in the example
above.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
5 INTERACTION
5.2
20
Linking and brushing
Other selecting tools are linking and brushing. Even though these procedures can also be
used to completely delete data points, this is not their primary goal.
Linking refers to the connection between different plots. In multiple views (refer to 3.3.3
Geometrical methods/multiple views) the manipulation of data in one plot automatically
affects the respective data set in the other (linked) graphs.
Highlighting data points is also called brushing. It is often used in connection with
linking so the user can observe the effect of highlighting data in one graph on the other
views. Consider the following example.
The data set is three dimensional with the dimensions product type, territory and number
of sales. As a visualization technique we use scatterplots in multiple views. The first
scatterplot shows the distribution of the number of sales over the different product types,
whereas the second one presents the number of sales over the territories. Highlighting the
data points for the first product in the first graph leads to (since the graphs are linked)
highlighting the respective data points in the second plot. The highlighted cluster in the
linked graph now indicates (e. g.) that the chosen product is mostly sold in a particular
territory. This connection between the two attributes might not have been seen if the
graphs had not been linked. This is a very simple example but the principle can easily
be applied to higher dimensional data sets. It is also not limited to multiple views of the
same visualization technique but is also very useful for multiple views of different plot
types.
5.3
Zooming
Zooming is the method of showing a particular part of the data set in detail. The problem
that arises with this way of focussing on portions of the data is that you might lose
the big picture. A remedial measure takes the approach of non-linear magnification.
Methods that fall within this category are also known as focus+context (F+C) techniques
[Sahling03]. This progressive form of zooming tries to expand the selected area while
showing the original context. Examples for non-linear magnification are the fish-eye lens
or the hyperbolic space. If you view data through a fish-eye lens it seems to be seen with
a ”wide-angle camera lens” [Sahling03]. The example in figure 19 shows the tram network
in Washington D.C., USA. Due to its properties (see also 4.1 Multi-dimensional scaling
(MDS)/HMDS) the use of the hyperbolic space is predestined for zooming interactions.
5.4
Manipulation of data
Manipulation of data, mainly input data, or the removal of outliers is a very important
part of the whole knowledge discovery process. The observation of the consequences in
the output data (or neural network) due to a change in the input data helps to get a
feel for the basics. It also allows the study of ”what if”-questions on the particular data
set. All these methods cannot be related to a particular location in the visualization or
interaction process in KDD but are applicable at any time and therefore indispensable.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
6 SOFTWARE AND APPLICATIONS
21
Figure 19: Tramnetwork F+C technique [Keahey]
6
Software and Applications
Several software packets are available at the moment, most of them are commercial but
there are a couple of public-domain software tools as well. I would like to give an overview
of two packages as examples.
XGobi is a data visualization tool that has been developed by Deborah F. Swayne, Di
Cook and Andreas Buja. It handles multivariate data presentation using scatterplots.
This software also offers projection techniques to reduce the dimensionality of the data
sets and also supports the basic interaction methods like brushing and zooming. A very
important feature of this tool is the handling of missing values. An extension and further
development of XGobi is GGobi. It provides the user with a new and clearer interface
and supports new technology or software such as XML and database systems.
Xmdv is a public-domain software. The main focus of this tool is the interaction of
user and machine. It handles multidimensional data sets by applying the visualization
tools scatterplots, star glyphs, parallel coordinates and dimensional stacking (all presented
in this paper). Since this software concentrates on interaction it supports the methods I
dealt with in the last chapter. Furthermore it offers the opportunity to mask dimensions
in order to study their impact on the clustering in the data set. A unique feature of this
tool is its use of clustering data sets to visualize them without overloading the screen and
therefore the user. This is a good aspect regarding clarity but it excludes the user from
the process.
Both of the above mentioned software packages are available for many industry branches.
There often exist certain modules that adapt the general idea of the tool to those different areas of interest. Xmdv for instance offers modules for fields such as finance and
geochemistry (cf. Xmdv). A very interesting application for such tools is investigative
visualization in the medical field. On the one hand graphical representations of statistics
such as the number of breast cancer patients for the last twenty years are essential to
reveal trends. On the other hand, especially useful for research, the graphs can be used to
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
REFERENCES
22
do studies on causes of diseases. The occurrence of certain illnesses such as breast cancer
depends on various variables. In order to find out which combination of these variables
leads to the actual disease a graphical representation is crucial.
I would like to conclude my paper with a glance at future research.
7
Outlook and Conclusion
Even though there has been a large amount of research on the human-machine integration
it is still an area which needs further improvement and development to more effectively
and efficiently use the power that results of this symbiosis. Keim [Keim02] is of the opinion that not only computers in general need to be integrated with the human. He states
that the two components, namely visualization techniques and the well-known methods
applied in different areas such as statistics and operations research for instance have to
be brought together more extensively. From this point of view research on the human
perceptibility should also be increased [Keahey99]. Especially methods that do not reduce
dimensionality can overload the user and therefore have opposite effects as the user might
not be able to see the overall picture.
Another question that arises [Brodbeck97] is whether to mainly present data in the 2D
space or to use three-dimensional visualization techniques or perhaps even animations.
Animation is a very functional visualization technique. It is widely applicable to many
kinds of data sets as the last dimension allows the visualization of a flow in a variable,
usually time. An advantage of 3D is that the human world is three dimensional and
therefore this visualization seems to be more natural to the human user than any of the
others. The problem though is again the overload 3D data produces. Humans do still
have to break down the visualized data into two dimensions and compare it in their minds.
Intensive interaction could be helpful, such as rotation of the data set as well as the already presented tools. Until now 2D representation is the most extensively used form of
visualization.
Hyperbolic planes are also an area that visualization tools should employ further. The
advantages of its infinite representation space and focus+context capabilities are very
important features that will be greatly needed in the future. The hindrance could be the
fact that hyperbolic planes are geometries we are not familiar with. This might incur the
users displeasure but I think that professional education and information about this view
of the world will improve the chance of a smooth introduction of hyperbolic planes as the
base of data representation.
In order to conclude this paper I can say that the existing techniques use very distinct
approaches to the problems. Each offers a selection opportunity, since, as mentioned before, different data types need diverse graphical representations. As I pointed out above
there is still a lot of research that needs to be done but I think the requirement has been
identified and we can therefore look forward to a large amount of new and innovative
techniques for the visualization of data and information in the future.
References
[Brodbeck97]
Brodbeck, D. et al.:
Domesticating Bead:
Adapting an Information Visualization System to a Financial Institution.
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
REFERENCES
23
http://www.dcs.gla.ac.uk/˜matthew/papers/infovis97.pdf,
printed 20.04.2005
1997,
[Docherty02]
Docherty, P., Beck, A. (edt): A Visual Metaphor for Knowledge Discovery. In: Information Visualization in Data Mining and Knowledge
Discovery. Morgan Kaufmann, San Francisco (CA) 2002
[Fayyad02]
Fayyad, U., Grinstein, G. (editors): Introduction. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann,
San Francisco (CA) 2002
[Grinstein01]
Grinstein, G., Trutschi, M., Cvek, U.: High-Dimensional Visualizations.
http://www.cs.uml.edu/˜mtrutsch/research/High2001,
printed
Dimensional Visualizations-KDD2001-color.pdf,
20.04.2005
[Grinstein02a]
Grinstein, G., Hoffmann, P., Pickett, R. (edt): Benchmark Development
for the Evaluation of Visualization for Data Mining. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan
Kaufmann, San Francisco (CA) 2002
[Grinstein02b] Grinstein, G., Ward, M. (edt): Introduction to Data Visualization. In:
Information Visualization in Data Mining and Knowledge Discovery.
Morgan Kaufmann, San Francisco (CA) 2002
[Hoffmann99]
Hoffmann, P. E.: Table Visualizations: A Formal Model and Its Applications, http://home.comcast.net/˜peh2.hoffman/tablevizx.pdf, 1999,
printed 07.05.2005
[Hoffmann02]
Hoffmann, P., Grinstein, G. (edt): A Survey of Visualizations for HighDimensional Data Mining. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann, San Francisco (CA)
2002
[Kaidi00]
Kaidi, Z.: Data visualization. http://www.cs.uic.edu/˜kzhao/Papers/
00 course Data visualization.pdf, 2000, printed 20.05.2005
[Keahey99]
Keahey, T.A.: Visualization of High-Dimensional Clusters Using Nonlinear Magnification. http://www.ccs.lanl.gov/ccs1/projects/Viz/pdfs/
99-spie.pdf, 1999, printed 20.04.2005
[Keahey]
Keahey, T.A.:
A Brief Tour of Nonlinear Magnification.
http://www.cs.indiana.edu/˜tkeahey/research/nlm/nlmTour.html
[Keim02]
Keim, D. A.: Information Visualization and Visual Data Mining. http://ieeexplore.ieee.org/iel5/2945/21152/00981847.pdf?tp=&
arnumber=981847&isnumber=21152, 2002, printed 20.04.2005
[Kontkanen99] Kontkanen, P. et al.: Supervised model-based visualization of high- dimensional data. http://cosco.hiit.fi/Articles/ida00.pdf, 1999, printed
20.04.2005
[Koua03]
Koua, E.L: Using Self-Organizing Maps for Information Visualization and Knowledge Discovery in Complex Geospatial Datasets.
2003,
http://www.itc.nl/library/Papers 2003/art proc/koua.pdf,
printed 20.04.2005
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
REFERENCES
24
[Rhodes02]
Rhodes, P. (edt): Discovering New Relationships. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan Kaufmann,
San Francisco (CA) 2002
[Mihalisin02]
Mihalisin, T. (edt): Data Warfare and Multidimensional Education. In:
Information Visualization in Data Mining and Knowledge Discovery.
Morgan Kaufmann, San Francisco (CA) 2002
[Oellien03]
Oellien, F.:
Data Mining und Datenvisualisierung (ch 5 of
Algorithmen und Applikationen zur interaktiven Visualisierung
und Analyse chemiespezifischer Datensätze). http://www2.ccc.unierlangen.de/people/Frank Oellien/diss/index.html, 2003, printed
08.05.2005
[Sahling03]
Sahling, G. N.:
Interactive 3D Scatterplots
From High Dimensional Data to Insight. http://www.vrvis.at/vis/resources/DANSahling/masterthesis.html, 2003
[Thearling02]
Thearling, K. et al. (edt): Visualizing Data Mining Models. In: Information Visualization in Data Mining and Knowledge Discovery. Morgan
Kaufmann, San Francisco (CA) 2002
[Walter]
Walter, J. A.: Interactive Visualization and Navigation using the Hyperbolic Space. http://www.techfak.uni-bielefeld.de/˜walter/h2vis/
[Walter02]
Walter, J. A., Ritter, H.:
On Interactive Visualization
of High-dimensional Data using the Hyperbolic Plane.
http://www.techfak.uni-bielefeld.de/˜walter/pub/Walter02-kdd.pdf,
2002, printed 07.05.2005
[Ward99]
Ward,
M.
O.:
A
Taxonomy
of
Glyph
Placement
Strategies
for
Multidimensional
Data
Visualization.
http://davis.wpi.edu/˜matt/courses/glyphs/, 1999
[Wezel03]
Wezel, M. C. van, Kosters, W.A.:
Nonmetric multidimensional scaling: Neural networks versus traditional techniques.
http://www.liacs.nl/˜kosters/ida00191.PDF,
2003,
printed
07.05.2005
c 2005 Kim Bartke: 2D, 3D and High-Dimensional Data and Information Visualization
Download