Representing Data Graphically

advertisement
Data Visualization
By: Taggert J. Brooks
Representing Data Graphically
Data visualization, sometimes called information visualization - or infovis1 for short –
comes from the convergence of computer science, statistics and design. It is a marriage
between science and art, between the left and right halves of the brain. The goal is to
make data presentation interesting, aesthetically pleasing and hopefully informative.
Good data visualization goes further by revealing relationships in the data that might
otherwise have gone unnoticed. With the absence of hypothesis tests it is easy to
discount visualization as unscientific, but that would be a mistake. There are many uses
of data visualization, and the reality is hypothesis testing can bore the audience, if not
completely surpass their level of understanding. Data visualization then is a means to an
end for statisticians who want to be better communicators. And it’s a pathway to a
better understanding of the data for the designers amongst us.
"In our excitement to produce what we could only make before with great effort, many
of us have lost sight of the real purpose of quantitative displays — to provide the
reader with important, meaningful, and useful insight."
— Stephen Few
I would add that good visualization techniques will not only help the reader, but also
help the producer of the visualization to discover meaningful insights
This document is meant to be an introduction to different visualization techniques, and
though I provide some practical how to, I do not provide everything. Where I fail,
Google and the internet can fill in the gaps.
Too Much Data
The internet has led to an explosion in the amount of data we have collected, stored
and easily accessible. It has done this through dramatically lowering the costs of those
activities. The problem we now face is filtering the valuable data from the invaluable data
and determining how we use it to inform business decisions or research. A recent
example of the ubiquity of new data can be taken from the presidential election. We
have data on the frequency of word searches in Google by each minute of the Vice
Presidential debate between Senator Joe Biden and Governor Sarah Palin.2 Apparently
people were trying to figure out exactly what a “Maverick” actually is.
What type of media will you use to make your presentation? How long does your
audience have to take in the data? The longer the audience has the more data dense the
visualization can and should be. The less time and autonomy your audience has to
peruse the data the more simplified the visualization should be.
1
A wiki dedicated to Infovis: http://www.infovis-wiki.net/index.php?title=Main_Page
A graph of the searches can be found here
http://www.readwriteweb.com/archives/google_has_changed_political_d.php
2
1
Data Visualization
By: Taggert J. Brooks
Will it be a written report, a power point presentation, or is the data going to be
rendered on the web? In other words will the visualization be static or dynamic? These
questions are some of the first you should answer when selecting a visualization
method.
Visualization is about Discovery, Discerning Patterns, and Disseminating
Information.
Below we have a nice info graphic describing the “data collection” to “data use”
continuum.
A good example of the effectiveness of visualization for identifying outliers, or data
errors can be found below. This is derived from 3
3
http://www.visualizingeconomics.com/2009/07/12/data-scienist-data-geek-designer/
2
Data Visualization
By: Taggert J. Brooks
The picture above is a great way of using visualization to identify errant data. The
underlying data in this case must be no more than 100%, yet we can see one mistaken
observation.4
Another example comes from a entry on the R-bloggers blog:
When the data was plotted, the differences between the perceptions and the
realities were immediately visible – and the reporters knew they were on the
right track. “It’s not just about producing graphics for publication,” Aldhous
explains. “It’s about playing around and making a bunch of graphics that help you
explore your data. This kind of graphical analysis is a really useful way to help
you understand what you’re dealing with, because if you can’t see it, you can’t
really understand it. But when you start graphing it out, you can really see what
you’ve got.”5
Selecting the Right Graph
Design is choice. The theory of the visual display of quantitative information consists of
principles that generate design options and that guide choices among options. The
principles should not be applied rigidly or in a peevish spirit; they are not logically or
mathematically certain; and it is better to violate any principle than to place graceless
or inelegant marks on paper.
— Edward Tufte, The Visual Display of Quantitative Information
Selecting the appropriate display can be difficult because it involves a good
understanding of the nature of your data, statistics, as well as a good understanding of
design principles. There are many possibilities for a given variable or dataset, but you
need a place to start. There are a few web pages, which try to help, but none satisfy
both the issues of statistics and design simultaneously.6 As the quote by Tufte suggests,
the choice of design does not easily fit into a simple algorithm.
4
This is from the higher ed weblog http://blog.une.edu.au/robbi/2009/08/06/data-testing-usingvisualisation/
5
http://www.r-bloggers.com/r-is-hot-part-4/
6
This webpage http://interface.fh-potsdam.de/infodesignpatterns/news.php is closer to the visual end
while this webpage http://www.ncsu.edu/labwrite/res/gh/gh-graphtype.html does a better job of helping
select the appropriate graph from a statistics perceptive and this one helps choose the right statistical test
http://www.ats.ucla.edu/stat/stata/whatstat/default.htm,
.
3
Data Visualization
By: Taggert J. Brooks
Some other examples of websites which try to provide guidance in the choice of
appropriate representations can be found in the blog entry titled “Things should be made
as simple as possible, but not any simpler” 7, which is a famous Einstein quote.
1. Determine the relationship you want to display
2. Determine if you want to emphasize individual values or the overall pattern
3. Determine the chart type
Bad charts
Before we begin discussing some of the common, and not so common visualizations it
might be better to provide some links to bad charts, and improvements. Stephen Few
provides some excellent examples of bad charts and then provides recommendations
for fixing the problems.8 Another set of examples is provided here.9
Many of these criticisms and corrections are based on the rules and suggestions from
the work of Edward Tufte. His rules can be found at his website.10
7
http://blog.xlcubed.com/chart-rules-as-simple-as-possible-but-not-any-simpler/ A follow up can be found
here as well. http://blog.xlcubed.com/household-income-distribution-1967-2005-as-small-multiples-chart/.
Still another example of a chart chooser can be found here: http://chartchooser.juiceanalytics.com/, which
also produces Excel templates from your choices
8
http://www.perceptualedge.com/examples.php
9
http://lilt.ilstu.edu/jpda/charts/bad_charts1.htm
10
http://www.washington.edu/computing/training/560/zz-tufte.html
4
Data Visualization
By: Taggert J. Brooks
Seth Godin, the famed marketer also has rules for making good graphs11.
Graph Types
Microsoft Excel is a common tool for creating graphic representations, but sadly their
default choices are often not good design choices. And many of the default graphs they
provide should never be used. While Excel 2007 is much better than the horrible
defaults in Excel 2003, they both can benefit from some alterations. For some details on
altering the charts after excel has created one using the default templates see the link
below.1213
Some traditional graphical means of data representation, which can be found under the
INSERT ribbon in Excel 2007:
Pie chart
The pie chart is useful for representing the relative proportions of a few categories. The
more categories, the greater the number of “slices”, the more difficult the chart is to
read.
The field of info visualization is rather new, and like any new field there are often very
impassioned people in the field with starkly different opinions. For some their beliefs are
almost religious, and the rules they profess delivered with the same vigor as a Baptist
Minister delivering a sermon from the pulpit. An example of this occurred in the
blogophere when marketing guru Seth Godin suggested there should be no more bar
charts, only pie charts. This led to a swift reply from the community of InfoVis folks,
many of who countered with the exact opposite advice. Remember the quote from
Tufte above, the reality is always somewhere in between, born of the exercise of good
judgment14.
The problem with pie charts – as infovis people will tell you - is that consumers of
visualizations have a hard time estimating angles. In fact, they get them wrong, thus
drawing the wrong inference from the slices of a pie chart. People are better at visually
judging height, which is why many infovis people prefer the column chart.15 The visual
hierarchy of Cleveland is provided at this website.16
11
http://sethgodin.typepad.com/seths_blog/2009/07/how-to-make-graphs-that-work.html
How to alter the defaults in Excel: http://blog.xlcubed.com/defaults-in-excel-charting/
13
http://www.juiceanalytics.com/writing/fixing-excel-charts/
14
Much of the debate is captured here http://peltiertech.com/WordPress/peltier-loves-pie/
15
http://seedmagazine.com/content/article/getting_past_the_pie_chart/
16
http://www.processtrends.com/TOC_data_visualization.htm
12
5
Data Visualization
By: Taggert J. Brooks
17
Bar and Column Charts
Bar charts are often good for representing categorical data. You can present the
frequency of responses in each category, or the relative frequency.18 You can also
present the frequency or relative frequency of one variable, over the groups or
categories of another variable. Making it an excellent choice when you have two
categorical variables.
http://peltiertech.com/WordPress/pie-chart-for-pi-day/
Most of the charts in this article were produced in Microsoft Excel 2007, unless otherwise noted. They
were copied into Word 2007 using the pastepaste specialMicrosoft Excel object function.
17
18
6
Data Visualization
By: Taggert J. Brooks
100
5
50
4
3
0
1
1
1
2
3
4
5
6
0
Column chart
50
0
100
Bar Chart
100
200
Stacked Bar Chart
Here is a recent bar chart I used to highlight US Debt to GDP ratio. Notice the use of
the single red bar to draw attention to the US relative to the rest of the OECD. Imagine
how ugly this would look, and how confusing if I used a different color for every
country? How would this look if I used the same color for every country? Obviously this
works in color, would it work in grayscale?
0
20
40
60
80
100
120
140
160
180
Japan
Greece
Italy
Belgium
Portugal
Hungary
United Kingdom
Austria
France
Netherlands
Poland
Iceland
United States
Turkey
Germany
Sweden
Spain
Denmark
Finland
Korea
Canada
Ireland
Czech Republic
Slovak Republic
Mexico
Switzerland
New Zealand
Norway
Luxembourg
Australia
2008 Debt to GDP Ratio for OECD
Line Graph
The traditional line graph is generally used to measure a single variable (usually
continuous) over time, with time being represented on the horizontal axis. Though it
could be used to measure the relative frequency of a single response category over time
as well.
100
50
0
1
2
3
4
5
6
7
8
9
10
7
Data Visualization
By: Taggert J. Brooks
U.S. Payroll Employment: Total Nonagricultural: SA, Thousands of
Persons
142.0
137.0
132.0
127.0
122.0
117.0
112.0
107.0
Jan-90 Jan-92 Jan-94 Jan-96 Jan-98 Jan-00 Jan-02 Jan-04 Jan-06 Jan-08
A few quick notes about the above graph. I’ve removed the horizontal gridlines as they
were an example of ink with no purpose, since we do not care about the precise level
of employment, but rather the changes over time. The background fill of the chart area
has been changed to white. I added shaded bars to denote the periods of recession. If I
were to improve this further, I would probably reduce the number of labels on the
horizontal axis, say maybe every 36 months, rather than 24. I’d probably also reduce the
number of labels on the vertical axis as it currently feels a bit cluttered. Finally I might
eliminate the title altogether and make a very small footnote that contained the same
information. Or maybe just title the chart Employment and relegate the details to the
footnote.
Area Chart
An area chart is a line chart with the area below the line shaded. This can be useful
when you have two lines over time and one line represents a subset of the first. For
example, you could have retail sales over time broken into two categories, durable and
non-durable goods.
200
100
180
160
140
50
120
100
80
0
60
40
1 2 3 4 5 6 7 8 9 10
20
0
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
Scatter Plot
Scatter plots are useful when you have two continuous variables with one represented
by the X axis and the other on the Y axis. A third variable can be used to measure
another attribute of the points, yielding a bubble chart, which will be discussed later.
8
Data Visualization
By: Taggert J. Brooks
100
50
0
0
50
100
Tables
We should not always rush to make a chart, sometimes just presenting the numbers in
tabular form is sufficient to get your point across, or maybe you blend both? Below are
two examples using the conditional formatting in Excel 2007, which blends the graphic
design of a chart with the data in tabular form.19
Leisure Time Spent
biking
125
hiking
40
reading
30
singing
25
dancing
10
cleaning
5
Leisure Time Spent
biking
125
hiking
40
reading
30
singing
25
dancing
10
cleaning
5
Whenever presenting data like this it is useful to rank order the data from largest to
smallest. Failure to do so makes it a bit harder for the reader to sift through the data as
you can see from the example below.
Leisure Time Spent
biking
125
hiking
5
reading
50
singing
75
dancing
10
cleaning
80
Leisure Time Spent
biking
125
hiking
5
reading
50
singing
75
dancing
10
cleaning
80
A simple way to quickly deemphasize the numbers is to change the font of the numbers
to white.
Leisure Time Spent
biking
hiking
reading
singing
dancing
cleaning
19
125
40
30
25
10
5
In the Home Ribbon select conditional formatting  data bars
9
Data Visualization
By: Taggert J. Brooks
The one very unfortunate issue with this technique is that Microsoft Excel violates an
important statistical and visualization principle with their bars. Zero values should be
represented by the absence of any color, bar or indicator. Yet, no matter how small the
lowest quantity in the range of cells the bar appears to be about 5%, even if the value is
zero, as can be seen in the example below.20
Leisure Time Spent
biking
125
hiking
40
reading
30
singing
25
dancing
10
cleaning
0
Spark Lines
Sparklines are small inline line graphs developed by Edward Tufte21.
GDP [5.8%]22
GDP [5.8%]
Notice how simple the sparkline is. We have removed the clutter of the Y and X axis
labels. Yet the important information is still there, you see the relative values, clearly it
is not currently at its highest value yet is higher than previous. Compare that to the
more traditional graph below:
GDP
10
8
6
4
2
0
1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
20
Thanks to the excellent juice analytics for making this point.
http://www.juiceanalytics.com/writing/excel-2007-and-lie-factor/
21
Edward Tufte’s explanation of the theory and practice of sparklines
http://www.edwardtufte.com/bboard/q-and-a-fetch-msg?msg_id=0001OR&topic_id=1
22
The sparkline was created with the free open source add in for Microsoft Excel, called TinyGraphs. It
can be found here: http://www.spreadsheetml.com/products.html.
10
Data Visualization
By: Taggert J. Brooks
This representation clearly consumes more space, and invites the reader to linger on
the chart, rather than the point you are trying to make about the data. However, this
type of chart has its place. For example it might be a better representation if it is
important for the reader to see that the highest value occurred in 1996, or that the
lowest value was in 1995, or if you want them to easily see that GDP fluctuates between
2% and 6%.
It is important to note that sparklines can be more than just line charts. They can be bar
charts, pie charts, etc. Sparklines merely refers to what Edward Tufte calls “Intense,
Simple, Word-Sized Graphics”. Sparklines are obviously not well suited for power point
type presentation graphics, but are well suited for written reports, or the currently in
vogue data dense business intelligence reports referred to as “Dashboards”.23
Bullet Graph
The Bullet graph, due to Stephen Few, is another piece of dashboard graph.
24
There is also a google gadget api for use in google docs that will produce this25.
Spine Plots / Mosaic Plots / Matrix Charts
These are best used for categorical data. Notice that we have added another dimension
to the data by making the width of the bar proportional to the fraction of cars in that
category (domestic versus foreign). Thus taking the traditional bar chart and adding
another level of data.
Made with Stata’s ado file –spineplot-. Jon Peltier has a solution for Excel which he calls
a Matrix Chart 26. It is available in statistical language R as well.27
23
For some examples see http://www.ozgrid.com/excel-add-ins/spark-maker-explained.htm
The picture come from Stephen Few’s Perceptual Edge here
http://www.perceptualedge.com/blog/?p=375
25
http://dealerdiagnostics.com/blog/2008/09/the-ddr-bullet-graph-gadget/
26
http://pubs.logicalexpressions.com/Pub0009/LPMArticle.asp?ID=508
24
11
Data Visualization
By: Taggert J. Brooks
Heat maps
Heatmaps are 2 dimensional maps where the color intensity represents the underlying
data. The above table on the right can be thought of as a heatmap. The darker orange
colors represent larger values. When choosing the different colors to use, designers rely
on color theory. Colorbrewer is a useful website to make sure that viewers can clearly
distinguish differences in your data.28
Choropleth Maps (Color Maps)
Choropleth maps are a specific type of heat map where the two dimensional object is a
geographical map. The map is then painted with color based upon the intensity of the
underlying variable. Often darker colors represent larger values of the underlying
variable. This is a great way to visually represent data that varies geographically. The
example below was produced with Stata and comes from some foreclosure data I have
by county. The data represents the number of foreclosure filings as a percentage of
housing units in each county for 2007 and the darker the shading of the county the
higher the rate of foreclosures filings in that county. Juneau County sticks out as the
obvious county with the highest rate of foreclosures filings.
A similar graph for the state of Wisconsin is below. Note that the shading has changed
relative to the previous graph and is now based upon different intervals.
27
http://ideas.repec.org/a/tsj/stataj/v8y2008i1p105-121.html
28
The website can be found here:
http://www.personal.psu.edu/cab38/ColorBrewer/ColorBrewer_intro.html
12
Data Visualization
By: Taggert J. Brooks
While I used a statistics program (Stata) to generate these graphics, there are many
opportunities for producing your own choropleths on the web. Google Documents has
added their own visualization tools, which include the ability to create choropleths for
different countries.29 These maps and the presentation of this data geographically
intersect with a rapidly growing field and use of Geographic Information Systems (GIS)
in economic geography. Can you imagine the marketing uses for this type of
information?
There are of course problems with these types of maps as well. They can mislead a
viewer. The geographic area may be completely unrelated to the “area at risk”. For
example, if the map represents foreclosure rates – as these do – you might think Juneau
County represents a large economic problem for the region. However, the reality is
that the population of Juneau is quite small relative to La Crosse, and while the
foreclosure rate might be high, the total number of foreclosures is still quite small,
because there are fewer houses in that county relative to some of the other counties.
The fundamental problem is that the graphic invites you to infer economic importance
in proportion to geographic size, when this is not true. One solution is to distort the
geographic area based instead on the metric of interest.
Cartograms (Distorted Maps)
Another example of using colors and maps comes from the following distorted maps,
where the distortion is based upon some underlying variable, in this case alcohol
consumption. Here the color only serves to demarcate the different countries. Rather
than color intensity conveying the values of the underlying variable we the creators have
29
Details on producing these maps can be found here
http://documents.google.com/support/spreadsheets/bin/answer.py?answer=91599
And here http://googlesystem.blogspot.com/2008/02/data-visualization-google-gadgets.html
13
Data Visualization
By: Taggert J. Brooks
distorted the size of the country proportionally to their alcohol consumption. There are
some people who feel cartograms hide more than they reveal.30
Alcohol Consumption (2001)31
Another example of a cartogram comes from the recent election.32Below is a
reinterpretation of the simplistic red/blue map you might have seen on TV or in the
newspaper. Now the colors are shaded based upon the vote, rather than simply one
color for each party based upon the majority vote in that state. The states are also
“distorted” by the number of votes cast in that state.
Compare that to the traditional depiction:
30
http://flowingdata.com/2008/11/13/alternative-to-cartograms-using-transparency/
The distorted maps presented here come from the following article
http://www.dailymail.co.uk/news/article-439315/How-world-really-shapes-up.html. Producing the distorted
cartograms involves a substantial knowledge of programming, graph theory.
32
http://www-personal.umich.edu/~mejn/election/2008/
31
14
Data Visualization
By: Taggert J. Brooks
Treemaps
Tree Maps are another type of heat map, well suited for hierarchical data. The classic
example on the internet is the smartmoney.com map of the market33. Here the
hierarchy from bottom up is as follows: start with individual stocks, they are group by
company, which is represented by market capitalization (outstanding shares of that
company times share price). Higher market capitalization for the firm, means a larger
area for their box. This would be the initial box. Then companies are further grouped
together into a larger box by industry. The small boxes are then colored based upon the
percentage gain or lost on the day, with green representing gains and red representing
losses. Visually it is very important to distinguish gains from losses by different colors.
That was the major shortcoming with a recent NY Times34 heatmap.
33
Smartmoney’s map of the market is updated with a 15 minute delay. The site is here:
http://www.smartmoney.com/map-of-the-market/
34
The graphic concerns the performance of the economy under different Presidents and it can be seen
here http://www.nytimes.com/interactive/2008/10/18/business/20081019-metrics-graphic.html
15
Data Visualization
By: Taggert J. Brooks
A recent bad day on Wall Street is captured by the following35.
It is possible to produce tree maps of your own, whether through Microsoft Research’s
excel add-in36 or the use of IBM’s web software ManyEyes.37 There are several examples
35
These data come from http://www.uie.com/brainsparks/2008/09/30/seeing-red-smartmoneycoms-mapof-the-market/
36
Microsoft provides an AddIn for Treemaps. http://www.gilsmethod.com/node/81
16
Data Visualization
By: Taggert J. Brooks
of data you may have which could be represented by a treemap. Let’s say you are
working on a project which is looking at student’s choice of major. The hierarchy from
top down could be:
CollegeMajornumber of students
So the number of students determines the size of the box for each major. Then the
majors are collected within the larger box of the college within which they are offered.
The boxes could be colored by many different things, for example, let’s say you were
trying to get a sense of how many students change their major and what the change it
to. You could then color the boxes by the percentage of the people in that major who
have always had that major, or by the percentage that changed to that major within the
last year.
Another example could be looking at the time students spend in different activities. Let’s
say you ask them the average number of hours per week they spend doing several
things, such as studying, going to class, reading, writing, etc. Again it would be possible
for you to break these down. You could make the first level of boxes equal in size to
the average percentage of time spent in the particular activity. The next level of boxes
would involve grouping the activities into broader areas, say academic, versus non
academic. Basically any data that can be grouped through some sort of hierarchy will
make a good treemap.
Some examples of brilliant dynamic web treemaps are provided by the New York Times
article on changes in inflation38. The New York Times also uses treemaps in a recent
graphic depicting the year of heavy losses on Wall Street39.
Bump Charts
Bump charts are a good way of showing changes in rank order and relative position over
time. Below The New York Times talks about the challenges which face the US and
other countries on infant mortality.40 Where would you rather have an infant born? The
US or Singapore? According to the chart Singapore. However, remember that this is
measuring the number of deaths of infants (one year of age or younger) per 1000 live
births. We are more likely than other countries to have successful preterm births, but
this group is very much at risk for early death.
37
The service is available here
http://services.alphaworks.ibm.com/manyeyes/page/Treemap_for_Comparisons.html
38
A look at recent inflation
http://www.nytimes.com/interactive/2008/05/03/business/20080403_SPENDING_GRAPHIC.html?scp=1&s
q=inflation%20chart&st=cse
39
http://www.nytimes.com/interactive/2008/09/15/business/20080916-treemap-graphic.html
40
http://www.nytimes.com/2009/04/07/health/07stat.html?ref=science
17
Data Visualization
By: Taggert J. Brooks
Word Clouds
Word clouds are good for representing responses to open ended questions41.
This is from the following question:
Looking ahead, which would you say is more likely - that in the country as a whole we'll
have continuous good times during the next five years or so, or that we will have
periods of widespread unemployment or depression?
A. Good times
B. Widespread unemployment or depression
C. Other, please specify
41
An easy to use web site http://wordle.net/ provides allows you to produce your own word clouds.
Another one that provides more control is here http://www.tagxedo.com/
18
Data Visualization
By: Taggert J. Brooks
The word cloud is comprised of the responses to the C. Other, please specify answer, I
have removed the first two.
There are problems with this type of presentation. First, since the responses to the
”other” answer were actually short phrases, we don’t really capture the full phrase, but
rather the frequency of the words. As a demonstration of this problem let’s say 10
people said good times and ten said bad times. Since the word “times” appears in
both, it will be the most frequent response (appearing 20 times) and therefore the
largest. But that doesn’t tell us much about the sentiment being conveyed by the
respondents.
19
Data Visualization
By: Taggert J. Brooks
This is solved below by tying all the words of a single response together with the tilde
(~). Joining the words with a ~ like this (joined~words), allows Wordle to produce a
phrase cloud, which is a great way of visualizing responses to questions with 5 or so
categories, where a phrase represents each category. This is very easy to do in excel,
just highlight the column, do a find and replace where you put a blank space in the find
and a ~ (tilde) in the replace. Then copy and paste the text into Wordle. Done.
The other problem with this presentation is that it visually doesn’t direct and steer the
eye, while making the point. Your eye wanders all over the place.
Using the question:
When you think about the property taxes you or your landlord pay on the home in
which you live and the services you receive for those taxes would you say property taxes
in Wisconsin (or your state of residence) are much too high, somewhat too high, about
right, somewhat too low or much too low?
Answers that are joined are
a. Much too high
b. Somewhat too high
c. About right
d. Somewhat too low
e. Much too low
f. Other
20
Data Visualization
By: Taggert J. Brooks
One could easily list the words by frequency from greatest to least, but word clouds are
popular because they are more than just data they are art. They invite the observer in,
even if they get a little lost in the presentation. Sometimes efficiently conveying
information is sacrificed for the visual esthetic of good design. An example where the
art matters more than some of the underlying data42
42
This graphic comes from the website http://www.pitchinteractive.com/election2008/. More artistic
visualizations can be found here: http://www.visualcomplexity.com/vc/ and Slate has an excellent collection
of artistic visualizations here http://www.slate.com/id/2197749/
21
Data Visualization
By: Taggert J. Brooks
The edge of the “doughnut” lists the names of donors to the 2008 presidential
campaigns. Clearly in this level of presentation you cannot read the names. However it
still gets some ideas across, like the disproportionate amount of funds raised by Obama,
relative to McCain.
Bubble Charts
Bubble charts allow you to present 3 variables in two dimensions. They are basically
traditional XY scatter plots, where the size of the bubble is proportional to a third
variable. In the case below the scatter plot represents the unemployment rate and
foreclosure rate for each of the Wisconsin counties in the 7 rivers region, and the size
of the bubble is proportional to the population of the county. It is a static presentation
for one year, 2007.
Unemployment Rate
7 Rivers Region 2007 8
7
6
5
4
3
2
1
0
Jackson Juneau La Crosse Monroe Trempealeau Vernon
0
0.002
0.004
0.006
0.008
0.01
Foreclosure Rate
Another example, which highlights the problem with too many colors competing for
attention can be found below. In example A the mind gets lost, whereas example B does
a good job of highlighting –with context – the data if the orange circle.43
43
http://charts.jorgecamoes.com/is-data-visualization-useful/
22
Data Visualization
By: Taggert J. Brooks
Dynamic bubble charts allow you to plot the above, for different years, and then you can
watch the data change over the years. I’ve produced some examples of the foreclosure
data to give you another idea for presenting the data44.
One of the best examples of dynamic bubble charts can be found at Gapminder.45 How
would you insert them into presentations? In the past I have posted them to a webpage,
and rendered them separately, or within powerpoint. Obviously this type of
presentation is not possible (currently) in a written report. I imagine that technology is
not far behind, as you could imagine Amazon’s kindle bridging the gap.
These are beautiful graphic from the New York Times46, but they might be difficult for
you to re-create, though they should get you thinking how data can be presented so
graphically pleasing and at the same time informative.
Presenting data in a written format requires different techniques than presenting the
same data orally. You have more time in a written piece for the user to dig into the
data, the graph/chart can be more complex as the NYtimes pieces are.
In the case of a power point, keep it simple and active. A science meets art, as in the
case of graphs and design. It is important to realize there will be differences. There is
less likely to be an objective standard. Some arguments will be over design, and some
over the content. Always ask yourself who your audience is, what the point of the graph
is and if your design is in fact conveying what you want it to47. The following represents
some important differences in preferences, but also important differences in terms of
information presented. Some other tips can be found at the links48
Dynamic/Interactive Graphs
These graphs can be dynamic in the sense that they are constantly updated and changing
either due to the influx of new data or interactive manipulations by the viewer.
44
http://www.uwlax.edu/faculty/brooks/prof/charts/foreclosure.htm and
http://www.uwlax.edu/faculty/brooks/prof/charts/foreclosure-state.htm
45
http://googlegadgetsapi.blogspot.com/2008/06/spreadsheet-gadgets-free-dynamic-data.html
http://code.google.com/apis/visualization/documentation/gadgetgallery.html
46
Movies. http://www.nytimes.com/interactive/2008/02/23/movies/20080223_REVENUE_GRAPHIC.html
NY Times on spending http://www.nytimes.com/interactive/2008/09/04/business/20080907-metricsgraphic.html Drug admts
http://www.nytimes.com/2008/06/14/opinion/14blow.html?_r=3&oref=slogin&oref=slogin&oref=slogin
47
http://sethgodin.typepad.com/seths_blog/2008/07/the-three-laws.html
http://sethgodin.typepad.com/seths_blog/2008/07/bar-graphs-vs-p.html
http://peltiertech.com/WordPress/2008/07/12/bar-graphs-vs-pie-charts/
http://www.perceptualedge.com/blog/?p=247
http://blog.xlcubed.com/chart-rules-as-simple-as-possible-but-not-any-simpler/
48
http://www.macworld.com/article/134708/2008/07/chartsandgraphs.html?t=103
http://www.giantflightlessbirds.com/workshops/better_graphs.pdf
some excel tips http://charts.jorgecamoes.com/category/how-to-and-tips/
http://services.alphaworks.ibm.com/manyeyes/app and another link
http://www.decisionsciencenews.com/?p=475
23
Data Visualization
By: Taggert J. Brooks
Data Visualization in Seminars/Talks/Presentations.
When the audience is in front of you rather than at home in front of their computer,
you are responsible for grabbing their attention and keeping them awake.
Here is an example of the principle of simplicity in the presentation of data in a
lecture/talk/seminar. The chart below contains three values: The percentage of water in
the body, the brain and the blood. Put yourself in the shoes of the audience if you saw
this chart. Interesting? Mind numbing?
Percent Water
90
80
70
60
50
40
30
20
10
0
body
brain
blood
Now what if I presented these same three pieces of data in three different power point
slides?
24
Data Visualization
By: Taggert J. Brooks
25
Data Visualization
By: Taggert J. Brooks
We could present the boring bar chart. It’s simple, easy to understand, but not visually
stimulating. It is more “data dense” than the three slides, yet I think you will agree the
three slides would have a bigger impact in a presentation. They engage the audience
visually in a way the bar chart does not, giving the data a bigger impact. The slides came
from the award winning presentation entitled Thirst49.
Another must see slide presentation entitled Death by Power Point50 is available at
slideshare.com. Garr Reynolds also provides a good section of his book on Presentation
Zen through his blog where he details the 4 principles of design: Contrast, Repetition,
Alignment, and Proximity51.
Contrast
49
Thirst won the 2008 award for the World’s Best Presentation from Slideshare.com
http://www.slideshare.net/jbrenman/thirst
50
Slideshare has several good presentations on how to present. Death by PowerPoint
http://www.slideshare.net/thecroaker/death-by-powerpoint and Presenting With Text
http://www.slideshare.net/girba/presenting-with-text
51
Part of Chapter 6 can be downloaded here http://www.presentationzen.com/chapter6_pages.pdf
26
Data Visualization
By: Taggert J. Brooks
Repetition
Alignment and Proximity
27
Data Visualization
By: Taggert J. Brooks
When thinking about PowerPoint design think about other technology. What do we
love about Apple? Simple design. What do we love about Facebook? The design and
interface is much cleaner than most MySpace pages, though sadly that is changing52.
Google, redefined simple and clean, and I am convinced that it helped fuel their early
success. Did I mention I think simplicity is important? Avoid all of the visual crap that
Microsoft seems to think is important.
Good presentations are about more than just good slide design. They are also about
being a good speaker and telling a good story. How do you learn this? Watch a few
great presentations. Pay attention to how they interact with the audience, how they’ve
52
See this article http://www.readwriteweb.com/archives/is_facebook_becoming_myspace.php
28
Data Visualization
By: Taggert J. Brooks
organized their thoughts. A great presentation by Hans Rosling can be found in the link
below53. In fact most of the TED talks are useful examples of good succinct
presentations5455.
Some general principles of slide design by Garr Reynolds at Presentation Zen can be
found at the link56. He makes the important point that slides should have a high signal to
noise ratio57.
Nancy Duarte of Duarte Design, responsible for designing some of the best TED talks
and Al Gore’s An Inconvenient Truth provides a wonderful webinar on using
powerpoint58. Nancy also has an excellent book entitled Slide:ology.59
A link to some insights on the presentations of Steve Jobs60.
And please no bullet points61.
53
http://www.youtube.com/watch?v=hVimVzgtD6w
http://www.ted.com/
55
Additional notes on good presentation organization can be found here:
http://www.extremepresentation.com/
56
http://www.presentationzen.com/presentationzen/2008/08/learning-from-the-design-around-youikea.html
57
http://www.presentationzen.com/presentationzen/2007/03/a_few_weeks_ago.html
58
http://www.vizthink.com/blog/2008/06/18/webinar-creating-powerful-presentations-with-nancy-duarte/
59
http://www.amazon.com/slide-ology-Science-CreatingPresentations/dp/0596522347/ref=pd_bbs_sr_1?ie=UTF8&s=books&qid=1238982954&sr=8-1
60
http://images.businessweek.com/ss/09/09/0929_jobs_presentations/1.htm
61
http://aralbalkan.com/1286
54
29
Data Visualization
By: Taggert J. Brooks
Finally, lest you think there is no fun in data visualization, here are some funny graphs62.
Some Dos and don’ts
I hate to give you a list of things to do and things not to do because as with any rules,
there are times when they should be broken. However, by giving you some rules, you
might make sure and only break them when you have good reason to.
Don’t
Use 3-D graphics in excel
Use Microsoft clip art
Use a powerpoint design template
Read your presentation
Use bullet points
Do
Use Pictures
Use repetition in your design
Practice/rehearse presentation
Keep each slide to one idea
References and Endnotes
Some useful links to data visualization blogs and leading thinkers in the infoviz world.:
http://junkcharts.typepad.com/
http://www.visualcomplexity.com/vc/
http://www.edwardtufte.com/tufte/
http://www.perceptualedge.com/
http://infoclarity.blogspot.com/
http://eagereyes.org/
http://charts.jorgecamoes.com/
http://visualizeit.wordpress.com/
http://www.visualizingeconomics.com
http://www.juiceanalytics.com/writing/
Presentation Related Blogs
http://blog.duarte.com/
http://www.presentationzen.com/presentationzen/
62
http://graphjam.com/
30
Data Visualization
By: Taggert J. Brooks
Duarte, N. (2008). Slide:ology: The Art and Science of Creating Great Presentations: O'Reilly.
Few, S. (2004). Show Me the Numbers: Designing Tables and Graphs to Enlighten (1st ed.). Oakland, CA:
Analytics Press.
Few, S. (2006). Information Dashboard Design: The Effective Visual Communication of Data (1st ed.). Beijing ;
Cambride [MA]: O'Reilly.
Reynolds, G. (2008). Presentation Zen: Simple Ideas on Presentation Design and Delivery. Berkeley, CA: New
Riders.
Tufte, E. R. (2001). The Visual Display of Quantitative Information (2nd ed.). Cheshire, Conn.: Graphics Press.
Tufte, E. R. (2003). The Cognitive Style of PowerPoint. Cheshire, Conn.: Graphics Press.
Tufte, E. R. (2003). Envisioning Information (9th printing, Aug. 2003. ed.). Cheshire, Conn.: Graphics Press.
Tufte, E. R. (2006). Beautiful Evidence. Cheshire, Conn.: Graphics Press.
Tufte, E. R. (2007). Visual Explanations: Images and Quantities, Evidence and Narrative (8th printing, with
revisions, June. 2007. ed.). Cheshire, Conn.: Graphics Press.
31
Data Visualization
By: Taggert J. Brooks
Appendix: TIPS for Excel 2007
How to change the axis of a chart to the logarithmic scale.
From http://office.microsoft.com/en-us/excel/HP030656791033.aspx
Make changes to the scales of value axes
1. On a chart sheet or in an embedded chart, click the value (y) axis that you want to
change.
2. On the Format menu, click Selected Axis.
3. On the Scale tab, do one of the following:

To change the number at which the value (y) axis starts and ends, type a
different number in the Minimum box or the Maximum box.

To change the interval of tick marks and gridlines, type a different number in
the Major unit box or Minor unit box.

To change the units displayed on the value (y) axis, click the units that you
want or type a numeric value in the Display units list.
To show a label that describes the units expressed, select the Show display
units label on chart check box.
Tip If your chart values consist of large numbers, you can make the axis text
shorter and more readable by changing the display unit of the axis. For
example, if the chart values range from 1,000,000 to 50,000,000, you can
display the numbers as 1 to 50 on the axis and show a label that indicates that
the units express millions.

To change the value (y) axis to logarithmic, select the Logarithmic
scale check box.

To reverse values so that you can flip bars or columns or other data markers,
select the Values in reverse order check box.
32
Data Visualization
By: Taggert J. Brooks
How to use the Histogram add-in in Excel
http://support.microsoft.com/kb/214269
SUMMARY
This step-by-step article describes how to create a histogram with a chart from a sample set of data. The Analysis ToolPak that is included with Microsoft Excel includes a Histogram tool.
Back to the top
Verify Installation of the Analysis ToolPak
Before you use the Histogram tool, you need to make sure the Analysis ToolPak Add-in is installed. To verify whether the Analysis ToolPak is installed, follow these steps:
1. In Microsoft Office Excel 2003 and in earlier versions of Excel, click Add-Ins on the Tools menu.
In Microsoft Office Excel 2007, follow these steps:
a. Click the Microsoft Office Button, and then click Excel Options.
b. Click the Add-Ins category.
c. In the Manage list, select Excel Add-ins, and then click Go.
2. In the Add-Ins dialog box, make sure that the Analysis ToolPak check box under Add-Ins available is selected.
ClickOK.
NOTE: In order for the Analysis ToolPak to be shown in the Add-Ins dialog box, it must be installed on your computer. If you do not see Analysis ToolPak in the Add-Ins dialog box, run Microsoft Excel
Setup and add this component to the list of installed items.
Back to the top
Create a Histogram
1. Type the following in a new worksheet:
A1: 87
B1: 20
A2: 27
B2: 40
A3: 45
B3: 60
A4: 62
B4: 80
A5: 3
B5:
A6: 52
B6:
A7: 20
B7:
A8: 43
B8:
A9: 74
B9:
A10: 61
B10:
2. In Excel 2003 and in earlier versions of Excel, click Data Analysis on the Tools menu.
In Excel 2007, click Data Analysis in the Analysis group on the the Data tab.
3. In the Data Analysis dialog box, click Histogram, and then click OK.
4. In the Input Range box, type A1:A10.
5. In the Bin Range box, type B1:B4.
6. Under Output Options, click New Workbook, select the Chart Output check box, and then click OK.
A new workbook with a Histogram table and an embedded chart is generated.
33
Data Visualization
By: Taggert J. Brooks
Based on the sample data from step 1, the Histogram table will look like the following table:
A1: Bin
B1: Frequency
A2: 20
B2:
2
A3: 40
B3:
1
A4: 60
B4:
3
A5: 80
B5:
3
A6: More
B6:
1
And, your chart will be a column chart that reflects the data in this Histogram table.
Excel counts the number of data points in each data bin. A data point is included in a particular data bin if the number is greater than the lowest bound and equal to or less than the greater bound for the data bin.
In the example here, the bin that corresponds to data values from 0 to 20 contains two data points, 3 and 20.
If you omit the bin range, Excel creates a set of evenly distributed bins between the data's minimum and maximum values.
NOTE: You will not be able to create the Histogram chart if you specify the options (Output range or New worksheet ply) that create the Histogram table in the same workbook as your data.
34
Download