Uploaded by Elisha

Data-Visualization-in-R

advertisement
9/6/22, 7:12 PM
Data Visualization in R
Section 0: Introduction
Section 1: Variables and Assignment
Section 2: Installing New Packages
Section 3: Importing Data
Section 4: Basic Visualization
Section 5: Why is Data Visualization Important?
Section 6: Principles of Data Visualization
Section 7: Data Visualization Choices
Section 8: Mapping
Section 9: Shiny
Section 10: Data Visualization in Practice
Additional Resources and Tools for Next Steps
Acknowledgments and Questions
Data Visualization in R
Quantitative Methods in Global Health: Mini-Conference
McGoldrick Professional Development Program in Public Health
Amy Zhou
08-16-2022
Section 0: Introduction
During the course of your work, you come up with interesting research questions, discuss ways of gathering data with your collaborators to answer your question, and
want to use numbers and charts to communicate your results and simplify complex ideas.
Now you might wonder: how do I actually do this myself? When I gather information, how do I work with it to make sense of things, and how can I produce visuals to
communicate my results with others?
You can do this with programming and data visualization. A common programming language that we will be using is called R .
By the end of this workshop, you’ll be able to:
Import a COVID-19 dataset
Understand the principles of data visualization
And have the resources and tools to do your own data visualization later!
What is R ? What can it do?
R is a programming language designed for statistical computing. Like any other language, it has a particular set of rules and words and symbols that we can use to
write instructions that our computer will understand. There are many special things about the R language in particular, including
Vast capabilities, wide range of statistical and graphical techniques
FREE (no cost, ``open source’’)
Excellent community support: mailing list, blogs, tutorials
Easy to extend by writing new functions
You can download R from this site, and install it the same way you install any other program:
https://cloud.r-project.org (https://urldefense.com/v3/__https://cloud.r-project.org__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhD
Choose a mirror (webserver) that is close to you geographically.
How do I use R and start coding? What is RStudio?
RStudio is the program we use to write code in R . You can download RStudio from this site, and install it after you’ve downloaded and installed R :
https://www.rstudio.com/products/rstudio/download/ (https://urldefense.com/v3/__https://www.rstudio.com/products/rstudio/download/__;!!CvMGjuU!
(select the free version, “RStudio Desktop,” and follow the installation instructions on this page.)
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
1/36
9/6/22, 7:12 PM
Data Visualization in R
If R is the language, RStudio is like the messaging app we use to send instructions to the computer written in that language, and to receive back the results.
RStudio shows us our code, outputs and plots. It also makes it easy to get help information about how an R function works. Let’s start by looking at the four parts of
the RStudio interface.
Top Left: This is the editor. It is where you write and save your code. If you highlight a portion of code and press Control + Enter (or, if you’re on a Mac,
Command + Enter ) then the highlighted code will run in the…
Bottom Left: This is the console. It shows you the code you’ve just run and what the result is.
Top Right: This is the environment. If you create or store any objects in your code, they appear here. The history contains all of the commands you sent to the
console.
Bottom Right: This is the plot window. If you create plots, they appear here. It is also where help information appears. Files shows the directory where R thinks
your workspace is, and Help shows you the help files and dcoumentation.
Typical RStudio Window
How do I write and run code?
Open RStudio and create a new ‘script’ by going to File > New File > R Script . This will open what looks like a blank document in the top left portion of your
RStudio screen. R scripts are just text files that store your code. When you save them, they’ll get a .R appendix, which lets your computer know to open it as an R
script.
We call it a ‘script’ because, like a script in a play, this file will become a set of ‘instructions’ you give to the computer to tell it what to do. It is best to write and run
code from a script in the editor, so that you keep track of what you’ve done.
Type the following header below into the new R file, and then save the file somewhere you’ll remember.
# Data Visualization Workshop
# [Your Name]
# [The Date]
Notice that typing the hashtag symbol changes the color of the text. This turns the line into a comment, so it will not run like other code. We use comments to organize
and write helpful notes to ourselves in our code.
Below your header, type 2+2 , highlight it, and press Ctrl + Enter . This runs the code, and you should see the result in the console. You can do more calculations
such as the ones below:
2+3*4
2+(3*4)
(2+3)*4
(15/3)*5-2
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
2/36
9/6/22, 7:12 PM
Data Visualization in R
Section 1: Variables and Assignment
As we’ve just seen, R has the capabilities of a calculator: + , - , * and / are symbols used for addition, subtraction, multiplication and division, respectively.
What if you want to store the results of a calculation, and use them in another calculation? For that, we create what are called variables using either <- or = .
#Assign a number
x <- 5
y <- 3
You should immediately notice that in the top right panel, the variable x has been stored in your environment and has the value 5 , and y has been stored with the
value 3 . Use this panel to keep track of the variables you’ve created!
Now, you can use the stored value to perform additional calculations. Let’s try a few. In this handout, the code we want to run will be surrounded by a little grey
box, and then the output resulting from that code shows up on the following line starting with ## [1] . So, if you’re following along with the examples you’ll only
be typing into the computer what’s in the grey boxes, you do not need to type in any line that starts with ## [1] because that’s what the computer is going to give to
you!
x+y
[1] 8
x-y
[1] 2
x*y
[1] 15
x/y
[1] 1.666667
x
[1] 5
y
[1] 3
Notice none of these calculations changed the value of x or y because we did not reassign a value to it. x is still 5, and y is still 3.
However, if we do reassign a different value to x or y , the old value is forgotten and replaced by the new value. It’s important to keep track of changes you make to
variables!
x <- x + 4
x
[1] 9
x + y
[1] 12
y <- 8
x + y
[1] 17
In addition to numbers, R can also work with and store letters and other characters. We call these strings, and they can be a wide range of values, such as words,
names, sentences, paragraphs, passwords, etc. We do this by putting the string we want to store inside quotation marks like this:
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
3/36
9/6/22, 7:12 PM
Data Visualization in R
z <- "z is a string variable."
z
[1] "z is a string variable."
To be clear: the variable is z without quotation marks, and the value that z stores is the string "z is a string variable" which has quotation marks.
Lastly, variables don’t just have to be called x or y , they can be an named by an unbroken set of letters or numbers like height , var3 , or final_value . It’s best
to call them something meaningful so you remember what they represent.
my_location <- "Boston, MA"
my_location
[1] "Boston, MA"
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
4/36
9/6/22, 7:12 PM
Data Visualization in R
Section 2: Installing New Packages
One of the benefits of R is that users from around the world write and publish new functions for us to use. We use them by downloading and loading packages of new
functions.
Using a new package takes two steps:
1. Install the package using install.packages("[name of package]") . You only need to do this the very first time you ever use a new package.
2. Load the package using library("[name of package]") . You need to do this every time you close and reopen RStudio. You can load the package only after
you have installed the package.
3. If you want to follow along with the workshop, you should install the following packages with this line of code: install.packages(c(
'dplyr', 'tidyverse', 'ggplot2', 'dslabs', 'gapminder', 'ggthemes', 'ggrepel', 'gridExtra', 'RColorBrewer',
'rnaturalearth', 'rnaturalearthdata', 'sf'))
For example, let’s load a new package to help with graphics, called ggplot2 .
#install.packages("ggplot2") #We've commented out this line because we already ran it before.
library("ggplot2")
Next, use ?ggplot to open the help file for this package, and click the link that says “Index” at the bottom to look up all of the functions contained in the new
package.
For example, ggplot2 has a function for a quick plot:
x <- c(1,2,3,4)
y <- c(2,4,6,8)
qplot(x, y)
qplot(1:10, letters[1:10])
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
5/36
9/6/22, 7:12 PM
Data Visualization in R
Some packages include datasets. For example, the dslabs package contains datasets that we will be using for data visualization.
#install.packages("dslabs") #We've commented out this line because we already ran it before.
#Note: you only need to run the 'install.packages'
#command the very first time you want to use a new package
#then you can start with the 'library' command every time after
library("dslabs")
data(gapminder) # load specified dataset
head(gapminder) #'head' is a function that shows the first few rows of data.
country year infant_mortality life_expectancy fertility
1
Albania 1960
115.40
62.87
6.19
2
Algeria 1960
148.20
47.50
7.65
3
Angola 1960
208.00
35.98
7.32
4 Antigua and Barbuda 1960
NA
62.97
4.43
5
Argentina 1960
59.87
65.39
3.11
6
Armenia 1960
NA
66.86
4.55
population
gdp continent
region
1
1636054
NA
Europe Southern Europe
2
11124892 13828152297
Africa Northern Africa
3
5270844
NA
Africa
Middle Africa
4
54681
NA Americas
Caribbean
5
20619075 108322326649 Americas
South America
6
1867396
NA
Asia
Western Asia
tail(gapminder) #'tail' is a function that shows the last few rows of data.
country year infant_mortality life_expectancy fertility
10540
Venezuela 2016
NA
74.80
NA
10541 West Bank and Gaza 2016
NA
74.70
NA
10542
Vietnam 2016
NA
75.60
NA
10543
Yemen 2016
NA
64.92
NA
10544
Zambia 2016
NA
57.10
NA
10545
Zimbabwe 2016
NA
61.69
NA
population gdp continent
region
10540
NA NA Americas
South America
10541
NA NA
Asia
Western Asia
10542
NA NA
Asia South-Eastern Asia
10543
NA NA
Asia
Western Asia
10544
NA NA
Africa
Eastern Africa
10545
NA NA
Africa
Eastern Africa
#?gapminder
We can load an existing dataset gapminder from the dslabs package and look at the first few rows of data using head() or the last few rows of data using
tail() . We can use ?gapminder to open the help file for the gapminder dataset to see what variables the dataset contains.
One final note: packages can be written and uploaded by anyone, so if you are using a new package to do something important, make sure you trust where the
package is coming from. One way you can do this by searching the package name online and finding out about the people who wrote it.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
6/36
9/6/22, 7:12 PM
Data Visualization in R
Section 3: Importing Data
Thus far we have used a dataset already stored in an R object. You will rarely have such luck and will have to import data into R from either a file, a database, or some
other source. But because it is so common to read data from a file, we will briefly describe the key approach and function, in case you want to use your new
knowledge on one of your own datasets.
Small datasets are commonly stored as Excel files. Although there are R packages designed to read Excel (xls) format, you generally want to avoid this format and
save files as comma delimited (Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files. These plain-text formats make it easier to share
data since commercial software is not required for working with the data.
The first step is to find the file containing your data and know its path. When you are working in R it is useful to know your working directory. This is the folder in which
R will save or look for files by default. You can see your working directory by typing:
getwd()
You can also change your working directory using the function setwd . Or you can change it through RStudio by clicking on “Session”.
The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for
beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory.
We have included Covid-19 data from the New York Times (https://urldefense.com/v3/__https://github.com/nytimes/covid-19data__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlpnnlabH$) in a CSV file. We recommend
placing your data in your working directory.
You should be able to see the file in your working directory and can check using:
list.files()
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
"covid-shiny"
"Data-Viz-Figures"
"Data_Viz_Examples.R"
"Data_Viz_Workshop_2022.html"
"Data_Viz_Workshop_2022.rmd"
"Data_Viz_Workshop_2022_files"
"Data_Viz_Workshop_2022_Intro.html"
"Data_Viz_Workshop_2022_Intro.rmd"
"Data_Viz_Workshop_2022_Slides.html"
"Data_Viz_Workshop_2022_Slides.pdf"
"Data_Viz_Workshop_2022_Slides.rmd"
"HarvardChan_logo_center_RGB_Large.png"
"Rstudio_screenshot.png"
"us-states.csv"
"us.csv"
Reading in csv files
We are ready to read in the file. There are several functions for reading in tables. Here we introduce one included in base R:
covid_states <- read.csv("us-states.csv")
head(covid_states)
1
2
3
4
5
6
date
2020-01-21
2020-01-22
2020-01-23
2020-01-24
2020-01-24
2020-01-25
state fips cases deaths
Washington
53
1
0
Washington
53
1
0
Washington
53
1
0
Illinois
17
1
0
Washington
53
1
0
California
6
1
0
This table shows the daily number of cases and deaths for each U.S. state, U.S. territories, and the District of Columbia from January 21, 2020 to August 10, 2022.
Cases represent the total number of cases of Covid-19, including both confirmed and probable. Deaths represents the total number of deaths from Covid-19, including
both confirmed and probable. FIPS codes are a standard geographic identifier that allows you to combine this data with other data sets like a map file or population
data.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
7/36
9/6/22, 7:12 PM
Data Visualization in R
Section 4: Basic Visualization
Just reading the numbers in the table makes it difficult to get a sense of the growth of the Covid-19 outbreak. Let’s draw a line plot to visualize the total number of
cases in Massachusetts.
library("ggplot2")
library("dplyr")
covid_states$date <- as.Date(covid_states$date)
mass_covid <- covid_states %>% filter(state=="Massachusetts")
ggplot(data=mass_covid, aes(x=date, y=cases)) +
geom_line() +
ylab("Cumulative confirmed and probable cases")
On March 11, 2020, WHO declared a Covid-19 a global pandemic. Let’s add some annotations for this landmark event
ggplot(data=mass_covid, aes(x=date, y=cases)) +
geom_line() +
ylab("Cumulative confirmed and probable cases") +
geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed") +
geom_text(aes(x=as.Date("2020-03-11"), label="Pandemic\ndeclared"), y=1500000)
Which states or territories in the United States have been hit the hardest?
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
8/36
9/6/22, 7:12 PM
Data Visualization in R
top_states <- covid_states %>%
group_by(state) %>%
summarize(total_cases = max(cases)) %>%
top_n(5)
Selecting by total_cases
top_states
# A tibble: 5 x 2
state
total_cases
<chr>
<int>
1 California
10858351
2 Florida
6892701
3 Illinois
3617765
4 New York
5874882
5 Texas
7563617
Let’s plot these states’ confirmed and probable cases over time compared to Massachusetts.
top5_mass <- covid_states %>% filter(state %in% c("Massachusetts", "California", "Florida", "Illinois", "New York", "Texas"))
ggplot(data=top5_mass, aes(x=date, y=cases, color=state)) +
geom_line() +
ylab("Cumulative confirmed and probable cases")
While the legend is helpful, it’s sometimes easier to label the plot directly.
labels <- data.frame(state = c("California", "Texas", "Florida", "New York", "Illinois", "Massachusetts"),
x = c(as.Date("2022-03-01"), as.Date("2022-03-10"), as.Date("2022-03-20"), as.Date("2022-03-25"), as.Date("20
22-03-10"), as.Date("2022-03-15")),
y = c(9700000, 7200000, 6100000, 4000000, 2500000, 700000))
ggplot(data=top5_mass, aes(x=date, y=cases, color=state)) +
geom_line() +
theme(legend.position = "none") +
ylab("Cumulative confirmed and probable cases") +
geom_text(data=labels, aes(x, y, label=state), size=5)
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
9/36
9/6/22, 7:12 PM
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
Data Visualization in R
10/36
9/6/22, 7:12 PM
Data Visualization in R
Section 5: Why is Data Visualization Important?
Looking at raw numbers is hard:
Most people don’t want to dig through data themselves. Plus, looking at the numbers and character strings that define a dataset is rarely useful. To convince yourself,
print and stare at this data table on US murders from the dslabs package. Each row represents a state with the following variables: name, abbreviation, region,
population, and total murders.
library(tidyverse)
library(dslabs)
data(murders)
head(murders)
state abb region population total
1
Alabama AL South
4779736
135
2
Alaska AK
West
710231
19
3
Arizona AZ
West
6392017
232
4
Arkansas AR South
2915918
93
5 California CA
West
37253956 1257
6
Colorado CO
West
5029196
65
What do you learn from staring at this table?
How quickly can you determine which states have the largest populations?
Which states have the smallest?
How large is a typical state?
Is there a relationship between population size and total murders?
How do murder rates vary across regions of the country?
Looking at a plot is more informative:
For most people, looking for trends in tables is difficult! It is quite difficult to extract this information just from looking at the numbers.
In contrast, the answer to all the questions above are readily available from examining this plot:
We are reminded of the saying “a picture is worth a thousand words”. Data visualization provides a powerful way to communicate a data-driven finding. In some
cases, the visualization is so convincing that no follow-up analysis is required. People trust what they see. We also note that many widely used data analysis tools
were initiated by discoveries made via exploratory data analysis (EDA). EDA is perhaps the most important part of data analysis, yet is often overlooked.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
11/36
9/6/22, 7:12 PM
Data Visualization in R
Section 6: Principles of Data Visualization
Here we aim to provide some general principles one can use as a guide for effective data visualization. We will show some examples of plot styles we should avoid,
explain how to improve them, and use these as motivation for a list of principles. We will compare and contrast plots that follow these principles to those that don’t.
The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the
way our brains process visual information.
When deciding on a visualization approach, it is also important to keep our goal in mind. Our goal should guide what type of visualization you create. Our goals may
vary and we may be comparing a viewable number of quantities, describing a distribution for categories or numeric values, comparing the data from two groups, or
describing the relationship between two variables.
No matter our goal, we must always present the data truthfully. The best visualizations are truthful, intuitive, and aesthetically pleasing.
Encoding data using visual cues
We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness,
and color hue.
Internet browser usage
To illustrate how some of these strategies:
library(tidyverse)
library(gridExtra)
library(dslabs)
ds_theme_set()
browsers <- data.frame(Browser = rep(c("Opera","Safari","Firefox","IE","Chrome"),2),
Year = rep(c(2000, 2015), each = 5),
Percentage = c(3,21,23,28,26, 2,22,21,27,29)) %>%
mutate(Browser = reorder(Browser, Percentage))
library(ggthemes)
p1 <- browsers %>% ggplot(aes(x = "", y = Percentage, fill = Browser)) +
geom_bar(width = 1, stat = "identity", col = "black") + coord_polar(theta = "y") +
theme_excel() + xlab("") + ylab("") +
theme(axis.text=element_blank(),
axis.ticks = element_blank(),
panel.grid = element_blank()) +
facet_grid(.~Year)
p1
This is a widely used graphical representation of percentages called the pie chart. It’s very popular in Microsoft Excel. The goal of this pie chart is to report the results
from two hypothetical polls regarding browser preference taken in 2000 and then 2015 using percentages.
Here we are representing quantities with both areas and angles since both the angle and area of each pie slice is proportional to the quantity it represents. This turns
out to be a suboptimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when only area
is available.
It is hard to quantify angles and determine how the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank
the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
12/36
9/6/22, 7:12 PM
Data Visualization in R
Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart is a preferable way of
displaying this type of data.
The preferred way to plot quantities is to use length and position since humans are much better at judging linear measure. The bar plot uses bars of length proportional
to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the quantifying through the position
of the top of the bars.
p2 <- browsers %>%
ggplot(aes(Browser, Percentage)) +
geom_bar(stat = "identity", width=0.5, fill=4, col = 1) +
ylab("Percent using the Browser") +
facet_grid(.~Year)
grid.arrange(p1, p2, nrow = 2)
Notice how much easier it is to see the differences in the barplot. We used the grid.arrange function from the gridExtra package to put these two plots side by
side! The gridExtra package arranges multiple plots by specifying number of columns and/or rows. We can now determine the actual percentages by following a
horizontal line to the x-axis.
In general, position and length are the preferred ways to display quantities over angles which are preferred to area.
Brightness and color are even harder to quantify than angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being
displayed.
Southwest border apprehensions
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
13/36
9/6/22, 7:12 PM
Data Visualization in R
Here is an illustrative example of more barplots. The goal is to show the number of Southwest border apprehensions in 3 consecutive years.
When using barplots, it is dishonest not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being
displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media
organizations trying to exaggerate a difference. Do not distort quantities.
From the Fox news plot, it appears that apprehensions have almost tripled when in fact they have only increased by about 16%. Starting the graph at 0 illustrates this
clearly:
data.frame(Year = as.character(c(2011, 2012, 2013)),Southwest_Border_Apprehensions = c(165244,170223,192298)) %>%
ggplot(aes(Year, Southwest_Border_Apprehensions )) +
geom_bar(stat = "identity", fill = "yellow", col = "black", width = 0.65)
State murder rates
data(murders)
p1 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("")
p2 <- murders %>% mutate(murder_rate = total / population * 100000) %>%
mutate(state = reorder(state, murder_rate)) %>%
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("")
grid.arrange(p1, p2, ncol = 2)
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
14/36
9/6/22, 7:12 PM
Data Visualization in R
Here are more barplots. The goal is to show state murder rates by state. As we can see, you can order the plots differently, such as alphabetically or numerically. We
rarely want to use alphabetical order. Instead we should order by a meaningful value. If our goal is to compare the murder rates across states, we’re probably
interested in the most dangerous and safest states. It makes more sense to order by the actual rate rather than by order alphabetically.
Gun deaths in Florida
Here is a line graph of gun deaths in Florida over time from Reuters (graphics.thomsonreuters.com/14/02/US-FLORIDA0214.gif). The goal is to show the dramatic
spike in murders by firearm after the “stand your ground” law was enacted in 2005.
However, notice that the y-axis is flipped and if you didn’t pay close attention, you could draw an erroneous conclusion that number of murders have decreased due to
the law. Make your axes intuitive.
Flipping the y-axis makes the graph less misleading and illustrates the increase clearly:
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
15/36
9/6/22, 7:12 PM
Data Visualization in R
Average height
We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups.
Our next plot includes a barplot where the goal is to compare height between females and males. A commonly seen plot used for comparisons between groups,
popularized by software such as Microsoft Excel, shows the average and standard errors (standard errors are defined in a later lecture, but don’t confuse them with the
standard deviation of the data).
The plot looks like this:
data(heights)
p1 <- heights %>% group_by(sex) %>% summarize(average = mean(height), se=sd(height)/sqrt(n())) %>%
ggplot(aes(sex, average)) + theme_excel() +
geom_errorbar(aes(ymin = average - 2*se, ymax = average+2*se), width = 0.25)+
geom_bar(stat = "identity", width=0.5, fill=4, col = 1) +
ylab("Height in inches")
p1
The average of each group is represented by the top of each bar and the antennae expand to the average plus two standard errors. If all someone receives is this plot
they will have little information on what to expect if they meet a group of human males and females. The bars go to 0, does this mean there are tiny humans measuring
less than one foot? Are all males taller than the tallest females? Is there a range of heights? Someone can’t answer these questions since we have provided almost no
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
16/36
9/6/22, 7:12 PM
Data Visualization in R
information on the height distribution.
This brings us to our next principle: show the data. This simple ggplot code already generates a more informative plot than the barplot by simply showing all the data
points:
heights %>% ggplot(aes(sex, height)) + geom_point()
For example, we get an idea of the range of the data. However, this plot has limitations as well since we can’t really see all the 238 and 812 points plotted for females
and males respectively, and many points are plotted on top of each other. As we have described, visualizing the distribution is much more informative.
The first is to add jitter: adding a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the height of the points
do not change, but we minimize the number of points that fall on top of each other and therefore get a better sense of how the data is distributed.
A second improvement comes from using alpha blending: making the points somewhat transparent. The more points fall on top of each other, the darker the plot
which also helps us get a sense of how the points are distributed.
Here is the same plot with jitter and alpha blending:
heights %>% ggplot(aes(sex, height)) +
geom_jitter(width = 0.1, alpha = 0.2)
Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal lines demonstrating that many reported values are rounded
to the nearest integer. Since there are so many points it is more effective to show distributions, rather than show individual points. In our next example we show the
improvements provided by distributions and suggest further principles.
Height distributions
Earlier we saw this plot used to compare male and female heights. However, what if we have too many points? Since there are so many points it is more effective to
show distributions, rather than show individual points. We therefore show histograms for each group, with the goal to show the distribution of heights between females
and males:
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
17/36
9/6/22, 7:12 PM
Data Visualization in R
heights %>%
ggplot(aes(height, ..density..)) +
geom_histogram(binwidth = 1, color="black") +
facet_grid(.~sex)
From this plot, it is immediately obvious that males are, on average, taller than females. An important principle here is to keep the axes the same when comparing
data across two plots. Ease comparisons by using common axes.
Align plots vertically to see horizontal changes and horizontally to see vertical changes. In these histograms, the visual cue related to decreases or increases in
height are shifts to the left or right respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axis are fixed:
p2 <- heights %>%
ggplot(aes(height, ..density..)) +
geom_histogram(binwidth = 1, color="black") +
facet_grid(sex~.)
p2
This plot makes it much easier to notice that men are, on average, taller. If instead of histograms we want the more compact summary provided by boxplots, then we
align them horizontally, since, by default, boxplots move up and down with changes in height.
Country income
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
18/36
9/6/22, 7:12 PM
Data Visualization in R
For our last principle, we observe a boxplot comparing country income between years in each continent.
When comparing income data between 1972 and 2002 across region we made a figure similar to the one below.
library(gapminder)
data(gapminder)
gapminder %>% filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>%
mutate(dollars_per_day = gdpPercap/pop/365) %>%
mutate(labels = paste(year, continent)) %>%
ggplot(aes(labels, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
ylab("Income in dollars per day")
Note that, for each continent, we want to compare the distributions from 1972 to 2002. The default is to order alphabetically so the labels with 1972 come before the
labels with 2002, making the comparisons challenging.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
19/36
9/6/22, 7:12 PM
Data Visualization in R
Comparison is easier when boxplots are next to each other, with adjacent visual cues:
gapminder %>%
filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>%
mutate(dollars_per_day = gdpPercap/pop/365) %>%
mutate(labels = paste(continent, year)) %>%
ggplot(aes(labels, dollars_per_day)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
ylab("Income in dollars per day")
Comparison is even easier when color is used to denote the two things compared. Ease comparison by using color.
gapminder %>%
filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>%
mutate(dollars_per_day = gdpPercap/pop/365, year = factor(year)) %>%
ggplot(aes(continent, dollars_per_day, fill = year)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
ylab("Income in dollars per day")
Use labels instead of legends
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
20/36
9/6/22, 7:12 PM
Data Visualization in R
For trend plots, we recommend labeling the lines rather than using legends as the viewer can quickly see which line is which country. This suggestion actually applies
to most plots: labeling is usually preferred over legends.
We demonstrate how we can do this using the life expectancy data. We define a data table with the label locations and then use a second mapping just for these
labels:
countries <- c("China","Germany")
labels <- data.frame(country = countries, x = c(1972,1962), y = c(60,72))
gapminder %>%
filter(country %in% countries) %>%
ggplot(aes(year, lifeExp, col = country)) +
geom_line() +
geom_text(data = labels, aes(x, y, label = country), size = 5) +
theme(legend.position = "none")
Think of the color blind
About 10% of the population is color blind. Unfortunately, the default colors in the ggplot2 package are not great. However, ggplot does make it easy to change the
color palette used in plots. Here is an example of how we can use a color blind friendly pallet (https://urldefense.com/v3/__http://www.cookbookr.com/Graphs/Colors_(ggplot2)/*a-colorblind-friendlypalette__;Iw!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlg5NU9AP$). One resource is the
package RColorBrewer , which has several color palettes, including color blind friendly palettes.
library(RColorBrewer)
par(mar=c(3,4,2,2))
display.brewer.all(colorblindFriendly = TRUE)
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
21/36
9/6/22, 7:12 PM
Data Visualization in R
Main Takeaways
Use position and length, rather than angles or area
In general, don’t use pie charts
Do not distort quantities
Order categories in a meaningful way
Make axes intuitive
Show the data
Keep axes the same
Ease comparisons
Use labels instead of legends
Think of the color blind
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
22/36
9/6/22, 7:12 PM
Data Visualization in R
Section 7: Data Visualization Choices
There are many visualization techniques you can use and many options available in the ggplot2 package. You can look at the geometries (geoms) listed here
(https://urldefense.com/v3/__https://ggplot2.tidyverse.org/reference/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusH
to see all the options available. While we’ve covered some here and review them below, due to time and space constraints, we won’t be able to cover them all. A
helpful resource if you’re trying to decide on an appropriate data visualization for your data is the Data to Viz (https://urldefense.com/v3/__https://www.data-toviz.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlkeiBmD6$) website.
Tables
head(murders)
state abb region population total
1
Alabama AL South
4779736
135
2
Alaska AK
West
710231
19
3
Arizona AZ
West
6392017
232
4
Arkansas AR South
2915918
93
5 California CA
West
37253956 1257
6
Colorado CO
West
5029196
65
We showed this table earlier. While a table is a helpful reference if you want to look up individual values, it is difficult to draw any comparisons or look at trends.
Scatter plot
murders %>% mutate(murder_rate = total / population * 100000) %>%
ggplot(aes(state, murder_rate)) +
geom_point() +
coord_flip() +
xlab("")
A scatter plot allows comparison along a common scale. However, the order should be meaningful.
Histogram
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
23/36
9/6/22, 7:12 PM
Data Visualization in R
Earlier we saw this plot used to compare male and female heights. We used histograms for each group, with the goal to show the distribution of heights between
females and males:
heights %>%
ggplot(aes(height, ..density..)) +
geom_histogram(binwidth = 1, color="black") +
facet_grid(.~sex)
A histogram visualizes the distribution of data. Data are grouped into bins or intervals. Histograms shows the shape of the data, and can help identify extreme values
or gaps in the data, but they are not useful for comparisons.
Box (Whisker) Plot
gapminder %>%
filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>%
mutate(dollars_per_day = gdpPercap/pop/365, year = factor(year)) %>%
ggplot(aes(continent, dollars_per_day, fill = year)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(trans = "log2") +
ylab("Income in dollars per day")
A box plot shows the maximum, minimum, median, first quartile, and third quartile of the data. Outliers are also identified.
Bar chart
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
24/36
9/6/22, 7:12 PM
Data Visualization in R
murders %>% mutate(murder_rate = total / population * 100000) %>%
mutate(state = reorder(state, murder_rate)) %>%
ggplot(aes(state, murder_rate)) +
geom_bar(stat="identity") +
coord_flip() +
xlab("")
A bar chart is helpful for changes over time or comparing different groups. When there are a larger number of categories or category names are long, such as in the
example above that we used earlier, you can switch to a horizontal bar chart.
Line graph
A line graph is useful for showing trends, or how data changes over time. Line graphs are used for quantitative data over a continuous interval or time period, where
the x-axis is often a timescale.
For example, you may want to look at a trend over time by using a time series plots with time on the x-axis and outcome or measurement of interest on the y-axis. An
example below is the United States life expectancy in years over time:
library(gapminder)
data(gapminder)
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, lifeExp)) +
geom_point()
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
25/36
9/6/22, 7:12 PM
Data Visualization in R
We see that the trend is relatively linear, though there is a sharp increase in the early 70s.
When the points are regularly and densely spaced, as they are here, we can connect the points to create a curve using the geom_line function to showing that the
data are from a single country.
gapminder %>%
filter(country == "United States") %>%
ggplot(aes(year, lifeExp)) +
geom_line()
This is particularly helpful when we look at two countries. Let’s compare the trend in two countries. We can subset the data to include two countries, one from Europe
and one from Asia, and assign colors to different countries:
countries <- c("China","Germany")
gapminder %>%
filter(country %in% countries & !is.na(lifeExp)) %>%
ggplot(aes(year, lifeExp, col = country)) +
geom_line()
Heatmaps
Sometimes, you may want to use different magnitudes of color such as variations in hue or intensity to show clusters or how a variables change over a space.
heatmaps use colors as a visual cue.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
26/36
9/6/22, 7:12 PM
Data Visualization in R
library(RColorBrewer)
world.data <- filter(gapminder, year%in%c(2007))
world.numeric.data <- world.data %>% select(lifeExp, pop, gdpPercap)
world.cor <- cor(world.numeric.data, use = "pairwise.complete.obs")
heatmap(x = world.cor, col = brewer.pal(11, "PiYG"),
margins = c(6,6), symm = T,
cexRow = 0.8,
cexCol = 0.8)
legend(x="bottomright", xpd=T,
legend=c("Negative", "None", "Positive"), cex=0.8,
fill=colorRampPalette(brewer.pal(11, "PiYG"))(3))
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
27/36
9/6/22, 7:12 PM
Data Visualization in R
Section 8: Mapping
For geographic data visualization with geospatial data, you may want to do some mapping. sf packages involves spatial classes or objects. The ggplot2 package
takes sf objects. One resource for maps is the rnaturalearth package, which provides a map of countries of the entire world.
library("ggplot2")
theme_set(theme_bw())
library("sf")
library("rnaturalearth")
library("rnaturalearthdata")
world <- ne_countries(scale = "medium", returnclass = "sf")
class(world)
[1] "sf"
"data.frame"
This is the the world map with population:
ggplot(data = world) +
geom_sf(aes(fill = pop_est)) +
scale_fill_viridis_c(option = "plasma", trans = "sqrt")
However, we can use other plotting packages such as sp , tmap , leaflet . Beyond static maps, there are animated maps and interactive maps, which we won’t go
into too much detail here.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
28/36
9/6/22, 7:12 PM
Data Visualization in R
Section 9: Shiny
Now that we understand how to create beautiful and informative plots with ggplot2 , we can go one step further and create interactive visualization applications with
Shiny
(https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1ri
Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown
documents or build dashboards. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. You can create pretty complicated
Shiny apps with no knowledge of HTML, CSS, or JavaScript. On the other hand, Shiny doesn’t limit you to creating trivial or prefabricated apps: its user interface
components can be easily customized or extended, and its server uses reactive programming to let you create any type of backend logic you want. You can even
share Shiny apps publicly on the web for free with shinyapps.io
(https://urldefense.com/v3/__https://www.shinyapps.io/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5
Shiny Application Examples
Shiny allows your plots to be interactive, so a viewer can decide what variables, subset of data, or types of plots they look at.
Simple Example
Here is an example using the Covid-19 dataset. We can choose which state to examine, decide what variable to compare by, and subset the range of dates we’d like
to consider.
Full Shiny Application Examples
Shiny applications can be incredibly complex, making them useful across many industries. Shiny is used in academia as a teaching tool for statistical concepts, by big
pharma companies to speed collaboration between scientists and analysts during drug development, and by Silicon Valley tech companies to set up realtime metrics
dashboards that incorporate advanced analytics.
Here’s a screenshot of an app “created to help people living and working in Scotland explore how geographical areas have changed over time or how they compare to
other areas, across a range of indicators of health and wider determinants of health” (view it here
(https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/scotphoprofiles.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlktghbLY$)):
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
29/36
9/6/22, 7:12 PM
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
Data Visualization in R
30/36
9/6/22, 7:12 PM
Data Visualization in R
Here’s a screenshot of an app that gives “a full picture of New Zealand’s trading profile through intuitive and interactive graphs and tables”(view it here
(https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/nz-tradedash.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlr08kQBh$)):
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
31/36
9/6/22, 7:12 PM
Data Visualization in R
Section 10: Data Visualization in Practice
We’ve shown you how we used ggplot2 code to create insightful and aesthetically pleasing plots, but there is a growing availability of data and tools out there for
you to explore. Data visualization is useful across industries, academia, and government.
A particularly effective example is a Wall Street Journal article (https://urldefense.com/v3/__http://graphics.wsj.com/infectious-diseases-andvaccines/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlqSb9YsD$) showing data related to
the impact of vaccines on battling infectious diseases. One of the graphs shows measles cases by US state through the years with a vertical line demonstrating when
the vaccine was introduced.
The University of the Witwatersrand has a great Covid-19 dashboard
(https://urldefense.com/v3/__https://www.covid19sa.org/southafricavaccination__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uo
that combines various types of data visualization techniques such as mapping, bar charts, and line graphs.
Another striking example comes from New York Times (graphics8.nytimes.com/images/2011/02/19/nyregion/19schoolsch/19schoolsch-popup.gif), which summarizes
scores from the NYC Regents Exams. As described in the article, these scores are collected for several reasons, including to determine if a student graduates from
high school. In New York City you need a 65 to pass.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
32/36
9/6/22, 7:12 PM
Data Visualization in R
Another good example comes from the Department of Statistics South Africa (https://urldefense.com/v3/__https://www.statssa.gov.za/?
p=15583__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFltDAcoAa$) that tracks consumer
inflation surges over the past 13 years. The annual rate for the Consumer Price Index (CPI) shows an upward trajectory in the first half of 2022, and highlights the
highest rate since May 2009.
An interactive plot by Statistics Botswana
(https://urldefense.com/v3/__https://www.statsbots.org.bw/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMo
allows you to examine each individual data point on the total merchandise trade plot, which includes imports, exports, and trade balance.
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
33/36
9/6/22, 7:12 PM
Data Visualization in R
A striking population pyramid comes from the Zimbabwe National Statistics Agency
(https://urldefense.com/v3/__https://www.zimstat.co.zw/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU
Data visualization is the strongest tool of exploratory data analysis. You can use programming to bridge the gap between an idea and an interesting visual to
communicate it.
“The greatest value of a picture is when it forces us to notice what we never expected to see.” - John Tukey, father of
EDA
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
34/36
9/6/22, 7:12 PM
Data Visualization in R
Additional Resources and Tools for Next Steps
Many of the figures we showed in the previous sections were produced with the [ ggplot2 ] package. While we don’t have time to learn the basics of R or the
packages, we will provide some helpful links.
R for Data Science
(https://urldefense.com/v3/__https://r4ds.had.co.nz/index.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusH
a great book to teach you the basics of R and data visualization
RStudio Cheatsheets
(https://urldefense.com/v3/__https://www.rstudio.com/resources/cheatsheets/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oF
contains helpful cheatsheets for commonly used packages such as ggplot2 for data visualization and dplyr for data transformation
ggplot2: Elegant Graphics for Data Analysis (https://urldefense.com/v3/__https://ggplot2book.org/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlheHfASC$): book on the
theory and grammar of graphics for ggplot2
ggplot2 gallery (https://urldefense.com/v3/__https://r-graph-gallery.com/ggplot2package.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlgq7PAdp$): examples of
graphics made with ggplot2
How Charts Lie
(https://urldefense.com/v3/__https://albertocairo.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riM
book by Alberto Cairo about data visualization
https://stackoverflow.com/
(https://urldefense.com/v3/__https://stackoverflow.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1ri
is a website that most of your google searches will direct to–users answer each others questions with really useful code bits.
https://www.rstudio.com/online-learning/ (https://urldefense.com/v3/__https://www.rstudio.com/onlinelearning/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlmUUdVNn$) is a list by the
creators of RStudio filled with additional resources for students interested in learning more about programming in R .
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
35/36
9/6/22, 7:12 PM
Data Visualization in R
Acknowledgments and Questions
This workshop had help from lots of other people, including Patrick Emedom-Nnamdi, Intekhab Hossain, Jenna Landy, Shirley Liao, Jonathan Luu, Heather Mattie,
Marcello Pagano, Harrison Reeder, and Luli Zou.
If you have any questions, feel free to reach out to me at amyzhou@g.harvard.edu (mailto:amyzhou@g.harvard.edu).
file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html
36/36
Download