9/6/22, 7:12 PM Data Visualization in R Section 0: Introduction Section 1: Variables and Assignment Section 2: Installing New Packages Section 3: Importing Data Section 4: Basic Visualization Section 5: Why is Data Visualization Important? Section 6: Principles of Data Visualization Section 7: Data Visualization Choices Section 8: Mapping Section 9: Shiny Section 10: Data Visualization in Practice Additional Resources and Tools for Next Steps Acknowledgments and Questions Data Visualization in R Quantitative Methods in Global Health: Mini-Conference McGoldrick Professional Development Program in Public Health Amy Zhou 08-16-2022 Section 0: Introduction During the course of your work, you come up with interesting research questions, discuss ways of gathering data with your collaborators to answer your question, and want to use numbers and charts to communicate your results and simplify complex ideas. Now you might wonder: how do I actually do this myself? When I gather information, how do I work with it to make sense of things, and how can I produce visuals to communicate my results with others? You can do this with programming and data visualization. A common programming language that we will be using is called R . By the end of this workshop, you’ll be able to: Import a COVID-19 dataset Understand the principles of data visualization And have the resources and tools to do your own data visualization later! What is R ? What can it do? R is a programming language designed for statistical computing. Like any other language, it has a particular set of rules and words and symbols that we can use to write instructions that our computer will understand. There are many special things about the R language in particular, including Vast capabilities, wide range of statistical and graphical techniques FREE (no cost, ``open source’’) Excellent community support: mailing list, blogs, tutorials Easy to extend by writing new functions You can download R from this site, and install it the same way you install any other program: https://cloud.r-project.org (https://urldefense.com/v3/__https://cloud.r-project.org__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhD Choose a mirror (webserver) that is close to you geographically. How do I use R and start coding? What is RStudio? RStudio is the program we use to write code in R . You can download RStudio from this site, and install it after you’ve downloaded and installed R : https://www.rstudio.com/products/rstudio/download/ (https://urldefense.com/v3/__https://www.rstudio.com/products/rstudio/download/__;!!CvMGjuU! (select the free version, “RStudio Desktop,” and follow the installation instructions on this page.) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 1/36 9/6/22, 7:12 PM Data Visualization in R If R is the language, RStudio is like the messaging app we use to send instructions to the computer written in that language, and to receive back the results. RStudio shows us our code, outputs and plots. It also makes it easy to get help information about how an R function works. Let’s start by looking at the four parts of the RStudio interface. Top Left: This is the editor. It is where you write and save your code. If you highlight a portion of code and press Control + Enter (or, if you’re on a Mac, Command + Enter ) then the highlighted code will run in the… Bottom Left: This is the console. It shows you the code you’ve just run and what the result is. Top Right: This is the environment. If you create or store any objects in your code, they appear here. The history contains all of the commands you sent to the console. Bottom Right: This is the plot window. If you create plots, they appear here. It is also where help information appears. Files shows the directory where R thinks your workspace is, and Help shows you the help files and dcoumentation. Typical RStudio Window How do I write and run code? Open RStudio and create a new ‘script’ by going to File > New File > R Script . This will open what looks like a blank document in the top left portion of your RStudio screen. R scripts are just text files that store your code. When you save them, they’ll get a .R appendix, which lets your computer know to open it as an R script. We call it a ‘script’ because, like a script in a play, this file will become a set of ‘instructions’ you give to the computer to tell it what to do. It is best to write and run code from a script in the editor, so that you keep track of what you’ve done. Type the following header below into the new R file, and then save the file somewhere you’ll remember. # Data Visualization Workshop # [Your Name] # [The Date] Notice that typing the hashtag symbol changes the color of the text. This turns the line into a comment, so it will not run like other code. We use comments to organize and write helpful notes to ourselves in our code. Below your header, type 2+2 , highlight it, and press Ctrl + Enter . This runs the code, and you should see the result in the console. You can do more calculations such as the ones below: 2+3*4 2+(3*4) (2+3)*4 (15/3)*5-2 file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 2/36 9/6/22, 7:12 PM Data Visualization in R Section 1: Variables and Assignment As we’ve just seen, R has the capabilities of a calculator: + , - , * and / are symbols used for addition, subtraction, multiplication and division, respectively. What if you want to store the results of a calculation, and use them in another calculation? For that, we create what are called variables using either <- or = . #Assign a number x <- 5 y <- 3 You should immediately notice that in the top right panel, the variable x has been stored in your environment and has the value 5 , and y has been stored with the value 3 . Use this panel to keep track of the variables you’ve created! Now, you can use the stored value to perform additional calculations. Let’s try a few. In this handout, the code we want to run will be surrounded by a little grey box, and then the output resulting from that code shows up on the following line starting with ## [1] . So, if you’re following along with the examples you’ll only be typing into the computer what’s in the grey boxes, you do not need to type in any line that starts with ## [1] because that’s what the computer is going to give to you! x+y [1] 8 x-y [1] 2 x*y [1] 15 x/y [1] 1.666667 x [1] 5 y [1] 3 Notice none of these calculations changed the value of x or y because we did not reassign a value to it. x is still 5, and y is still 3. However, if we do reassign a different value to x or y , the old value is forgotten and replaced by the new value. It’s important to keep track of changes you make to variables! x <- x + 4 x [1] 9 x + y [1] 12 y <- 8 x + y [1] 17 In addition to numbers, R can also work with and store letters and other characters. We call these strings, and they can be a wide range of values, such as words, names, sentences, paragraphs, passwords, etc. We do this by putting the string we want to store inside quotation marks like this: file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 3/36 9/6/22, 7:12 PM Data Visualization in R z <- "z is a string variable." z [1] "z is a string variable." To be clear: the variable is z without quotation marks, and the value that z stores is the string "z is a string variable" which has quotation marks. Lastly, variables don’t just have to be called x or y , they can be an named by an unbroken set of letters or numbers like height , var3 , or final_value . It’s best to call them something meaningful so you remember what they represent. my_location <- "Boston, MA" my_location [1] "Boston, MA" file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 4/36 9/6/22, 7:12 PM Data Visualization in R Section 2: Installing New Packages One of the benefits of R is that users from around the world write and publish new functions for us to use. We use them by downloading and loading packages of new functions. Using a new package takes two steps: 1. Install the package using install.packages("[name of package]") . You only need to do this the very first time you ever use a new package. 2. Load the package using library("[name of package]") . You need to do this every time you close and reopen RStudio. You can load the package only after you have installed the package. 3. If you want to follow along with the workshop, you should install the following packages with this line of code: install.packages(c( 'dplyr', 'tidyverse', 'ggplot2', 'dslabs', 'gapminder', 'ggthemes', 'ggrepel', 'gridExtra', 'RColorBrewer', 'rnaturalearth', 'rnaturalearthdata', 'sf')) For example, let’s load a new package to help with graphics, called ggplot2 . #install.packages("ggplot2") #We've commented out this line because we already ran it before. library("ggplot2") Next, use ?ggplot to open the help file for this package, and click the link that says “Index” at the bottom to look up all of the functions contained in the new package. For example, ggplot2 has a function for a quick plot: x <- c(1,2,3,4) y <- c(2,4,6,8) qplot(x, y) qplot(1:10, letters[1:10]) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 5/36 9/6/22, 7:12 PM Data Visualization in R Some packages include datasets. For example, the dslabs package contains datasets that we will be using for data visualization. #install.packages("dslabs") #We've commented out this line because we already ran it before. #Note: you only need to run the 'install.packages' #command the very first time you want to use a new package #then you can start with the 'library' command every time after library("dslabs") data(gapminder) # load specified dataset head(gapminder) #'head' is a function that shows the first few rows of data. country year infant_mortality life_expectancy fertility 1 Albania 1960 115.40 62.87 6.19 2 Algeria 1960 148.20 47.50 7.65 3 Angola 1960 208.00 35.98 7.32 4 Antigua and Barbuda 1960 NA 62.97 4.43 5 Argentina 1960 59.87 65.39 3.11 6 Armenia 1960 NA 66.86 4.55 population gdp continent region 1 1636054 NA Europe Southern Europe 2 11124892 13828152297 Africa Northern Africa 3 5270844 NA Africa Middle Africa 4 54681 NA Americas Caribbean 5 20619075 108322326649 Americas South America 6 1867396 NA Asia Western Asia tail(gapminder) #'tail' is a function that shows the last few rows of data. country year infant_mortality life_expectancy fertility 10540 Venezuela 2016 NA 74.80 NA 10541 West Bank and Gaza 2016 NA 74.70 NA 10542 Vietnam 2016 NA 75.60 NA 10543 Yemen 2016 NA 64.92 NA 10544 Zambia 2016 NA 57.10 NA 10545 Zimbabwe 2016 NA 61.69 NA population gdp continent region 10540 NA NA Americas South America 10541 NA NA Asia Western Asia 10542 NA NA Asia South-Eastern Asia 10543 NA NA Asia Western Asia 10544 NA NA Africa Eastern Africa 10545 NA NA Africa Eastern Africa #?gapminder We can load an existing dataset gapminder from the dslabs package and look at the first few rows of data using head() or the last few rows of data using tail() . We can use ?gapminder to open the help file for the gapminder dataset to see what variables the dataset contains. One final note: packages can be written and uploaded by anyone, so if you are using a new package to do something important, make sure you trust where the package is coming from. One way you can do this by searching the package name online and finding out about the people who wrote it. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 6/36 9/6/22, 7:12 PM Data Visualization in R Section 3: Importing Data Thus far we have used a dataset already stored in an R object. You will rarely have such luck and will have to import data into R from either a file, a database, or some other source. But because it is so common to read data from a file, we will briefly describe the key approach and function, in case you want to use your new knowledge on one of your own datasets. Small datasets are commonly stored as Excel files. Although there are R packages designed to read Excel (xls) format, you generally want to avoid this format and save files as comma delimited (Comma-Separated Value/CSV) or tab delimited (Tab-Separated Value/TSV/TXT) files. These plain-text formats make it easier to share data since commercial software is not required for working with the data. The first step is to find the file containing your data and know its path. When you are working in R it is useful to know your working directory. This is the folder in which R will save or look for files by default. You can see your working directory by typing: getwd() You can also change your working directory using the function setwd . Or you can change it through RStudio by clicking on “Session”. The functions that read and write files (there are several in R) assume you mean to look for files or write files in the working directory. Our recommended approach for beginners will have you reading and writing to the working directory. However, you can also type the full path, which will work independently of the working directory. We have included Covid-19 data from the New York Times (https://urldefense.com/v3/__https://github.com/nytimes/covid-19data__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlpnnlabH$) in a CSV file. We recommend placing your data in your working directory. You should be able to see the file in your working directory and can check using: list.files() [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] "covid-shiny" "Data-Viz-Figures" "Data_Viz_Examples.R" "Data_Viz_Workshop_2022.html" "Data_Viz_Workshop_2022.rmd" "Data_Viz_Workshop_2022_files" "Data_Viz_Workshop_2022_Intro.html" "Data_Viz_Workshop_2022_Intro.rmd" "Data_Viz_Workshop_2022_Slides.html" "Data_Viz_Workshop_2022_Slides.pdf" "Data_Viz_Workshop_2022_Slides.rmd" "HarvardChan_logo_center_RGB_Large.png" "Rstudio_screenshot.png" "us-states.csv" "us.csv" Reading in csv files We are ready to read in the file. There are several functions for reading in tables. Here we introduce one included in base R: covid_states <- read.csv("us-states.csv") head(covid_states) 1 2 3 4 5 6 date 2020-01-21 2020-01-22 2020-01-23 2020-01-24 2020-01-24 2020-01-25 state fips cases deaths Washington 53 1 0 Washington 53 1 0 Washington 53 1 0 Illinois 17 1 0 Washington 53 1 0 California 6 1 0 This table shows the daily number of cases and deaths for each U.S. state, U.S. territories, and the District of Columbia from January 21, 2020 to August 10, 2022. Cases represent the total number of cases of Covid-19, including both confirmed and probable. Deaths represents the total number of deaths from Covid-19, including both confirmed and probable. FIPS codes are a standard geographic identifier that allows you to combine this data with other data sets like a map file or population data. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 7/36 9/6/22, 7:12 PM Data Visualization in R Section 4: Basic Visualization Just reading the numbers in the table makes it difficult to get a sense of the growth of the Covid-19 outbreak. Let’s draw a line plot to visualize the total number of cases in Massachusetts. library("ggplot2") library("dplyr") covid_states$date <- as.Date(covid_states$date) mass_covid <- covid_states %>% filter(state=="Massachusetts") ggplot(data=mass_covid, aes(x=date, y=cases)) + geom_line() + ylab("Cumulative confirmed and probable cases") On March 11, 2020, WHO declared a Covid-19 a global pandemic. Let’s add some annotations for this landmark event ggplot(data=mass_covid, aes(x=date, y=cases)) + geom_line() + ylab("Cumulative confirmed and probable cases") + geom_vline(aes(xintercept=as.Date("2020-03-11")), linetype="dashed") + geom_text(aes(x=as.Date("2020-03-11"), label="Pandemic\ndeclared"), y=1500000) Which states or territories in the United States have been hit the hardest? file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 8/36 9/6/22, 7:12 PM Data Visualization in R top_states <- covid_states %>% group_by(state) %>% summarize(total_cases = max(cases)) %>% top_n(5) Selecting by total_cases top_states # A tibble: 5 x 2 state total_cases <chr> <int> 1 California 10858351 2 Florida 6892701 3 Illinois 3617765 4 New York 5874882 5 Texas 7563617 Let’s plot these states’ confirmed and probable cases over time compared to Massachusetts. top5_mass <- covid_states %>% filter(state %in% c("Massachusetts", "California", "Florida", "Illinois", "New York", "Texas")) ggplot(data=top5_mass, aes(x=date, y=cases, color=state)) + geom_line() + ylab("Cumulative confirmed and probable cases") While the legend is helpful, it’s sometimes easier to label the plot directly. labels <- data.frame(state = c("California", "Texas", "Florida", "New York", "Illinois", "Massachusetts"), x = c(as.Date("2022-03-01"), as.Date("2022-03-10"), as.Date("2022-03-20"), as.Date("2022-03-25"), as.Date("20 22-03-10"), as.Date("2022-03-15")), y = c(9700000, 7200000, 6100000, 4000000, 2500000, 700000)) ggplot(data=top5_mass, aes(x=date, y=cases, color=state)) + geom_line() + theme(legend.position = "none") + ylab("Cumulative confirmed and probable cases") + geom_text(data=labels, aes(x, y, label=state), size=5) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 9/36 9/6/22, 7:12 PM file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html Data Visualization in R 10/36 9/6/22, 7:12 PM Data Visualization in R Section 5: Why is Data Visualization Important? Looking at raw numbers is hard: Most people don’t want to dig through data themselves. Plus, looking at the numbers and character strings that define a dataset is rarely useful. To convince yourself, print and stare at this data table on US murders from the dslabs package. Each row represents a state with the following variables: name, abbreviation, region, population, and total murders. library(tidyverse) library(dslabs) data(murders) head(murders) state abb region population total 1 Alabama AL South 4779736 135 2 Alaska AK West 710231 19 3 Arizona AZ West 6392017 232 4 Arkansas AR South 2915918 93 5 California CA West 37253956 1257 6 Colorado CO West 5029196 65 What do you learn from staring at this table? How quickly can you determine which states have the largest populations? Which states have the smallest? How large is a typical state? Is there a relationship between population size and total murders? How do murder rates vary across regions of the country? Looking at a plot is more informative: For most people, looking for trends in tables is difficult! It is quite difficult to extract this information just from looking at the numbers. In contrast, the answer to all the questions above are readily available from examining this plot: We are reminded of the saying “a picture is worth a thousand words”. Data visualization provides a powerful way to communicate a data-driven finding. In some cases, the visualization is so convincing that no follow-up analysis is required. People trust what they see. We also note that many widely used data analysis tools were initiated by discoveries made via exploratory data analysis (EDA). EDA is perhaps the most important part of data analysis, yet is often overlooked. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 11/36 9/6/22, 7:12 PM Data Visualization in R Section 6: Principles of Data Visualization Here we aim to provide some general principles one can use as a guide for effective data visualization. We will show some examples of plot styles we should avoid, explain how to improve them, and use these as motivation for a list of principles. We will compare and contrast plots that follow these principles to those that don’t. The principles are mostly based on research related to how humans detect patterns and make visual comparisons. The preferred approaches are those that best fit the way our brains process visual information. When deciding on a visualization approach, it is also important to keep our goal in mind. Our goal should guide what type of visualization you create. Our goals may vary and we may be comparing a viewable number of quantities, describing a distribution for categories or numeric values, comparing the data from two groups, or describing the relationship between two variables. No matter our goal, we must always present the data truthfully. The best visualizations are truthful, intuitive, and aesthetically pleasing. Encoding data using visual cues We start by describing some principles for encoding data. There are several approaches at our disposal including position, aligned lengths, angles, area, brightness, and color hue. Internet browser usage To illustrate how some of these strategies: library(tidyverse) library(gridExtra) library(dslabs) ds_theme_set() browsers <- data.frame(Browser = rep(c("Opera","Safari","Firefox","IE","Chrome"),2), Year = rep(c(2000, 2015), each = 5), Percentage = c(3,21,23,28,26, 2,22,21,27,29)) %>% mutate(Browser = reorder(Browser, Percentage)) library(ggthemes) p1 <- browsers %>% ggplot(aes(x = "", y = Percentage, fill = Browser)) + geom_bar(width = 1, stat = "identity", col = "black") + coord_polar(theta = "y") + theme_excel() + xlab("") + ylab("") + theme(axis.text=element_blank(), axis.ticks = element_blank(), panel.grid = element_blank()) + facet_grid(.~Year) p1 This is a widely used graphical representation of percentages called the pie chart. It’s very popular in Microsoft Excel. The goal of this pie chart is to report the results from two hypothetical polls regarding browser preference taken in 2000 and then 2015 using percentages. Here we are representing quantities with both areas and angles since both the angle and area of each pie slice is proportional to the quantity it represents. This turns out to be a suboptimal choice since, as demonstrated by perception studies, humans are not good at precisely quantifying angles and are even worse when only area is available. It is hard to quantify angles and determine how the percentages in the plots above changed from 2000 to 2015. Can you determine the actual percentages and rank the browsers’ popularity? Can you see how the percentages changed from 2000 to 2015? It is not easy to tell from the plot. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 12/36 9/6/22, 7:12 PM Data Visualization in R Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart is a preferable way of displaying this type of data. The preferred way to plot quantities is to use length and position since humans are much better at judging linear measure. The bar plot uses bars of length proportional to the quantities of interest. By adding horizontal lines at strategically chosen values, in this case at every multiple of 10, we ease the quantifying through the position of the top of the bars. p2 <- browsers %>% ggplot(aes(Browser, Percentage)) + geom_bar(stat = "identity", width=0.5, fill=4, col = 1) + ylab("Percent using the Browser") + facet_grid(.~Year) grid.arrange(p1, p2, nrow = 2) Notice how much easier it is to see the differences in the barplot. We used the grid.arrange function from the gridExtra package to put these two plots side by side! The gridExtra package arranges multiple plots by specifying number of columns and/or rows. We can now determine the actual percentages by following a horizontal line to the x-axis. In general, position and length are the preferred ways to display quantities over angles which are preferred to area. Brightness and color are even harder to quantify than angles and area but, as we will see later, they are sometimes useful when more than two dimensions are being displayed. Southwest border apprehensions file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 13/36 9/6/22, 7:12 PM Data Visualization in R Here is an illustrative example of more barplots. The goal is to show the number of Southwest border apprehensions in 3 consecutive years. When using barplots, it is dishonest not to start the bars at 0. This is because, by using a barplot, we are implying the length is proportional to the quantities being displayed. By avoiding 0, relatively small differences can be made to look much bigger than they actually are. This approach is often used by politicians or media organizations trying to exaggerate a difference. Do not distort quantities. From the Fox news plot, it appears that apprehensions have almost tripled when in fact they have only increased by about 16%. Starting the graph at 0 illustrates this clearly: data.frame(Year = as.character(c(2011, 2012, 2013)),Southwest_Border_Apprehensions = c(165244,170223,192298)) %>% ggplot(aes(Year, Southwest_Border_Apprehensions )) + geom_bar(stat = "identity", fill = "yellow", col = "black", width = 0.65) State murder rates data(murders) p1 <- murders %>% mutate(murder_rate = total / population * 100000) %>% ggplot(aes(state, murder_rate)) + geom_bar(stat="identity") + coord_flip() + xlab("") p2 <- murders %>% mutate(murder_rate = total / population * 100000) %>% mutate(state = reorder(state, murder_rate)) %>% ggplot(aes(state, murder_rate)) + geom_bar(stat="identity") + coord_flip() + xlab("") grid.arrange(p1, p2, ncol = 2) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 14/36 9/6/22, 7:12 PM Data Visualization in R Here are more barplots. The goal is to show state murder rates by state. As we can see, you can order the plots differently, such as alphabetically or numerically. We rarely want to use alphabetical order. Instead we should order by a meaningful value. If our goal is to compare the murder rates across states, we’re probably interested in the most dangerous and safest states. It makes more sense to order by the actual rate rather than by order alphabetically. Gun deaths in Florida Here is a line graph of gun deaths in Florida over time from Reuters (graphics.thomsonreuters.com/14/02/US-FLORIDA0214.gif). The goal is to show the dramatic spike in murders by firearm after the “stand your ground” law was enacted in 2005. However, notice that the y-axis is flipped and if you didn’t pay close attention, you could draw an erroneous conclusion that number of murders have decreased due to the law. Make your axes intuitive. Flipping the y-axis makes the graph less misleading and illustrates the increase clearly: file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 15/36 9/6/22, 7:12 PM Data Visualization in R Average height We have focused on displaying single quantities across categories. We now shift our attention to displaying data, with a focus on comparing groups. Our next plot includes a barplot where the goal is to compare height between females and males. A commonly seen plot used for comparisons between groups, popularized by software such as Microsoft Excel, shows the average and standard errors (standard errors are defined in a later lecture, but don’t confuse them with the standard deviation of the data). The plot looks like this: data(heights) p1 <- heights %>% group_by(sex) %>% summarize(average = mean(height), se=sd(height)/sqrt(n())) %>% ggplot(aes(sex, average)) + theme_excel() + geom_errorbar(aes(ymin = average - 2*se, ymax = average+2*se), width = 0.25)+ geom_bar(stat = "identity", width=0.5, fill=4, col = 1) + ylab("Height in inches") p1 The average of each group is represented by the top of each bar and the antennae expand to the average plus two standard errors. If all someone receives is this plot they will have little information on what to expect if they meet a group of human males and females. The bars go to 0, does this mean there are tiny humans measuring less than one foot? Are all males taller than the tallest females? Is there a range of heights? Someone can’t answer these questions since we have provided almost no file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 16/36 9/6/22, 7:12 PM Data Visualization in R information on the height distribution. This brings us to our next principle: show the data. This simple ggplot code already generates a more informative plot than the barplot by simply showing all the data points: heights %>% ggplot(aes(sex, height)) + geom_point() For example, we get an idea of the range of the data. However, this plot has limitations as well since we can’t really see all the 238 and 812 points plotted for females and males respectively, and many points are plotted on top of each other. As we have described, visualizing the distribution is much more informative. The first is to add jitter: adding a small random shift to each point. In this case, adding horizontal jitter does not alter the interpretation, since the height of the points do not change, but we minimize the number of points that fall on top of each other and therefore get a better sense of how the data is distributed. A second improvement comes from using alpha blending: making the points somewhat transparent. The more points fall on top of each other, the darker the plot which also helps us get a sense of how the points are distributed. Here is the same plot with jitter and alpha blending: heights %>% ggplot(aes(sex, height)) + geom_jitter(width = 0.1, alpha = 0.2) Now we start getting a sense that, on average, males are taller than females. We also note dark horizontal lines demonstrating that many reported values are rounded to the nearest integer. Since there are so many points it is more effective to show distributions, rather than show individual points. In our next example we show the improvements provided by distributions and suggest further principles. Height distributions Earlier we saw this plot used to compare male and female heights. However, what if we have too many points? Since there are so many points it is more effective to show distributions, rather than show individual points. We therefore show histograms for each group, with the goal to show the distribution of heights between females and males: file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 17/36 9/6/22, 7:12 PM Data Visualization in R heights %>% ggplot(aes(height, ..density..)) + geom_histogram(binwidth = 1, color="black") + facet_grid(.~sex) From this plot, it is immediately obvious that males are, on average, taller than females. An important principle here is to keep the axes the same when comparing data across two plots. Ease comparisons by using common axes. Align plots vertically to see horizontal changes and horizontally to see vertical changes. In these histograms, the visual cue related to decreases or increases in height are shifts to the left or right respectively: horizontal changes. Aligning the plots vertically helps us see this change when the axis are fixed: p2 <- heights %>% ggplot(aes(height, ..density..)) + geom_histogram(binwidth = 1, color="black") + facet_grid(sex~.) p2 This plot makes it much easier to notice that men are, on average, taller. If instead of histograms we want the more compact summary provided by boxplots, then we align them horizontally, since, by default, boxplots move up and down with changes in height. Country income file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 18/36 9/6/22, 7:12 PM Data Visualization in R For our last principle, we observe a boxplot comparing country income between years in each continent. When comparing income data between 1972 and 2002 across region we made a figure similar to the one below. library(gapminder) data(gapminder) gapminder %>% filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>% mutate(dollars_per_day = gdpPercap/pop/365) %>% mutate(labels = paste(year, continent)) %>% ggplot(aes(labels, dollars_per_day)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(trans = "log2") + ylab("Income in dollars per day") Note that, for each continent, we want to compare the distributions from 1972 to 2002. The default is to order alphabetically so the labels with 1972 come before the labels with 2002, making the comparisons challenging. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 19/36 9/6/22, 7:12 PM Data Visualization in R Comparison is easier when boxplots are next to each other, with adjacent visual cues: gapminder %>% filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>% mutate(dollars_per_day = gdpPercap/pop/365) %>% mutate(labels = paste(continent, year)) %>% ggplot(aes(labels, dollars_per_day)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(trans = "log2") + ylab("Income in dollars per day") Comparison is even easier when color is used to denote the two things compared. Ease comparison by using color. gapminder %>% filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>% mutate(dollars_per_day = gdpPercap/pop/365, year = factor(year)) %>% ggplot(aes(continent, dollars_per_day, fill = year)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(trans = "log2") + ylab("Income in dollars per day") Use labels instead of legends file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 20/36 9/6/22, 7:12 PM Data Visualization in R For trend plots, we recommend labeling the lines rather than using legends as the viewer can quickly see which line is which country. This suggestion actually applies to most plots: labeling is usually preferred over legends. We demonstrate how we can do this using the life expectancy data. We define a data table with the label locations and then use a second mapping just for these labels: countries <- c("China","Germany") labels <- data.frame(country = countries, x = c(1972,1962), y = c(60,72)) gapminder %>% filter(country %in% countries) %>% ggplot(aes(year, lifeExp, col = country)) + geom_line() + geom_text(data = labels, aes(x, y, label = country), size = 5) + theme(legend.position = "none") Think of the color blind About 10% of the population is color blind. Unfortunately, the default colors in the ggplot2 package are not great. However, ggplot does make it easy to change the color palette used in plots. Here is an example of how we can use a color blind friendly pallet (https://urldefense.com/v3/__http://www.cookbookr.com/Graphs/Colors_(ggplot2)/*a-colorblind-friendlypalette__;Iw!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlg5NU9AP$). One resource is the package RColorBrewer , which has several color palettes, including color blind friendly palettes. library(RColorBrewer) par(mar=c(3,4,2,2)) display.brewer.all(colorblindFriendly = TRUE) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 21/36 9/6/22, 7:12 PM Data Visualization in R Main Takeaways Use position and length, rather than angles or area In general, don’t use pie charts Do not distort quantities Order categories in a meaningful way Make axes intuitive Show the data Keep axes the same Ease comparisons Use labels instead of legends Think of the color blind file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 22/36 9/6/22, 7:12 PM Data Visualization in R Section 7: Data Visualization Choices There are many visualization techniques you can use and many options available in the ggplot2 package. You can look at the geometries (geoms) listed here (https://urldefense.com/v3/__https://ggplot2.tidyverse.org/reference/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusH to see all the options available. While we’ve covered some here and review them below, due to time and space constraints, we won’t be able to cover them all. A helpful resource if you’re trying to decide on an appropriate data visualization for your data is the Data to Viz (https://urldefense.com/v3/__https://www.data-toviz.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlkeiBmD6$) website. Tables head(murders) state abb region population total 1 Alabama AL South 4779736 135 2 Alaska AK West 710231 19 3 Arizona AZ West 6392017 232 4 Arkansas AR South 2915918 93 5 California CA West 37253956 1257 6 Colorado CO West 5029196 65 We showed this table earlier. While a table is a helpful reference if you want to look up individual values, it is difficult to draw any comparisons or look at trends. Scatter plot murders %>% mutate(murder_rate = total / population * 100000) %>% ggplot(aes(state, murder_rate)) + geom_point() + coord_flip() + xlab("") A scatter plot allows comparison along a common scale. However, the order should be meaningful. Histogram file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 23/36 9/6/22, 7:12 PM Data Visualization in R Earlier we saw this plot used to compare male and female heights. We used histograms for each group, with the goal to show the distribution of heights between females and males: heights %>% ggplot(aes(height, ..density..)) + geom_histogram(binwidth = 1, color="black") + facet_grid(.~sex) A histogram visualizes the distribution of data. Data are grouped into bins or intervals. Histograms shows the shape of the data, and can help identify extreme values or gaps in the data, but they are not useful for comparisons. Box (Whisker) Plot gapminder %>% filter(year %in% c(1972, 2002) & !is.na(gdpPercap)) %>% mutate(dollars_per_day = gdpPercap/pop/365, year = factor(year)) %>% ggplot(aes(continent, dollars_per_day, fill = year)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_continuous(trans = "log2") + ylab("Income in dollars per day") A box plot shows the maximum, minimum, median, first quartile, and third quartile of the data. Outliers are also identified. Bar chart file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 24/36 9/6/22, 7:12 PM Data Visualization in R murders %>% mutate(murder_rate = total / population * 100000) %>% mutate(state = reorder(state, murder_rate)) %>% ggplot(aes(state, murder_rate)) + geom_bar(stat="identity") + coord_flip() + xlab("") A bar chart is helpful for changes over time or comparing different groups. When there are a larger number of categories or category names are long, such as in the example above that we used earlier, you can switch to a horizontal bar chart. Line graph A line graph is useful for showing trends, or how data changes over time. Line graphs are used for quantitative data over a continuous interval or time period, where the x-axis is often a timescale. For example, you may want to look at a trend over time by using a time series plots with time on the x-axis and outcome or measurement of interest on the y-axis. An example below is the United States life expectancy in years over time: library(gapminder) data(gapminder) gapminder %>% filter(country == "United States") %>% ggplot(aes(year, lifeExp)) + geom_point() file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 25/36 9/6/22, 7:12 PM Data Visualization in R We see that the trend is relatively linear, though there is a sharp increase in the early 70s. When the points are regularly and densely spaced, as they are here, we can connect the points to create a curve using the geom_line function to showing that the data are from a single country. gapminder %>% filter(country == "United States") %>% ggplot(aes(year, lifeExp)) + geom_line() This is particularly helpful when we look at two countries. Let’s compare the trend in two countries. We can subset the data to include two countries, one from Europe and one from Asia, and assign colors to different countries: countries <- c("China","Germany") gapminder %>% filter(country %in% countries & !is.na(lifeExp)) %>% ggplot(aes(year, lifeExp, col = country)) + geom_line() Heatmaps Sometimes, you may want to use different magnitudes of color such as variations in hue or intensity to show clusters or how a variables change over a space. heatmaps use colors as a visual cue. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 26/36 9/6/22, 7:12 PM Data Visualization in R library(RColorBrewer) world.data <- filter(gapminder, year%in%c(2007)) world.numeric.data <- world.data %>% select(lifeExp, pop, gdpPercap) world.cor <- cor(world.numeric.data, use = "pairwise.complete.obs") heatmap(x = world.cor, col = brewer.pal(11, "PiYG"), margins = c(6,6), symm = T, cexRow = 0.8, cexCol = 0.8) legend(x="bottomright", xpd=T, legend=c("Negative", "None", "Positive"), cex=0.8, fill=colorRampPalette(brewer.pal(11, "PiYG"))(3)) file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 27/36 9/6/22, 7:12 PM Data Visualization in R Section 8: Mapping For geographic data visualization with geospatial data, you may want to do some mapping. sf packages involves spatial classes or objects. The ggplot2 package takes sf objects. One resource for maps is the rnaturalearth package, which provides a map of countries of the entire world. library("ggplot2") theme_set(theme_bw()) library("sf") library("rnaturalearth") library("rnaturalearthdata") world <- ne_countries(scale = "medium", returnclass = "sf") class(world) [1] "sf" "data.frame" This is the the world map with population: ggplot(data = world) + geom_sf(aes(fill = pop_est)) + scale_fill_viridis_c(option = "plasma", trans = "sqrt") However, we can use other plotting packages such as sp , tmap , leaflet . Beyond static maps, there are animated maps and interactive maps, which we won’t go into too much detail here. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 28/36 9/6/22, 7:12 PM Data Visualization in R Section 9: Shiny Now that we understand how to create beautiful and informative plots with ggplot2 , we can go one step further and create interactive visualization applications with Shiny (https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1ri Shiny is an R package that makes it easy to build interactive web apps straight from R. You can host standalone apps on a webpage or embed them in R Markdown documents or build dashboards. You can also extend your Shiny apps with CSS themes, htmlwidgets, and JavaScript actions. You can create pretty complicated Shiny apps with no knowledge of HTML, CSS, or JavaScript. On the other hand, Shiny doesn’t limit you to creating trivial or prefabricated apps: its user interface components can be easily customized or extended, and its server uses reactive programming to let you create any type of backend logic you want. You can even share Shiny apps publicly on the web for free with shinyapps.io (https://urldefense.com/v3/__https://www.shinyapps.io/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5 Shiny Application Examples Shiny allows your plots to be interactive, so a viewer can decide what variables, subset of data, or types of plots they look at. Simple Example Here is an example using the Covid-19 dataset. We can choose which state to examine, decide what variable to compare by, and subset the range of dates we’d like to consider. Full Shiny Application Examples Shiny applications can be incredibly complex, making them useful across many industries. Shiny is used in academia as a teaching tool for statistical concepts, by big pharma companies to speed collaboration between scientists and analysts during drug development, and by Silicon Valley tech companies to set up realtime metrics dashboards that incorporate advanced analytics. Here’s a screenshot of an app “created to help people living and working in Scotland explore how geographical areas have changed over time or how they compare to other areas, across a range of indicators of health and wider determinants of health” (view it here (https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/scotphoprofiles.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlktghbLY$)): file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 29/36 9/6/22, 7:12 PM file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html Data Visualization in R 30/36 9/6/22, 7:12 PM Data Visualization in R Here’s a screenshot of an app that gives “a full picture of New Zealand’s trading profile through intuitive and interactive graphs and tables”(view it here (https://urldefense.com/v3/__https://shiny.rstudio.com/gallery/nz-tradedash.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlr08kQBh$)): file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 31/36 9/6/22, 7:12 PM Data Visualization in R Section 10: Data Visualization in Practice We’ve shown you how we used ggplot2 code to create insightful and aesthetically pleasing plots, but there is a growing availability of data and tools out there for you to explore. Data visualization is useful across industries, academia, and government. A particularly effective example is a Wall Street Journal article (https://urldefense.com/v3/__http://graphics.wsj.com/infectious-diseases-andvaccines/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlqSb9YsD$) showing data related to the impact of vaccines on battling infectious diseases. One of the graphs shows measles cases by US state through the years with a vertical line demonstrating when the vaccine was introduced. The University of the Witwatersrand has a great Covid-19 dashboard (https://urldefense.com/v3/__https://www.covid19sa.org/southafricavaccination__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uo that combines various types of data visualization techniques such as mapping, bar charts, and line graphs. Another striking example comes from New York Times (graphics8.nytimes.com/images/2011/02/19/nyregion/19schoolsch/19schoolsch-popup.gif), which summarizes scores from the NYC Regents Exams. As described in the article, these scores are collected for several reasons, including to determine if a student graduates from high school. In New York City you need a 65 to pass. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 32/36 9/6/22, 7:12 PM Data Visualization in R Another good example comes from the Department of Statistics South Africa (https://urldefense.com/v3/__https://www.statssa.gov.za/? p=15583__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFltDAcoAa$) that tracks consumer inflation surges over the past 13 years. The annual rate for the Consumer Price Index (CPI) shows an upward trajectory in the first half of 2022, and highlights the highest rate since May 2009. An interactive plot by Statistics Botswana (https://urldefense.com/v3/__https://www.statsbots.org.bw/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMo allows you to examine each individual data point on the total merchandise trade plot, which includes imports, exports, and trade balance. file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 33/36 9/6/22, 7:12 PM Data Visualization in R A striking population pyramid comes from the Zimbabwe National Statistics Agency (https://urldefense.com/v3/__https://www.zimstat.co.zw/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU Data visualization is the strongest tool of exploratory data analysis. You can use programming to bridge the gap between an idea and an interesting visual to communicate it. “The greatest value of a picture is when it forces us to notice what we never expected to see.” - John Tukey, father of EDA file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 34/36 9/6/22, 7:12 PM Data Visualization in R Additional Resources and Tools for Next Steps Many of the figures we showed in the previous sections were produced with the [ ggplot2 ] package. While we don’t have time to learn the basics of R or the packages, we will provide some helpful links. R for Data Science (https://urldefense.com/v3/__https://r4ds.had.co.nz/index.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusH a great book to teach you the basics of R and data visualization RStudio Cheatsheets (https://urldefense.com/v3/__https://www.rstudio.com/resources/cheatsheets/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oF contains helpful cheatsheets for commonly used packages such as ggplot2 for data visualization and dplyr for data transformation ggplot2: Elegant Graphics for Data Analysis (https://urldefense.com/v3/__https://ggplot2book.org/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlheHfASC$): book on the theory and grammar of graphics for ggplot2 ggplot2 gallery (https://urldefense.com/v3/__https://r-graph-gallery.com/ggplot2package.html__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlgq7PAdp$): examples of graphics made with ggplot2 How Charts Lie (https://urldefense.com/v3/__https://albertocairo.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riM book by Alberto Cairo about data visualization https://stackoverflow.com/ (https://urldefense.com/v3/__https://stackoverflow.com/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1ri is a website that most of your google searches will direct to–users answer each others questions with really useful code bits. https://www.rstudio.com/online-learning/ (https://urldefense.com/v3/__https://www.rstudio.com/onlinelearning/__;!!CvMGjuU!6yRVn83zk0kywMAc24MArIjP_m6xIOrQmBcvzUBlM0zqhDhGe6qRvvy_oFb0uopQvIfgusHksD1riMoVU5pFlmUUdVNn$) is a list by the creators of RStudio filled with additional resources for students interested in learning more about programming in R . file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 35/36 9/6/22, 7:12 PM Data Visualization in R Acknowledgments and Questions This workshop had help from lots of other people, including Patrick Emedom-Nnamdi, Intekhab Hossain, Jenna Landy, Shirley Liao, Jonathan Luu, Heather Mattie, Marcello Pagano, Harrison Reeder, and Luli Zou. If you have any questions, feel free to reach out to me at amyzhou@g.harvard.edu (mailto:amyzhou@g.harvard.edu). file:///Users/elarson/Downloads/Data_Viz_Workshop_2022 (1).html 36/36