ggplot2 Package (An implementation of the grammar of graphics

advertisement
ggplot2 Package (An implementation of the grammar of graphics)
Sandra P. Galeano (sgalea1@tigers.lsu.edu)
05/12/13
Ggplot2 (created by Hadley Wickham and Winston Chang) is a data visualization package for R
that works with an underlying grammar of graphics. It is extremely flexible and easy to use to
make graphs visually appealing. Compared to the basic R graphics, ggplot2 may be more
verbose for simple graphics but is less verbose for complex graphics. Also, ggplot2 allows
adding, removing or altering components in a plot easily; or building up a plot from multiple
data sources. ggplot2 can be used to construct basic figures such as histograms, box plots and
scatterplots, or more complex figures such as maps and heat maps. ggplot2 offers two main
functions that can be used to construct the same kind of figures: qplot() and ggplot(). The
qplot() mimics R`s traditional plot() function and provides a simpler sintax than ggplot().
However, the ggplot() allows maximum features and flexibility and therefore is recommended
to use when constructing complex plots.
This self-tutorial goes step-by-step to help you to start using ggplot2 to create basic figures
using the ggplot() function. Although the examples here provided show how to create and
modify scatterplots, the basic grammar used in ggplot2 to construct any kind of figure is
explained. In this self-tutorial I used a modified version of the Fisher`s Iris dataset (Fisher 1936)
which I have called “irismod”. The original dataset represents the morphologic variation of
three species of Iris (Iris setosa, Iris virginica, and Iris versicolor) based on measurements of the
length and width of petals and sepals. In “irismod”, I modified the original dataset by adding a
“site” column (north and south) only for the purpose of this exercise.
Documentation on ggplot2 package is available from CRAN in the following websites:
http://cran.r-project.org/web/packages/ggplot2/index.html
http://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf
I made this self-tutorial based on the following three resources:
- Chang, Winston. 2013. R graphics Cookbook. O`Reilly Media, Inc. USA.
- Ram, Karthik. 2013. A quick introduction to ggplot2. Available at: http://www.rbloggers.com/a-quick-introduction-to-ggplot2/
- Wickham, Hadley. 2009. ggplot2: Elegant Graphics for Data Analysis. Springer
Science+Business Media. USA.
1
Getting started:
Note: Installing R. 3.0.0 may be required to run some functions
Installing and loading ggplot2 package and dependencies:
>install.packages("ggplot2", dependencies = TRUE) Install ggplot2 and automatically installs
dependencies (other packages needed by
ggplot2)
> library ("ggplot2")
Load ggplot2
Installing and loading reshape2 and plyr packages:
> install.packages("reshape2")
Install reshape2. This package allows for
restructuring and aggregating data
> install.packages("plyr")
Install plyr. This package allows for splitting,
applying and combining data
> library ("reshape2")
Load reshape2
> library ("plyr")
Load plyr
Reading the data (iris database modified):
> irismod=read.table(file="c:/Users/Sandra/Desktop/R class/ggplot2/irismod.txt", header = T)
Read the content of irismod.txt dataset
> irismod[1:5,]
1
2
3
4
5
Display the first 5 rows of irismod.txt dataset
Petal.Width Petal.Length Sepal.Width Sepal.Length Species Site
0.2
1.4
3.3
5.0 Setosa South
0.2
1.0
3.6
4.6 Setosa North
0.2
1.6
3.1
4.8 Setosa South
0.1
1.4
3.6
4.9 Setosa North
0.2
1.3
3.2
4.4 Setosa South
Exploring the data using a scatterplot:
> ggplot(data = irismod, aes(x = Sepal.Length, y = Sepal.Width)) + geom_point()
The command “aes” stands for “aesthetics” and refers to “something you can see” such as position,
shape, color, etc. The command “geom” stands for geometric and refers to “marks to put in a plot”
such as line, point, bar, etc). Plots in ggplot are constructed by adding multiple "aes" and "geom".
There is no limit to the number of “geom” you can add to a line code to make a plot. In this case the
2
position for each variable is provided under the aes command (x-axis and y-axis), and “geom_point" is
selected to represent the data in the scatterplot.
Adding color, trendlines, and error shades:
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5)
“color=Species” adds colors automatically by species and “size=3.5” changes point size.
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(aes(shape =
Species), size = 3.5)
“shape=Species” changes shape of points per species
3
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm", se = FALSE )
“geom_smooth” adds a trendline to the data per species, “se=FALSE” is added to avoid representing
standard error of the data
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm")
To represent the standard error of the data, delete “se=FALSE” from geom_smooth
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3) +
geom_smooth(method = "lm") + geom_text(aes(label=rownames(iris)))
“label=rownames" is used to add row labels to each point in the figure
4
Selecting colors:
Selecting colors manually (useful when only few colors are needed):
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_color_manual(values = c("red", "green", "blue"))
“scale_color_manual” allows selecting colors manually for each species
Selecting colors from library with palettes of colors (useful when more than three colors are needed):
> library(RColorBrewer)
Load RColorBrewer to use palettes of colors
> display.brewer.all()
Display palettes of colors
5
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_color_brewer(palette = "Dark2")
“scale_color_brewer” selects the preferred palette from RColorBrewer
Selecting colors from color blind friendly palettes:
Colors in palettes are specified as a hexadecimal RGB triplet, such as "#0066CC". The first two digits are
the level of red, the next two green, and the last two blue. The value for each ranges from 00 to FF in
hexadecimal (base-16) notation. At least two colorblind friendly palettes can be chosen to add in figures
using ggplot2: The cbPalette and the cbbPalette.
> cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
Calls the color blind friendly palette with grey represented here:
> cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
Calls the color blind friendly palette with black represented here:
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_colour_manual(values=cbbPalette)
6
“values=cbbPalette” selects the color blind friendly palette with gray “cbpalette”
Faceting:
Faceting in ggplot2 is used to generate small multiple plots using the command “facet”. Each small plot
shows a different subset of the data, allowing comparison in one figure. To generate small multiples two
commands can be used: facet_wrap and facet_grid.
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_colour_manual(values=cbPalette) + facet_wrap(~
Species)
“facet_wrap” defines subsets as the level of a single grouping variable, in this case the species variable.
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_colour_manual(values=cbPalette) + facet_grid(Species ~
Site)
“facet_grid” defines subsets as the level of two grouping variables, in this case the species and the site
variables.
7
Editing plots:
The command “theme” is used to edit backgrounds, legends, labels, and fonts in ggplot2. A list of the
possible elements can be found under:
> ?theme
Retrieves all theme elements
> ggplot(irismod, aes(Sepal.Length, Sepal.Width, color = Species)) + geom_point(size = 3.5) +
geom_smooth(method = "lm") + scale_colour_manual(values=cbPalette) +
facet_grid(Species ~ Site)+theme_bw()+
theme(legend.position="none",axis.text=element_text(size=11),axis.title=element_text
(size=14,face="bold"),strip.text=element_text(size = 12))
“theme_bw()” is used to change the gray background of the entire figure into a white background
maintaining the grid. “legend.position=none” deletes the legend in the figure,
“axis.text=element_text()” is used to modify the axes text, “axis.title=element_text()” modifies the axes
titles, and “strip.text=element_text()” modifies the facet text which in this case is the name of each
species and of each site.
8
Download