Stat 579: Graphical Displays Ranjan Maitra

advertisement
Stat 579: Graphical Displays
Ranjan Maitra
2220 Snedecor Hall
Department of Statistics
Iowa State University.
Phone: 515-294-7757
maitra@iastate.edu
,
1/19
Plots and Graphical Parameters
We start with a simple example.
illit <- state.x77[ ,3]
murder <- state.x77[, 5]
plot(illit, murder)
The R function plot() is the most frequently used
high-level plotting function. By default, this results in a plot
of points. Other options are “lines” (l) or both “points and
lines” (b) or “none” (n).
x <- seq(from = -3, to = 3, length = 100)
plot(x, dnorm(x), type = ‘‘l’’)
provides a plot of the standard normal density function.
The axes, scaling, titles, labels, plotting symbols, colors,
etc., are the results of values set by default.
Other graphical parameter values such as pch=, col=,
lty=, ylab= etc., can be set as additional arguments
specified in the high-level plotting function calls. These
parameters may also be specified in a separate call to the
par() function, invoking which lists all graphical
parameters available.
,
2/19
Adding Features to Existing Graphs
Suppose we want to add a fitted regression line in purple
to the dataset. > plot(illit, murder, pch=3,
xlab="Illiteracy", ylab="Murder Rate")
> regout <- lsfit(illit,murder)
> yhat <- regout$coef[1]+regout$coef[2]*illit
> lines(illit, yhat, col=6)
If it is desired that different colors be used for the plotting
symbol, the lines, etc., it is preferable to build the graph
from scratch.
For example, an empty graph may be produced using
type="n" as an argument to the plot() function. The graph
will have the axes properly scaled but no symbols or lines
will be plotted to represent the points. The points() and
the lines() functions may then be used to add points
and/or lines with the desired parameter settings, as
mentioned earlier.
,
3/19
Adding Features to Existing Graphs (continued)
> plot(illit, murder, type="n", xlab="Illiteracy", ylab="Murder
Rate")
> points(illit, murder, col=4, pch=8)
> lines(illit, regout$coef[1]+regout$coef[2]*illit, col=6)
The above R expressions add points and the regression
line in different colors to the blank graph. Note that
graphical parameters col.axis=,col.lab= col.main= may be
used to specify colors for axes, labels, and titles,
respectively. Besides lines(), and points(), abline(), and
text() are examples of two other R functions that allow the
user to add features to enhance an existing plot. These are
called low-level plotting functions. > text(illit, murder,
state.abb, cex=.7, adj=-.5, col=6)
> text(.5,14, cex=1.5, adj=0, "Plot of Murder Rate vs.
Illiteracy")
> label <- paste(" Mean Murder Rate = ",
as.character(mean(murder)))
> text(2, 3, adj=0, label)
Next, the residuals are plotted against illit, in addition to
the x-axis.
> plot(x = illit, y = regout$resid, col = 4, pch = 5, ylab
= "Residuals"); abline(h = 0, lty = 3, col = 2)
,
4/19
Generic Graphical Functions
A powerful feature of several high-level plotting functions is
that they perform generic plotting of R objects, i.e., the
function recognizes an R object that is used as its
argument and employs a different method (i.e., code) for
producing a predefined plot appropriate for representing
the object.
These objects are typically produced by a statistical
function in R and are categorized as belonging to a class
of objects.
For example, if the x-variable is a factor object, the plot()
function produces side-by-side boxplots of the y-variable.
This feature is an attribute of the object-oriented
programming capability of R.
,
5/19
Generic Graphical Functions – Example
> diet <- c(1, 1, 1, 1, 2, 2, 2, 2,
4, 4, 4, 4, 4, 4, 4)
> time <- c(62, 60, 63, 59, 63, 67,
67, 68, 68, 56, 62, 60, 61, 63, 64,
> blood2 <- data.frame(diet = diet,
2, 2, 3, 3, 3, 3, 3, 3, 4,
71, 64, 65, 66, 68, 66, 71,
63, 59)
time = time)
> plot(blood2$diet, blood2$time) > diet2 <- factor(blood2$diet)
> plot(diet2, blood2$time)
In the first case, a scatter plot of the data is obtained,
whereas in the second case, a set of boxplots is produced.
This is simply the result of the fact that diet2 is a factor
object. If this type of data are available in a data frame
object, then the same plot can be obtained directly u sing a
formula as the first argument to the plot() function.
Note that a formula is an R object.
plot(weight∼feed, data = chickwts)
states <- data.frame(murder, illit)
plot(murder∼illit, data = states)
,
6/19
Generic Graphical Functions – Output of lm
If the object created by the lm() function is used as the
first argument of the plot() function, it will generate a
series of diagnostic plots, each displayed in response to a
prompt.
The lm() function is used here to fit the same straight line
model as before and the result is assigned to an object
named lmout.
The lmout object thus created may be used to create
diagnostic plots, such as the residual plot shown next.
> lmout <- lm(murder∼illit)
> plot(lmout)
Hit <Return> to see next plot:
.....
,
7/19
Other High-level Graphics Functions
Although methods for diagnostics are available in generic
functions, such as plot(), it may be desirable to obtain
them directly using high-level graphics functions. Eg, a
normal probability plot of the residuals from the straight
line model fitted earlier to the murder and illiteracy
variables can be plotted using the qqnorm() function:
> qqnorm(regout$resid, xlab="Standard Normal Quantiles",
ylab="Residuals",main="Normal Probability Plot of
Residuals")
The blood2 data is converted to a data frame suitable for
use with the function aov(), which performs a one-way
ANOVA when the variable on the right hand side of the
formula is a factor. Here are the boxplots of the
original data and the residuals for verifying assumptions:
> blood3 <data.frame(time=blood2$time,diet=factor(blood2$diet))
> blood.aov <- aov(time∼diet,data=blood3)
> summary(blood.aov)
> boxplot(time∼diet, data=blood3)
> boxplot(blood.aov$res∼diet, data=blood3)
,
8/19
More High-level Graphics Functions
Eg., Barplot of means for each feed type of chickwts data
> barplot(with(chickwts, tapply(X = weight, INDEX = feed,
FUN = mean)), ylab="Mean Weight (in gms)", xlab="Feed
Type",main="Bargraph of Mean Weight")
The cut() function creates categorical variables from
continuous ones by creating non-overlapping intervals
according to breaks or cut-off values provided by the
user. This results in a factor unless labels=FALSE is
used, in which case, we get a vector object.
to a factor if so desired.
> data(mtcars)
> cut(x = mtcars[,"wt"], breaks = c(0, 2.5, 3.5, 5.5))
> carsize <- cut(mtcars[,"wt"],c(0,2.5,3.5,5.5),labels=F)
is.factor(carsize)
> carsize <cut(mtcars[,"wt"],c(0,2.5,3.5,5.5),labels=c("1","2","3"))
> is.factor(carsize)
carsize <- cut(x = mtcars[,"wt"], breaks =
c(0,2.5,3.5,5.5), labels=c("Compact", "Midsize", "Large"))
,
9/19
The cut() function (continued)
Category variables are useful for creating contingency
tables, or testing hypotheses using analysis of variance
models, or simply plotting barcharts, as shown below.
First, a statistic of the variable that is to be displayed in the
barchart needs to be computed for observations grouped
by the values of the category variable(s).
The sample medians of gas mileage for cars of each size
are computed in this example using the factor carsize
and the tapply() function. These medians are then
plotted in a bar chart using the barplot() function:
> carmeds <- tapply(mtcars[,"mpg"], carsize, median)
> barplot(carmeds, names = levels(carsize), xlab =
"Size of Car", ylab = "Median Gas Mileage", main =
"Gas Mileage by Size")
,
10/19
The split() function
A similar approach can be used to obtain side-by-side
boxplots of gas mileage categorized by car size. The
split() function is first used to partition the gas mileage
variable values into the three groups, as components of a
list object. If this object is used as the first argument to the
boxplot() function, the desired boxplots will be drawn:
> split(mtcars[,"mpg"], carsize)
> boxplot(split(mtcars[,"mpg"], carsize), xlab="Size of
Car", ylab="Gas Mileage in Miles/Gallon", main="Gas Mileage
Disribution by Size")
The factor carsize may be used to plot characters (C, M,
or L, say) instead of symbols to identify points, eg, scatter
plot of mileage against weight:
> plot(mtcars[,"wt"], mtcars[,"mpg"], type="n",
xlab="Weight",ylab="Mileage", main="Fuel Use vs. Weight")
> char <- substring(carsize,1,1)
> text(mtcars[,"wt"], mtcars[,"mpg"], char, cex=.7, font=2)
> legend(4,30, levels(carsize), pch="CML")
The legend() function describes what each symbol
represents.
,
11/19
Some low-level graphics functions
The low-level graphics functions segments() and the
arrows() are useful for adding line segments to a plot.
As an example, the plot of residuals of the regression of
murder variable against the illit variable can be
enhanced in the following way:
> plot(illit, regout$resid, type="n", ylab="Residuals")
> abline(h=0, lty=3, col=2)
> segments(illit, 0, illit, regout$resid, col=3)
The segments() function causes line segments to be
drawn between pairs of (x, y ) co-ordinates defined by the
first two arguments and pairs of (x, y) co-ordinates defined
by the next two arguments. Another example:
> plot(illit, murder)
> lines(illit, yhat)
> segments(illit,murder,illit,yhat)
,
12/19
Interacting with Plots in R – the locator() function
Several R functions allow the user to dynamically interact
with an existing plot. Suppose that we want to insert a
character string at a specific position on the plot. Instead of
calculating the position co-ordinates, locator() may be
used to dynamically determine the co-ordinates. First a
plot is created and then locator() is invoked:
> plot(illit,murder)
> locator(1)
$x
1.613948
$y
4.501447
With the locator() function, the cursor in graphics
window turns into a crosshair and mouse-button-1
(left-button) is clicked at a location on the plot. This action
selects a point in the plot area. locator() returns the
(x, y)-coordinates of the selected point and prints them in
the R window. These two values (x ∗ , y ∗ ) may be used as
the first two arguments in a text() function to paste a text
string at or near the selected location on the plot.
,
13/19
The locator() and identify() functions
> text(1.6, 4.5, "Lower Murder Rates Down Here!", cex =.5,
adj=0)
These two functions can be combined in an obvious way
as illustrated below.
> text(locator(1), "An Outlier ?",cex=1,adj=0,col=2)
The identify() function is another R graphics function
that can be used to interact with a plot.
> identify(illit,murder,state.abb)
After executing the identify() function, move to the
graphics window; again the cursor turns into a crosshair.
Use mouse-button-1 to click near a plotted point; this will
cause the name of the observation identified by that point
to be plotted on the graph. This can be repeated on any
number of points desired. To exit the identify mode, click
mouse-button-2 (middle button) while in the graphics
window.
,
14/19
Multiple Plots on a Single Page – the par() function
To create an n by m array of plots in a single page i.e., in a
single graphics window, one can use the graphical
parameter mfrow= or mfcol=.
The parameter is specified as a vector of the form c(nr,
nc) and the subsequent plots will be arranged in a
nr-by-nc array on the graphical device.
Plots are drawn by column (mfcol=) or by row (mfrow=).
The par() function is invoked to set parameters.
The mar= or mai= parameters may be used to reserve
space in the margins in units of inches or lines in the
individual plots. Similarly, oma= or omi= may be used to
reserve space in the margins in the entire page.
> par(mfrow=c(2,2), mar=rep(4,4))
> plot(illit, murder, col = 2, pch = 4, main = "Murder vs.
Illiteracy")
> lines(illit, lmout$fitted, col=5, lty = 4)
> qqnorm(lmout$res, col=6, pch=16, main="Normal Plot of
Residuals")
> boxplot(lmout$res, ylab="Residuals", main="Boxplot")
> hist(lmout$res, xlab="Residuals", main="Histogram of
Residuals")
,
15/19
Displaying Multivariate Data – I
Many plotting functions are available for displaying
multivariate data. Some examples are those that plot
scatterplot matrices, star plots, or parallel coordinate plots.
The pairs() function creates a graph that consists of a
scatterplot for each combination of variables that is
supplied through an R object argument. A panel=
argument can be used to add overlays on each scatterplot.
> pairs(state.x77)
> pairs(state.x77[,2:6],cex=.5,pch=3)
> pairs(state.x77[,2:6],panel=function(x,y){points(x,y);
lines(lowess(x,y))})
> stars(mtcars,cex=.5)
> parcoord(state.x77)
> parcoord(state.x77[, c(4, 6, 2, 5, 3)])
,
16/19
Displaying Multivariate Data – II
The contour() and the persp() functions create
contour plots, and three dimensional surfaces. Here, these
functions are used to plot contours and 3-D graph of a
surface fitted to a set of topological measurements in topo
(from the MASS package). First, a surface is fitted to the
data using the loess() function in R. The fitted surface is
used to obtain predicted values z over a 2-dimensional grid
(named as topo.grid below). The R generic function
predict() is useful for this purpose. The function
expand.grid() used below, creates a data frame from
all combinations of the two vectors x and y.
>
>
>
>
>
>
>
>
>
,
data(topo)
topo.surf <- loess(z∼x*y,topo,span=0.25)
topo.surf
topo.grid <- list(x=seq(0,6.5,0.2),y=seq(0,6.5,0.2))
topo.z <- predict(topo.surf, expand.grid(topo.grid))
contour(topo.grid$x,topo.grid$y,topo.z)
points(topo)
persp(topo.grid$x,topo.grid$y,topo.z)
persp(topo.grid$x,topo.grid$y,topo.z,theta=40,phi=40)
17/19
Displaying Multivariate Data – III
Conditioning plots, or coplots are produced by the R
function coplot(). Two variables are plotted against
each other in a series of plots conditioned on the values of
a (continuous-valued) third variable. This enables
visualization of how the relationship between the first two
variables depends on the third variable to be examined.
The third variable is allowed to take values in a set of
overlapping ranges. These intervals may be determined by
the use of the co.intervals() function.
If these are not provided, R determines them using
co.intervals(x,number=6,overlap=0.5) where x
is the conditioning variable. The following produces a
coplot of the murder variable against the illit variable
conditioned on the income variable in the state.x77
data set:
> income=state.x77[,2]
> coplot(murder∼illit|income)
,
18/19
Printing Plots to a File
R provides a number of ways to print plots to a file.
dev.copy2pdf is perhaps what you will use most
frequently in this class.
however, for many journal submissions, you may need to
get an encapsulated postscript file for the plots. In this
case, we use dev.copy2eps to get a .eps file.
In either case, we get a file of the plot on the (last) graphics
window.
It is more preferable to create a .eps file and then convert
into a .pdf file, but on OS’s that one throws money after,
this may not in general be an easy proposition.
,
19/19
Download