Stat 579: Graphical Displays Ranjan Maitra 2220 Snedecor Hall Department of Statistics Iowa State University. Phone: 515-294-7757 maitra@iastate.edu , 1/19 Plots and Graphical Parameters We start with a simple example. illit <- state.x77[ ,3] murder <- state.x77[, 5] plot(illit, murder) The R function plot() is the most frequently used high-level plotting function. By default, this results in a plot of points. Other options are “lines” (l) or both “points and lines” (b) or “none” (n). x <- seq(from = -3, to = 3, length = 100) plot(x, dnorm(x), type = ‘‘l’’) provides a plot of the standard normal density function. The axes, scaling, titles, labels, plotting symbols, colors, etc., are the results of values set by default. Other graphical parameter values such as pch=, col=, lty=, ylab= etc., can be set as additional arguments specified in the high-level plotting function calls. These parameters may also be specified in a separate call to the par() function, invoking which lists all graphical parameters available. , 2/19 Adding Features to Existing Graphs Suppose we want to add a fitted regression line in purple to the dataset. > plot(illit, murder, pch=3, xlab="Illiteracy", ylab="Murder Rate") > regout <- lsfit(illit,murder) > yhat <- regout$coef[1]+regout$coef[2]*illit > lines(illit, yhat, col=6) If it is desired that different colors be used for the plotting symbol, the lines, etc., it is preferable to build the graph from scratch. For example, an empty graph may be produced using type="n" as an argument to the plot() function. The graph will have the axes properly scaled but no symbols or lines will be plotted to represent the points. The points() and the lines() functions may then be used to add points and/or lines with the desired parameter settings, as mentioned earlier. , 3/19 Adding Features to Existing Graphs (continued) > plot(illit, murder, type="n", xlab="Illiteracy", ylab="Murder Rate") > points(illit, murder, col=4, pch=8) > lines(illit, regout$coef[1]+regout$coef[2]*illit, col=6) The above R expressions add points and the regression line in different colors to the blank graph. Note that graphical parameters col.axis=,col.lab= col.main= may be used to specify colors for axes, labels, and titles, respectively. Besides lines(), and points(), abline(), and text() are examples of two other R functions that allow the user to add features to enhance an existing plot. These are called low-level plotting functions. > text(illit, murder, state.abb, cex=.7, adj=-.5, col=6) > text(.5,14, cex=1.5, adj=0, "Plot of Murder Rate vs. Illiteracy") > label <- paste(" Mean Murder Rate = ", as.character(mean(murder))) > text(2, 3, adj=0, label) Next, the residuals are plotted against illit, in addition to the x-axis. > plot(x = illit, y = regout$resid, col = 4, pch = 5, ylab = "Residuals"); abline(h = 0, lty = 3, col = 2) , 4/19 Generic Graphical Functions A powerful feature of several high-level plotting functions is that they perform generic plotting of R objects, i.e., the function recognizes an R object that is used as its argument and employs a different method (i.e., code) for producing a predefined plot appropriate for representing the object. These objects are typically produced by a statistical function in R and are categorized as belonging to a class of objects. For example, if the x-variable is a factor object, the plot() function produces side-by-side boxplots of the y-variable. This feature is an attribute of the object-oriented programming capability of R. , 5/19 Generic Graphical Functions – Example > diet <- c(1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4, 4, 4, 4) > time <- c(62, 60, 63, 59, 63, 67, 67, 68, 68, 56, 62, 60, 61, 63, 64, > blood2 <- data.frame(diet = diet, 2, 2, 3, 3, 3, 3, 3, 3, 4, 71, 64, 65, 66, 68, 66, 71, 63, 59) time = time) > plot(blood2$diet, blood2$time) > diet2 <- factor(blood2$diet) > plot(diet2, blood2$time) In the first case, a scatter plot of the data is obtained, whereas in the second case, a set of boxplots is produced. This is simply the result of the fact that diet2 is a factor object. If this type of data are available in a data frame object, then the same plot can be obtained directly u sing a formula as the first argument to the plot() function. Note that a formula is an R object. plot(weight∼feed, data = chickwts) states <- data.frame(murder, illit) plot(murder∼illit, data = states) , 6/19 Generic Graphical Functions – Output of lm If the object created by the lm() function is used as the first argument of the plot() function, it will generate a series of diagnostic plots, each displayed in response to a prompt. The lm() function is used here to fit the same straight line model as before and the result is assigned to an object named lmout. The lmout object thus created may be used to create diagnostic plots, such as the residual plot shown next. > lmout <- lm(murder∼illit) > plot(lmout) Hit <Return> to see next plot: ..... , 7/19 Other High-level Graphics Functions Although methods for diagnostics are available in generic functions, such as plot(), it may be desirable to obtain them directly using high-level graphics functions. Eg, a normal probability plot of the residuals from the straight line model fitted earlier to the murder and illiteracy variables can be plotted using the qqnorm() function: > qqnorm(regout$resid, xlab="Standard Normal Quantiles", ylab="Residuals",main="Normal Probability Plot of Residuals") The blood2 data is converted to a data frame suitable for use with the function aov(), which performs a one-way ANOVA when the variable on the right hand side of the formula is a factor. Here are the boxplots of the original data and the residuals for verifying assumptions: > blood3 <data.frame(time=blood2$time,diet=factor(blood2$diet)) > blood.aov <- aov(time∼diet,data=blood3) > summary(blood.aov) > boxplot(time∼diet, data=blood3) > boxplot(blood.aov$res∼diet, data=blood3) , 8/19 More High-level Graphics Functions Eg., Barplot of means for each feed type of chickwts data > barplot(with(chickwts, tapply(X = weight, INDEX = feed, FUN = mean)), ylab="Mean Weight (in gms)", xlab="Feed Type",main="Bargraph of Mean Weight") The cut() function creates categorical variables from continuous ones by creating non-overlapping intervals according to breaks or cut-off values provided by the user. This results in a factor unless labels=FALSE is used, in which case, we get a vector object. to a factor if so desired. > data(mtcars) > cut(x = mtcars[,"wt"], breaks = c(0, 2.5, 3.5, 5.5)) > carsize <- cut(mtcars[,"wt"],c(0,2.5,3.5,5.5),labels=F) is.factor(carsize) > carsize <cut(mtcars[,"wt"],c(0,2.5,3.5,5.5),labels=c("1","2","3")) > is.factor(carsize) carsize <- cut(x = mtcars[,"wt"], breaks = c(0,2.5,3.5,5.5), labels=c("Compact", "Midsize", "Large")) , 9/19 The cut() function (continued) Category variables are useful for creating contingency tables, or testing hypotheses using analysis of variance models, or simply plotting barcharts, as shown below. First, a statistic of the variable that is to be displayed in the barchart needs to be computed for observations grouped by the values of the category variable(s). The sample medians of gas mileage for cars of each size are computed in this example using the factor carsize and the tapply() function. These medians are then plotted in a bar chart using the barplot() function: > carmeds <- tapply(mtcars[,"mpg"], carsize, median) > barplot(carmeds, names = levels(carsize), xlab = "Size of Car", ylab = "Median Gas Mileage", main = "Gas Mileage by Size") , 10/19 The split() function A similar approach can be used to obtain side-by-side boxplots of gas mileage categorized by car size. The split() function is first used to partition the gas mileage variable values into the three groups, as components of a list object. If this object is used as the first argument to the boxplot() function, the desired boxplots will be drawn: > split(mtcars[,"mpg"], carsize) > boxplot(split(mtcars[,"mpg"], carsize), xlab="Size of Car", ylab="Gas Mileage in Miles/Gallon", main="Gas Mileage Disribution by Size") The factor carsize may be used to plot characters (C, M, or L, say) instead of symbols to identify points, eg, scatter plot of mileage against weight: > plot(mtcars[,"wt"], mtcars[,"mpg"], type="n", xlab="Weight",ylab="Mileage", main="Fuel Use vs. Weight") > char <- substring(carsize,1,1) > text(mtcars[,"wt"], mtcars[,"mpg"], char, cex=.7, font=2) > legend(4,30, levels(carsize), pch="CML") The legend() function describes what each symbol represents. , 11/19 Some low-level graphics functions The low-level graphics functions segments() and the arrows() are useful for adding line segments to a plot. As an example, the plot of residuals of the regression of murder variable against the illit variable can be enhanced in the following way: > plot(illit, regout$resid, type="n", ylab="Residuals") > abline(h=0, lty=3, col=2) > segments(illit, 0, illit, regout$resid, col=3) The segments() function causes line segments to be drawn between pairs of (x, y ) co-ordinates defined by the first two arguments and pairs of (x, y) co-ordinates defined by the next two arguments. Another example: > plot(illit, murder) > lines(illit, yhat) > segments(illit,murder,illit,yhat) , 12/19 Interacting with Plots in R – the locator() function Several R functions allow the user to dynamically interact with an existing plot. Suppose that we want to insert a character string at a specific position on the plot. Instead of calculating the position co-ordinates, locator() may be used to dynamically determine the co-ordinates. First a plot is created and then locator() is invoked: > plot(illit,murder) > locator(1) $x 1.613948 $y 4.501447 With the locator() function, the cursor in graphics window turns into a crosshair and mouse-button-1 (left-button) is clicked at a location on the plot. This action selects a point in the plot area. locator() returns the (x, y)-coordinates of the selected point and prints them in the R window. These two values (x ∗ , y ∗ ) may be used as the first two arguments in a text() function to paste a text string at or near the selected location on the plot. , 13/19 The locator() and identify() functions > text(1.6, 4.5, "Lower Murder Rates Down Here!", cex =.5, adj=0) These two functions can be combined in an obvious way as illustrated below. > text(locator(1), "An Outlier ?",cex=1,adj=0,col=2) The identify() function is another R graphics function that can be used to interact with a plot. > identify(illit,murder,state.abb) After executing the identify() function, move to the graphics window; again the cursor turns into a crosshair. Use mouse-button-1 to click near a plotted point; this will cause the name of the observation identified by that point to be plotted on the graph. This can be repeated on any number of points desired. To exit the identify mode, click mouse-button-2 (middle button) while in the graphics window. , 14/19 Multiple Plots on a Single Page – the par() function To create an n by m array of plots in a single page i.e., in a single graphics window, one can use the graphical parameter mfrow= or mfcol=. The parameter is specified as a vector of the form c(nr, nc) and the subsequent plots will be arranged in a nr-by-nc array on the graphical device. Plots are drawn by column (mfcol=) or by row (mfrow=). The par() function is invoked to set parameters. The mar= or mai= parameters may be used to reserve space in the margins in units of inches or lines in the individual plots. Similarly, oma= or omi= may be used to reserve space in the margins in the entire page. > par(mfrow=c(2,2), mar=rep(4,4)) > plot(illit, murder, col = 2, pch = 4, main = "Murder vs. Illiteracy") > lines(illit, lmout$fitted, col=5, lty = 4) > qqnorm(lmout$res, col=6, pch=16, main="Normal Plot of Residuals") > boxplot(lmout$res, ylab="Residuals", main="Boxplot") > hist(lmout$res, xlab="Residuals", main="Histogram of Residuals") , 15/19 Displaying Multivariate Data – I Many plotting functions are available for displaying multivariate data. Some examples are those that plot scatterplot matrices, star plots, or parallel coordinate plots. The pairs() function creates a graph that consists of a scatterplot for each combination of variables that is supplied through an R object argument. A panel= argument can be used to add overlays on each scatterplot. > pairs(state.x77) > pairs(state.x77[,2:6],cex=.5,pch=3) > pairs(state.x77[,2:6],panel=function(x,y){points(x,y); lines(lowess(x,y))}) > stars(mtcars,cex=.5) > parcoord(state.x77) > parcoord(state.x77[, c(4, 6, 2, 5, 3)]) , 16/19 Displaying Multivariate Data – II The contour() and the persp() functions create contour plots, and three dimensional surfaces. Here, these functions are used to plot contours and 3-D graph of a surface fitted to a set of topological measurements in topo (from the MASS package). First, a surface is fitted to the data using the loess() function in R. The fitted surface is used to obtain predicted values z over a 2-dimensional grid (named as topo.grid below). The R generic function predict() is useful for this purpose. The function expand.grid() used below, creates a data frame from all combinations of the two vectors x and y. > > > > > > > > > , data(topo) topo.surf <- loess(z∼x*y,topo,span=0.25) topo.surf topo.grid <- list(x=seq(0,6.5,0.2),y=seq(0,6.5,0.2)) topo.z <- predict(topo.surf, expand.grid(topo.grid)) contour(topo.grid$x,topo.grid$y,topo.z) points(topo) persp(topo.grid$x,topo.grid$y,topo.z) persp(topo.grid$x,topo.grid$y,topo.z,theta=40,phi=40) 17/19 Displaying Multivariate Data – III Conditioning plots, or coplots are produced by the R function coplot(). Two variables are plotted against each other in a series of plots conditioned on the values of a (continuous-valued) third variable. This enables visualization of how the relationship between the first two variables depends on the third variable to be examined. The third variable is allowed to take values in a set of overlapping ranges. These intervals may be determined by the use of the co.intervals() function. If these are not provided, R determines them using co.intervals(x,number=6,overlap=0.5) where x is the conditioning variable. The following produces a coplot of the murder variable against the illit variable conditioned on the income variable in the state.x77 data set: > income=state.x77[,2] > coplot(murder∼illit|income) , 18/19 Printing Plots to a File R provides a number of ways to print plots to a file. dev.copy2pdf is perhaps what you will use most frequently in this class. however, for many journal submissions, you may need to get an encapsulated postscript file for the plots. In this case, we use dev.copy2eps to get a .eps file. In either case, we get a file of the plot on the (last) graphics window. It is more preferable to create a .eps file and then convert into a .pdf file, but on OS’s that one throws money after, this may not in general be an easy proposition. , 19/19