Graphics and Visualization This is an overview of some of the standard methods available in R for visualization of data with statistical graphics. Examination of your data graphically is an important early step of any data analysis. In general, you should start off with univariate methods, histograms and such, examining each variable in isolation. Then look at pairs of variables with scatterplots, and work your way up to high-dimensional methods. Only a small number of examples of each method will be provided. Remember that you can always use the help function to get more details and options on any of these functions. 1. Categorical Data data(Titanic) Titanic , , Age = Child, Survived = No Sex Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 , , Age = Adult, Survived = No Sex Class Male Female 1st 118 4 2nd 154 13 3rd 387 89 Crew 670 3 , , Age = Child, Survived = Yes Sex Class Male Female 1st 5 1 2nd 11 13 3rd 13 14 Crew 0 0 , , Age = Adult, Survived = Yes Sex Class Male Female 1st 57 140 2nd 14 80 3rd 75 76 Crew 192 20 ftable(Titanic) Survived Class Sex 1st Male Female 2nd Male Female 3rd Male Female Crew Male Female Age Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult Child Adult No Yes 0 5 118 57 0 1 4 140 0 11 154 14 0 13 13 80 35 13 387 75 17 14 89 76 0 0 670 192 0 0 3 20 Titanic1<-margin.table(Titanic, 1) Titanic1 Class 1st 2nd 3rd Crew 325 285 706 885 1a. Bar Chart barplot(Titanic1) barplot(Titanic1, main="Individuals on the Titanic") 1b. Pie Chart (usually, a bar chart is better!) pie(Titanic1, main="Individuals on the Titanic") 1c. Stacked Bar Chart Titanic2<-margin.table(Titanic, c(4,1)) Titanic2 Class Survived 1st 2nd 3rd Crew No 122 167 528 673 Yes 203 118 178 212 barplot(Titanic2, legend.text=T, main="Survival on the Titanic, By Class") The ylim argument will let us stretch the y-axis to better fit the legend. barplot(Titanic2, ylim = c(0,1100), legend.text=T, main="Survival on the Titanic, By Class") 1d. Grouped Bar Chart barplot(Titanic2, beside=T, ylim = c(0,800), legend.text=T, main="Survival on the Titanic, By Class") 2. Univariate Data data(iris) iris Sepal.Length Sepal.Width Petal.Length Petal.Width 1 5.1 3.5 1.4 0.2 2 4.9 3.0 1.4 0.2 : : : : : 150 5.9 3.0 5.1 1.8 plength<-iris[,3] species<-iris[,5] 2a. Strip Chart (Univariate Scatterplot) stripchart(plength) stripchart(plength, "jitter") stripchart(plength, "stack") stripchart(plength~species, method="stack") Species setosa setosa : virginica 2b. Box Plot boxplot(plength) boxplot(plength~species) 2c. Histogram hist(plength) Using "freq=F" puts the histogram on a density scale (total area = 1). hist(plength, breaks="Scott", freq=F) You can select your bin edges explicitly. Looking at multiple bin widths and/or starting points is a good idea. t<-seq(0.5,7,by=0.5) t [1] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 hist(plength,breaks=t) hist(plength,breaks=t+.25) 2d. Kernel Density Estimates Kernel density estimates are similar to histograms, but smoother and more accurate estimates of the underlying density. A "kernel" (by default, a normal curve) is placed centered on each observation, scaled to have area 1/n and with a standard deviation equal to a smoothing parameter called the bandwidth ("bw"). The (vertical) sum of all of the kernels is the final estimate. Here's an example, showing how a KDE is built up. Don't worry too much about the code. X <- rnorm(10) X [1] 0.05395434 -0.86492411 -1.30461335 1.81755076 [7] -0.55122585 0.48825963 -1.72901353 -0.38337110 0.55755471 -1.14514124 plot(density(X, bw=.4)) points(X, rep(.02,10), pch=16) t <- seq(-3, 3, length=101) for (i in 1:10) {lines(t, dnorm(t, X[i], .4)/10, col="red")} plot(density(plength)) The "adjust" argument is a multiplier on the default bandwidth. As with histograms, looking at more than one is a good idea. plot(density(plength, adjust=1/2)) plot(density(plength,adjust=1/4)) plot(density(plength,adjust=2)) Setting bw="SJ" chooses a good ("Sheather-Jones") data-driven bandwidth. plot(density(plength, bw="SJ"), main="Estimated PDF of Petal Length") 3. A) Bivariate Normal library(mixtools) library(mvtnorm) set.seed(17) p <- rmvnorm(1000, c(250000, 20000), matrix(c(100000^2, 22000^2, 22000^2, 6000^2),2,2)) plot(p, pch=20, xlim=c(0,500000), ylim=c(0,50000), xlab="Packets", ylab="Flows") ellipse(mu=colMeans(p), sigma=cov(p), alpha = .05, npoints = 250, col="red") Plotting two bivariate normals in 3d and their contours respectively # lets first simulate a bivariate normal sample library(MASS) bivn <- mvrnorm(1000, mu = c(0, 0), Sigma = matrix(c(1, .5, .5, 1), 2)) # now we do a kernel density estimate bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50) # fancy perspective persp(bivn.kde, phi = 45, theta = 30, shade = .1, border = NA) # fancy contour with image image(bivn.kde); contour(bivn.kde, add = T) library(rgl) col1 <- rainbow(length(bivn.kde$z))[rank(bivn.kde$z)] persp3d(x=bivn.kde, col = col1) B) Bivariate Data data(state) dimnames(state.x77) [[1]] [1] "Alabama" [5] "California" [9] "Florida" [13] "Illinois" [17] "Kentucky" "Alaska" "Colorado" "Georgia" "Indiana" "Louisiana" "Arizona" "Connecticut" "Hawaii" "Iowa" "Maine" "Arkansas" "Delaware" "Idaho" "Kansas" "Maryland" [21] [25] [29] [33] [37] [41] [45] [49] "Massachusetts" "Missouri" "New Hampshire" "North Carolina" "Oregon" "South Dakota" "Vermont" "Wisconsin" "Michigan" "Montana" "New Jersey" "North Dakota" "Pennsylvania" "Tennessee" "Virginia" "Wyoming" [[2]] [1] "Population" "Income" [6] "HS Grad" "Frost" "Minnesota" "Nebraska" "New Mexico" "Ohio" "Rhode Island" "Texas" "Washington" "Illiteracy" "Life Exp" "Area" "Mississippi" "Nevada" "New York" "Oklahoma" "South Carolina" "Utah" "West Virginia" "Murder" illiteracy<-state.x77[,3] murder<-state.x77[,5] 3a. Scatterplots The simplest way to call a scatterplot is plot(x, y). Formula notation plot(y~x) is also useful, as a number of R functions use models with this sort of formulation. plot(illiteracy,murder) plot(Murder~Illiteracy, data=state.x77) You can change the plotting symbol (glyph) with the "pch" argument. You can change the color of the points with the "col" argument. Labels can be added to an existing plot with the text function. plot(illiteracy,murder,col="red", pch = 16, xlim=c(0.2,3)) text(illiteracy,murder,labels=state.name) You can pass a vector of plotting characters or colors and they will be applied to the individual points. This can be useful for separating points by category. The unclass function can be helpful in turning a factor vector into a numeric one for this purpose. state.region [1] South West West South West [6] West Northeast South South South [11] West West North Central North Central North Central [16] North Central South South Northeast South [21] Northeast North Central North Central South North Central [26] West North Central West Northeast Northeast [31] West Northeast South North Central North Central [36] South West Northeast Northeast South [41] North Central South South West Northeast [46] South West South North Central West Levels: Northeast South North Central West unclass(state.region) [1] 2 4 4 2 4 4 1 2 2 2 4 4 3 3 3 3 2 2 1 2 1 3 3 2 3 4 3 4 1 1 4 1 2 3 3 2 4 1 [39] 1 2 3 2 2 4 1 2 4 2 3 4 attr(,"levels") [1] "Northeast" "South" "North Central" "West" plot(illiteracy,murder,pch=unclass(state.region),xlim=c(0.2,3)) plot(illiteracy,murder,col=unclass(state.region),xlim=c(0.2,3),pch=16) plot(illiteracy,murder,col=unclass(state.region),pch=unclass(state.region), xlim=c(0.2,3), main = "Murder vs. Illiteracy Rates - U.S. States") legend("bottomright",levels(state.region),pch=1:4,col=1:4) identify(illiteracy,murder,state.name) [1] 1 10 11 18 28 31 34 41 3b. Time Series and Line Plots You can plot line plots (as for time series) with the plot command as well. Set the "type" argument to "l" (for lines), "b" (for both points and lines) or "o" (for overstrike). library(lattice) data(melanoma) melanoma year incidence 1 1936 0.9 2 1937 0.8 : : : 37 1972 4.8 plot(incidence~year,data=melanoma,type='l') plot(incidence~year,data=melanoma,type='b') plot(incidence~year,data=melanoma,type='o',main="Melanoma Incidence by Year",ylab="melanoma incidence") 4. Trivariate Data – 3-D Scatterplots frost<-state.x77[,7] A 3-D scatterplot function is available in the package scatterplot3d. Remember, you'll need to set the proxy server to install this from campus. 150 16 14 murder 200 library(scatterplot3d) scatterplot3d(cbind(illiteracy, murder, frost)) scatterplot3d(cbind(illiteracy, murder, frost), type='h', highlight.3d=T) 100 frost 12 10 8 50 6 4 0 2 0 0.5 1.0 1.5 2.0 2.5 3.0 illiteracy The rgl package produces nice 3d images that can be rotated with your mouse for superior interpretability. You'll also need to install this package. library(rgl) plot3d(illiteracy,frost,murder) rgl.snapshot("C:\\...\\triscat1.png") plot3d(illiteracy,frost,murder, type="s") plot3d(illiteracy,frost,murder, type="s", size=.25) rgl.snapshot("C:\\...\\triscat2.png") Note that the rgl window cannot be cut-and-pasted to Word as easily as regular R graphics. You'll need to call rgl.snapshot as above, to create a .png (image) file. Then use Insert -> Picture -> From File to import the image into Word. plot3d(illiteracy,frost,murder, type="s", size=.25, col="red") rgl.snapshot("C:\\...\\triscat3.png") plot3d(illiteracy,frost,murder, type="s", size=.25, col=c("red","yellow","blue","green")[unclass(state.region)]) text3d(illiteracy,frost,murder+.25,text=state.name) rgl.snapshot("C:\\...\\triscat4.png") 4. "Hypervariate" Data William Cleveland noted that just as most of statistics changes fundamentally once we increase the number of dimensions to 2 or more ("multivariate"), graphics must be done in a fundamentally different way once we have 4 or more variables. He suggested the term "hypervariate" for use in this context. 4a. Pairwise scatterplot matrices pairs(iris[,1:4]) pairs(iris[,1:4],pch=16,col=unclass(species)) state<-state.x77[,2:7] pairs(state) pairs(state,pch=16,col=unclass(state.region)) 4b. Star Plots stars(state,key.loc=c(15,1.5)) stars(state,key.loc=c(15,1.5), col.stars=unclass(state.region)+1) 4c. Parallel Coordinate Plots library(MASS) parcoord(state,col=unclass(state.region)) legend("topleft",levels(state.region),lty=1,col=1:4) parcoord(iris[,1:4],col=c(1,2,4)[unclass(species)]) legend("topleft",levels(species),lty=1,col=c(1,2,4)) 5. Further (Built-In) Graphics Demos R has a number of built-in demos of various features and packages. These not only demonstrate abilities but show you the commands used to create them. You can see a list with demo( ). Two of the most useful are the following. demo(graphics) demo(plotmath)