CSSS 508: Intro R 2/22/06 Lab 7: More Plotting Practice Looking more closely at the example at the end of last class lecture: The CDC released the new average heights and weights by gender in 2002. Men – Average Height 69.5 in; Average Weight 191 lbs Women – Average Height 64 in; Average Weight 164 lbs (We don’t have the standard deviations, so we’ll just estimate them as 2.5 and 15.) Height and weight are often distributed normally; what would these individual distributions look like? mh<-rnorm(50,69.5,2.5) mw<-rnorm(50,191,15) wh<-rnorm(50,64,2.5) ww<-rnorm(50,164,15) mh.dens<-dnorm(mh,69.5,2.5) mw.dens<-dnorm(mw,191,15) wh.dens<-dnorm(wh,64,2.5) ww.dens<-dnorm(ww,164,15) heights<-c(mh,wh) weights<-c(mw,ww) plot(mh,mh.dens,xlim=c(min(heights),max(heights)),xlab="Height(in)",ylab="Densi ty",col=2,pch=16) points(wh,wh.dens,col=4,pch=16) title("Height by Gender") Now can do either: text(locator(1),"Men") text(locator(1),"Women") Or: legend(locator(1),c(“Men”,”Women”),col=c(2,4),pch=16) So we have some overlap in the distributions. What about weight? plot(mw,mw.dens,xlim=c(min(weights),max(weights)),xlab="Weight (lbs)",ylab="Density",col=2,pch=16) points(ww,ww.dens,col=4,pch=16) title("Weight by Gender") text(locator(1),"Men") text(locator(1),"Women") Rebecca Nugent, Department of Statistics, U. of Washington -1- We have been using histograms/bar charts/etc just to look at a picture of our data. However, these plot commands can give us back a lot of information. Assign the plot command to a variable and then type the variable name to see what info you get back. men.h<-hist(mh,xlab="Men's Height (in)") men.h We get back a list of all the breaks/cut points the histogram used to create the graph, the counts/heights of each bar, the corresponding density values, the midpoints of each bins, etc. We can access all this information just by men.h$breaks, men.h$counts, etc. Let’s combine the gender data and look at the overall picture: plot(heights,weights,col=c(rep(2,100),rep(4,100)),pch=16) title("Height vs Weight: 2002") Identify 4 people you think might be interesting in the data set. How are they different? identify(heights,weights,n=4) Now we’ll plot the density when we combine the data; you can see two slightly different peaks that correspond to the two gender groups. The individual gender groups are plotted in lines. First heights: heights.dens<-0.5*dnorm(heights,69.5,2.5)+0.5*dnorm(heights,64,2.5) plot(heights,heights.dens,xlab="Height (in)",ylab="Density",pch=16,ylim=c(0,max(c(mh.dens,wh.dens)))) lines(sort(mh),mh.dens[order(mh)],col=2) lines(sort(wh),wh.dens[order(wh)],col=4) text(locator(1),"Men") text(locator(1),"Women") text(locator(1),"Combined") title("Height Distribution in 2002") Then weights: weights.dens<-0.5*dnorm(weights,191,15)+0.5*dnorm(weights,164,15) plot(weights,weights.dens,xlab="Weight (lbs)",ylab="Density",pch=16,ylim=c(0,max(c(mw.dens,ww.dens)))) lines(sort(mw),mw.dens[order(mw)],col=2) lines(sort(ww),ww.dens[order(ww)],col=4) text(locator(1),"Men") text(locator(1),"Women") text(locator(1),"Combined") title("Weight Distribution in 2002") These plots show how the density curves change when we combine our gender groups. The height/weight values in the overlap areas have a much higher probability of occurring than in each gender alone. Rebecca Nugent, Department of Statistics, U. of Washington -2- Now let’s combine the height and weight data as pairs of measurements on subjects. Each subject has a height, weight, and gender (1=M, 2=F). data<-rbind(cbind(mh,mw),cbind(wh,ww)) data<-cbind(data,c(rep(1,100),rep(2,100))) data We have a bivariate normal distribution. Think of height as being on the x-axis and weight as the y-axis. What would this density look like? Since we are looking at data and do not know the true density, let’s look at pictures of a few density estimates. Binned Kernel Density Estimate: think of it like a 2-D histogram with lots of small bins. library(KernSmooth) z<-bkde2D(data,bandwidth=0.5) persp(z$x1,z$x2,z$fhat,col=3,xlab="Height",ylab="Weight",zlab="Density") image(z$x1,z$x2,z$fhat) contour(z$x1,z$x2,z$fhat) That bandwidth choice gives us a very peaky density estimate. Let’s vary the bandwidth and see what it does to the graph. When can you see the two gender peaks? (Don’t forget to turn on the History: Recording!) bw<-seq(0.25,3,by=.25) for(i in 1:length(bw)){ z<-bkde2D(data,bandwidth=bw[i]) persp(z$x1,z$x2,z$fhat,col=2,xlab="Height",ylab="Weight",zlab="Density") } Can you see anything better if you switch the dimensions? data2<-cbind(data[,2],data[,1]) bw<-seq(0.25,3,by=.25) for(i in 1:length(bw)){ z<-bkde2D(data2,bandwidth=bw[i]) persp(z$x1,z$x2,z$fhat,col=2,xlab="Weight",ylab="Height",zlab="Density") } Another binned density estimate is the Average Shifted Histogram (D. Scott, 1992) library(ash) hw.bin<-bin2(cbind(x=heights,y=weights)) hw.z<-ash2(hw.bin) image(hw.z) (bins the data) (creates the histogram) (plots it) We can vary the grid size we use: grid.size<-seq(20,80,by=10) for(i in 1:length(grid.size)){ hw.bin<-bin2(cbind(x=heights,y=weights),nbin=c(grid.size[i],grid.size[i])) hw.z<-ash2(hw.bin) image(hw.z) } Rebecca Nugent, Department of Statistics, U. of Washington -3-