CSSS 508: Intro to R 3/10/06 Review Data: Nutritional and Marketing Information on US Cereals The UScereal data come from the 1993 ASA Statistical Graphics Exposition. The measurements are taken from the FDA food label and have been normalized to a portion of one American cup. library(MASS) attach(UScereal) First we take a look at the structure of our data, assess the missingness, take a look at the type of variables we have, etc. dim(UScereal) [1] 65 11 help(UScereal) mfr: Manufacturer (no order in categories) G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina. calories: protein: fat: number of calories in one portion grams of protein in one portion grams of fat in one portion sodium: milligrams of sodium in one portion fibre: grams of dietary fibre in one portion carbo: grams of complex carbohydrates in one portion sugars: shelf: grams of sugars in one portion display shelf (1, 2, or 3, counting from the floor) (order in categories) potassium: vitamins: grams of potassium vitamins and minerals (none, enriched, or 100%) (order in categories) Rebecca Nugent, Department of Statistics, U. of Washington -1- summary(UScereal) mfr calories protein fat sodium G:22 Min. : 50.0 Min. : 0.7519 Min. :0.000 Min. : 0.0 K:21 1st Qu.:110.0 1st Qu.: 2.0000 1st Qu.:0.000 1st Qu.:180.0 N: 3 Median :134.3 Median : 3.0000 Median :1.000 Median :232.0 P: 9 Mean :149.4 Mean : 3.6837 Mean :1.423 Mean :237.8 Q: 5 3rd Qu.:179.1 3rd Qu.: 4.4776 3rd Qu.:2.000 3rd Qu.:290.0 R: 5 Max. :440.0 Max. :12.1212 Max. :9.091 Max. :787.9 fibre carbo sugars shelf Min. : 0.000 Min. :10.53 Min. : 0.00 Min. :1.000 1st Qu.: 0.000 1st Qu.:15.00 1st Qu.: 4.00 1st Qu.:1.000 Median : 2.000 Median :18.67 Median :12.00 Median :2.000 Mean : 3.871 Mean :19.97 Mean :10.05 Mean :2.169 3rd Qu.: 4.478 3rd Qu.:22.39 3rd Qu.:14.00 3rd Qu.:3.000 Max. :30.303 Max. :68.00 Max. :20.90 Max. :3.000 potassium vitamins Min. : 15.0 100% : 5 1st Qu.: 45.0 enriched:57 Median : 96.6 none : 3 Mean :159.1 3rd Qu.:220.0 Max. :969.7 Note that shelf, although it’s a numeric variable, would be better summarized by a table. There are only three location values; it can almost be viewed as categorical. table(shelf) shelf 1 2 3 18 18 29 We have no missing data. We do have a wide variety of cereal compositions. Every cereal has some protein, carbohydrates, and potassium. Some cereals have no sodium; others have lots of sodium. Similarly for sugar, fiber, and fat. Let’s compare the composition of the General Mills and Kellogg cereals. par(mfrow=c(2,4)) gr.label<-c(“General Mills”, “Kellogg”) boxplot(calories[mfr==”G”],calories[mfr==”K”],names=gr.label) title(“Calories”) boxplot(protein [mfr==”G”],protein[mfr==”K”],names=gr.label) title(“Protein(grams)”) boxplot(fat [mfr==”G”],fat[mfr==”K”],names=gr.label) title(“Fat(grams)”) boxplot(sodium[mfr==”G”],sodium[mfr==”K”],names=gr.label) title(“Sodium(milligrams)”) boxplot(fibre [mfr==”G”],fibre[mfr==”K”],names=gr.label) title(“Fiber(grams)”) boxplot(carbo[mfr==”G”],carbo[mfr==”K”],names=gr.label) title(“Carbohydrates(grams)”) boxplot(sugars [mfr==”G”],sugars[mfr==”K”],names=gr.label) title(“Sugars(grams)”) boxplot(potassium[mfr==”G”],potassium[mfr==”K”],names=gr.label) title(“Potassium(grams)”) Rebecca Nugent, Department of Statistics, U. of Washington -2- Do you think General Mills cereals or Kellogg cereals are better for you? What are some other comparison methods we could use? Now we’ll write a function that takes in a categorical variable and a matrix of continuous variables (number of rows = length of categorical variable). Inside the function, we want to build matrices with a row for each unique category and a column for each continuous variable. The ith, jth position in the matrix should be the mean of the jth continuous variable for the ith unique category. For example, the 1st row, 1st column could be the mean number of calories for General Mills cereals. The 2nd row, 1st column could be the mean number of calories for Kellogg cereals. Each row is comprised of the means of the continuous variables for a category of cereal. Create similar matrices for the 1st quartile, 2nd quartile, and 3rd quartile. Return all 4 matrices. This function should be as general as possible. We’ll use it again later. gen.sum<-function(cat.vec,cont.var.matrix){ categ<-unique(cat.vec) n.categ<-length(categ) n.col<-ncol(cont.var.matrix) mean.m<-quar1.m<-quar2.m<-quar3.m<-matrix(0,n.categ,n.col) colnames(mean.m)<-colnames(quar1.m)<-colnames(quar2.m)<colnames(quar3.m)<-colnames(cont.var.matrix) for(i in 1:n.categ){ for(j in 1:n.col){ mean.m[i,j]<-mean(cont.var.matrix[cat.vec==categ[i],j]) sum.vec<-summary(cont.var.matrix[cat.vec==categ[i],j]) quar1.m[i,j]<-sum.vec[2] quar2.m[i,j]<-sum.vec[3] quar3.m[i,j]<-sum.vec[5] } } return(mean.m,quar1.m,quar2.m,quar3.m) } Run the function on the categorical variable mfr and a matrix of calories, protein, fat, sodium, fibre, carbo, sugars, and potassium. m<-cbind(calories,protein,fat,sodium,fibre,carbo,sugars,potassium) gen.sum(mfr,m) How could we get rid of the double for loop? Rebecca Nugent, Department of Statistics, U. of Washington -3- Use scatterplots to look at the relationship between calories and the following variables: protein, fat, fibre, carbo, and sugars. (5 plots total). Label the plots appropriately. Find the linear relationship between calories and each of the variables. Plot each line on its respective scatterplot. Put the linear equation (in text: cal = Int + Slope*Var) on the plot as well. par(mfrow=c(3,2)) plot(protein, calories,xlab=”calories”,ylab=”protein (grams)”,pch=16) pro.fit<-lm(calories~protein) abline(pro.fit,col=3,lwd=2) text(locator(1),paste(“Calories = “,round(pro.fit$coef[1],3),”+ “,round(pro.fit$coef[2],3),”* Protein”)) plot(fat, calories,xlab=”calories”,ylab=”fat (grams)”,pch=16) fat.fit<-lm(calories~fat) abline(fat.fit,col=3,lwd=2) text(locator(1),paste(“Calories = “,round(fat.fit$coef[1],3),”+ “,round(fat.fit$coef[2],3),”* Fat”)) plot(fibre, calories,xlab=”calories”,ylab=”fibre (grams)”,pch=16) fibre.fit<-lm(calories~fibre) abline(fibre.fit,col=3,lwd=2) text(locator(1),paste(“Calories = “,round(fibre.fit$coef[1],3),”+ “,round(fibre.fit$coef[2],3),”* Fibre”)) plot(carbo, calories,xlab=”calories”,ylab=”carbohydrates (grams)”,pch=16) carb.fit<-lm(calories~carbo) abline(carb.fit,col=3,lwd=2) text(locator(1),paste(“Calories = “,round(carb.fit$coef[1],3),”+ “,round(carb.fit$coef[2],3),”* Carbs”)) plot(sugars, calories,xlab=”calories”,ylab=”sugars (grams)”,pch=16) sugar.fit<-lm(calories~sugars) abline(sugar.fit,col=3,lwd=2) text(locator(1),paste(“Calories = “,round(sugar.fit$coef[1],3),”+ “,round(sugar.fit$coef[2],3),”* Sugars”)) How could we find the cereals that are largely driving/influencing the lines? Choose a multivariate linear regression model that uses any combination of the above five variables to predict calories. Why did you choose this model? full.fit<-lm(calories~protein+fat+fibre+carbo+sugars) summary(full.fit) Would you remove any of these variables? step(full.fit) Rebecca Nugent, Department of Statistics, U. of Washington -4- Let’s look at what types of cereals get put on each shelf. Run the gen.sum function on the categorical variable shelf and the matrix of calories, protein, fat, sodium, fibre, carbo, sugars, and potassium. gen.sum(shelf,m) Rerun your scatterplots (without the linear regressions) and label the observations from each shelf with a different symbol and a different color. Use a legend to indicate which symbols you’ve chosen. par(mfrow=c(3,2)) shelf.vec<-c(“Shelf 1”,”Shelf 2”,”Shelf 3”) col.vec<-c(2,3,4) pch.vec<-c(16,17,18) plot(protein, calories,xlab=”calories”,ylab=”protein (grams)”,type=”n”) points(protein[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1]) points(protein[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2]) points(protein[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3]) legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8) plot(fat, calories,xlab=”calories”,ylab=”fat (grams)”,type=”n”) points(fat[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1]) points(fat[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2]) points(fat[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3]) legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8) plot(fibre, calories,xlab=”calories”,ylab=”fibre (grams)”,type=”n”) points(fibre[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1]) points(fibre[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2]) points(fibre[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3]) legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8) plot(carbo, calories,xlab=”calories”,ylab=”carbohydrates (grams)”,type=”n”) points(carbo[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1]) points(carbo[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2]) points(carbo[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3]) legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8) plot(sugars, calories,xlab=”calories”,ylab=”sugars (grams)”,type=”n”) points(sugars[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1]) points(sugars[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2]) points(sugars[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3]) legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8) Use logistic regression to predict whether or not a cereal will be on the bottom shelf. Describe how you chose your model. Find and interpret the odds ratios. Our response variable is binary: bottom shelf? Yes or No. bottom.shelf<-ifelse(shelf==1,1,0) Let’s look at a full fit. Rebecca Nugent, Department of Statistics, U. of Washington -5- full.fit<glm(bottom.shelf~mfr+calories+protein+fat+sodium+fibre+carb o+sugars+potassium+vitamins,family=binomial) logodds<-full.fit$coef or<-round(exp(logodds),3) First, note that the General Mills manufacturer has been chosen as the reference group; all manufacturer odds ratios correspond to comparisons against General Mills. How would we change this if we wanted a different reference group? What’s going on with those huge odds ratios in the vitamin variables? Your reference group is 100%. Very small. Not a good choice for a stable model. table(vitamins) We can change it to a reference group of enriched. vit.100<-ifelse(vitamins==”100%”,1,0) vit.none<-ifelse(vitamins==”none”,1,0) fit2<glm(bottom.shelf~mfr+calories+protein+fat+sodium+fibre+carb o+sugars+potassium+vit.100+vit.none,famil=binomial) Did this help? What would you keep? What would you remove? step(fit2) Rebecca Nugent, Department of Statistics, U. of Washington -6-