CSSS 508: Intro to R - Department of Statistics

advertisement
CSSS 508: Intro to R
3/10/06
Review
Data: Nutritional and Marketing Information on US Cereals
The UScereal data come from the 1993 ASA Statistical Graphics Exposition.
The measurements are taken from the FDA food label and have been normalized to a
portion of one American cup.
library(MASS)
attach(UScereal)
First we take a look at the structure of our data, assess the missingness, take a look at the
type of variables we have, etc.
dim(UScereal)
[1] 65 11
help(UScereal)
mfr:
Manufacturer (no order in categories)
G=General Mills, K=Kelloggs, N=Nabisco, P=Post, Q=Quaker Oats, R=Ralston Purina.
calories:
protein:
fat:
number of calories in one portion
grams of protein in one portion
grams of fat in one portion
sodium:
milligrams of sodium in one portion
fibre:
grams of dietary fibre in one portion
carbo:
grams of complex carbohydrates in one portion
sugars:
shelf:
grams of sugars in one portion
display shelf (1, 2, or 3, counting from the floor) (order in categories)
potassium:
vitamins:
grams of potassium
vitamins and minerals (none, enriched, or 100%) (order in categories)
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
summary(UScereal)
mfr
calories
protein
fat
sodium
G:22
Min.
: 50.0
Min.
: 0.7519
Min.
:0.000
Min.
: 0.0
K:21
1st Qu.:110.0
1st Qu.: 2.0000
1st Qu.:0.000
1st Qu.:180.0
N: 3
Median :134.3
Median : 3.0000
Median :1.000
Median :232.0
P: 9
Mean
:149.4
Mean
: 3.6837
Mean
:1.423
Mean
:237.8
Q: 5
3rd Qu.:179.1
3rd Qu.: 4.4776
3rd Qu.:2.000
3rd Qu.:290.0
R: 5
Max.
:440.0
Max.
:12.1212
Max.
:9.091
Max.
:787.9
fibre
carbo
sugars
shelf
Min.
: 0.000
Min.
:10.53
Min.
: 0.00
Min.
:1.000
1st Qu.: 0.000
1st Qu.:15.00
1st Qu.: 4.00
1st Qu.:1.000
Median : 2.000
Median :18.67
Median :12.00
Median :2.000
Mean
: 3.871
Mean
:19.97
Mean
:10.05
Mean
:2.169
3rd Qu.: 4.478
3rd Qu.:22.39
3rd Qu.:14.00
3rd Qu.:3.000
Max.
:30.303
Max.
:68.00
Max.
:20.90
Max.
:3.000
potassium
vitamins
Min.
: 15.0
100%
: 5
1st Qu.: 45.0
enriched:57
Median : 96.6
none
: 3
Mean
:159.1
3rd Qu.:220.0
Max.
:969.7
Note that shelf, although it’s a numeric variable, would be better summarized by a table.
There are only three location values; it can almost be viewed as categorical.
table(shelf)
shelf
1 2 3
18 18 29
We have no missing data.
We do have a wide variety of cereal compositions. Every cereal has some protein,
carbohydrates, and potassium. Some cereals have no sodium; others have lots of
sodium. Similarly for sugar, fiber, and fat.
Let’s compare the composition of the General Mills and Kellogg cereals.
par(mfrow=c(2,4))
gr.label<-c(“General Mills”, “Kellogg”)
boxplot(calories[mfr==”G”],calories[mfr==”K”],names=gr.label)
title(“Calories”)
boxplot(protein [mfr==”G”],protein[mfr==”K”],names=gr.label)
title(“Protein(grams)”)
boxplot(fat [mfr==”G”],fat[mfr==”K”],names=gr.label)
title(“Fat(grams)”)
boxplot(sodium[mfr==”G”],sodium[mfr==”K”],names=gr.label)
title(“Sodium(milligrams)”)
boxplot(fibre [mfr==”G”],fibre[mfr==”K”],names=gr.label)
title(“Fiber(grams)”)
boxplot(carbo[mfr==”G”],carbo[mfr==”K”],names=gr.label)
title(“Carbohydrates(grams)”)
boxplot(sugars [mfr==”G”],sugars[mfr==”K”],names=gr.label)
title(“Sugars(grams)”)
boxplot(potassium[mfr==”G”],potassium[mfr==”K”],names=gr.label)
title(“Potassium(grams)”)
Rebecca Nugent, Department of Statistics, U. of Washington
-2-
Do you think General Mills cereals or Kellogg cereals are better for you?
What are some other comparison methods we could use?
Now we’ll write a function that takes in a categorical variable and a matrix of continuous
variables (number of rows = length of categorical variable). Inside the function, we
want to build matrices with a row for each unique category and a column for each
continuous variable.
The ith, jth position in the matrix should be the mean of the jth continuous variable for
the ith unique category. For example, the 1st row, 1st column could be the mean number
of calories for General Mills cereals. The 2nd row, 1st column could be the mean number
of calories for Kellogg cereals. Each row is comprised of the means of the continuous
variables for a category of cereal.
Create similar matrices for the 1st quartile, 2nd quartile, and 3rd quartile.
Return all 4 matrices.
This function should be as general as possible. We’ll use it again later.
gen.sum<-function(cat.vec,cont.var.matrix){
categ<-unique(cat.vec)
n.categ<-length(categ)
n.col<-ncol(cont.var.matrix)
mean.m<-quar1.m<-quar2.m<-quar3.m<-matrix(0,n.categ,n.col)
colnames(mean.m)<-colnames(quar1.m)<-colnames(quar2.m)<colnames(quar3.m)<-colnames(cont.var.matrix)
for(i in 1:n.categ){
for(j in 1:n.col){
mean.m[i,j]<-mean(cont.var.matrix[cat.vec==categ[i],j])
sum.vec<-summary(cont.var.matrix[cat.vec==categ[i],j])
quar1.m[i,j]<-sum.vec[2]
quar2.m[i,j]<-sum.vec[3]
quar3.m[i,j]<-sum.vec[5]
}
}
return(mean.m,quar1.m,quar2.m,quar3.m)
}
Run the function on the categorical variable mfr and a matrix of calories, protein,
fat, sodium, fibre, carbo, sugars, and potassium.
m<-cbind(calories,protein,fat,sodium,fibre,carbo,sugars,potassium)
gen.sum(mfr,m)
How could we get rid of the double for loop?
Rebecca Nugent, Department of Statistics, U. of Washington
-3-
Use scatterplots to look at the relationship between calories and the following variables:
protein, fat, fibre, carbo, and sugars. (5 plots total). Label the plots appropriately.
Find the linear relationship between calories and each of the variables. Plot each line on
its respective scatterplot. Put the linear equation (in text: cal = Int + Slope*Var) on the
plot as well.
par(mfrow=c(3,2))
plot(protein, calories,xlab=”calories”,ylab=”protein (grams)”,pch=16)
pro.fit<-lm(calories~protein)
abline(pro.fit,col=3,lwd=2)
text(locator(1),paste(“Calories = “,round(pro.fit$coef[1],3),”+
“,round(pro.fit$coef[2],3),”* Protein”))
plot(fat, calories,xlab=”calories”,ylab=”fat (grams)”,pch=16)
fat.fit<-lm(calories~fat)
abline(fat.fit,col=3,lwd=2)
text(locator(1),paste(“Calories = “,round(fat.fit$coef[1],3),”+
“,round(fat.fit$coef[2],3),”* Fat”))
plot(fibre, calories,xlab=”calories”,ylab=”fibre (grams)”,pch=16)
fibre.fit<-lm(calories~fibre)
abline(fibre.fit,col=3,lwd=2)
text(locator(1),paste(“Calories = “,round(fibre.fit$coef[1],3),”+
“,round(fibre.fit$coef[2],3),”* Fibre”))
plot(carbo, calories,xlab=”calories”,ylab=”carbohydrates
(grams)”,pch=16)
carb.fit<-lm(calories~carbo)
abline(carb.fit,col=3,lwd=2)
text(locator(1),paste(“Calories = “,round(carb.fit$coef[1],3),”+
“,round(carb.fit$coef[2],3),”* Carbs”))
plot(sugars, calories,xlab=”calories”,ylab=”sugars (grams)”,pch=16)
sugar.fit<-lm(calories~sugars)
abline(sugar.fit,col=3,lwd=2)
text(locator(1),paste(“Calories = “,round(sugar.fit$coef[1],3),”+
“,round(sugar.fit$coef[2],3),”* Sugars”))
How could we find the cereals that are largely driving/influencing the lines?
Choose a multivariate linear regression model that uses any combination of the above
five variables to predict calories. Why did you choose this model?
full.fit<-lm(calories~protein+fat+fibre+carbo+sugars)
summary(full.fit)
Would you remove any of these variables?
step(full.fit)
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
Let’s look at what types of cereals get put on each shelf.
Run the gen.sum function on the categorical variable shelf and the matrix of calories,
protein, fat, sodium, fibre, carbo, sugars, and potassium.
gen.sum(shelf,m)
Rerun your scatterplots (without the linear regressions) and label the observations from
each shelf with a different symbol and a different color.
Use a legend to indicate which symbols you’ve chosen.
par(mfrow=c(3,2))
shelf.vec<-c(“Shelf 1”,”Shelf 2”,”Shelf 3”)
col.vec<-c(2,3,4)
pch.vec<-c(16,17,18)
plot(protein, calories,xlab=”calories”,ylab=”protein (grams)”,type=”n”)
points(protein[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1])
points(protein[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2])
points(protein[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3])
legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8)
plot(fat, calories,xlab=”calories”,ylab=”fat (grams)”,type=”n”)
points(fat[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1])
points(fat[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2])
points(fat[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3])
legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8)
plot(fibre, calories,xlab=”calories”,ylab=”fibre (grams)”,type=”n”)
points(fibre[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1])
points(fibre[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2])
points(fibre[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3])
legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8)
plot(carbo, calories,xlab=”calories”,ylab=”carbohydrates (grams)”,type=”n”)
points(carbo[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1])
points(carbo[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2])
points(carbo[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3])
legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8)
plot(sugars, calories,xlab=”calories”,ylab=”sugars (grams)”,type=”n”)
points(sugars[shelf==1],calories[shelf==1],col=col.vec[1],pch=pch.vec[1])
points(sugars[shelf==2],calories[shelf==2],col=col.vec[2],pch=pch.vec[2])
points(sugars[shelf==3],calories[shelf==3],col=col.vec[3],pch=pch.vec[3])
legend(locator(1),legend=shelf.vec,col=col.vec,pch=pch.vec,cex=.8)
Use logistic regression to predict whether or not a cereal will be on the bottom shelf.
Describe how you chose your model. Find and interpret the odds ratios.
Our response variable is binary: bottom shelf? Yes or No.
bottom.shelf<-ifelse(shelf==1,1,0)
Let’s look at a full fit.
Rebecca Nugent, Department of Statistics, U. of Washington
-5-
full.fit<glm(bottom.shelf~mfr+calories+protein+fat+sodium+fibre+carb
o+sugars+potassium+vitamins,family=binomial)
logodds<-full.fit$coef
or<-round(exp(logodds),3)
First, note that the General Mills manufacturer has been chosen as the reference group;
all manufacturer odds ratios correspond to comparisons against General Mills.
How would we change this if we wanted a different reference group?
What’s going on with those huge odds ratios in the vitamin variables?
Your reference group is 100%. Very small. Not a good choice for a stable model.
table(vitamins)
We can change it to a reference group of enriched.
vit.100<-ifelse(vitamins==”100%”,1,0)
vit.none<-ifelse(vitamins==”none”,1,0)
fit2<glm(bottom.shelf~mfr+calories+protein+fat+sodium+fibre+carb
o+sugars+potassium+vit.100+vit.none,famil=binomial)
Did this help?
What would you keep? What would you remove?
step(fit2)
Rebecca Nugent, Department of Statistics, U. of Washington
-6-
Download