GDA of zagat (from smss) - Graphical Data Analysis with R

advertisement
GDA of zagat (from smss)
Background Zagat ratings of Italian restaurants in Boston, London, and New York.
Aims Are ratings in the three cities different? What patterns of ratings are there? How are ratings on
different criteria related?
Source Agresti and Finlay’s “Statistical Methods for the Social Sciences” (4th edition)
Structure 193 observations on 6 variables (4 numeric, 1 name, 1 factor)
Zagat ratings use four criteria (Food, Decor, Service, all reported on a scale of 0 to 30, and the Cost of an
average meal in dollars). The numbers of restaurants in each city and the Cost distributions are shown below:
library(gridExtra)
data(zagat, package="smss")
a1 <- ggplot(zagat, aes(City)) + geom_bar() + ylab("") + xlab("") +
ggtitle("Numbers of restaurants rated")
a2 <- ggplot(zagat, aes(City, Cost)) + geom_boxplot() + ylab("") + xlab("") +
ggtitle("Average cost of a meal") + ylim(0, 150)
grid.arrange(a1, a2, nrow=1)
Numbers of restaurants rated
Average cost of a meal
150
80
60
100
40
50
20
0
0
Boston
London
NY
Boston
London
NY
Fig 1: More Italian restaurants were rated in London than in either of New York or Boston. The meals were
most expensive in London. (The dates the ratings were made and the exchange rates used are not given.)
The distributions of the three other criteria are as follows:
boxplot(zagat[, 3:5], pch=16, horizontal = TRUE, ylim=c(0,30), las=1)
Service
Decor
Food
0
5
10
15
20
25
30
Fig 2: Boxplots of the ratings on the three criteria other than Cost. The low outlier on Food looks suspicious
and that restaurant will be left out of further analyses.
1
library(dplyr)
zagatX <- zagat %>% filter(Food > 2)
For a small dataset like this a lot can be shown in a scatterplot matrix display.
library(GGally)
ggpairs(zagatX, columns=c(1, 3:6), upper = list(continuous = "cor",
combo = "facetdensity"), lower = list(continuous = "points", combo = "box"),
diag=list(continuous = "density", discrete="bar"))
Boston
80
London
City
60
40
20
NY
0
Food
24
Corr:
0.436
20
Corr:
0.687
Corr:
0.0869
Corr:
0.64
Corr:
0.373
16
25
Decor
20
15
10
5
Service
24
Corr:
0.282
20
16
12
Cost
100
50
16
City
20
Food
24
5
10
15
Decor
20
25 12
16
20
Service
24
50
100
Cost
Fig 3: Food, Decor, and Service are fairly positively correlated. The distributions of these three criteria do
not appear to vary much by city, but there are significant differences between some of the means as the next
figure shows.
2
One way to get confidence intervals for each of the means is to fit linear models without an intercept and
plot the resulting coefficients with intervals depending on their standard errors. This does assume that the
variability is the same for each group.
library(coefplot)
library(gridExtra)
m1 <- lm(data=zagatX, Food~0 + City)
m2 <- lm(data=zagatX, Decor~0 + City)
m3 <- lm(data=zagatX, Service~0 + City)
g1 <- coefplot(m1, predictors="City", lwdOuter=0.5, title=
"CIs for average Food scores") + xlab("") + ylab("") + xlim(13,23) + coord_flip()
g2 <- coefplot(m2, predictors="City", lwdOuter=0.5, title=
"CIs for average Decor scores") + xlab("") + ylab("") + xlim(13,23) + coord_flip()
g3 <- coefplot(m3, predictors="City", lwdOuter=0.5, title=
"CIs for average Service scores") + xlab("") + ylab("") + xlim(13,23) + coord_flip()
grid.arrange(g1, g2, g3, nrow=1)
CIs for average Food scores
CIs for average Decor scores
CIs for average Service scores
22.5
22.5
22.5
20.0
20.0
20.0
17.5
17.5
17.5
15.0
15.0
15.0
12.5
CityBoston
CityLondon
CityNY
12.5
CityBoston
CityLondon
CityNY
12.5
CityBoston
CityLondon
CityNY
Fig 4: Confidence intervals for the mean ratings with widths equal to two standard errors. The average
ratings are highest on Food and lowest on Decor. Boston has the highest average scores on all three criteria
and is significantly better than New York on all and significantly better than London on Food.
These statements are apparent from the graphics (especially when you zoom in) and they can be confirmed
by looking at summaries of the linear models.
3
For many people the relationship of Food to Cost is of most interest. Since London is much more expensive
than the other two cities, separate highlighted scatterplots have been drawn.
library(extracat)
library(scales)
facetshade(data = zagatX, aes(x = Cost, y = Food), f = .~City) +
geom_point(colour = alpha("black", 0.05)) + geom_point(data = zagatX,
colour = "red") + facet_wrap(f=~City, nrow=1) + theme(legend.position="none")
Boston
London
NY
Food
24
20
16
50
100
50
100
50
100
Cost
Fig 5: Scatterplots of Food against Cost for the three cities. There is some evidence of increasing food scores
with increasing cost, especially for London, but there is considerable variability.
Fitting spline models to each City’s data gives the following display:
ggplot(zagatX, aes(Cost, Food)) + geom_point() + facet_wrap(~City) + geom_smooth(method="gam")
Boston
London
NY
Food
24
20
16
50
100
50
100
50
100
Cost
Fig 6: Scatterplots of Food against Cost for the three cities with spline smooths added. Nonlinear models do
not seem to be necessary, but none of the three fits is very good.
The dataset is used in three exercises in Agresti and Finlay’s book.
4
Download