Project #1

advertisement
Project #1
STAT 873
Fall 2015
There are many types of whisky. Some are more “smoky” in flavor while others may exhibit more floral
taste notes.
Inspired
by a
recent
discussion of
whisky categorizations (see
http://blog.revolutionanalytics.com/2013/12/k-means-clustering-86-single-malt-scotch-whiskies.html),
the purpose of this and the next project is to apply the multivariate methods discussed in our class to
whisky taste profiles for whiskies produced by distilleries in Scotland. The taste profiles are in the file
whisky.csv which can be obtained from the graded materials web page. Below is a portion of the data:
> whisky <- read.csv(file = "whisky.csv")
> options(width = 60)
> head(whisky, n = 3)
Dist.Numb Distillery Body Sweetness Smoky Medicinal
1
1
Aberfeldy
2
2
2
0
2
2
Aberlour
3
3
1
0
3
3
AnCnoc
1
3
2
0
1
2
3
Tobacco Honey Spicy Winey Nutty Malty Fruity Floral
0
2
1
2
2
2
2
2
0
4
3
2
2
3
3
2
0
2
0
0
2
2
3
2
Each of the taste variables (all variables in the data set except for Dist.Numb and Distillery) is
measured on a scale of 0, 1, 2, 3, and 4 to describe the amount of that particular quality for which a
whisky exhibits. Additional websites discussing the data are



http://wonkviz.tumblr.com/post/72400253092/whiskey-data-sleuthing-with-help-from-reddit
https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html
http://blog.revolutionanalytics.com/2014/01/where-the-whisky-flavor-profile-data-camefrom.html
Complete the following problems below. Within each part, include your R program output with code
inside of it and any additional information needed to explain your answer. Your R code and output
should be formatted in the same manner as in the lecture notes.
1) (3 points) Use summary(whisky) to obtain initial numerical summaries of the taste variables.
While the variables are measured on the same scale, describe the differences among them as
shown in this numerical summaries.
2) (5 points) Construct a stars plot of the taste variables using 10 rows of stars and using Dist.numb
to identify each star. Comment on the following:
a) Are there any unusual whiskies? Explain.
b) Are there any possible groupings of whiskies? Explain.
3) (3 points, extra credit) Create a star plot in R that is exactly the same as what is given at
http://i.imgur.com/1fh6eyc.png.
4) (14 total points) Complete the following relative to parallel coordinate plots.
a) (3 points) Construct a parallel coordinate plot of the taste variables using the original data values
and parcoord(). Identify the main problem with using this type of plot with the data.
b) (3 points) Run the following code and comment on what each line does.
total.values <- nrow(whisky) * (ncol(whisky) - 2)
1
set.seed(8910)
x <- matrix(data = runif(n = total.values, min = -0.3, max = 0.3), nrow =
nrow(whisky), ncol = (ncol(whisky) - 2))
whisky2 <- whisky[,-c(1:2)] + x
whisky2$Dist.Numb <- whisky$Dist.Numb
head(whisky2)
c) (5 points) Construct a new parallel coordinate plot of the taste variables using parcoord() and
the modified data from b). Are there any unusual whiskies? Are there any possible groupings of
whiskies? Explain.
d) (3 points) Construct another parallel coordinate plot of the taste variables using ipcp() in the
iplots package. Depending on the version of Java you have on your computer, this package may
not work initially in R due to 32-bit and 64-bit issues. If you encounter problems, I recommend
you use the alternative “bit-version” of R than what you normally use. In your plot, highlight the
whiskies that appear to have 4 largest smoky taste scores. Are there any similar characteristics
about these whiskies with respect to the other variables? Explain and state these observation
numbers (or whisky names).
It can be difficult to determine an exact observation number from a parallel coordinate plot itself.
One way outside of the plot to do this is through sorting the data by a particular variable and then
displaying the extreme values. For example, the four highest smoky whiskies are found with
tail(whisky2[order(whisky2$Smoky),], n = 4)
tail(whisky[order(whisky$Smoky),], n = 4)
5) (28 total points) Complete the following using PCA for the taste variables and their original values.
a) (2 points) What is the main reason why someone would want to use the covariance matrix rather
than the correlation matrix here?
b) (2 points) What is the main reason why someone would want to use the correlation matrix rather
than the covariance matrix here?
For the remainder of this problem, use the correlation matrix.
c) (5 points) Determine an appropriate number of PCs using the tools discussed in class. Fully
justify your answer.
d) (4 points) State in equation form the first two PCs chosen from part c). Interpret these PCs.
e) (4 points) Show how the first PC score for the first observation is found using your own matrix
algebra in R. Use predict() or the scores component from princomp() to check your
answer.
f) (4 points) Construct a scatter plot of the first three PCs. Are there any unusual whiskies? Are
there any possible groupings of whiskies? Explain and state these observation numbers (or
whisky names). Note that the text3d() function can help you include observation numbers on
the plot.
g) (4 points) Plot the scores for the first 5 PCs using a parallel coordinate plot. Are there any unusual
whiskies? Are there any possible groupings of whiskies? Explain and state these observation
numbers (or whisky names).
h) (3 points) Luba Gloukhov wrote the original blog post cited at the beginning of this project. She
indicated that she likes full bodied, smoky, and medicinal types of whisky (which means “high”
scores for the corresponding taste variables). What portions of the plots in parts f) and g)
would Luba want to look for whiskies that she would like? Explain.
2
I recommend that you explore this data further with other types of plots and PCA methods to help
prepare for Test #2.
3
Download