Project #1 STAT 873 Fall 2015 There are many types of whisky. Some are more “smoky” in flavor while others may exhibit more floral taste notes. Inspired by a recent discussion of whisky categorizations (see http://blog.revolutionanalytics.com/2013/12/k-means-clustering-86-single-malt-scotch-whiskies.html), the purpose of this and the next project is to apply the multivariate methods discussed in our class to whisky taste profiles for whiskies produced by distilleries in Scotland. The taste profiles are in the file whisky.csv which can be obtained from the graded materials web page. Below is a portion of the data: > whisky <- read.csv(file = "whisky.csv") > options(width = 60) > head(whisky, n = 3) Dist.Numb Distillery Body Sweetness Smoky Medicinal 1 1 Aberfeldy 2 2 2 0 2 2 Aberlour 3 3 1 0 3 3 AnCnoc 1 3 2 0 1 2 3 Tobacco Honey Spicy Winey Nutty Malty Fruity Floral 0 2 1 2 2 2 2 2 0 4 3 2 2 3 3 2 0 2 0 0 2 2 3 2 Each of the taste variables (all variables in the data set except for Dist.Numb and Distillery) is measured on a scale of 0, 1, 2, 3, and 4 to describe the amount of that particular quality for which a whisky exhibits. Additional websites discussing the data are http://wonkviz.tumblr.com/post/72400253092/whiskey-data-sleuthing-with-help-from-reddit https://www.mathstat.strath.ac.uk/outreach/nessie/nessie_whisky.html http://blog.revolutionanalytics.com/2014/01/where-the-whisky-flavor-profile-data-camefrom.html Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Your R code and output should be formatted in the same manner as in the lecture notes. 1) (3 points) Use summary(whisky) to obtain initial numerical summaries of the taste variables. While the variables are measured on the same scale, describe the differences among them as shown in this numerical summaries. 2) (5 points) Construct a stars plot of the taste variables using 10 rows of stars and using Dist.numb to identify each star. Comment on the following: a) Are there any unusual whiskies? Explain. b) Are there any possible groupings of whiskies? Explain. 3) (3 points, extra credit) Create a star plot in R that is exactly the same as what is given at http://i.imgur.com/1fh6eyc.png. 4) (14 total points) Complete the following relative to parallel coordinate plots. a) (3 points) Construct a parallel coordinate plot of the taste variables using the original data values and parcoord(). Identify the main problem with using this type of plot with the data. b) (3 points) Run the following code and comment on what each line does. total.values <- nrow(whisky) * (ncol(whisky) - 2) 1 set.seed(8910) x <- matrix(data = runif(n = total.values, min = -0.3, max = 0.3), nrow = nrow(whisky), ncol = (ncol(whisky) - 2)) whisky2 <- whisky[,-c(1:2)] + x whisky2$Dist.Numb <- whisky$Dist.Numb head(whisky2) c) (5 points) Construct a new parallel coordinate plot of the taste variables using parcoord() and the modified data from b). Are there any unusual whiskies? Are there any possible groupings of whiskies? Explain. d) (3 points) Construct another parallel coordinate plot of the taste variables using ipcp() in the iplots package. Depending on the version of Java you have on your computer, this package may not work initially in R due to 32-bit and 64-bit issues. If you encounter problems, I recommend you use the alternative “bit-version” of R than what you normally use. In your plot, highlight the whiskies that appear to have 4 largest smoky taste scores. Are there any similar characteristics about these whiskies with respect to the other variables? Explain and state these observation numbers (or whisky names). It can be difficult to determine an exact observation number from a parallel coordinate plot itself. One way outside of the plot to do this is through sorting the data by a particular variable and then displaying the extreme values. For example, the four highest smoky whiskies are found with tail(whisky2[order(whisky2$Smoky),], n = 4) tail(whisky[order(whisky$Smoky),], n = 4) 5) (28 total points) Complete the following using PCA for the taste variables and their original values. a) (2 points) What is the main reason why someone would want to use the covariance matrix rather than the correlation matrix here? b) (2 points) What is the main reason why someone would want to use the correlation matrix rather than the covariance matrix here? For the remainder of this problem, use the correlation matrix. c) (5 points) Determine an appropriate number of PCs using the tools discussed in class. Fully justify your answer. d) (4 points) State in equation form the first two PCs chosen from part c). Interpret these PCs. e) (4 points) Show how the first PC score for the first observation is found using your own matrix algebra in R. Use predict() or the scores component from princomp() to check your answer. f) (4 points) Construct a scatter plot of the first three PCs. Are there any unusual whiskies? Are there any possible groupings of whiskies? Explain and state these observation numbers (or whisky names). Note that the text3d() function can help you include observation numbers on the plot. g) (4 points) Plot the scores for the first 5 PCs using a parallel coordinate plot. Are there any unusual whiskies? Are there any possible groupings of whiskies? Explain and state these observation numbers (or whisky names). h) (3 points) Luba Gloukhov wrote the original blog post cited at the beginning of this project. She indicated that she likes full bodied, smoky, and medicinal types of whisky (which means “high” scores for the corresponding taste variables). What portions of the plots in parts f) and g) would Luba want to look for whiskies that she would like? Explain. 2 I recommend that you explore this data further with other types of plots and PCA methods to help prepare for Test #2. 3