Wine Recognition Data - Winona State University

advertisement
Wine Recognition Data
Data Source:
Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification
and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via
Brigata Salerno, 16147 Genoa, Italy.
Past Usage:
(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various classifiers. The classes are
separable, though only RDA has achieved 100% correct classification. (RDA : 100%,
QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data) (All results using the
leave-one-out technique).
(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
Relevant Information:
These data are the results of a chemical analysis of wines grown in the same region in
Italy but derived from three different cultivars. The analysis determined the quantities of
13 constituents found in each of the three types of wines. The units are not important for
the purposes of this problem as I recommend using the scaled data in your analysis
anyway.
The 13 attributes are:
1) Alcohol
2) Malic acid
3) Ash
4) Alcalinity of ash
5) Magnesium
6) Total phenols
7) Flavanoids
8) Nonflavanoid phenols
9) Proanthocyanins
10)Color intensity
11)Hue
12)OD280/OD315 of diluted wines
13)Proline
14) Cultivar/Wine Type (1,2, or 3)  Grouping Variable
QUESTIONS
a) Assess the normality of these data. You can use JMP (Wine.JMP), R (Wine), and
Arc (Wine.lsp). (3 pts.)
b) Compare the variance-covariance matrices of the three cultivars. Is it appropriate to
assume 1   2   3 for these data? You can use the TestVar(X,grp) command to
test equality of the variance-covariance matrices two cultivars at a time. (2 pts.)
c) Using the data frame Wine in R perform both linear and quadratic discriminant
analysis to classify cultivar. Reproduce the holdout error rates mentioned above
(bold/underlined). Also perform cross-validation to examine the classification
performance. Which method lda or qda would you recommend and why? (6 pts.)
d) Use the Discrim function in R to examine the loadings of the linear discriminant
functions and group separation achieved. Which variables appear to have the best
discriminatory ability? Be sure to keep in mind the fact that the variables are on very
different scales. (4 pts.)
e) Use JMP to perform MANOVA and a discriminant analysis for these data. Discuss
the canonical centroid plot obtained from the discriminant analysis. Does it agree with
your conclusions from part (d)? (3 pts.)
f) Perform forward selection (stepwise) discriminant analysis to find a subset of the 13
variables that significantly discriminate the cultivars. Which variables are not included?
(4 pts.)
g) Using only the variables from your forward stepwise selection use R to conduct and
cross-validate an appropriate discriminant analysis for this subset. Does the
discrimination model with fewer variables cross-validate better? Compare your results to
part (c) above. (4 pts.)
Download