0 - Stevens Institute of Technology

advertisement

Knowledge Discovery in

Databases

MIS 637

Professor Mahmoud Daneshmand

Fall 2012

Final Project: Red Wine Recipe Data Mining

By Jorge Madrazo

Profound Questions

• What basic properties are the formula for a good wine?

– Wine making is believed to be an art. But is there a formula for a quality wine?

– There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s?

Procedure

• Follow a data mining process

• Use SAS and SAS Enterprise Miner to execute the process

• SAS Enterprise Miner tool is modeled on the

SAS Institute defined data mining process of

SEMMA – Sample, Explore, Modify, Model,

Assess

• SEMMA is similar to the CRISP DM process

Sample

• 1,599 records

• Set up a data partition

– Training 40%

– Validation 30%

– Test 30%

Explore: Data Background

• Data source

– UCI Machine Learning Repository.

• Wine Quality Data Set.

– There are a red and white wine data set. I focused on the red wine set only.

– There are 11 input variables and one target variable.

» fixed acidity

» volatile acidity

» citric acid

» residual sugar

» chlorides

» free sulfur dioxide

» total sulfur dioxide

» density

» pH

» sulphates

» alcohol

» Output variable (based on sensory data): quality (score between 0 and 10)

Explore: Target=Quality

• Quality

– People gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8.

– An ordinal target

Explore: Inputs

• Correlation Analysis

– Some correlation, but not enough to discard inputs

• ods graphics on;

• ods select MatrixPlot;

proc corr data=wino.red PLOTS(MAXPOINTS=100000 ) plots=matrix(histogram nvar=all);

• var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid;

run;

Explore: Correlation Graphs

Explore: Chi 2 Statistics of Inputs

Explore: Worth of Inputs

Explore: Worth Graph

• The Worth Tracks closely with the Chi Statistic

Modify

• At this stage, no modifications are done

Model: Selection

• Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree

• Configuration

– The Splitting Rule is Entropy

– Maximum Branch is set to 5

• Therefore a C4.5 type of algorithm is being implemented

Assess: Initial Results

• A Bushy Tree using. The Resulting tree is too intricate for simple recommendation.

– Over 20 Leaf nodes.

Modify: Target

• Change the target so that it becomes a binary.

• New variable in the model called isGood. Any rating over 6 is categorized as isGood.

– SAS Code:

data wino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0;

run;

proc print data = wino.xx; title 'xx';

run;

Explore: Target = isGood

Model Strategy for isGood

• Model with Decision Tree to hope for more descriptive results.

• Also model with Neural Network to aid in assessment and do comparison

Model: Decision Tree

• ProbF splitting criteria at Significance Level .2

• Maximum Branch size = 5

Assess: Decision Tree Results

• Much simpler Tree

Assess: Decision Tree Results 2

• Leaf Statistics

Assess: Variable Importance

Variable

Name alcohol density

Label volatile_acidity sulphates fixed_acidity citric_acid free_sulfur_dioxide pH chlorides total_sulfur_dioxide residual_sugar

Number of

Splitting

Rules

1

0

0

0

0

1

0

0

0

0

0

Number of

Surrogate

Rules Importance

Validation

Importance

0 1 1

1 0.77055175

0.77055175

1 0.728868987 0.728868987

0 0.671675628 0.477710505

1 0.553719729 0.393817671

1 0.549750361 0.390994569

0

0

0

0

0

0

0

0

0

0

0 NaN

0 NaN

0 NaN

0 NaN

0 NaN

Ratio of

Validation to

Training

Importance

1

0.711222032

1

1

0.711222032

0.711222032

Event Classification Table

Data Role=TRAIN Target=isgood

False

Negative True Negative

53 539

False

Positive

14

True

Positive

34

Data Role=VALIDATE Target=isgood

False

Negative True Negative

43 403

False

Positive

12

True

Positive

21

Model: Neural Network

• Positive – better at predicting

• Negative – hard to interpret the model

• Configured with 3 Hidden Nodes

Modify: Input Variables to NN

• Because of the complexity of the NN, it is recommended to prune variables prior to running the network.

Modify: R 2 Filter

Variable Name alcohol chlorides citric_acid

Role

Measurement

Level

INPUT INTERVAL

INPUT INTERVAL

REJECTED INTERVAL density fixed_acidity

INPUT INTERVAL

INPUT INTERVAL free_sulfur_dioxide INPUT INTERVAL pH REJECTED INTERVAL residual_sugar sulphates

REJECTED INTERVAL

INPUT INTERVAL total_sulfur_dioxide REJECTED INTERVAL volatile_acidity INPUT INTERVAL

Reasons for Rejection

Varsel:Small R-square value

Varsel:Small R-square value

Varsel:Small R-square value

Varsel:Small R-square value

Model: NN

• Specify 3 Hidden Units in the Hidden Layer

Assess: NN Results

• Hard to interpret results to formulate a recipe

The NEURAL Procedure

Optimization Results

Parameter Estimates

Gradient

Objective

N Parameter Estimate Function

1 alcohol_H11 3.679818 -0.001411

2 chlorides_H11 0.520190 -0.000479

3 density_H11 -2.171623 0.000883

4 fixed_acidity_H11 -0.055929 0.000179

5 free_sulfur_dioxide_H11 0.403412 0.000139

6 sulphates_H11 -4.954290 -0.000224

7 volatile_acidity_H11 2.686209 0.000205

8 alcohol_H12 -0.313005 0.001209

9 chlorides_H12 0.200973 0.000759

Assess: Comparative Results

• Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree

Assess: Comparative Results

• Cumulative Lift for NN vs Decision Tree

Assess: Comparison with Reference

Paper

• Used R-Miner

• Support Vector Machine (SVM) and Neural

Network used

• He applied techniques to extract relative importance of variables

• He attempted to predict every quality level

• He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.”

Assess: Paper Variable Importance

Overall Project in SAS EM

References

• UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Wine

• P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J.

Reis. Modeling wine preferences by data mining from physicochemical properties.

In Decision Support Systems, Elsevier, 47(4):547-

553, 2009.

• Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf

Download