Assignment 4 Model Answers

advertisement
Model Answers
Assignment 4
STATS 747, 2nd Semester 2002
1) Your client owns a chain of cafés and wants to understand what is important to their customers. They have
carried
out
a
survey
using
the
questionnaire
available
online
at
http://www.stat.auckland.ac.nz/~reilly/cafeqns.pdf. The resulting data is available online at
http://www.stat.auckland.ac.nz/~reilly/cafe.csv. A wide range of cross-tabulations have already been carried out,
and now the client wants you to investigate whether there are segments of their customers that have different
feelings about what is important.
i) Explore the data, and in particular the questions about how important various service attributes are. Describe
any modifications to the data that would be appropriate before carrying out the cluster analyses outlined below,
and modify the data accordingly.
Answer:
[Details of data exploration omitted.]
All the importance questions (from Q6) are measured on the same five point scale, so standardisation is not
required. However there is a large proportion of missing data (from code 8 = “Don’t Know”), which would
result in biased results if list-wise deletion was used (or if the “Don’t Know”s are left as code 8).
For example, more than 15% of customers said “Don’t Know” about the importance of convenient hours, as
shown below. All of the items in question 6 had more than 10% “Don’t Know”s.
6(a) Convenient Hours
q6a_1
Frequency
Percent
Cumulative Frequency
Cumulative Percent
1
14
1.72
14
1.72
2
44
5.41
58
7.13
3
165
20.27
223
27.40
4
261
32.06
484
59.46
5
193
23.71
677
83.17
8
137
16.83
814
100.00
Imputation should therefore be carried out before the segmentation analyses. Mean or median imputation is a
possibility, but this would distort the distribution of responses and could lead to biased results. Hot-deck
imputation would be a better choice, or imputation of the mean plus a random error. The latter approach is used
here.
A factor analysis might also be a useful preparatory step if a small number of factors could be identified that
summarised the data well. Three factors have eigenvalues above or near 1, but these factors account for less than
half the variation amongst the Q6 variables. So much of the information would be lost by using just these
factors. If all ten factors were used as input to the cluster analyses, this would give factor 1 (the most important
component) equal weight to factor 10 (which probably represents random noise). This could drown out
important patterns in the data. For this reason, using factors does not seem wise in this situation. [However I
have not taken marks off for doing this.]
In case response effects were present, row means were subtracted from the imputed data. Then the k-means
clusters for this dataset were compared with those based on the original imputed data. However there was only a
small gain in relative R2 from doing this, so for ease of interpretation the final segmentations have been based on
the original imputed data.
ii) Conduct a k-means cluster analysis of the attribute importance questions. Obtain a 3 cluster solution.
Answer:
Twenty k-means analyses producing three clusters were run, each with different randomly chosen cluster seeds,
to avoid local minima. The cluster solution with the lowest criterion was chosen.
iii) Fit a normal mixture model with 3 components to the same data, using Weka’s EM clustering algorithm.
Answer:
The EM algorithm was run 20 times on the imputed data, looking for three latent classes, and using a different
seed each time. The likelihood was maximised when the seed was 1721.
The following command was used to produce cluster membership probabilities:
java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -S 1721
which were then read into SAS, and used to produce demographic profiles.
iv) Describe the market segments resulting from k-means in terms of the input variables and demographic
characteristics. Describe the market segments from the EM algorithm in terms of the input variables. How do
the results differ?
For extra credit: also describe the EM segments in terms of demographic characteristics, and compare with the
k-means results.
Show your work, including relevant program code and output. Be sure to avoid local minima and maxima in
your analyses.
Answer:
The three clusters from the k-means algorithm comprise 43%, 33% and 24% of the customers respectively.
People in k-means cluster 1 feel that “Value for Money” is important, along with “Availability of Nutritional
Information” and “Convenient Hours”. “Employee Friendliness” and “Appearance of Food” are less important
to them.
Members of k-means cluster 2 feel that “Value for Money” is even more important (on average), as well as
“Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Speed of Service” as important.
Cluster 3 holds customers who believe “Employee Friendliness” and “Appearance of Food” are important, while
“Value for Money” and “Cleanliness of the Facility” are not important to them.
k-means Cluster Means
Cluster
q6a_1
q6a_2
q6a_3
q6a_4
q6a_5
Q6a_6
q6a_7
q6a_8
q6a_9
q6a_10
1
3.61
3.83
3.30
4.56
3.88
4.32
4.63
4.56
4.66
3.55
2
4.16
4.37
2.76
3.32
3.66
4.51
3.82
3.74
3.70
4.09
3
3.87
3.93
4.58
3.18
4.67
4.49
3.34
3.68
3.88
4.47
Customers in k-means cluster 2 appear to be younger on average than other customers, while cluster 1 appears to
be older and members of cluster 3 are more likely to be middle-aged. The gender differences are not drastic, but
cluster 2 has a larger proportion of female customers than the other clusters.
Frequency
Row Pct
Table of CLUSTER by q7
q7(Age)
CLUSTER(Cluster)
1
2
3
4
5
6
Total
1
14
3.98
58
16.48
49
13.92
98
27.84
68
19.32
65
18.47
352
2
30
11.28
56
21.05
77
28.95
45
16.92
37
13.91
21
7.89
266
3
14
7.14
32
16.33
54
27.55
53
27.04
32
16.33
11
5.61
196
58
146
180
196
137
97
814
Total
Frequency
Row Pct
Table of CLUSTER by q8
q8(Gender)
CLUSTER(Cluster)
1
2
Total
1
144
40.91
208
59.09
352
2
118
44.36
148
55.64
266
3
80
40.82
116
59.18
196
342
472
814
Total
The three segments from the EM algorithm comprise 34%, 40% and 25% of the customers respectively.
Members of EM segment 0 feel that “Employee Friendliness” and “Convenient Hours” are important.
“Appearance of Food” and especially “Cleanliness of the Facility” are less important to them.
People in EM segment 1 feel that “Value for Money” is important, as well as “Employee Friendliness” and
“Cleanliness of the Facility”. They do not regard “Selection of Food” or “Speed of Service” as important. This
is similar to k-means cluster 2, although the EM cluster is noticeably larger.
EM segment 2 holds customers who believe “Value for Money” is very important, and “Cleanliness of the
Facility”, “Convenient Hours” and “Availability of Nutritional Information” are also important. “Healthy
Choices” , “Appearance of Food” , “Freshness of Food” , and “Employee Friendliness” are not important to
them.
EM Segment Means
Segment
q6a_1
q6a_2
q6a_3
q6a_4
q6a_5
q6a_6
q6a_7
q6a_8
q6a_9
q6a_10
EM Segment Means
Segment
q6a_1
q6a_2
q6a_3
q6a_4
q6a_5
q6a_6
q6a_7
q6a_8
q6a_9
q6a_10
0
3.76
3.96
3.93
3.64
5.00
4.47
3.90
3.94
4.07
4.16
1
4.15
4.24
3.24
3.49
3.51
4.48
3.75
3.82
3.84
4.04
2
3.50
3.78
3.07
4.62
3.43
4.27
4.76
4.69
4.81
3.52
EM segments 0 and 2 have a larger proportion of female customers than segment 1. There do not appear to be
any consistent patterns of age differences between the segments.
Gender
N Obs
Variable
Mean
1
342
segprob1
segprob2
segprob3
0.3654961
0.3645272
0.2699767
2
472
segprob1
segprob2
segprob3
0.3290861
0.4372480
0.2336660
Age
N Obs
Variable
Mean
1
58
segprob1
segprob2
segprob3
0.3620683
0.3211545
0.3167771
2
146
segprob1
segprob2
segprob3
0.3493146
0.4212077
0.2294777
3
180
segprob1
segprob2
segprob3
0.3222212
0.4048728
0.2729059
4
196
segprob1
segprob2
segprob3
0.3418353
0.4241338
0.2340309
5
137
segprob1
segprob2
segprob3
0.3649632
0.4110561
0.2239807
6
97
segprob1
segprob2
segprob3
0.3437480
0.3975708
0.2586811
2) A lending institution has gathered information on loan applicants, including whether or not they are judged to
be a good credit risk. They wish to understand the factors that may influence credit risk, so that they can make
their approval procedure more efficient. The data is the German credit dataset discussed in lectures (available at
http://www.stat.auckland.ac.nz/~reilly/credit-g.arff).
i) Conduct an exploratory analysis of the data, including univariate graphical summaries and bivariate measures
of association between credit risk and the available predictor variables. Which predictors appear to be most
strongly associated with the credit risk?
Answer (restricted to the main points – I will add more detail if time permits):
Graphical summaries were generally well done. Measuring bivariate associations between credit risk and the
predictor variables can be based on the measures of association available in PROC FREQ, correlation
coefficients (after dummy variables have been created) or a series of bivariate logistic regression analyses.
ii) Carry out an appropriate canonical discriminant analysis of this data. You may need to create dummy or
indicator variables for the values of nominal predictor variables.
Answer – main points:
Dummy variables should be created for all values of the nominal variables, and perhaps also for ordinal
variables. (See attached SAS code for details.) The important output from the canonical discriminant analysis is
the (total-sample) standardised canonical coefficients. These show which variables load strongly on the
canonical variables. In particular, the variables that load strongly on the first canonical variable are the ones that
best explain credit risk.
iii) Produce a pruned classification tree for this dataset using the rpart and prune functions in R.
Answer – main points:
Most answers got the full tree without any problems. This tree should be pruned back to the simplest tree with a
cross-validated error lower than the minimum cross-validated error plus one standard error, which corresponds
here to a cp value of approximately 0.03.
iv) Summarise the main results from the above analyses, highlighting their practical implications. Describe any
differences between these results that would be of practical significance, and suggest possible reasons for these
differences. Which results do you believe would be most useful (and why)?
Q1 SAS code:
PROC IMPORT OUT= WORK.cafe
DATAFILE= "C:\jr\cafe.csv"
DBMS=CSV REPLACE;
GETNAMES=YES;
DATAROW=2;
RUN;
ods html body="temp.html" style=minimal;
proc freq data=cafe;
table q6a_1-q6a_10;
run;
ods html close;
data cafe2;
set cafe;
label
q6a_1="6(a) Convenient Hours"
q6a_2="6(b) Speed of Service"
q6a_3="6(c) Value for Money"
q6a_4="6(d) Employee Friendliness"
q6a_5="6(e) Cleanliness of the facility"
q6a_6="6(f) Selection of Food"
q6a_7="6(g) Appearance of Food"
q6a_8="6(h) Freshness of Food"
q6a_9="6(i) Healthy Choices"
q6a_10="6(j) Availability of Nutritional Information"
q7 = "Age"
q8 = "Gender"
;
array q6 q6a_1-q6a_10;
do over q6;
if q6=8 then q6=.;
end;
run;
proc summary data=cafe2;
var q6a_1-q6a_10;
output out=cafemeans mean=m6a_1-m6a_10;
run;
proc sql;
create table cafe2a as
select * from cafe2, cafemeans;
run;
data cafe3;
set cafe2a;
array q6 q6a_1-q6a_10;
array m6 m6a_1-m6a_10;
array e6 e6a_1-e6a_10;
do over q6;
if q6 ne . then e6=q6-m6;
end;
random=ranuni(1394881787);
idn=id+0;
run;
proc sort data=cafe3 out=cafe3a;
by random;
run;
%macro impute;
%do i=1 %to 10;
data temp; set cafe3a cafe3a; if q6a_&i ne .; run;
data cafe4a&i;
merge cafe3 (keep=idn q6a_&i q7 q8 in=c) temp (keep=m6a_&i e6a_&i);
if c; if q6a_&i=. then q6a_&i=m6a_&i+e6a_&i;
run;
%end;
data cafe4;
merge
%do i=1 %to 10; cafe4a&i %end;
;
by idn;
keep idn q6a_1-q6a_10 q7 q8;
run;
%mend;
%impute;
/** Try factor analysis to see whether this provides useful simplification.;*/
/*proc factor data=cafe4 n=6 rotate=varimax round reorder flag=.54 scree out=scores;*/
/*
var q6a_1-q6a_10;*/
/*run;*/
/**/
/** Adjust for possible response effect.;*/
/*proc transpose data=cafe4 out=cafe4t;*/
/*
var q6a_1-q6a_10;*/
/*
id idn;*/
/*run;*/
/**/
/*proc standard data=cafe4t out=cafe5t m=0;*/
/* var _1-_814;*/
/*run;*/
/**/
/*proc transpose data=cafe5t out=cafe5;*/
/*
var _1-_814;*/
/*
id _name_;*/
/*
idlabel _label_;*/
/*run;*/
/**/
/*proc fastclus data=cafe5 maxc=3 replace=random random=109162319;*/
/*
var q6a_1-q6a_10;*/
/*run;*/
%macro kmclust(seed);
ods output Criterion=kmcriterion;
ods html body="temp.html" style=minimal;
proc fastclus data=cafe4 maxc=3 replace=random random=&seed out=clusters;
var q6a_1-q6a_10;
run;
ods output close;
ods html close;
data _null_; set kmcriterion; put _all_; run;
%mend;
%kmclust(839209472);
%kmclust(726230843);
%kmclust(173049203);
%kmclust(320205828);
%kmclust(929017829);
%kmclust(109162319);
%kmclust(619283921);
%kmclust(463892012);
%kmclust(561092881);
%kmclust(718923491);
%kmclust(429729127);
%kmclust(379457285);
%kmclust(178916387);
%kmclust(739876219);
%kmclust(627881618);
%kmclust(821098721);
%kmclust(897638196);
%kmclust(862341233);
%kmclust(327632882);
%kmclust(116298123); * This seed gives the lowest criterion of 0.8027.;
* Cluster profiles by demographics.;
ods html body="temp.html" style=minimal;
proc freq data=clusters;
table cluster*(q7 q8) / nocol nopercent;
run;
ods html close;
* Export imputed data for fitting latent class model using EM algorithm.;
PROC EXPORT DATA= WORK.CAFE4
OUTFILE= "C:\cafe4.csv"
DBMS=CSV REPLACE;
RUN;
* Read in segment membership probabilities from EM output.;
data EMsegments;
infile "c:\jr\cafeEM.txt" missover pad;
input idn 8-10 modeseg 18 segprob1 20-26 segprob2 29-35 segprob3 38-44;
run;
data EMsegments2;
merge EMsegments cafe4;
by idn;
run;
* EM segment profiles by demographics.;
ods html body="temp.html" style=minimal;
proc means data=EMsegments2 mean print;
class q7 q8;
var segprob1-segprob3;
ways 1;
run;
ods html close;
EM code and output:
> java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -d C:\jr\cafe.out -S 1721
Seed: 1721
Number of instances: 814
Number of atts: 10
======================================
Clust: 0 att: 0
Normal Distribution. Mean =
Clust: 0 att: 1
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 2
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 3
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
0
StandardDev =
0
WeightSum =
0
Clust: 0 att: 4
Normal Distribution. Mean =
Clust: 0 att: 5
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 6
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 7
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 8
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 9
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 0
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 1
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 2
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 3
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 4
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 5
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 6
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 7
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 8
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 1 att: 9
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 0
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 1
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 2
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 3
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 4
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 5
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 6
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 2 att: 7
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
0
StandardDev =
0
WeightSum =
0
Clust: 2 att: 8
Normal Distribution. Mean =
Clust: 2 att: 9
0
StandardDev =
0
WeightSum =
0
Normal Distribution. Mean =
0
StandardDev =
Inst
0 Class 2 0.36
0.11808 0.52192
Inst
1 Class 2 0.44115 0.08651 0.47235
Inst
2 Class 1 0.3127
0.41949 0.26782
Inst
3 Class 0 0.53905 0.23162 0.22934
Inst
4 Class 1 0.3709
0.42352 0.20558
Inst
5 Class 1 0.02279 0.69282 0.2844
Inst
6 Class 0 0.48523 0.41443 0.10033
Inst
7 Class 2 0.11324 0.37974 0.50702
Inst
8 Class 0 0.70994 0.21248 0.07758
Inst
9 Class 2 0.36237 0.13268 0.50495
Inst
10 Class 0 0.59445 0.16679 0.23876
.
.
.
Inst
810 Class 0 0.53874 0.03339 0.42786
Inst
811 Class 2 0.3562
0.10449 0.53931
Inst
812 Class 2 0.3153
0.29604 0.38865
Inst
813 Class 0 0.37679 0.25703 0.36617
Loglikely: -12.947281732790431
Loglikely: -12.945401366274425
Loglikely: -12.938726747977523
Loglikely: -12.901977150615119
Loglikely: -12.75012451077629
Loglikely: -12.493136950893648
Loglikely: -12.365027354312812
Loglikely: -12.304789204789568
Loglikely: -12.244897621583464
Loglikely: -12.193376754473698
Loglikely: -12.167358555788184
Loglikely: -12.157754243677418
Loglikely: -12.15417345138127
Loglikely: -12.152423619369937
Loglikely: -12.151152790252866
Loglikely: -12.149841301667593
Loglikely: -12.148118219983392
Loglikely: -12.145561523116212
Loglikely: -12.141839465588516
Loglikely: -12.137677594019792
Loglikely: -12.134179767406906
Loglikely: -12.131434990910169
Loglikely: -12.129153188213081
Loglikely: -12.12710171492302
Loglikely: -12.125226260553502
Loglikely: -12.12349702882937
Loglikely: -12.121828686952819
Loglikely: -12.12013959946356
Loglikely: -12.118388891732248
Loglikely: -12.116555091413234
Loglikely: -12.1145537246838
Loglikely: -12.112084376404072
Loglikely: -12.10816598293722
Loglikely: -12.098749169963604
Loglikely: -12.055958366015664
Loglikely: -11.741061138423825
Loglikely: -8.27125123227167
Loglikely: -7.964040764111144
Loglikely: -7.962932367664369
Loglikely: -7.962702860111277
Loglikely: -7.962619911759608
Loglikely: -7.962588379101999
Loglikely: -7.962575791799965
Loglikely: -7.962570581434721
Loglikely: -7.962568372746465
Loglikely: -7.962567422597268
======================================
Clust: 0 att: 0
0
WeightSum =
0
Normal Distribution. Mean =
Clust: 0 att: 1
3.7643 StandardDev =
1.0008 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 2
3.9571 StandardDev =
0.8691 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 3
3.925
StandardDev =
1.0376 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 4
3.6429 StandardDev =
1.0252 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 5
5
0
Normal Distribution. Mean =
Clust: 0 att: 6
4.4679 StandardDev =
0.5785 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 7
3.9036 StandardDev =
0.9494 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 8
3.9429 StandardDev =
0.8347 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 0 att: 9
4.0679 StandardDev =
0.8695 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 1 att: 0
4.1571 StandardDev =
0.8086 WeightSum = 279.9992
Normal Distribution. Mean =
Clust: 1 att: 1
4.1496 StandardDev =
0.8437 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 2
4.2428 StandardDev =
0.7293 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 3
3.235
StandardDev =
1.0697 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 4
3.4858 StandardDev =
1.0646 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 5
3.5058 StandardDev =
0.688
Normal Distribution. Mean =
Clust: 1 att: 6
4.4793 StandardDev =
0.6006 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 7
3.7508 StandardDev =
0.9563 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 8
3.8212 StandardDev =
0.8231 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 1 att: 9
3.8357 StandardDev =
0.8844 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 2 att: 0
4.0428 StandardDev =
0.7403 WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 2 att: 1
3.4965 StandardDev =
0.9632 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 2
3.7824 StandardDev =
0.7585 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 3
3.0703 StandardDev =
0.9877 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 4
4.6241 StandardDev =
0.5041 WeightSum = 203.4457
StandardDev =
WeightSum = 279.9992
WeightSum = 330.5551
Normal Distribution. Mean =
Clust: 2 att: 5
3.4317 StandardDev =
0.7422 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 6
4.2732 StandardDev =
0.6017 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 7
4.7589 StandardDev =
0.4323 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 8
4.6887 StandardDev =
0.4754 WeightSum = 203.4457
Normal Distribution. Mean =
Clust: 2 att: 9
4.8076 StandardDev =
0.3957 WeightSum = 203.4457
Distribution. Mean =
0 Class 2 0
1 Class 2 0
2 Class 0 1
3 Class 1 0
4 Class 1 0
5 Class 0 1
6 Class 2 0
7 Class 2 0
8 Class 1 0
9 Class 1 0
10 Class 0 1
3.5176 StandardDev =
0.02973 0.97027
0.01223 0.98777
0
0
1
0
0.91439 0.08561
0
0
0.11805 0.88195
0.14931 0.85069
1
0
0.99892 0.00108
0
0
0.8439 WeightSum = 203.4457
810
811
812
813
0.00531
0
0.99076
0
Normal
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
Inst
.
.
.
Inst
Inst
Inst
Inst
Class
Class
Class
Class
2
0
1
0
0
0.99991
0
1
0.99469
0.00009
0.00924
0
EM
==
Number of clusters: 3
Cluster: 0 Prior probability: 0.344
Attribute: q6a_1
Normal Distribution.
Attribute: q6a_2
Normal Distribution.
Attribute: q6a_3
Normal Distribution.
Attribute: q6a_4
Normal Distribution.
Attribute: q6a_5
Normal Distribution.
Attribute: q6a_6
Normal Distribution.
Attribute: q6a_7
Normal Distribution.
Attribute: q6a_8
Normal Distribution.
Attribute: q6a_9
Normal Distribution.
Attribute: q6a_10
Normal Distribution.
Mean = 3.7643 StdDev = 1.0008
Mean = 3.9571 StdDev = 0.8691
Mean = 3.925 StdDev = 1.0376
Mean = 3.6429 StdDev = 1.0252
Mean = 5 StdDev = 0
Mean = 4.4679 StdDev = 0.5785
Mean = 3.9036 StdDev = 0.9494
Mean = 3.9429 StdDev = 0.8347
Mean = 4.0679 StdDev = 0.8695
Mean = 4.1571 StdDev = 0.8086
Cluster: 1 Prior probability: 0.4062
Attribute: q6a_1
Normal Distribution. Mean = 4.1496 StdDev = 0.8437
Attribute: q6a_2
Normal Distribution. Mean = 4.2428 StdDev = 0.7293
Attribute: q6a_3
Normal Distribution. Mean = 3.235 StdDev = 1.0697
Attribute: q6a_4
Normal Distribution.
Attribute: q6a_5
Normal Distribution.
Attribute: q6a_6
Normal Distribution.
Attribute: q6a_7
Normal Distribution.
Attribute: q6a_8
Normal Distribution.
Attribute: q6a_9
Normal Distribution.
Attribute: q6a_10
Normal Distribution.
Mean = 3.4858 StdDev = 1.0646
Mean = 3.5058 StdDev = 0.688
Mean = 4.4793 StdDev = 0.6006
Mean = 3.7508 StdDev = 0.9563
Mean = 3.8212 StdDev = 0.8231
Mean = 3.8357 StdDev = 0.8844
Mean = 4.0428 StdDev = 0.7403
Cluster: 2 Prior probability: 0.2498
Attribute: q6a_1
Normal Distribution.
Attribute: q6a_2
Normal Distribution.
Attribute: q6a_3
Normal Distribution.
Attribute: q6a_4
Normal Distribution.
Attribute: q6a_5
Normal Distribution.
Attribute: q6a_6
Normal Distribution.
Attribute: q6a_7
Normal Distribution.
Attribute: q6a_8
Normal Distribution.
Attribute: q6a_9
Normal Distribution.
Attribute: q6a_10
Normal Distribution.
Mean = 3.4965 StdDev = 0.9632
Mean = 3.7824 StdDev = 0.7585
Mean = 3.0703 StdDev = 0.9877
Mean = 4.6241 StdDev = 0.5041
Mean = 3.4317 StdDev = 0.7422
Mean = 4.2732 StdDev = 0.6017
Mean = 4.7589 StdDev = 0.4323
Mean = 4.6887 StdDev = 0.4754
Mean = 4.8076 StdDev = 0.3957
Mean = 3.5176 StdDev = 0.8439
=== Clustering stats for training data ===
Clustered Instances
0
280 ( 34%)
1
329 ( 40%)
2
205 ( 25%)
Log likelihood: -7.96257
Q2 SAS code.
libname asst4 'C:\My Documents\747';
* @data portion of credit-g.arff file extracted, and header line containing variable
names added,
before reading data into SAS.;
PROC IMPORT DATAFILE = 'C:\My Documents\747\credit-g.csv'
OUT = asst4.credit DBMS = CSV REPLACE;
run;
ods html body="C:\My Documents\747\credit.html" style=minimal;
proc contents data=asst4.credit;
run;
proc freq data=asst4.credit;
tables checking_status duration credit_history purpose credit_amount savings_status
employment
installment_commitment personal_status other_parties residence_since
property_magnitude
age other_payment_plans housing existing_credits job num_dependents
own_telephone
foreign_worker class;
run;
proc freq data=asst4.credit;
tables (checking_status duration credit_history purpose credit_amount savings_status
employment
installment_commitment personal_status other_parties residence_since
property_magnitude
age other_payment_plans housing existing_credits job num_dependents
own_telephone
foreign_worker)*class / measures;
run;
* proc glmmod provides a quick way to create dummy variables.
This could also be done in a data step using if-then statements.;
proc glmmod data=asst4.credit outdesign=asst4.credit2;
* checking_status, savings_status, property_magnitude and job could arguably be
regarded as ordinal,
and coded as appropriate numeric values. However here they are treated as nominal
variables.;
class checking_status credit_history purpose savings_status employment
personal_status other_parties property_magnitude
other_payment_plans housing job own_telephone
foreign_worker class;
model duration = checking_status duration credit_history purpose credit_amount
savings_status employment
installment_commitment personal_status other_parties residence_since
property_magnitude
age other_payment_plans housing existing_credits job num_dependents
own_telephone
foreign_worker class;
run;
proc contents data=asst4.credit2;
run;
proc candisc data=asst4.credit2;
class col64;
var
col1-col62;
run;
ods html close;
Q2 R code.
credit <- read.csv("C:\\My Documents\\747\\credit-g.csv")
package(rpart)
credit.fulltree <- rpart(class ~ .,data=credit)
plot(credit.fulltree)
text(credit.fulltree)
printcp(credit.fulltree)
plotcp(credit.fulltree)
credit.prunedtree <- prune(credit.fulltree,cp=0.03)
plot(credit.prunedtree)
text(credit.prunedtree)
Download