Model Answers Assignment 4 STATS 747, 2nd Semester 2002 1) Your client owns a chain of cafés and wants to understand what is important to their customers. They have carried out a survey using the questionnaire available online at http://www.stat.auckland.ac.nz/~reilly/cafeqns.pdf. The resulting data is available online at http://www.stat.auckland.ac.nz/~reilly/cafe.csv. A wide range of cross-tabulations have already been carried out, and now the client wants you to investigate whether there are segments of their customers that have different feelings about what is important. i) Explore the data, and in particular the questions about how important various service attributes are. Describe any modifications to the data that would be appropriate before carrying out the cluster analyses outlined below, and modify the data accordingly. Answer: [Details of data exploration omitted.] All the importance questions (from Q6) are measured on the same five point scale, so standardisation is not required. However there is a large proportion of missing data (from code 8 = “Don’t Know”), which would result in biased results if list-wise deletion was used (or if the “Don’t Know”s are left as code 8). For example, more than 15% of customers said “Don’t Know” about the importance of convenient hours, as shown below. All of the items in question 6 had more than 10% “Don’t Know”s. 6(a) Convenient Hours q6a_1 Frequency Percent Cumulative Frequency Cumulative Percent 1 14 1.72 14 1.72 2 44 5.41 58 7.13 3 165 20.27 223 27.40 4 261 32.06 484 59.46 5 193 23.71 677 83.17 8 137 16.83 814 100.00 Imputation should therefore be carried out before the segmentation analyses. Mean or median imputation is a possibility, but this would distort the distribution of responses and could lead to biased results. Hot-deck imputation would be a better choice, or imputation of the mean plus a random error. The latter approach is used here. A factor analysis might also be a useful preparatory step if a small number of factors could be identified that summarised the data well. Three factors have eigenvalues above or near 1, but these factors account for less than half the variation amongst the Q6 variables. So much of the information would be lost by using just these factors. If all ten factors were used as input to the cluster analyses, this would give factor 1 (the most important component) equal weight to factor 10 (which probably represents random noise). This could drown out important patterns in the data. For this reason, using factors does not seem wise in this situation. [However I have not taken marks off for doing this.] In case response effects were present, row means were subtracted from the imputed data. Then the k-means clusters for this dataset were compared with those based on the original imputed data. However there was only a small gain in relative R2 from doing this, so for ease of interpretation the final segmentations have been based on the original imputed data. ii) Conduct a k-means cluster analysis of the attribute importance questions. Obtain a 3 cluster solution. Answer: Twenty k-means analyses producing three clusters were run, each with different randomly chosen cluster seeds, to avoid local minima. The cluster solution with the lowest criterion was chosen. iii) Fit a normal mixture model with 3 components to the same data, using Weka’s EM clustering algorithm. Answer: The EM algorithm was run 20 times on the imputed data, looking for three latent classes, and using a different seed each time. The likelihood was maximised when the seed was 1721. The following command was used to produce cluster membership probabilities: java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -S 1721 which were then read into SAS, and used to produce demographic profiles. iv) Describe the market segments resulting from k-means in terms of the input variables and demographic characteristics. Describe the market segments from the EM algorithm in terms of the input variables. How do the results differ? For extra credit: also describe the EM segments in terms of demographic characteristics, and compare with the k-means results. Show your work, including relevant program code and output. Be sure to avoid local minima and maxima in your analyses. Answer: The three clusters from the k-means algorithm comprise 43%, 33% and 24% of the customers respectively. People in k-means cluster 1 feel that “Value for Money” is important, along with “Availability of Nutritional Information” and “Convenient Hours”. “Employee Friendliness” and “Appearance of Food” are less important to them. Members of k-means cluster 2 feel that “Value for Money” is even more important (on average), as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Speed of Service” as important. Cluster 3 holds customers who believe “Employee Friendliness” and “Appearance of Food” are important, while “Value for Money” and “Cleanliness of the Facility” are not important to them. k-means Cluster Means Cluster q6a_1 q6a_2 q6a_3 q6a_4 q6a_5 Q6a_6 q6a_7 q6a_8 q6a_9 q6a_10 1 3.61 3.83 3.30 4.56 3.88 4.32 4.63 4.56 4.66 3.55 2 4.16 4.37 2.76 3.32 3.66 4.51 3.82 3.74 3.70 4.09 3 3.87 3.93 4.58 3.18 4.67 4.49 3.34 3.68 3.88 4.47 Customers in k-means cluster 2 appear to be younger on average than other customers, while cluster 1 appears to be older and members of cluster 3 are more likely to be middle-aged. The gender differences are not drastic, but cluster 2 has a larger proportion of female customers than the other clusters. Frequency Row Pct Table of CLUSTER by q7 q7(Age) CLUSTER(Cluster) 1 2 3 4 5 6 Total 1 14 3.98 58 16.48 49 13.92 98 27.84 68 19.32 65 18.47 352 2 30 11.28 56 21.05 77 28.95 45 16.92 37 13.91 21 7.89 266 3 14 7.14 32 16.33 54 27.55 53 27.04 32 16.33 11 5.61 196 58 146 180 196 137 97 814 Total Frequency Row Pct Table of CLUSTER by q8 q8(Gender) CLUSTER(Cluster) 1 2 Total 1 144 40.91 208 59.09 352 2 118 44.36 148 55.64 266 3 80 40.82 116 59.18 196 342 472 814 Total The three segments from the EM algorithm comprise 34%, 40% and 25% of the customers respectively. Members of EM segment 0 feel that “Employee Friendliness” and “Convenient Hours” are important. “Appearance of Food” and especially “Cleanliness of the Facility” are less important to them. People in EM segment 1 feel that “Value for Money” is important, as well as “Employee Friendliness” and “Cleanliness of the Facility”. They do not regard “Selection of Food” or “Speed of Service” as important. This is similar to k-means cluster 2, although the EM cluster is noticeably larger. EM segment 2 holds customers who believe “Value for Money” is very important, and “Cleanliness of the Facility”, “Convenient Hours” and “Availability of Nutritional Information” are also important. “Healthy Choices” , “Appearance of Food” , “Freshness of Food” , and “Employee Friendliness” are not important to them. EM Segment Means Segment q6a_1 q6a_2 q6a_3 q6a_4 q6a_5 q6a_6 q6a_7 q6a_8 q6a_9 q6a_10 EM Segment Means Segment q6a_1 q6a_2 q6a_3 q6a_4 q6a_5 q6a_6 q6a_7 q6a_8 q6a_9 q6a_10 0 3.76 3.96 3.93 3.64 5.00 4.47 3.90 3.94 4.07 4.16 1 4.15 4.24 3.24 3.49 3.51 4.48 3.75 3.82 3.84 4.04 2 3.50 3.78 3.07 4.62 3.43 4.27 4.76 4.69 4.81 3.52 EM segments 0 and 2 have a larger proportion of female customers than segment 1. There do not appear to be any consistent patterns of age differences between the segments. Gender N Obs Variable Mean 1 342 segprob1 segprob2 segprob3 0.3654961 0.3645272 0.2699767 2 472 segprob1 segprob2 segprob3 0.3290861 0.4372480 0.2336660 Age N Obs Variable Mean 1 58 segprob1 segprob2 segprob3 0.3620683 0.3211545 0.3167771 2 146 segprob1 segprob2 segprob3 0.3493146 0.4212077 0.2294777 3 180 segprob1 segprob2 segprob3 0.3222212 0.4048728 0.2729059 4 196 segprob1 segprob2 segprob3 0.3418353 0.4241338 0.2340309 5 137 segprob1 segprob2 segprob3 0.3649632 0.4110561 0.2239807 6 97 segprob1 segprob2 segprob3 0.3437480 0.3975708 0.2586811 2) A lending institution has gathered information on loan applicants, including whether or not they are judged to be a good credit risk. They wish to understand the factors that may influence credit risk, so that they can make their approval procedure more efficient. The data is the German credit dataset discussed in lectures (available at http://www.stat.auckland.ac.nz/~reilly/credit-g.arff). i) Conduct an exploratory analysis of the data, including univariate graphical summaries and bivariate measures of association between credit risk and the available predictor variables. Which predictors appear to be most strongly associated with the credit risk? Answer (restricted to the main points – I will add more detail if time permits): Graphical summaries were generally well done. Measuring bivariate associations between credit risk and the predictor variables can be based on the measures of association available in PROC FREQ, correlation coefficients (after dummy variables have been created) or a series of bivariate logistic regression analyses. ii) Carry out an appropriate canonical discriminant analysis of this data. You may need to create dummy or indicator variables for the values of nominal predictor variables. Answer – main points: Dummy variables should be created for all values of the nominal variables, and perhaps also for ordinal variables. (See attached SAS code for details.) The important output from the canonical discriminant analysis is the (total-sample) standardised canonical coefficients. These show which variables load strongly on the canonical variables. In particular, the variables that load strongly on the first canonical variable are the ones that best explain credit risk. iii) Produce a pruned classification tree for this dataset using the rpart and prune functions in R. Answer – main points: Most answers got the full tree without any problems. This tree should be pruned back to the simplest tree with a cross-validated error lower than the minimum cross-validated error plus one standard error, which corresponds here to a cp value of approximately 0.03. iv) Summarise the main results from the above analyses, highlighting their practical implications. Describe any differences between these results that would be of practical significance, and suggest possible reasons for these differences. Which results do you believe would be most useful (and why)? Q1 SAS code: PROC IMPORT OUT= WORK.cafe DATAFILE= "C:\jr\cafe.csv" DBMS=CSV REPLACE; GETNAMES=YES; DATAROW=2; RUN; ods html body="temp.html" style=minimal; proc freq data=cafe; table q6a_1-q6a_10; run; ods html close; data cafe2; set cafe; label q6a_1="6(a) Convenient Hours" q6a_2="6(b) Speed of Service" q6a_3="6(c) Value for Money" q6a_4="6(d) Employee Friendliness" q6a_5="6(e) Cleanliness of the facility" q6a_6="6(f) Selection of Food" q6a_7="6(g) Appearance of Food" q6a_8="6(h) Freshness of Food" q6a_9="6(i) Healthy Choices" q6a_10="6(j) Availability of Nutritional Information" q7 = "Age" q8 = "Gender" ; array q6 q6a_1-q6a_10; do over q6; if q6=8 then q6=.; end; run; proc summary data=cafe2; var q6a_1-q6a_10; output out=cafemeans mean=m6a_1-m6a_10; run; proc sql; create table cafe2a as select * from cafe2, cafemeans; run; data cafe3; set cafe2a; array q6 q6a_1-q6a_10; array m6 m6a_1-m6a_10; array e6 e6a_1-e6a_10; do over q6; if q6 ne . then e6=q6-m6; end; random=ranuni(1394881787); idn=id+0; run; proc sort data=cafe3 out=cafe3a; by random; run; %macro impute; %do i=1 %to 10; data temp; set cafe3a cafe3a; if q6a_&i ne .; run; data cafe4a&i; merge cafe3 (keep=idn q6a_&i q7 q8 in=c) temp (keep=m6a_&i e6a_&i); if c; if q6a_&i=. then q6a_&i=m6a_&i+e6a_&i; run; %end; data cafe4; merge %do i=1 %to 10; cafe4a&i %end; ; by idn; keep idn q6a_1-q6a_10 q7 q8; run; %mend; %impute; /** Try factor analysis to see whether this provides useful simplification.;*/ /*proc factor data=cafe4 n=6 rotate=varimax round reorder flag=.54 scree out=scores;*/ /* var q6a_1-q6a_10;*/ /*run;*/ /**/ /** Adjust for possible response effect.;*/ /*proc transpose data=cafe4 out=cafe4t;*/ /* var q6a_1-q6a_10;*/ /* id idn;*/ /*run;*/ /**/ /*proc standard data=cafe4t out=cafe5t m=0;*/ /* var _1-_814;*/ /*run;*/ /**/ /*proc transpose data=cafe5t out=cafe5;*/ /* var _1-_814;*/ /* id _name_;*/ /* idlabel _label_;*/ /*run;*/ /**/ /*proc fastclus data=cafe5 maxc=3 replace=random random=109162319;*/ /* var q6a_1-q6a_10;*/ /*run;*/ %macro kmclust(seed); ods output Criterion=kmcriterion; ods html body="temp.html" style=minimal; proc fastclus data=cafe4 maxc=3 replace=random random=&seed out=clusters; var q6a_1-q6a_10; run; ods output close; ods html close; data _null_; set kmcriterion; put _all_; run; %mend; %kmclust(839209472); %kmclust(726230843); %kmclust(173049203); %kmclust(320205828); %kmclust(929017829); %kmclust(109162319); %kmclust(619283921); %kmclust(463892012); %kmclust(561092881); %kmclust(718923491); %kmclust(429729127); %kmclust(379457285); %kmclust(178916387); %kmclust(739876219); %kmclust(627881618); %kmclust(821098721); %kmclust(897638196); %kmclust(862341233); %kmclust(327632882); %kmclust(116298123); * This seed gives the lowest criterion of 0.8027.; * Cluster profiles by demographics.; ods html body="temp.html" style=minimal; proc freq data=clusters; table cluster*(q7 q8) / nocol nopercent; run; ods html close; * Export imputed data for fitting latent class model using EM algorithm.; PROC EXPORT DATA= WORK.CAFE4 OUTFILE= "C:\cafe4.csv" DBMS=CSV REPLACE; RUN; * Read in segment membership probabilities from EM output.; data EMsegments; infile "c:\jr\cafeEM.txt" missover pad; input idn 8-10 modeseg 18 segprob1 20-26 segprob2 29-35 segprob3 38-44; run; data EMsegments2; merge EMsegments cafe4; by idn; run; * EM segment profiles by demographics.; ods html body="temp.html" style=minimal; proc means data=EMsegments2 mean print; class q7 q8; var segprob1-segprob3; ways 1; run; ods html close; EM code and output: > java weka.clusterers.EM -N 3 -V -t C:\jr\cafe4b.arff -d C:\jr\cafe.out -S 1721 Seed: 1721 Number of instances: 814 Number of atts: 10 ====================================== Clust: 0 att: 0 Normal Distribution. Mean = Clust: 0 att: 1 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 2 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 3 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0 Clust: 0 att: 4 Normal Distribution. Mean = Clust: 0 att: 5 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 6 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 7 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 8 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 9 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 0 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 1 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 2 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 3 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 4 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 5 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 6 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 7 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 8 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 1 att: 9 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 0 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 1 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 2 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 3 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 4 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 5 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 6 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = Clust: 2 att: 7 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = 0 StandardDev = 0 WeightSum = 0 Clust: 2 att: 8 Normal Distribution. Mean = Clust: 2 att: 9 0 StandardDev = 0 WeightSum = 0 Normal Distribution. Mean = 0 StandardDev = Inst 0 Class 2 0.36 0.11808 0.52192 Inst 1 Class 2 0.44115 0.08651 0.47235 Inst 2 Class 1 0.3127 0.41949 0.26782 Inst 3 Class 0 0.53905 0.23162 0.22934 Inst 4 Class 1 0.3709 0.42352 0.20558 Inst 5 Class 1 0.02279 0.69282 0.2844 Inst 6 Class 0 0.48523 0.41443 0.10033 Inst 7 Class 2 0.11324 0.37974 0.50702 Inst 8 Class 0 0.70994 0.21248 0.07758 Inst 9 Class 2 0.36237 0.13268 0.50495 Inst 10 Class 0 0.59445 0.16679 0.23876 . . . Inst 810 Class 0 0.53874 0.03339 0.42786 Inst 811 Class 2 0.3562 0.10449 0.53931 Inst 812 Class 2 0.3153 0.29604 0.38865 Inst 813 Class 0 0.37679 0.25703 0.36617 Loglikely: -12.947281732790431 Loglikely: -12.945401366274425 Loglikely: -12.938726747977523 Loglikely: -12.901977150615119 Loglikely: -12.75012451077629 Loglikely: -12.493136950893648 Loglikely: -12.365027354312812 Loglikely: -12.304789204789568 Loglikely: -12.244897621583464 Loglikely: -12.193376754473698 Loglikely: -12.167358555788184 Loglikely: -12.157754243677418 Loglikely: -12.15417345138127 Loglikely: -12.152423619369937 Loglikely: -12.151152790252866 Loglikely: -12.149841301667593 Loglikely: -12.148118219983392 Loglikely: -12.145561523116212 Loglikely: -12.141839465588516 Loglikely: -12.137677594019792 Loglikely: -12.134179767406906 Loglikely: -12.131434990910169 Loglikely: -12.129153188213081 Loglikely: -12.12710171492302 Loglikely: -12.125226260553502 Loglikely: -12.12349702882937 Loglikely: -12.121828686952819 Loglikely: -12.12013959946356 Loglikely: -12.118388891732248 Loglikely: -12.116555091413234 Loglikely: -12.1145537246838 Loglikely: -12.112084376404072 Loglikely: -12.10816598293722 Loglikely: -12.098749169963604 Loglikely: -12.055958366015664 Loglikely: -11.741061138423825 Loglikely: -8.27125123227167 Loglikely: -7.964040764111144 Loglikely: -7.962932367664369 Loglikely: -7.962702860111277 Loglikely: -7.962619911759608 Loglikely: -7.962588379101999 Loglikely: -7.962575791799965 Loglikely: -7.962570581434721 Loglikely: -7.962568372746465 Loglikely: -7.962567422597268 ====================================== Clust: 0 att: 0 0 WeightSum = 0 Normal Distribution. Mean = Clust: 0 att: 1 3.7643 StandardDev = 1.0008 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 2 3.9571 StandardDev = 0.8691 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 3 3.925 StandardDev = 1.0376 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 4 3.6429 StandardDev = 1.0252 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 5 5 0 Normal Distribution. Mean = Clust: 0 att: 6 4.4679 StandardDev = 0.5785 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 7 3.9036 StandardDev = 0.9494 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 8 3.9429 StandardDev = 0.8347 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 0 att: 9 4.0679 StandardDev = 0.8695 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 1 att: 0 4.1571 StandardDev = 0.8086 WeightSum = 279.9992 Normal Distribution. Mean = Clust: 1 att: 1 4.1496 StandardDev = 0.8437 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 2 4.2428 StandardDev = 0.7293 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 3 3.235 StandardDev = 1.0697 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 4 3.4858 StandardDev = 1.0646 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 5 3.5058 StandardDev = 0.688 Normal Distribution. Mean = Clust: 1 att: 6 4.4793 StandardDev = 0.6006 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 7 3.7508 StandardDev = 0.9563 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 8 3.8212 StandardDev = 0.8231 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 1 att: 9 3.8357 StandardDev = 0.8844 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 2 att: 0 4.0428 StandardDev = 0.7403 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 2 att: 1 3.4965 StandardDev = 0.9632 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 2 3.7824 StandardDev = 0.7585 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 3 3.0703 StandardDev = 0.9877 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 4 4.6241 StandardDev = 0.5041 WeightSum = 203.4457 StandardDev = WeightSum = 279.9992 WeightSum = 330.5551 Normal Distribution. Mean = Clust: 2 att: 5 3.4317 StandardDev = 0.7422 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 6 4.2732 StandardDev = 0.6017 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 7 4.7589 StandardDev = 0.4323 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 8 4.6887 StandardDev = 0.4754 WeightSum = 203.4457 Normal Distribution. Mean = Clust: 2 att: 9 4.8076 StandardDev = 0.3957 WeightSum = 203.4457 Distribution. Mean = 0 Class 2 0 1 Class 2 0 2 Class 0 1 3 Class 1 0 4 Class 1 0 5 Class 0 1 6 Class 2 0 7 Class 2 0 8 Class 1 0 9 Class 1 0 10 Class 0 1 3.5176 StandardDev = 0.02973 0.97027 0.01223 0.98777 0 0 1 0 0.91439 0.08561 0 0 0.11805 0.88195 0.14931 0.85069 1 0 0.99892 0.00108 0 0 0.8439 WeightSum = 203.4457 810 811 812 813 0.00531 0 0.99076 0 Normal Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst Inst . . . Inst Inst Inst Inst Class Class Class Class 2 0 1 0 0 0.99991 0 1 0.99469 0.00009 0.00924 0 EM == Number of clusters: 3 Cluster: 0 Prior probability: 0.344 Attribute: q6a_1 Normal Distribution. Attribute: q6a_2 Normal Distribution. Attribute: q6a_3 Normal Distribution. Attribute: q6a_4 Normal Distribution. Attribute: q6a_5 Normal Distribution. Attribute: q6a_6 Normal Distribution. Attribute: q6a_7 Normal Distribution. Attribute: q6a_8 Normal Distribution. Attribute: q6a_9 Normal Distribution. Attribute: q6a_10 Normal Distribution. Mean = 3.7643 StdDev = 1.0008 Mean = 3.9571 StdDev = 0.8691 Mean = 3.925 StdDev = 1.0376 Mean = 3.6429 StdDev = 1.0252 Mean = 5 StdDev = 0 Mean = 4.4679 StdDev = 0.5785 Mean = 3.9036 StdDev = 0.9494 Mean = 3.9429 StdDev = 0.8347 Mean = 4.0679 StdDev = 0.8695 Mean = 4.1571 StdDev = 0.8086 Cluster: 1 Prior probability: 0.4062 Attribute: q6a_1 Normal Distribution. Mean = 4.1496 StdDev = 0.8437 Attribute: q6a_2 Normal Distribution. Mean = 4.2428 StdDev = 0.7293 Attribute: q6a_3 Normal Distribution. Mean = 3.235 StdDev = 1.0697 Attribute: q6a_4 Normal Distribution. Attribute: q6a_5 Normal Distribution. Attribute: q6a_6 Normal Distribution. Attribute: q6a_7 Normal Distribution. Attribute: q6a_8 Normal Distribution. Attribute: q6a_9 Normal Distribution. Attribute: q6a_10 Normal Distribution. Mean = 3.4858 StdDev = 1.0646 Mean = 3.5058 StdDev = 0.688 Mean = 4.4793 StdDev = 0.6006 Mean = 3.7508 StdDev = 0.9563 Mean = 3.8212 StdDev = 0.8231 Mean = 3.8357 StdDev = 0.8844 Mean = 4.0428 StdDev = 0.7403 Cluster: 2 Prior probability: 0.2498 Attribute: q6a_1 Normal Distribution. Attribute: q6a_2 Normal Distribution. Attribute: q6a_3 Normal Distribution. Attribute: q6a_4 Normal Distribution. Attribute: q6a_5 Normal Distribution. Attribute: q6a_6 Normal Distribution. Attribute: q6a_7 Normal Distribution. Attribute: q6a_8 Normal Distribution. Attribute: q6a_9 Normal Distribution. Attribute: q6a_10 Normal Distribution. Mean = 3.4965 StdDev = 0.9632 Mean = 3.7824 StdDev = 0.7585 Mean = 3.0703 StdDev = 0.9877 Mean = 4.6241 StdDev = 0.5041 Mean = 3.4317 StdDev = 0.7422 Mean = 4.2732 StdDev = 0.6017 Mean = 4.7589 StdDev = 0.4323 Mean = 4.6887 StdDev = 0.4754 Mean = 4.8076 StdDev = 0.3957 Mean = 3.5176 StdDev = 0.8439 === Clustering stats for training data === Clustered Instances 0 280 ( 34%) 1 329 ( 40%) 2 205 ( 25%) Log likelihood: -7.96257 Q2 SAS code. libname asst4 'C:\My Documents\747'; * @data portion of credit-g.arff file extracted, and header line containing variable names added, before reading data into SAS.; PROC IMPORT DATAFILE = 'C:\My Documents\747\credit-g.csv' OUT = asst4.credit DBMS = CSV REPLACE; run; ods html body="C:\My Documents\747\credit.html" style=minimal; proc contents data=asst4.credit; run; proc freq data=asst4.credit; tables checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class; run; proc freq data=asst4.credit; tables (checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker)*class / measures; run; * proc glmmod provides a quick way to create dummy variables. This could also be done in a data step using if-then statements.; proc glmmod data=asst4.credit outdesign=asst4.credit2; * checking_status, savings_status, property_magnitude and job could arguably be regarded as ordinal, and coded as appropriate numeric values. However here they are treated as nominal variables.; class checking_status credit_history purpose savings_status employment personal_status other_parties property_magnitude other_payment_plans housing job own_telephone foreign_worker class; model duration = checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class; run; proc contents data=asst4.credit2; run; proc candisc data=asst4.credit2; class col64; var col1-col62; run; ods html close; Q2 R code. credit <- read.csv("C:\\My Documents\\747\\credit-g.csv") package(rpart) credit.fulltree <- rpart(class ~ .,data=credit) plot(credit.fulltree) text(credit.fulltree) printcp(credit.fulltree) plotcp(credit.fulltree) credit.prunedtree <- prune(credit.fulltree,cp=0.03) plot(credit.prunedtree) text(credit.prunedtree)