Group Task M1L7 Data Mining & Visualization Model 1, Class #7 Professor: Lee Create a JMP Journal (or Word file) to hold copies of your output (screenshots) and your answers to the questions. 1. Hierarchical Clustering: Use the BirthDeathSubset.jmp file with mortality (birth and death) rates for several countries. Determine which countries share similar mortality characteristics by examining two hierarchical cluster models (use Ward and Complete clustering). Note the label has been assigned to the Country column. Color the Clusters (click Red TriangleColor Clusters.) What differences do you see in the two dendrograms as far as how they group countries as you define larger numbers of clusters? Move the diamond on the dendrograms until you have clusters you can explain/that make sense. Put one dendrogram into your journal (or word report) and comment on the similarities of the countries that group together. The largest cluster using the ward method was the blue cluster. When using, the complete method that cluster was still the largest with the same amount of countries. However, one noticeable difference was how South Africa and Russia were no longer isolated from the red cluster in the ward version. In the complete method they are connected to the red cluster (Afghanistan, Zaire). 2. K-Means (Non-Hierarchical clustering): Suppose that we are examining financial data from companies. The objective is to understand the structure of the different types of firms based on these basic financial measures for these 97 companies. Data can be found in Financial.jmp (Sample Data, Business and Demographic). The columns are show: a) How many clusters do you think are a reasonable choice based on this data (concenptually)? I believe about 6-7 clusters would be a reasonable choice to segment this data based on the subject matter and variables. b) Use cluster analysis based on the k-means algorithm. i) Try initial 2, 3 and 4 initial clusters. ii) Which do you think is the “best” # of clusters? The best # of clusters is 3. In the cluster comparison the highest CCC value (compactness and separation) was NCluster[3] = -1.9318. We now know this is the most optimal cluster. iii) Examine the cluster means (averages) and the Counts to make your decision. According to the cluster counts, cluster #2 has the highest count of 78 and cluster #1 has the lowest at 2 (our most unique cluster). Even though many don’t reside here, this cluster does prove to be high performing. For example, cluster #1 leads in average sales, profit, and even assets. Even though they dominate these categories, not many belong to it, so it isn’t as optimal. For cluster #2, many belong but those who reside in this cluster don’t perform as high as cluster #1 and #3. MEANS CLUSTER 1: H sales, H profit, L emp, M profits/emp, H assets, M sales/emp, H stockholder’s eq CLUSTER 2: L sales, L profit, H emp, L profits/emp, L assets, L sales/emp, L stockholder’s eq CLUSTER 3: M sales, M profit, M emp, H profits/emp, M assets, H sales/emp, M stockholder’s eq STANDARD DEVIATIONS CLUSTER 1: M sales, L profit, H emp, L profits/emp, L assets, H sales/emp, M stockholder’s eq CLUSTER 2: L sales, M profit, M emp, M profits/emp, M assets, L sales/emp, L stockholder’s eq CLUSTER 3: H sales, H profit, L emp, H profits/emp, H assets, M sales/emp, H stockholder’s eq iv) Also examine the scatterplot matrices and the Parallel Coordinate Plots. c) Put a copy of your three Cluster Summary, Means and Standard deviations output and your scatterplots and Parallel Coordinate Plots of your output for each number of cluster choices. d) Interpret and “label” each cluster for the analysis with your final cluster choices. e) Save the Clusters to the JMP table. f) Finally, redo the clusters without Sales and Sales/emp and save the new clusters to your table. g) Now use Analyze, Fit Model (regression) to predict Sales with the variables #emp, assets, stakeholders equity, and your Cluster variables. Out of all the predictors in our regression model, only two variables showed a strong significance to sales. These variables included: assets, and stockholder’s equity. I was able to conclude this due to their p values being below 0.05. The variables #emp, stockholder’s eq, and all the cluster variables were over our value of 0.05 which tells me that they are not significantly related to sales. h) Put a copy of your output into your journal (or word doc) i) Comment on each variable’s significance, including Cluster, strength in predicting Sales.