Uploaded by yungfeelx

Document 25800770

advertisement
Group Task M1L7
Data Mining & Visualization
Model 1, Class #7
Professor: Lee
Create a JMP Journal (or Word file) to hold copies of your output (screenshots) and your
answers to the questions.
1. Hierarchical Clustering:
 Use the BirthDeathSubset.jmp file with mortality (birth and death) rates for several countries.
 Determine which countries share similar mortality characteristics by examining two hierarchical
cluster models (use Ward and Complete clustering).
Note the label has been assigned to the Country column.
 Color the Clusters (click Red TriangleColor Clusters.)
What differences do you see in the two dendrograms as far as how they group countries as you
define larger numbers of clusters?
 Move the diamond on the dendrograms until you have clusters you can explain/that make sense.
 Put one dendrogram into your journal (or word report) and comment on the similarities of the
countries that group together.
The largest cluster using the ward method was the blue cluster. When using, the complete method
that cluster was still the largest with the same amount of countries. However, one noticeable
difference was how South Africa and Russia were no longer isolated from the red cluster in the
ward version. In the complete method they are connected to the red cluster (Afghanistan, Zaire).
2. K-Means (Non-Hierarchical clustering):
 Suppose that we are examining financial data from companies. The objective is to understand the
structure of the different types of firms based on these basic financial measures for these 97
companies. Data can be found in Financial.jmp (Sample Data, Business and Demographic). The
columns are show:
a) How many clusters do you think are a reasonable choice based on this data (concenptually)?
I believe about 6-7 clusters would be a reasonable choice to segment this data based on the subject
matter and variables.
b) Use cluster analysis based on the k-means algorithm.
i) Try initial 2, 3 and 4 initial clusters.
ii) Which do you think is the “best” # of clusters?
The best # of clusters is 3. In the cluster comparison the highest CCC value (compactness and
separation) was NCluster[3] = -1.9318. We now know this is the most optimal cluster.
iii) Examine the cluster means (averages) and the Counts to make your decision.
According to the cluster counts, cluster #2 has the highest count of 78 and cluster #1 has the
lowest at 2 (our most unique cluster). Even though many don’t reside here, this cluster does
prove to be high performing. For example, cluster #1 leads in average sales, profit, and even
assets. Even though they dominate these categories, not many belong to it, so it isn’t as optimal.
For cluster #2, many belong but those who reside in this cluster don’t perform as high as
cluster #1 and #3.
MEANS
CLUSTER 1: H sales, H profit, L emp, M profits/emp, H assets, M sales/emp, H stockholder’s eq
CLUSTER 2: L sales, L profit, H emp, L profits/emp, L assets, L sales/emp, L stockholder’s eq
CLUSTER 3: M sales, M profit, M emp, H profits/emp, M assets, H sales/emp, M stockholder’s
eq
STANDARD DEVIATIONS
CLUSTER 1: M sales, L profit, H emp, L profits/emp, L assets, H sales/emp, M stockholder’s eq
CLUSTER 2: L sales, M profit, M emp, M profits/emp, M assets, L sales/emp, L stockholder’s eq
CLUSTER 3: H sales, H profit, L emp, H profits/emp, H assets, M sales/emp, H stockholder’s eq
iv) Also examine the scatterplot matrices and the Parallel Coordinate Plots.
c) Put a copy of your three Cluster Summary, Means and Standard deviations output and your
scatterplots and Parallel Coordinate Plots of your output for each number of cluster choices.
d) Interpret and “label” each cluster for the analysis with your final cluster choices.
e) Save the Clusters to the JMP table.
f) Finally, redo the clusters without Sales and Sales/emp and save the new clusters to your
table.
g) Now use Analyze, Fit Model (regression) to predict Sales with the variables #emp, assets,
stakeholders equity, and your Cluster variables.
Out of all the predictors in our regression
model, only two variables showed a strong
significance to sales. These variables included:
assets, and stockholder’s equity. I was able to
conclude this due to their p values being below
0.05.
The variables #emp, stockholder’s eq, and all
the cluster variables were over our value of
0.05 which tells me that they are not
significantly related to sales.
h) Put a copy of your output into your journal (or word doc)
i) Comment on each variable’s significance, including Cluster, strength in predicting Sales.
Download