Uploaded by ANG ZHI NUO

Cluster analysis of USArrests using R

advertisement
ANG ZHI NUO SIT3010 T3
Background of data
The "USArrests" dataset, sourced from the Federal Bureau of Investigation (FBI), is a
fundamental resource for delving into the intricate interplay between crime and socioeconomic factors in the United States. Comprising 50 rows, each representing a U.S. state,
and 4 columns featuring variables such as murder rates, assault rates, urban population
percentages, and rape rates, this dataset facilitates in-depth analysis of regional crime
disparities and potential correlations between these variables, with rates calculated per
100,000 people. It serves as a valuable tool for seamlessly exploring and interpreting crime
trends and their intricate relationships with demographic and socio-economic factors
throughout the United States.
Result
Figure 1. Plot of Euclidean distance matrix
The distance matrix plot vividly illustrates the degree of similarity or dissimilarity between
data points. In this plot, darker squares signify greater distances, suggesting heightened
dissimilarity, while lighter squares denote shorter distances, indicating stronger similarity.
Notably, the plot unveils inherent data groupings; for instance, the stark dissimilarity
observed between Vermont and North Dakota in contrast to Nevada hints at the likely
formation of two distinct clusters within the dataset. On the other hand, Wisconsin is closed
to Maine and are likely to form same cluster.
Figure 2. Plots of cluster with different number of groups
Figure 3. Elbow plot for different number of clusters
Multiple k-cluster plots are generated to determine the optimal number of clusters. For k=4,
the data is effectively partitioned into four distinct and non-overlapping clusters,
demonstrating a strong performance. However, as k increases, starting from k=5, the
clusters begin to overlap, which is indicative of a less desirable outcome, as overlapping
clusters can compromise the quality of the clustering results. This conclusion gains further
validation from the elbow method, which illustrates that increasing k beyond 4 does not lead
to a substantial reduction in within-cluster sum of squares (WSS), emphasizing the
appropriateness of a 4-cluster solution.
Table 1. Summary table of k-mean clustering with k=4
custers
1
2
3
4
Murder
Assault
UrbanPop
1.4118898 0.8743346 -0.8145211
-0.9615407 -1.106601 -0.9301069
-0.4894375 -0.3826001 0.5758298
0.6950701 1.0394414
0.722637
Rape
0.01927104
-0.9667633
-0.2616538
1.27693964
Tables 2. tables of states in each clusters
states
group
Alabama
Arkansas
Georgia
Louisiana
Mississippi
North Carolina
South Carolina
Tennessee
1
1
1
1
1
1
1
1
states
group
Idaho
Iowa
Kentucky
Maine
Minnesota
Montana
Nebraska
New Hampshire
North Dakota
South Dakota
Vermont
West Virginia
Wisconsin
2
2
2
2
2
2
2
2
2
2
2
2
2
states
group
Connecticut
Delaware
Hawaii
Indiana
Kansas
Massachusetts
New Jersey
Ohio
Oklahoma
Oregon
Pennsylvania
Rhode Island
Utah
Virginia
Washington
Wyoming
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
states
group
Alaska
Arizona
California
Colorado
Florida
Illinois
Maryland
Michigan
Missouri
Nevada
New Mexico
New York
Texas
4
4
4
4
4
4
4
4
4
4
4
4
4
In this K-means clustering analysis with four clusters, the table provides insights into the
composition of each cluster, as well as their respective means for the variables Murder,
Assault, UrbanPop, and Rape. Cluster 1, with a size of 8, exhibits above-average values for
Murder and Assault, but a below-average value for UrbanPop. Cluster 2, with 13 data points,
is characterized by below-average values in all variables. Cluster 3, the largest with 16 data
points, displays below-average Murder and Assault rates but an above-average UrbanPop
value. Finally, Cluster 4, comprising 13 data points, has above-average values for all
variables, indicating a higher prevalence of crime in this cluster.
Discussion
In this intriguing K-means clustering analysis, the data has been partitioned into four distinct
clusters, each with its own compelling narrative. Cluster 1 transports us to areas where
urban life isn't as bustling (low UrbanPop), yet a dark secret is unveiled - a high murder rate
and a somewhat enigmatic presence of assault. This seemingly paradoxical cluster might be
attributed to the fact that in less urban settings, law enforcement and social services might
be less accessible, potentially leading to higher crime rates. Meanwhile, Cluster 2 takes us
to quieter, less urban locales where crime seems to have taken a backseat, with low
UrbanPop and significantly lower rates of murder, assault, and even the lesser-discussed
crime, rape. The tranquility of these areas can be a result of less crowded living conditions
and possibly stronger community ties. Shifting our focus to Cluster 3, we are introduced to
the world of vibrant urban life (high UrbanPop) combined with lower crime rates, giving the
impression of a bustling metropolis with an aura of safety. Here, increased urbanization
might lead to more efficient law enforcement and social programs, contributing to the lower
crime rates. Lastly, Cluster 4 presents a scenario where bustling and intense urban living,
characterized by high UrbanPop, is matched by soaring rates of murder, assault, and even
the unfortunate occurrence of rape. This striking contrast highlights the complexity of urban
environments, where various factors like population density, economic disparities, and social
dynamics can create conditions conducive to higher crime rates. This analysis offers a
captivating exploration of the diverse crime and population landscapes within our dataset,
shedding light on the intricate relationship between urbanization and criminal activity.
Download