Seven (plus or minus two) Clusters, A Monte Carlo Study Larry Hoyle, Policy Research Institute, The University of Kansas 1972 Kansas Statistical Abstract Shading by Overprinting Shading by Line Spacing Line Shading Detail What did they have in common? • Neither method is “continuous” • So both methods required grouping or classes Fixed number of combinations Characters on a fixed grid Integer number of lines in the polygon Lines are relatively coarse How to Group for Shading • • • • Equal Intervals Equal numbers (quantiles) By clusters Don’t group (unclassed) Population Density – 7 Equal Intervals 100 counties fall into the bottom class Population Density - Equal Numbers 15 counties in each class - a very different picture Population Density - Cluster Means Group around the 7 values that “best” represent the data Population Density - Unclassed No classes, just shade in proportion to value Clustering • Tries for “Best” grouping • Each member of cluster can be represented by the mean of the group Proc Fastclus • You specify the number of clusters • Minimizes cluster sum of squared distance (e.g. minimum within cluster variance) • inspired by: – k-means (MacQueen) leader algorithm (Hartigan) Example clustering - data . y data cluster 0 10 20 30 40 50 x 60 70 80 90 4 clusters . y data cluster R-squared=.9912 0 10 20 30 40 50 x 60 70 80 90 4 clusters data cluster original cluster Number Value Mean 1 2 6.9 1 3 6.9 1 5 6.9 1 8 6.9 1 9 6.9 1 10 6.9 1 11 6.9 2 18 21.3 2 20 21.3 2 22 21.3 2 25 21.3 3 40 42.7 3 42 42.7 3 46 42.7 4 73 77.6 4 75 77.6 4 77 77.6 4 78 77.6 4 79 77.6 4 80 77.6 4 81 77.6 Correlation .9956 R-squared=.9912 3 clusters . y data cluster R-squared=.9609 0 10 20 30 40 50 x 60 70 80 90 How many clusters is enough? Plot R-squared by number of clusters RSQ 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Sample of 300 observations, 0.2 Uniform distribution, 0.1 11 cluster analyses 0.0 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 What happens if there really aren’t any clusters? Let’s try 500 samples Uniform, 300 obs. per sample RSQ 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.70 0.69 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Uniform, 1000 obs. per sample RSQ 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Normal, 300 obs. per sample RSQ 1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80 0.78 0.76 0.74 0.72 0.70 0.68 0.66 0.64 0.62 0.60 0.58 0.56 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Normal, 1000 obs. per sample RSQ 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.70 0.69 0.68 0.67 0.66 0.65 0.64 0.63 0.62 0.61 0.60 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Exponential, 300 obs. per sample RSQ 1.0 0.9 0.8 0.7 0.6 500 samples, 11 clusterings each 0.5 0.4 0.3 0.2 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Distribution of worst sample Exponential, 1000 obs. per sample RSQ 1.0 0.9 0.8 0.7 0.6 0.5 500 samples, 11 clusterings each 0.4 0.3 0.2 0.1 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 So What’s with 72? Uniform, 72 RSQ 1.00 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.90 0.89 0.88 0.87 0.86 0.85 0.84 0.83 0.82 0.81 0.80 0.79 0.78 0.77 0.76 0.75 0.74 0.73 0.72 0.71 0.70 0.69 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Normal, 72 RSQ 1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.86 0.84 0.82 0.80 0.78 0.76 0.74 0.72 0.70 0.68 0.66 0.64 0.62 0.60 0.58 0.56 500 samples, 11 clusterings each 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Exponential, 72 RSQ 1.0 0.9 0.8 0.7 0.6 500 samples, 11 clusterings each 0.5 0.4 0.3 0.2 2 3 4 5 6 7 8 # of Clusters 9 10 11 12 13 Minimum R squared by sample size and distribution Exponential Normal Uniform 300 1000 300 1000 300 1000 5 0.883 0.865 0.877 0.887 0.951 0.956 6 0.92 0.899 0.908 0.922 0.966 0.969 7 0.938 0.926 0.933 0.94 0.975 0.978 8 0.953 0.934 0.945 0.947 0.982 0.982 9 0.966 0.951 0.957 0.957 0.985 0.986 Clusters At least 95% of the variance for all Histograms • Equal intervals • Number of observations in each interval FREQUENCY 120 100 80 60 40 20 0 -3.0 -2.4 -1.8 -1.2 -0.6 0.0 v MIDPOINT 0.6 1.2 1.8 dither 120 100 Needle Plot of Cluster Means 80 60 40 20 0 -4 -3 -2 -1 0 1 v FREQUENCY 120 2 3 Count 80 60 100 80 40 60 20 40 20 0 0 -2.4 -1.2 0.0 1.2 v MIDPOINT 2.4 -4 -3 -2 -1 v 0 1 2 Count 80 Bar chart needs more bars 60 40 20 0 -3.0 -2.4 -1.8 -1.2 -0.6 0.0 v MIDPOINT 0.6 1.2 1.8 The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity for Information Processing George Miller, The Psychological Review 1956, vol.63 pp. 81-97 Limits on Categories for Absolute Judgments • • • • • Pitch 6 Loudness 5 Visual position 9 Size of a square 5 Hue 8 Name the colors in this slide “And finally, what about the magical number seven?” George A. Miller Miller – Quote 1 “What about the •seven wonders of the world •seven seas •seven deadly sins •seven daughters of Atlas in the Pleiades •seven ages of man •seven levels of hell •seven primary colors •seven notes of the musical scale •seven days of the week” Miller – Quote 2 “What about the •seven-point rating scale •seven categories for absolute judgment •seven objects in the span of attention •seven digits in the span of immediate memory” Miller – Quote 3 “…Perhaps there is something deep and profound behind all these sevens, something just calling out for us to discover it.” Miller - close “But I suspect that it is only a pernicious, Pythagorean coincidence.” Coincidence or Nature’s Parsimony? Does our capacity match what’s needed for 95% of the variance? 95%? Hmmmm……. Larry Hoyle Policy Research Institute University of Kansas confidence intervals LarryHoyle@ku.edu 19 fingers and toes an A 970,000 web pages