Seven (plus or minus two) Clusters, A Monte Carlo Study

advertisement
Seven
(plus or minus two)
Clusters,
A Monte Carlo Study
Larry Hoyle,
Policy Research Institute,
The University of Kansas
1972 Kansas Statistical Abstract
Shading by Overprinting
Shading by Line Spacing
Line Shading Detail
What did they have in common?
• Neither method is “continuous”
• So both methods required grouping or
classes
Fixed number of combinations
Characters on a fixed grid
Integer number of lines in the polygon
Lines are relatively coarse
How to Group for Shading
•
•
•
•
Equal Intervals
Equal numbers (quantiles)
By clusters
Don’t group (unclassed)
Population Density – 7 Equal Intervals
100 counties fall into the bottom class
Population Density - Equal Numbers
15 counties in each class - a very different picture
Population Density - Cluster Means
Group around the 7 values that “best” represent the data
Population Density - Unclassed
No classes, just shade in proportion to value
Clustering
• Tries for “Best” grouping
• Each member of cluster can be represented
by the mean of the group
Proc Fastclus
• You specify the number of clusters
• Minimizes cluster sum of squared distance
(e.g. minimum within cluster variance)
• inspired by: – k-means (MacQueen)
leader algorithm (Hartigan)
Example clustering - data
.
y
data
cluster
0
10
20
30
40
50
x
60
70
80
90
4 clusters
.
y
data
cluster
R-squared=.9912
0
10
20
30
40
50
x
60
70
80
90
4 clusters data
cluster original cluster
Number Value Mean
1
2
6.9
1
3
6.9
1
5
6.9
1
8
6.9
1
9
6.9
1
10
6.9
1
11
6.9
2
18
21.3
2
20
21.3
2
22
21.3
2
25
21.3
3
40
42.7
3
42
42.7
3
46
42.7
4
73
77.6
4
75
77.6
4
77
77.6
4
78
77.6
4
79
77.6
4
80
77.6
4
81
77.6
Correlation .9956
R-squared=.9912
3 clusters
.
y
data
cluster
R-squared=.9609
0
10
20
30
40
50
x
60
70
80
90
How many clusters is enough?
Plot R-squared by number of clusters
RSQ
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Sample of 300 observations,
0.2
Uniform distribution,
0.1
11 cluster analyses
0.0
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
What happens if there really
aren’t any clusters?
Let’s try 500 samples
Uniform, 300 obs. per sample
RSQ
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.90
0.89
0.88
0.87
0.86
0.85
0.84
0.83
0.82
0.81
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
0.70
0.69
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Uniform, 1000 obs. per sample
RSQ
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.90
0.89
0.88
0.87
0.86
0.85
0.84
0.83
0.82
0.81
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Normal, 300 obs. per sample
RSQ
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
0.84
0.82
0.80
0.78
0.76
0.74
0.72
0.70
0.68
0.66
0.64
0.62
0.60
0.58
0.56
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Normal, 1000 obs. per sample
RSQ
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.90
0.89
0.88
0.87
0.86
0.85
0.84
0.83
0.82
0.81
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
0.70
0.69
0.68
0.67
0.66
0.65
0.64
0.63
0.62
0.61
0.60
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Exponential, 300 obs. per sample
RSQ
1.0
0.9
0.8
0.7
0.6
500 samples,
11 clusterings
each
0.5
0.4
0.3
0.2
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Distribution of worst sample
Exponential, 1000 obs. per sample
RSQ
1.0
0.9
0.8
0.7
0.6
0.5
500 samples,
11 clusterings
each
0.4
0.3
0.2
0.1
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
So What’s with 72?
Uniform, 72
RSQ
1.00
0.99
0.98
0.97
0.96
0.95
0.94
0.93
0.92
0.91
0.90
0.89
0.88
0.87
0.86
0.85
0.84
0.83
0.82
0.81
0.80
0.79
0.78
0.77
0.76
0.75
0.74
0.73
0.72
0.71
0.70
0.69
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Normal, 72
RSQ
1.00
0.98
0.96
0.94
0.92
0.90
0.88
0.86
0.84
0.82
0.80
0.78
0.76
0.74
0.72
0.70
0.68
0.66
0.64
0.62
0.60
0.58
0.56
500 samples,
11 clusterings
each
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Exponential, 72
RSQ
1.0
0.9
0.8
0.7
0.6
500 samples,
11 clusterings
each
0.5
0.4
0.3
0.2
2
3
4
5
6
7
8
# of Clusters
9
10
11
12
13
Minimum R squared by sample
size and distribution
Exponential
Normal
Uniform
300
1000
300
1000
300
1000
5
0.883
0.865
0.877
0.887
0.951
0.956
6
0.92
0.899
0.908
0.922
0.966
0.969
7
0.938
0.926
0.933
0.94
0.975
0.978
8
0.953
0.934
0.945
0.947
0.982
0.982
9
0.966
0.951
0.957
0.957
0.985
0.986
Clusters
At least 95% of the variance for all
Histograms
• Equal intervals
• Number of observations in each interval
FREQUENCY
120
100
80
60
40
20
0
-3.0
-2.4
-1.8
-1.2 -0.6 0.0
v MIDPOINT
0.6
1.2
1.8
dither
120
100
Needle Plot
of Cluster Means
80
60
40
20
0
-4
-3
-2
-1
0
1
v
FREQUENCY
120
2
3
Count
80
60
100
80
40
60
20
40
20
0
0
-2.4
-1.2 0.0 1.2
v MIDPOINT
2.4
-4
-3
-2
-1
v
0
1
2
Count
80
Bar chart needs more bars
60
40
20
0
-3.0
-2.4
-1.8
-1.2 -0.6 0.0
v MIDPOINT
0.6
1.2
1.8
The Magical Number Seven, Plus
or Minus Two: Some Limits on
our Capacity for Information
Processing
George Miller,
The Psychological Review
1956, vol.63 pp. 81-97
Limits on Categories for
Absolute Judgments
•
•
•
•
•
Pitch 6
Loudness 5
Visual position 9
Size of a square 5
Hue 8
Name the colors in this slide
“And finally, what about the
magical number seven?”
George A. Miller
Miller – Quote
1
“What about the
•seven wonders of the world
•seven seas
•seven deadly sins
•seven daughters of Atlas in the Pleiades
•seven ages of man
•seven levels of hell
•seven primary colors
•seven notes of the musical scale
•seven days of the week”
Miller – Quote
2
“What about the
•seven-point rating scale
•seven categories for absolute judgment
•seven objects in the span of attention
•seven digits in the span of immediate memory”
Miller – Quote
3
“…Perhaps there is something deep and profound
behind all these sevens, something just calling out for
us to discover it.”
Miller - close
“But I suspect that it is only a pernicious, Pythagorean
coincidence.”
Coincidence or Nature’s
Parsimony?
Does our capacity match what’s
needed for 95% of the variance?
95%? Hmmmm…….
Larry Hoyle
Policy Research Institute
University of Kansas
confidence intervals
LarryHoyle@ku.edu
19 fingers and toes
an A
970,000 web pages
Download