EDA

advertisement
Exploratory Data Analysis
John Tukey
• Developed these procedures to help one
get a first look at distributions of scores.
• What is the shape of the distribution?
• Are there any suspicious scores.
• Stem and Leaf Display
• Box and Whiskers Plot
Stem and Leaf Display
• See the pulse rate data at Exploratory
Data Analysis (EDA).
• The scores range from 48 to 104.
• We probably want to group them into 5 to
15 intervals.
• I’ll use two intervals for the 40’s, two for
the 50’s, etc.
The Stem
Consists of a column of
leading (aka “most
significant” digits, the
leftmost digits in the
scores. I’ll add to the
stem the leaves, the
trailing (rightmost, least
significant) digits of each
score
The Stem With Leaves
Next, I’ll
arrange the
leaves (within
each row)
from lowest to
highest and
add a “depth”
column.
Leaves Arranged in Order
The Depth Column
• This column tells you how many scores
there are in that row and all rows between
it and the closer tail of the distribution.
• The row that contains the median has the
row frequency in parentheses.
Rotated Display
It looks like a
histogram, but the bars
made up of the scores.
From this display, can
you identify any scores
that are odd, compared
to most of the other
scores?
Box and Whisker Plot
• Median Location = (N + 1)/2 = 97/2 =48.5.
• The median will be located between the
48th and the 49th scores from either tail.
Are 40 scores from 68 to 48. Count up 8 more scores,
starting with the first 70. The 48th score is a 70, the 49th
score is a 70, the median is 70.
The Hinge Location
• = (Median Location + 1)/2.
• Drop any decimal on the median location
• For our data, hinge location = (48 + 1)/2 =
24.5.
• Now, the upper hinge is the 24.5th score
from the upper end of the distribution.
There are 24 scores from 80 up to 104. Go in toward the
median one more score. The 25th score from the highest is
a 78. The upper hinge is (78 + 79)/2 = 79.
The 26th score from the lowest score is a 64. Move
towards the lower tail by one score and you see the 25th
score is also a 64. One more, the 24th score is also a 64.
The lower hinge is 64.
The H-Spread
• = the difference between the upper hinge
and the lower hinge. For our data, 79 - 64
= 15.
• This is the range of the middle 50% of the
scores.
• You also know this as the interquartile
range.
The Inner Fences
• The upper inner fence = the upper hinge
plus 1.5 H-spreads. For our data, 79 +
1.5(15) = 101.5.
• The lower inner fence is the lower hinge
minus 1.5 H-spreads, 64 - 1.5(15) = 41.5.
• These are invisible fences, they are not
plotted.
Adjacent Values
• These are scores that are outside of the
middle 50% of the scores but within the
inner fences.
• For our data, these will be scores that fall
– between 79 and 101.5 or
– between 41.5 and 64
Outliers
• These are scores that are beyond the
inner fences.
• For our data, these are scores that are
– Less than 41.5 or
– Greater than 101.5
Outer Fences
• These invisible fences are 3 H-spreads
beyond the hinges.
• For our data the lower outer fence is at
79 - 3(15) = 34
• and the upper outer fence is at 79 + 3(15)
= 124
• Scores that are beyond the outer fences
are called way-outliers.
Drawing the Plot
• Prepare a numerical scale.
• Draw a box that extends from the lower
hinge to the upper hinge.
• Draw a line through the box at the median.
• May also insert a symbol at the mean.
• Draw whiskers out to the most extreme
adjacent values on each side
Whiskers
• For our data, the lowest adjacent value is
the 48, so we draw the whiskers on the
lower end out to 48.
• We do not go all the way out the inner
fence unless there is a score there.
• The highest adjacent value is a 99, so we
draw whiskers on the upper end out to 99.
Outliers
• Every outlier is plotted with a special
symbol, often a O for a regular outlier and
an * for a way-outliers.
• Some programs will also print the
identification number next to every outlier
• These days, we use statistical software to
make these displays and plots rather than
doing them by hand.
Plots Produced by SAS
How tall, in inches, is your ideal mate?
Eight Foot Tall Mate !
• That is a WAY-OUTLIER for sure !
• Investigation of the original data sheets
revealed that the actual response was 69
inches, not 96 inches.
Exploratory Data Analysis (EDA)
• It is highly recommended that you read the
document linked above.
• It includes additional examples and a bit of
silliness that might help you remember key
concepts.
• Do watch the video clip of the Id
attempting to cross an outer fence on the
Forbidden Planet.
Download