Lecture Power Point Slides

advertisement
Describing Distributions Numerically
Definitions
 Mean: the simple average you learned how to do in 6th
grade
 Median: using a list arranged in order from low to
high, the middle number
 Mode: the value that occurs most often
 Midrange: the average of the minimum and maximum
values
 Very sensitive to skewed distributions and outliers.
Mode
 The most frequent value in the data set
 Ex: Here is a list with the number of years each
employee has worked at a company, what is the mode?
1 2 2 4 4 4 5 5 5 5 5 5 5 7 10 11
Median: The Middle of Everything
 The median is the value with exactly half the data
values below it and half above it.
 It is the middle data
 The average of the two
numbers in the middle
if its an even number
of values
Mean
 To find the mean add up all the values and divide by the
total number of cases
Total  y
y

n
n
 Ex: the total number of boxes sold between January and
June are: 249, 337, 163, 289, 298, and 104. How many boxes
were sold on average?
Mean or Median?
 In symmetric distributions, the mean and median are
approximately the same in value, so either measure of
center may be used.
 For skewed data, though, it’s better to report the
median than the mean as a measure of center.
Data Set: A list
of ages of a
group of 16
retired people:
52, 58, 63, 64,
65, 66, 67, 68,
69, 69, 71, 71, 71,
71, 72
Mean
1064/16=66.5
Median
(67 + 68)/2 = 67.5
Mode
71
Range
 Always report a measure of spread along with a measure of
center when describing a distribution numerically.
 The range of the data is the difference between the
maximum and minimum values:
Range = max – min
 A disadvantage of the range is that a single extreme value
can make it very large and, thus, not representative of the
data overall.
Spread: The Interquartile Range
 The interquartile range (IQR) lets us ignore extreme
data values and concentrate on the middle of the data.
 To find the IQR, we first need to know what quartiles
are…
Spread: The Interquartile Range (cont.)
 Quartiles divide the data into four equal sections.
 The lower quartile is the point that 1 quarter of the data
lies below
 The upper quartile is the point that 1 quarter al the data
lies above
 The difference between the quartiles is the IQR, so
IQR = upper quartile – lower quartile
Spread: The Interquartile Range (cont.)
 The lower and upper quartiles are the 25th and 75th
percentiles of the data, so…
 The IQR (purple) contains
50% of the values of the
distribution
The Five-Number Summary
 The five-number summary
of a distribution reports its
median, quartiles, and
extremes (maximum and
minimum).
 Example: The fivenumber summary for
the ages at death for
rock concert goers who
died from being crushed
is
Max
47 years
Q3
22
Median
19
Q1
17
Min
13
Finding the IQR by hand
 Sort the values from smallest to largest
 Split the data into 2 halves at the median
 When N is odd, include the median in both halves
 Q1: the median of the lower half of the data
 Find the median of just the lower half of the data
 Q3: the median of the upper half of the data
 Find the median of just the upper half of the data
 Remember:
 Lower Quartile = 25th percentile = Q1
 Median = 50th percentile = Q2
 Upper Quartile = 75th percentile = Q3
Example Chapter 5 #4
 Here are the annual numbers of deaths from tornados in
the United States from 1990 through 2000:
53, 39, 39, 33, 69, 30, 25, 67, 130, 94, 40
Find the Mean, Median, Quartiles, Range and IQR






Mean: (53 + 39+…+40)/11 = 56.3
Median: 40
Q1 Median of lower half: (33 + 39)/2 = 36
Q3 Median of upper half: (67 + 69)/2 = 68
Range: 130 – 25 = 105
IQR Q3 – Q1: 68 – 36 = 32
Boxplots
 A boxplot is a graphical display of the five-number
summary.
 Boxplots are particularly useful when comparing
groups.
Constructing Boxplots
 The Crowdsafe Database lists the ages, names, causes, and
locations of these unfortunate concert-goers. During the
period of 1999 and 2000 there were 66 people who died
from “crowd crush.” Here is a 5 number summary of their
ages:
Max
47 Years
Q3
22
Median
19
Q1
17
Min
13
Constructing Boxplots
Draw a single vertical axis
spanning the range of
the data (10 to 50).
Draw short horizontal lines
at the lower (17) and
upper quartiles (22) and
at the median (19).
Then connect them with
vertical lines to form a
box.
Constructing Boxplots (cont.)
Erect “fences” around the main part of the data.
The upper fence is 1.5 IQRs
(22-17 =5) above the upper quartile
22+ 1.5(5) = 29.5
The lower fence is 1.5 IQRs below the
lower quartile 17 – 1.5(5) = 9.5
Note: the fences only help with
constructing the boxplot and should
not appear in the final display.
Constructing Boxplots (cont.)
Use the fences to grow “whiskers.”
Draw lines from the ends of the
box up and down to the most
extreme data values found
within the fences.
We Don’t have all the data
points so lets just say its 28 and
13
If a data value falls outside one
of the fences, we do not
connect it with a whisker.
Constructing Boxplots (cont.)
Add the outliers by displaying any
data values beyond the fences
with special symbols.
Our outliers are 37 and 47
We often use a different symbol
for “far outliers” that are
farther than 3 IQRs from the
quartiles (3 x 5).
Constructing Boxplots (cont.)
 Compare the histogram and boxplot for rock concert
deaths:
Comparing Groups With Boxplots
 The following set of boxplots compares the effectiveness of
various coffee containers:
 What does this graphical display tell you?
Summarizing Symmetric Distributions
 The distribution of pulse rates for 52 adults is generally
symmetric, with a mean of 72.7 beats per minute
(bpm) and a median of 73 bpm:
What About Spread? The Standard Deviation
 A more powerful measure of spread than the IQR is
the standard deviation, which takes into account how
far each data value is from the mean.
 A deviation is the distance that a data value is from the
mean.
 Since adding all deviations together would total zero, we
square each deviation and find an average of sorts for
the deviations.
What About Spread? The Standard Deviation (cont.)
 The variance, notated by s2, is found by summing the
squared deviations and (almost) averaging them:
s
2
y  y


2
n 1
 The variance will play a role later in our study, but it is
problematic as a measure of spread—it is measured in
squared units!
What About Spread? The Standard Deviation (cont.)
 The standard deviation, s, is just the square root of the
variance and is measured in the same units as the
original data.
s
 y  y 
n 1
2
Standard Deviation Ex. (81)
 Suppose we are given the values: 4, 3, 10, 12, 8, 9, and 3
 The mean, γ = 7
Original Values
Deviations
Squared Deviations
4
4 – 7 = -3
(-3)2 = 9
3
3 – 7 = -4
(-4)2 = 16
10
10 – 7 = 3
9
12
12 – 7 = 5
25
8
8–7=1
1
9
9–7=2
4
3
3 – 7 = -4
16
Standard Deviation Example (cont.)
 Add up the squared deviations:
∑: 9 + 16 + 9 + 25 + 1 + 4 + 16 = 80
 Divide by n – 1: 80/6 = 13.33
 Finally take the square root: √13.33 = 3.65
Using your calculator
TI 83/84
TI 89
 STAT
 APPS
 1 or Enter
 6: Data Matrix Editor
 Lists functions L1 to L6
 1 and Enter
 Enter your data values
 C1: list your data values
 STAT
 F5 Calc: OneVar Enter
 CALC
 X: type C1
 Enter
 Enter
 1 or Enter (1-Var STATS)
 Enter
 Press 2nd button and L1
 Enter
Using your calculator (cont.)
TI 83/84 Printout
TI 89 Printout
Thinking About Variation
 Since Statistics is about variation, spread is an
important fundamental concept of Statistics.
 Measures of spread help us talk about what we don’t
know.
 When the data values are tightly clustered around the
center of the distribution, the IQR and standard
deviation will be small.
 When the data values are scattered far from the center,
the IQR and standard deviation will be large.
Shape, Center, and Spread
 When telling about a quantitative variable, always
report the shape of its distribution, along with a center
and a spread.
 If the shape is skewed, report the median and IQR.
 If the shape is symmetric, report the mean and standard
deviation and possibly the median and IQR as well.
What About Outliers?
 If there are any clear outliers and you are reporting the
mean and standard deviation, report them with the
outliers present and with the outliers removed. The
differences may be quite revealing.
 Note: The median and IQR are not likely to be affected
by the outliers.
What Can Go Wrong?
 Don’t forget to do a reality check—don’t let technology do
your thinking for you.
 Sort the values before finding the median or percentiles.
 Don’t compute numerical summaries of a categorical
variable.
 Beware of outliers
 Make a picture (make a picture and make a picture)
What Can Go Wrong? (cont.)
 Be careful when
comparing groups that
have very different
spreads.
 Consider these side-
by-side boxplots of
cotinine levels:
*Re-expressing to Equalize the
Spread of Groups
 Here are the side-by-
side boxplots of the log
(cotinine) values:
For Next week
 Homework Chapter 5: 3, 7, 9, 12, 27 a-c, 28
 Extra Credit is due (5 points)
 Quiz 1: Closed Notes:
 Know the difference between categorical and
quantitative variables
 Calculate the mean, median, and mode
 Find percents from a contingency table
 Sketch a stem and leaf plot
Download