Chapter1 DISTRIBUTIONS

advertisement
Chapter 1
DISTRIBUTIONS
When describing a distribution, one should, at a minimum, describe the spread, shape and
outliers. So far we have done this with words. Now it is time to introduce numbers to aid in
the description.
The center of a distribution can be described by its mean or median.
The mean or average of a set of observations is found by: adding their values and dividing by
the number of observations.
OR
Barry Bonds Homeruns for 1987-2001
16 25 24 19 33 25 34 46 37 33 42 40 37 34 49 73
Calculate mean using STAT/CALC/1-Var Stats
The 73 homeruns may be an outlier. Change the 73 to something more consistent with the
other values (e.g., 35). Now recalculate the mean. What happened?
The mean is NOT a resistant measure because it is effected by extreme measurements
(outliers). In fact the mean is drawn toward them.
The median M is the midpoint of a distribution. It is a number such that half the observations
are smaller and the other half are larger. To find the median
1. Arrange all observations in order of size, from smallest to largest
1
Chapter 1
DISTRIBUTIONS
2. If the number of observations n is odd, the median M is the center observation in the
ordered list
3. If the number of observations n is even, the median M is the mean of the two center
observations in the ordered list
123456
0123456
Now change the 6 to 100. Does this change the median?
The median is a resistant measure, because it is not effected by extreme observations.
The closer the distribution gets to a symmetrical shape the closer together the values of the
median and the mean. The mean and median are identical for perfectly symmetrical
distributions.
Since the median is a resistant measure and the mean is not, the mean will be drawn toward
extreme observations. For distributions that are skewed to the left, the mean will be less than
the median. For distributions that are skewed to the right, the mean will be greater than the
median
The spread of a distribution can be described by the Interquartile Range (IQR) or the standard
deviation (s)
2
Chapter 1
DISTRIBUTIONS
The Quartiles Q1 and Q3 are calculated as follows
1. Arrange the observations in increasing order and locate the median M in the ordered list
of observations
2. The first quartile Q1 is the median of the observations whose position in the ordered list
is to the left of the location of the overall median
3. The third quartile Q3 is the median of the observations whose position in the ordered
list is to the right of the location of the overall median
The INTERQUARTILE RANGE (IQR) is the distance between the first and third quartile and is
calculated as Q3 – Q1
50% of the observations lie within the IQR. The IQR can be used to identify outliers. By
definition an outlier is an observation that is
Greater than
Q3 + 1.5(IQR) or
Less than
16 19
24
25 25
↑
Q1
33
33
34
34
Q1 – 1.5(IQR)
37
37
↑
M
40
42
↑
Q3
46
49
73
A quick summary/description of the center and spread of a distribution can be given by the 5
number summary. The five number summary is
Minimum
Q1
M
Q3
Maximum
The 5 number is shown graphically in a box and whisker plot more typically referred to as
simply a boxplot.
The boxplot can be found under 2nd Y=/plot#/ Icon 5
3
Chapter 1
DISTRIBUTIONS
Side-by-side boxplots comparing the number of homeruns per year by Barry Bonds and Hank
Aaron
A modified boxplot is a graph of the 5 number summary , with outliers identified using the
IQR. In a modified boxplot




The central box still spans the quartiles
A line in the box still identifies the median
Observations more than 1.5(IQR) outside the box are plotted individually
The lines now extend from the box out to the smallest and largest observations THAT
ARE NOT OUTLIERS
To obtain a modified boxplot go to ICON 4 under StatPlot
Regular (a) and modified (b) boxplots comparing the home run production of Barry Bonds and
Hank Aaron
4
Chapter 1
DISTRIBUTIONS
When using mean to describe the center of a distribution, standard deviation is a more
appropriate measure of spread than median.
The variance s2 of a set of observations is the average of the squares of the deviations of the
observations from their mean. In symbols, the variance of n observations becomes
OR
The Standard Deviation, s, becomes:
s measures spread about the mean and should be used only when the mean is chosen as the
measure of center
s=0 only when there is not spread. That is all observations have the same value.
Otherwise, s > 0. As observations become more spread out about their mean, s gets larger
s, like the mean x-bar, is not resistant, Strong skewness or a few outliers can make s very
large.
New York Yankee Roger Maris held the single-season home run record from 1961 until 1998.
Here are his home run counts for his 10 years in the American League.
15
28
16
39
61
33
23
26
8
13
Describe the distribution
5
Chapter 1
DISTRIBUTIONS
Compare the following two data sets. The first is the number of cesarean sections performed
by 15 male doctors in Switzerland during one year. The second is the number of cesarean
sections performed by 10 female doctors in Switzerland during the same year.
Male Doctors
27 50 33 25 86 25 85 31 37 44 20 36 59 34 28
Female Doctors
5 7 10 14 18 19 25 29 31 33
Back-to-back stemplot of the number of cesarean sections performed by male and female
Swiss doctors
Which AP Exam is Easier: Calculus AB or Statistics??? The table below gives the distribution of
grades earned by students taking the Calculus AB and Statistics AP exams in 2000.
CalcAB
Stat
5
16.8%
9.8%
4
23.2%
21.5%
3
23.5%
22.4%
2
19.6%
20.5%
1
16.8%
25.8%
6
Chapter 1
DISTRIBUTIONS
The 2 distributions are roughly similar for grades 2,3, & 4. A larger proportion of Statistics
students received a grade of 5. This suggests that the Statistics exam is harder. At the very
least it indicates that students who take the Statistics exam get poorer grades than students
who take the Calculus exam.
7
Download