Chapter 3.1 Measures of Central Tendency

advertisement
Chapter 3.1 Measures of Central Tendency
Objective A : Mean, Median and Mode
Three measures of central of tendency: the mean, the median, and the mode.
A1. Mean
The mean of a variable is the sum of all data values divided by the number of observations.
Population mean:  
x
i
where xi is each data value and N is the population size (the number of
N
observations in the population).
Sample mean: x 
x
i
where xi is each data value and n in the sample size (the number of
n
observations in the sample).
Example 1: Population: 12 16 23 17 32 27 14 16
Compute the population mean and sample mean from a simple random sample of size 4.
Does the sample mean equal to the population mean? Does the population mean or sample
mean stay the same? Explain.
(a) Population mean: (Round the mean to one more decimal place than that in the raw
data)
(b) Sample mean:
From a lottery method, 23 16 14 17 were selected.
(c) Does the sample mean equal to the population mean?
(d) Does the population mean or sample mean stay the same? Explain.
A2. Median
The median, M, is the value that lies in the middle of the data when arranged in ascending order.
If n is odd, the median is the data value in the middle of the data set; the location of the median is the
n 1
position.
2
n
If n is even, the median is the mean of the two middle observations in the data set that lie in the and
2
n
 1 position respectively.
2
Example 1: Find the median of the data given below.
4 12 32 24 9 18 28 10 36
Example 2: Find the median of the data given below.
$35.34 $42.09 $38.72 $43.28 $39.45 $49.36 $30.15 $40.88
A3. Mode
Mode is the most frequent observation in the data set.
Example 1: Find the mode of the data given below.
76 60 81 72 60 80 68 73 80 67
Example 2: Find the mode of the data given below.
A C D C B C A B B F B W F D B W D A D C D
Example 3: The following data represent the G.P.A. of 12 students.
2.56 3.21 3.88 2.44 1.96 2.85 2.32 3.38 1.86 3.04 2.75 2.23
Find the mean, median, and mode G.P.A.
Objective B : Relation Between the Mean, Median and Distribution Shape
- The mean is sensitive to extreme data. For continuous data, if the distribution shape is
a bell-shaped curve, the mean is a better measure of central tendency because it
includes all data values in a data set.
- The median is resistant to extreme data. For continuous data, if the distribution shape is
skewed to the right or left, the median is a better measure of central tendency.
- The mode is used to represent the measure of central tendency for qualitative data.
Mean or Median versus Skewness
Chapter 3.2 Measures of Dispersion
Measurement of dispersion is a numerical measure that can quantify the spread of data.
In this section, the three numerical measures of dispersion that we will discuss are the range, variance, and
standard deviation. In the later section, we will discuss another measure of dispersion called interquartile
range (IQR).
Objective A : Range, Variance and Standard Deviation
A1. Range
Range = R = largest data value - smallest data value
The range is not resistant because it is affected by extreme values in the data set.
A2. Variance and Standard Deviation
Variance is base the deviation about the mean. Since the sum of deviation about the mean is zero, we
cannot use the average deviation about the mean as a measure of spread.
We use the average squared deviation instead.
The population variance,  2 , of a variable is the sum of the squared deviations about the population
mean,  , divided by the number of observations in the population, N .

2
 (x  )

2
i
Definition Formula
N
 
2
x
2
i
 x 

2
i
N
Computational Formula
N
The sample variance, s 2 , of a variable is the sum of the squared deviations about the sample mean, x ,
divided by the number of observations in the sample minus 1, n 1 .
s2 
s 
2
 (x  x )
2
i
Definition Formula
n 1
x
i
2
 x 

2
i
n 1
n
Computational Formula
In order to use the sample variance to obtain an unbiased estimate of the population variance, we divide
the sum of the squared deviations about the sample mean by n 1 . We call n 1 the degree of freedom
because the first n 1 observations have freedom to be whatever value they wish, but the n th value has
no freedom in order to force
 ( x  x ) to be zero.
i
The population standard deviation,  , is the square root of the population variance or
σ=√𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
The sample standard deviation, s , is the square root of the sample variance or
s=√𝑆𝑎𝑚𝑝𝑙𝑒 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
To avoid round-off error, never use the rounded value of the variance to compute the standard deviation.
Keep a few more decimal places for an intermediate step calculation.
Example 1: Use the definition formula to find the population variance and standard deviation.
Population: 4, 10, 12, 13, 21
Example 2: Use the definition formula to find the sample variance and standard deviation.
Sample: 83, 65, 91, 84
Example 3: Use the computational formula to find the sample variance and standard deviation.
Sample: 83, 65, 91, 84 (same data set as Example 2)
Example 4: Use StatCrunch to find the sample variance and standard deviation.
Sample: 83, 65, 91, 84 (same data set as Example 2)
Step 1:
Click StatCrunch navigation button under the Course Home page 
Click StatCrunch website  Click Open StatCrunch 
Input the raw data in Var 1 column  Click Stat  Click Summary Stats  Columns
Step 2:
Click var1 under Select column(s):  Under Statistics:, choose Variance and Std. dev. (click them while
holding Ctrl key on the keyboard)  Click Compute!
Variance and standard deviation are computed.
s 2  122.9
s  11.1
For more detailed instructions, please download “Q3.2.20 “ by clicking the StatCrunch Handout navigation
button of the course homepage.
Note : For a small data set, students are expected to calculate the standard deviation by hand.
Objective C : Empirical Rule
The figure below illustrates the Empirical Rule
Example 1:
SAT Math scores have a bell-shaped distribution with a mean of 515 and a standard
deviation of 114. (Source: College Board, 2007)
(a) What percentage of SAT scores is between 401 and 62
(b) What percentage of SAT scores is between 287 and 743?
(c) What percentage of SAT scores is less than 401 or greater than 629?
(d) What percentage of SAT scores is between 515 and 743?
(e) About 99.7% of SAT scores will be between what scores?
Objective D : Chebyshev’s Inequality
Example 1: According to the U.S. Census Bureau, the mean of the commute time to for a
resident to Boston, Massachusetts, is 27.3 minutes. Assume that the standard
deviation of the commute time is 8.1 minutes to answer the following:
(a) What minimum percentage of commuters in Boston has a commute time
within 2 standard deviations of the mean?
(b) (i) What minimum percentage of commuters in Boston has a commute time
within 1.5 standard deviations of the mean?
(ii) What are the commute times within 1.5 standard deviations of the mean?
Chapter 3.4
Measures of Position and Outliers
Measures of position determine the relative position of a certain data value within the entire set of data.
Objective A : z -scores
The z -score represents the distance that a data value is from the mean in terms of the number
of standard deviations.
Population z -score: z 
Sample z -score:
z
x

xx
s
Example 1: The average 20- to 29-year-old man is 69.6 inches tall, with a standard deviation of
3.0 inches, while the average 20- to 29-year-old woman is 64.1 inches tall, with a
standard deviation of 3.8 inches. Who is relatively taller, a 67-inch man or 62-inch
woman?
Objective B : Percentiles and Quartiles
B1. Percentiles
The k th percentile, Pk , of a set of data is a value such that k percent of the observations are less
than or equal to the value.
Example 1: Explain the meaning of the 5th percentile of the weight of males 36 months of age is
12.0 kg.
The most common percentiles are quartiles.
The first quartile, Q1 , is equivalent to P25 .
The second quartile, Q2 , is equivalent to P50 .
The third quartile, Q3 , is equivalent to P75 .
Example 2: Determine the quartiles of the following data.
46 45 58 71 42 66 72 42 61 49 80
B2. Interquartile
The interquartile range, IQR, is the measure of dispersion that is based on quartiles. The range and
standard deviation are effected by extreme values. The IQR is resistant to extreme values.
Example 1: One variable that is measured by online homework systems is the amount of time a student
spends on homework for each section of the text. The following is a summary of the number
of minutes a student spends for each section of the text for the fall 2007 semester in a
College Algebra class at Joliet Junior College.
Q1  42
Q2  51.5
Q3  72.5
(a) Provide an interpretation of these results.
(b) Determine and interpret the interquartile range.
(c) Do you believe that the distribution of time spent doing homework is skewed or
symmetric? Why?
Objective C : Outliers
Extreme observations are called outliers; they may occur by error in the measurement or
during data entry or from errors in sampling.
Example 1: The following data represent the hemoglobin ( in g/dL ) for 20 randomly
selected cats. (Source: Joliet Junior College Veterinarian Technology Program)
5.7
8.9 9.6 10.6 11.7 7.7 9.4 9.9 10.7 12.9 7.8 9.5
10.0 11.0 13.0 8.7 9.6 10.3 11.2 13.4
(a) Determine the quartiles.
(b) Compute and interpret the interquartile range, IQR.
(c) Determine the lower and upper fences. Are there any outliers, according to this
criterion?
Chapter 3.5 The Five-Number Summary and Boxplots
Objective A : The Five-Number Summary
Example 1: The number of chocolate chips in a randomly selected 21 name-brand cookies
were recorded. The results are shown below.
28 23 28 31 27 29 24 19 26 23 21 25 22 23
21 23 33 28 33 21 30
Find the Five-Number Summary.
Objective B: Boxplots
The five-number summary can be used to construct a graph called the boxplot.
Example 1: A stockbroker recorded the number of clients she saw each day over an 11-day
period. The data are shown. Draw a boxplot.
32 39 41 30 31 43 48 27 42 20 34
Objective C : Using a Boxplot to describe the shape of a distribution
Example 1: Use the side-by-side boxplots shown to answer the questions that follow.
(a) To the nearest integer, what is the median of variable x ?
(b) To the nearest integer, what is the first quartile of variable y ?
(c) Which variable has more dispersion? Why?
(d) Does the variable x have any outliers? If so, what is the value of the outlier?
(e) Describe the shape of the variable y . Support your position.
Example 2: The following data represent the carbon dioxide emissions per capita (total carbon
dioxide emissions, in tons, divided by total population) for the countries of Western
Europe in 2004.
(a) Find the five-number summary.
(b) Determine the lower and upper fences.
(c) Construct a boxplot.
(d) Use the boxplot and quartiles to describe the shape of the distribution.
Download