Uploaded by xinwen hu

Statistics Chapter3-1

advertisement
Chapter 3
Descriptive Statistics:
Numerical Methods
Section 3.1 Describing Central Tendency
Population Parameters(总体参数)
A population parameter is a number calculated from all
the population measurements that describes some
aspect of the population
The population mean, denoted , is a population
parameter and is the average of the population
measurements
The Mean(均值)
Population X1, X2, …, XN

Sample x1, x2, …, xn
x
Population Mean
Sample Mean
n
N


Xi
i=1
N
x
x
i
i=1
n
Point Estimates and Sample Statistics
A point estimate(点估计) is a one-number estimate of
the value of a population parameter
A sample statistic is a number calculated using sample
measurements that describes some aspect of the sample
 Use sample statistics as point estimates of the
population parameters
The sample mean, denoted x, is a sample statistic and is
the average of the sample measurements
 The sample mean is a point estimate of the population
mean
Measures of Central Tendency
Mean, : The average or expected value
Median, Md: The value of the middle point of
the ordered measurements
Mode, Mo: The most frequent value
The Sample Mean(样本均值)
For a sample of size n, the sample mean is defined as
n
x
x
i 1
n
i
x1  x2  ...  xn

n
and is a point estimate of the population mean 
• It is the value to expect, on average and in the long run
Example: Car Mileage Case
Sample mean for first five car mileages from Table 2.1
30.8, 31.7, 30.1, 31.6, 32.1
5
x
x
i 1
5
i
x1  x2  x3  x4  x5

5
30.8  31.7  30.1  31.6  32.1 156.3
x

 31.26
5
5
Example: Car Mileage Case Continued
Sample mean for all the car mileages from Table 2.1
49
x
i
1546.1
x

 31.5531
49
49
i 1
Based on this calculated sample mean, the point
estimate of mean mileage of all cars is 31.5531 mpg
The Median(中位数)
The population or sample median Md is a value such that
50% of all measurements, after having been arranged in
numerical order, lie above (or below) it
The median Md is found as follows:
1. If the number of measurements is odd, the median
is the middlemost measurement in the ordered
values
2. If the number of measurements is even, the median
is the average of the two middlemost measurements
in the ordered values
Example: Sample Median
Example 2.3 Internist’s Yearly Salaries (x$1000)
127 132 138 141 144 146 152 154 165 171 177 192 241
Because n = 13 (odd,) then the median is the middlemost
or 7th value of the ordered data, so
Md=152
 An annual salary of $180,000 is in the high end, well
above the median salary of $152,000
• In fact, $180,000 a very high and competitive
salary
The Mode(众数)
The mode Mo of a population or sample of
measurements is the measurement that occurs most
frequently
• Modes are the values that are observed “most
typically”
• Sometimes higher frequencies at two or more values
• If there are two modes, the data is bimodal
• If more than two modes, the data is multimodal
• When data are in classes, the class with the highest
frequency is the modal class
• The tallest box in the histogram
Example 2.4
DVD Recorder Satisfaction
Satisfaction rankings on a scale of 1 (not satisfied) to 10
(extremely satisfied), arranged in increasing order
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
Because n = 20 (even,) then the median is the average of
two middlemost ratings; these are the 10th and 11th
values. Both of these are 8 (circled), so
Md = 8
Because te rating 8 occurs with the highest rating,
Mo = 8
Comparing Mean, Median & Mode
 The median is not affected by extreme values
• “Extreme values” are values much larger or much
smaller than most of the data
• The median is resistant to extreme values
 The mean is strongly affected by extreme values
• The mean is sensitive to extreme values
Payment Time Case
Mean=18.108 days
Median=17.000 days
Mode=16.000 days
So:
Expect the mean payment time to be 18.108
days
A long payment time would be > 17 days and a
short payment time would be < 17 days
The typical payment time is 16 days
Section 3.2 Measures of Variation
Figure 2.31 indicates that we need measures of
variation to express how the two distributions differ.
Figure 2.31 20 Repair Times for Personal Computers at Two Service Centers
The Range
Range = largest measurement - smallest measurement
The range measures the interval spanned by all the data
Example 2.3: Internist’s Salaries (in thousands of
dollars)
127 132 138 141 144 146 152 154 165 171 177 192 241
Range = 241 - 127 = 114 ($114,000)
The Population Variance  2 (pronounced sigma
squared) (总体方差)
The average of the squared deviations of all
the population measurements from the
population mean
Standard Deviation  (pronounced sigma) (标准差)
The square root of the variance
The Variance
Population X1, X2, …, XN
Sample x1, x2, …, xn
2
s2
Population Variance
N
2 
 X i -  
Sample Variance
n
2
i=1
N
s 2=
2


x
x
 i
i=1
n-1
The Variance
For a population of size N, the population variance 2 is defined
as
N
2 
2


x


 i
i 1
N
2
2
2

x1     x2       xN   

N
For a sample of size n, the sample variance s2 is defined as
n
s2 
2


x

x
 i
i 1
n 1
2
2
2

x1  x   x2  x     xn  x 

and is a point estimate for 2
n 1
The Standard Deviation(标准差)
Population Standard Deviation, :
Sample Standard Deviation, s:
  
s s
2
2
Example 2.6 The Car Mileage Case
Sample variance and standard deviation for first five car
mileages from Table 2.1
30.8, 31.7, 30.1, 31.6, 32.1
2
  xi  x 
5
s2 
i 1
5 1

30 .8  31.26 2  31.7  31.26 2  30.1  31.26 2  31.6  31 .26 2  32.1  31.26 2

4
= 2.572 /4 = 0.643
Sample variance and standard deviation for all car mileages
from Table 2.1,
.
49
s 
2
2


x

x
 i
i 1
49  1
30.66204

 0.638793
48
s  s 2  0.638793  0.7992
The point estimate of the variance of all cars is 0.638793 mpg2
and the point estimate of the standard deviation of all cars is
0.7992 mpg.
The computational formula for the sample variance
s2
2
n


 
  xi  

1  n
 i 1
 
2

x

 i

n  1  i 1
n






The Payment Time Case
Example 2.7
Consider the sample of 65 payment times in Table 2.2.
65
x
i 1
i
65
x
i 1
2
i
 x1  x2    x65  22  19    21  1,177
2
 x12  x22    x65
 (22) 2  (19) 2    (21) 2  22,317
Therefore
1 
(1,177) 2  1,004.2464
s 
 15.69135
22,317 

(65  1) 
65 
64
2
and s  s 2  15.69135  3.9612 Days.
Section 3.3 The Normal Curve(正态曲线)
Symmetrical and bell-shaped
curve for a normally distributed
population
The height of the normal over
any point represents the relative
proportion of values near that point
Example 2.1, The Car Mileages
Case
Daily Return of ZSYH for Recent 5 Years
Daily Return
15,00%
10,00%
5,00%
0,00%
-5,00%
-10,00%
-15,00%
2006.4.21
2007.4.21
2008.4.21
2009.4.21
2010.4.21
The Empirical Rule(经验准则) for
Normal Populations
If a population has mean  and standard deviation  and
is described by a normal curve, then
1. 68.26% of the population measurements lie within one
standard deviation of the mean: [, ]
2. 95.44% of the population measurements lie within two
standard deviations of the mean: [2, 2]
3. 99.73% of the population measurements lie within
three standard deviations of the mean: [3, 3]
The Empirical Rule
 The Empirical Rule holds for normally distributed
populations.
 This rule also approximately holds for populations
having mound-shaped (single-peaked) distributions
that are not very skewed to the right or left.
 For example , Recall that the distribution of 65
payment times, it indicates that the empirical rule
holds.
x
x  x 
49
x

i 1
49
49
i

1546.1
 31.5531
49
s2 

i 1
2
i
49  1

30.66204
 0.638793
48
s  s 2  0.638793  0.7992
Example 2.8 The Car Mileage Case

68.26% of all individual cars will have mileages in
the range
x  s]  31.6  0.8]  30.8,32.4] mpg

95.44% of all individual cars will have mileages in
the range
x  2s]  31.6 1.6]  30.0,33.2] mpg

99.73% of all individual cars will have mileages in
the range
x  3s]  31.6  2.4]  29.2,34.0] mpg
Tolerance Intervals(容许区间)
An Interval that contains a specified percentage of the
individual measurements in a population is called a
tolerance interval.
 The one, two, and three standard deviation intervals
around  given in (1), (2) and (3) are tolerance
intervals containing, respectively, 68.26 percent, 95.44
percent and 99.73 percent of the measurements in a
normally distributed population.
 The three-sigma interval   3 ] to be a tolerance
interval that contains almost all of the measurements
in a normally distributed population.
Section 2.4 Percentiles, Quartiles(四分之一分位
点) and Box-and-Whiskers Display
For a set of measurements arranged in increasing order,
the pth percentile(百分位点) is a value such that p
percent of the measurements fall at or below the value
and (100-p) percent of the measurements fall at or above
the value
The first quartile Q1 is the 25th percentile
The second quartile (or median) Md is the 50th percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR(四分位距) is Q3 - Q1
Calculating pth percentile
• Calculate the index i=(p/100) ×n
• If i is not an integer, the next integer greater
than i denotes the position of the pth
percentile in the ordered arrangement.
• If i is an integer, then the pth percentile is
the average of the measurements in position
i and i+1 in the ordered arrangement.
Percentile Example
• i=(10/100)12=1.2
• Not an integer so round up to 2
• 10th percentile is in the second position so
11,070
• i=(25/100)12=3
• Integer so average values in positions 3 and 4
• 25th percentile (18,211+26,817)/2 or 22,514
Figure 2.33 Using stem-and-leaf displays to find percentiles.
(a) The 75th percentile of the 65 payment (b) The 5th percentile of the 60 bottle
design ratings and a five-number summary
times, and a five-number summary
Example 2.10
DVD Recorder Satisfaction
20 customer satisfaction ratings:
1 3 5 5 7 8 8 8 8 8 8 9 9 9 9 9 10 10 10 10
Md = (8+8)/2 = 8
Q1 = (7+8)/2 = 7.5
Q3 = (9+9)/2 = 9
IQR = Q3  Q1 = 9  7.5 = 1.5
The Box-and-Whiskers Plots(盒型图)

The box plots the:
 first quartile, Q1
 median, Md
 third quartile, Q3
 inner fences, located 1.5IQR away from the quartiles:
 = Q1 – (1.5  IQR)
 = Q3 + (1.5  IQR)
 outer fences, located 3IQR away from the quartiles:
 = Q1 – (3  IQR)
 = Q3 + (3  IQR)


The “whiskers” are dashed lines that plot the
range of the data
 A dashed line drawn from the box below Q1
down to the smallest measurement
 Another dashed line drawn from the box
above Q3 up to the largest measurement
Note: Q1, Md, Q3, the smallest value, and the
largest value are sometimes referred to as the
five number summary
Outliers(异常值)


Outliers are measurements that are very different from
most of the other measurements
 Because they are either very much larger or very much
smaller than most of the other measurements
Outliers lie beyond the fences of the box-and-whiskers
plot
 Measurements between the inner and outer fences are
mild outliers
 Measurements beyond the outer fences are extreme
outliers
Weighted Means(加权均值)

Sometimes, some measurements are more important than
others
 Assign numerical “weights” to the data


Weights measure relative importance of the value
Calculate weighted mean as
w x
w
i
i
i
where wi is the weight assigned to the ith measurement xi
Example 2.12
June 2001 unemployment rates in the U.S. by region
Census Region
Civilian Labor Force Unemployment
(millions)
Rate (%)
Northeast
26.9
4.1
South
50.6
4.7
Midwest
34.7
4.4
West
32.5
5.0
Want the mean unemployment rate for the U.S.

Calculate it as a weighted mean
 So that the bigger the region, the more heavily it counts
in the mean


The data values are the regional unemployment
rates
The weights are the sizes of the regional labor
forces
26 .9  4.1  50 .6  4.7   34 .7  4.4  32 .5  5.0



26 .9  50 .6  34 .7  25 .5  32 .5
663 .29
 4.58 %
144 .7
Note that the unweigthed mean is 4.55%, which
underestimates the true rate by 0.03%

That is, 0.0003  144.7 million = 43,410 workers
Download