1.1 Descriptive Statistics

advertisement
1
1. Descriptive Statistics and Basic Probability
1.1. Descriptive Statistics
Suppose y1 , y 2 , , y N are all the elements in the population and
x1 , x2 ,, xn
are the sample drawn from y1 , y 2 , , y N , where N is referred to as the
population size and n is the sample size. In this chapter, we introduce several
numerical measures to obtain important information about the population. These
numerical measures computed from a sample are called sample statistics while
those numerical measures computed from a population are called population
parameters.
(I) Measure of Location:
n
1. Mean: x 
x
i 1
n
i
.
2. Median:
The data are arranged in ascending (or descending) order. Then,
(i) As the sample size is odd, the median is the middle value.
(ii) As the sample size is even, the median is the mean of the middle two
numbers.
3. Mode: The data value occurs with greatest frequency (not necessarily to be
numerical).
4. Percentile:The pth percentile is a value such as at least p percent of the data
have this value or less and at least (100-p) percent of the data have
this value or more.
Note: 50th percentile = median!!
The procedure to calculate the pth percentile:
(i) Arrange the data in ascending order.
 p 
(ii) Compute an index i, i  
n.
 100 
(iii)(a) If i is not an integer, round up, i.e., the next integer value greater than i
denote the position of the pth percentile.
1
2
(b) If i is an integer, the pth percentile is the average of the data values in
positions i and i+1.
5. Quartiles: When dividing data into 4 parts, the division points are referred to
as the quartile!!
That is,
Q1  the first quartile or 25th percentile
Q2  the second quartile or 50th percentile
Q3  the third quartile or 75th percentile
Example 1:
Suppose the following data are the scores of 10 students in a quiz,
1, 3, 5, 7, 9, 2, 4, 6, 8, 10.
Some measures need to be used to provide information about the performance of
the 10 students in this quiz.
1. mean: x 
1  3    10
 5.5
10
2. median 
56
 5.5
2
If the data are 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11. Then,
median  6
4. Please find 40th percentile and 26th percentile for the previous data.
Step 1: the data in ascending order are
1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
Step 2:
For 40 th percentile,
 40 
i
10  4 .
 100 
2
3
For 26 th percentile,
 26 
i
10  2.6
100


Step 3: 40th percentile 
45
 4.5 and 26th percentile  3 .
2
5. Find the first quartile and the third quartile for the previous example.
Step 2:
For the first quartile,
 25 
i
10  2.5 .
 100 
For the third quartile,
 75 
i
10  7.5
 100 
Step 3:
Q1  3
and
Q3  8
(II) Measure of Dispersion:
Example 2:
Suppose there are two factories producing the batteries. From each factory, 10
batteries are drawn to test for the lifetime (in hours). These lifetimes are:
Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1
Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12.
The mean lifetimes of the two factories are both 10. However, by looking at the
data, it is obvious that the batteries produced by factory 1 are much more
reliable than the ones by factory 2. This implies other measures for measuring
3
4
the “dispersion” or “variation” of the data are required.
◆
1. Range: range=(largest value of the data)-(smallest value of the data).
2. Interquartile Range: Interquartile is the difference between the third and the
first quartiles. That is,
IQR  Q3  Q1 .
3. Variance and Standard Deviation:
n
s2 
  xi
n
 x
2
i 1
n 1

x
2
i
i 1
 nx 2
n 1
,
s  s2
.
4. Coefficient of Variation: The coefficient of variation is another useful statistic
for measuring the dispersion of
the data. The coefficient of variation is
C .V . 
s
 100
x
The coefficient of variation is invariant with respect to the scale of the data. On
the other hand, the standard deviation is not scale-invariant.
Example 2 (continue):
1. Range of lifetime data for factory 1=10.1-9.9=0.2
Range of lifetime data for factory 2=16-3=13
 The range of battery lifetimes for factory 1 is much smaller than the one
for factor 2.
2. The first quartile and the third quartile for the data from factory 1 are 9.9 and
10.1, respectively, and 6 and 14 for the data from factory 2. Therefore,
IQR (factory 1)=10.1-9.9=0.2
IQR (factory 2)=14-6=8.
 The interquartile of battery lifetimes for factory 1 is much smaller than
4
5
the one for factor 2.
3. s 2 ( factory.1) 
10.1  102  9.9  102    10.1  102
s 2 ( factory.2) 
10  1
16  102
 0.0111
 5  10     12  10 
 21.1111
10  1
2
2
 The sample variance of battery lifetimes for factory 2 is 1900 times larger
than the one for factor 1.
The sample standard deviation for the data from factories 1 and 2 are
0.01111  0.1054
21.1111  4.5946 ,
and
respectively.
4. In the battery data from factory 1, suppose the measurement is in minutes
rather than hours. Then, the data are 606, 594, 606, 594, 594, 606, 594,
606, 594, 606. Thus, the standard deviation becomes 6.3245 which is 60
times larger than the one 0.1054 based on the original data measured in hours.
However, no matter the data are measured in hours and minutes, the coefficient
of variation is
C.V . 
0.1054
6.3245
100 
100  1.054.
10
600
Example 3:
The amount of time (in minutes) that a sample of students spends watching
television per day is given below.
40
25
35
30
20
40
30
40
10
30
20
10
5
20
(a) Compute the mean
(b) The standard deviation.
(c) The coefficient of variation.
(d) The 40th percentile.
(e) The mode.
5
20
6
(f) The interquartile range.
(g) Construct a frequency distribution, a cumulative frequency distribution and
a relative frequency distribution. Let the first class be 1-10.
[solution:]
(a)
15
x
x
i 1
15
i

40  25    5  20
 25
15
(b)
15
s
 x
i 1
i
 x
15  1
2

40  252  25  252    5  252  20  252
14
 11.339
(c)
C.V . 
s
11.339
 100 
 100  45.356 .
x
25
(d)
1. The data are
5
10
10
20
20
20
20
30
30
30
35
40
40
40
25
2.
15 
40
6
100
Thus,
20  20
 20
2
is the 40th percentile.
(e)The mode is 20.
(f) Since
Q1  20, Q3  35 ,
IQR  Q3  Q1  35  20  15 .
(III) Other Descriptive Statistics:
1. Five-Number Summary:
The five number summary can provide important information about both the
location and the dispersion of the data. They are
6
7

Smallest value
 First quartile
 Median
 Third quartile
 Largest value
2. Z-score, referred to as the standardized value for observation i, is defined as
xi  x
s
zi 
.
3. Weighted Mean:
n
xw 
w x
i
i 1
n
i
.
w
i
i 1
4. Sample Mean for Grouped Data:
m
xg 
m

fk M k

k 1
m

k 1

k 1
fk M k
n
fk
,
Sample Variance for Grouped Data:
m
s g2 

k 1
f k M k  x g 
m
2
n 1


k 1
f k M k2  nx g2
n 1
.
Example 2 (continue):
The original data (in hours) are:
Factory 1: 10.1, 9.9, 10.1, 9.9, 9.9, 10.1, 9.9, 10.1, 9.9, 10.1
Factory 2: 16, 5, 7, 14, 6, 15, 3, 13, 9, 12.
The five-number summary for the data from both factories is
Smallest
Q1
Median
Q3
Largest
Factory 1
9.9
9.9
10
10.1
10.1
Factory 2
3
6
10.5
14
16
Z-scores for the data:
7
8
Factory 1:
xi
10.1
zi
0.948 -0.948 0.948 -0.948 -0.948 0.948 -0.948 0.948 -0.948 0.948
9.9
10.1
9.9
9.9
10.1
9.9
10.1
9.9
10.1
Factory 2:
xi
16
zi
5
7
14
6
15
3
13
9
12
1.305 -1.088 -0.652 0.870 -0.870 1.088 -1.523 0.652 -0.217 0.435
Example 4:
The following are 5 purchases of a raw material over the past 3 months.
Purchase
Cost per Pound ($)
Number of Pounds
1
2
3
3.00
3.40
2.80
1200
500
2750
4
5
2.90
3.25
1000
800
Find the mean cost per pound.
[solutions:]
w1  1200, w2  500, w3  2750, w4  1000, w5  800.
and
x1  3.00, x2  3.40, x3  2.80, x4  2.90, x5  3.25.
Then,
5
xw 
w x
i 1
5
i i
w
i 1
i
1200  3.00  500  3.40  2750  2.80  1000  2.90  800  3.25
1200  500  2750  1000  800
 2.96

Example 5:
The following are the frequency distribution of the time in days required to
complete year-end audits:
Audit Time (days)
Frequency
10-14
15-19
20-24
4
8
5
8
9
25-29
2
30-34
1
What is the mean and the variance of the audit time?
[solutions:]
f1  4, f 2  8, f 3  5, f 4  2, f 5  1.
n  f1  f 2  f 3  f 4  f 5  4  8  5  2  1  20
and M 1  12, M 2  17, M 3  22, M 4  27, M 5  32.
Thus,
5
xg 
fM
i
i 1
i
5
f
i 1
4  12  8  17   5  22  2  27   1 32
 19
4  8  5  2 1

i
and
 f M
5
s g2 
i 1
i
 xg 
2
i
n 1
2
2
2
2
2
4  12  19  8  17  19  5  22  19  2  27  19  1 32  19

20  1
 30
(IV) Numerical Measures of Association: Covariance and Correlation
Coefficient:
Let sample 1:
x1 , x2 ,, xn
and sample 2:
z1 , z 2 ,, z n .
The sample covariance
n
s xz 
 ( xi  x )( zi  z )
i 1
n 1
n

x z
i 1
i
i
 nx z
n 1
.
while the sample correlation coefficient is
n
rxz
s xz


sx sz
 x
n
 x
i 1
Note:
i
i 1
 x  z i  z 
 x
2
i
n
 z
i 1
rxz  1 .
9
 z
2
i
.
10
Example 6: .
Let xi be the total money spent on advertisement for some product and z i be
the sales volume (1 unit  100 packs).
xi
2
5
1
3
4
1
5
3
4
2
zi
50
57
41
54
54
38
63
48
59
46
( xi  x )( z i  z )
1
12
20
0
3
26
24
0
8
5
10
 s xz 
 (x
i 1
 x )( z i  z )
i
10  1
10
 ( xi  x ) 2
s x2 
i 1
10  1
Then, rxz 

10
 1.4907 
2
and s z2 
(z
i 1
i
99
 11 .
10  1
 z )2
 7.9303
2
10  1
s xz
 0.93 .
sx sz
Example 7:
Let z i  2 xi , i  1,2,3,4,5 .
xi
1
2
3
4
5
zi
2
4
6
8
10
Then,
5
x  3, z  6, s x 
 ( xi  x ) 2
i 1
5 1
5
s xz 
Thus, rxz 
s xz

sx sz
 (x
i 1

5
, sz 
2
 x )( z i  z )
5 1
5
5
2
i
5
 (z
i 1
i
 z)2
5 1
 10 ,
 5.
 1.
10
Note: when there is a perfect positive linear relationship between variable x and
z, then rxz  1. rxz  1 might indicate a positive linear relationship.
10
Download