ASWC_03b(2006-03

advertisement
Slides Prepared by
Juei-Chao Chen
Fu Jen Catholic University
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Chapter 3
Descriptive Statistics: Numerical Measures
Part B
3.3 Measures of Distribution Shape, Relative
Location, and Detecting Outliers
3.4 Exploratory Data Analysis
3.5 Measures of Association Between Two
Variables
3.6 The Weighted Mean and Working with
Grouped Data
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
3.3 Measures of Distribution Shape,
Relative Location, and Detecting Outliers
•
•
•
•
•
Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• An important measure of the shape of a
distribution is called skewness.
• The formula for computing skewness for a data
set is somewhat complex.
• Note: The formula for the skewness of sample
data
3
n
 xi  x 
skewness 



(n  1)( n  2)  s 
• Skewness can be easily computed using
statistical software.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
Relative Frequency
.35
Skewness = 0
.30
.25
.20
.15
.10
.05
0
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.
Relative Frequency
.35
Skewness = - .31
.30
.25
.20
.15
.10
.05
0
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.
Relative Frequency
.35
Skewness = .31
.30
.25
.20
.15
.10
.05
0
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
Relative Frequency
.35
Skewness = 1.25
.30
.25
.20
.15
.10
.05
0
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
• Example: Apartment Rents
Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next slide.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Distribution Shape: Skewness
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Slide ‹#›
Distribution Shape: Skewness
Relative Frequency
.35
Skewness = .92
.30
.25
.20
.15
.10
.05
0
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value xi is from the mean.
xi  x
zi 
s
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
z-Scores
• An observation’s z-score is a measure of the
relative location of the observation in a data set.
• A data value less than the sample mean will
have a z-score less than zero.
• A data value greater than the sample mean
will have a z-score greater than zero.
• A data value equal to the sample mean will
have a z-score of zero.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
z-Scores
• z-Score of Smallest Value (425)
xi  x 425  490.80
z

  1.20
s
54.74
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
Slide ‹#›
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Chebyshev’s Theorem
At least 75% of the data values must be
within
z = 2 standard deviations of the mean.
At least 89% of the data values must be
within
z = 3 standard deviations
of the mean.
At least 94% of the data values must be
within
z = 4 standard deviations
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
of the mean.
Slide ‹#›
Chebyshev’s Theorem
wx

Let z = 1.5 with x 
= 490.80 and s = 54.74
w
For example:
i i
i
At least (1 - 1/(1.5)2) = 1 - 0.44 = 0.56 or 56%
of the rent values must be between
wx

x -z(s) = 490.80 - 1.5(54.74) = 409
w
and

wx
x +z(s) = 490.80 + 1.5(54.74) = 573
w
i i
i
i
i
i
(Actually, 86% of the rent values
are between 409 and 573.)
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Empirical Rule
For data having a bell-shaped distribution:
68.26% of the values of a normal random variable
are within
+/- 1 standard deviation of its mean.
95.44% of the values of a normal random variable
are within +/- 2 standard deviations of its mean.
99.72% of the values of a normal random variable
are within +/- 3 standard deviations of its mean.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Empirical Rule
99.72%
95.44%
68.26%
m – 3s
m – 1s
m – 2s
m
m + 3s
m + 1s
m + 2s
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
x
Slide ‹#›
Detecting Outliers
• An outlier is an unusually small or unusually
large value in a data set.
• A data value with a z-score less than -3 or
greater than +3 might be considered an outlier.
• It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included
in the data set
• a correctly recorded data value that belongs
in the data set
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Detecting Outliers
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier,
there are no outliers in this data set.
-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
Standardized Values for Apartment Rents
-1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93
-0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75
-0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47
-0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20
-0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17
0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45
1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
Slide ‹#›
3.4 Exploratory Data Analysis
• Five-Number Summary
• Box Plot
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Five-Number Summary
1
Smallest Value
2
First Quartile
3
Median
4
Third Quartile
5
Largest Value
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Five-Number Summary
• Example: Monthly Starting Salaries for a
sample of 12 Business School Graduates
• Five-Number Summary
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Q1=2865
Q2=2905
Q3=3000
(Median)
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Five-Number Summary
First Quartile = 445
Lowest Value = 425
Median = 475
Third Quartile = 525 Largest Value = 615
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Slide ‹#›
Box Plot
• A box is drawn with its ends located at the first
and third quartiles.
• A vertical line is drawn in the box at the location
of the median (second quartile).
375 400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Box Plot
• Limits are located (not drawn) using the
interquartile range (IQR).
• Data outside these limits are considered
outliers.
• The locations of each outlier is shown with the
symbol * .
… continued
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Box Plot
• The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(75) = 332.5
• The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5
• There are no outliers (values less than 332.5 or
greater than 637.5) in the apartment rent data.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Box Plot
• Whiskers (dashed lines) are drawn from the
ends of the box to the smallest and largest data
values inside the limits.
375 400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Largest value
inside limits = 615
Slide ‹#›
Box Plot
• Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
• Box Plot
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
3.5 Measures of Association
Between Two Variables
• Covariance
• Correlation Coefficient
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance
The correlation coefficient is computed as follows:
sxy
s xy
 ( xi  x )( yi  y )

n 1
 ( xi  m x )( yi  m y )

N
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
for
samples
for
populations
Slide ‹#›
Covariance
• Example: Sample Data for the Stereo and
Sound Equipment Store
• Data
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance
• Scatter Diagram for the Stereo and Sound
Equipment Store
• Sample Covariance
S xy
(x


i
 x )( yi  y )
n 1
99

 11
9
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance
• Partitioned Scatter Diagram for the Stereo
and Sound Equipment Store
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship.
Values near +1 indicate a strong positive linear
relationship.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Correlation Coefficient
The correlation coefficient is computed as follows:
rxy 
sxy
sx s y
for
samples
 xy
s xy

s xs y
for
populations
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Correlation Coefficient
Correlation is a measure of linear association and not
necessarily causation.
Just because two variables are highly correlated, it
does not mean that one variable is the cause of the
other.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance and Correlation Coefficient
A golfer is interested in investigating
the relationship, if any, between driving
distance and 18-hole score.
Average Driving
Distance (yds.)
277.6
259.5
269.1
267.0
255.6
272.9
Average
18-Hole Score
69
71
70
70
71
69
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Covariance and Correlation Coefficient
x
y
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
( xi  x ) ( yi  y ) (xi  x )(yi  y )
10.65
-7.45
2.15
0.05
-11.35
5.95
-1.0
1.0
0
0
1.0
-1.0
Average
267.0 70.0
Std. Dev. 8.2192 .8944
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
Slide ‹#›
Covariance and Correlation Coefficient
• Sample Covariance
sxy
(x  x )( y  y ) 35.40




i
i
n1
61
 7.08
• Sample Correlation Coefficient
sxy
7.08
rxy 

 -.9631
sx sy (8.2192)(.8944)
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
The Weighted Mean and
Working with Grouped Data
•
•
•
•
Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Weighted Mean
• When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
• In the computation of a grade point average
(GPA), the weights are the number of credit
hours earned for each grade.
• When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Weighted Mean
wx

x
w
i i
i
where:
xi = value of observation i
wi = weight for observation i
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Grouped Data
• The weighted mean computation can be
used to obtain approximations of the mean,
variance, and standard deviation for the
grouped data.
• To compute the weighted mean, we treat the
midpoint of each class as though it were the
mean of all items in the class.
• We compute a weighted mean of the class
midpoints using the class frequencies as weights.
• Similarly, in computing the variance and
standard deviation, the class frequencies are
used as weights.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Mean for Grouped Data
• Sample Data
fM

x
i
i
n
• Population Data
fM

m
i
i
N
where:
fi = frequency of class i
Mi = midpoint of class i
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Sample Mean for Grouped Data
Given below is the previous sample of monthly
rents for 70 efficiency apartments, presented
here as grouped data in the form of a frequency
distribution. Rent ($) Frequency
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
8
17
12
8
7
4
2
4
2
6
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Sample Mean for Grouped Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
34, 525
x
 493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
Slide ‹#›
Variance for Grouped Data
• For sample data
2
f
(
M

x
)

i
i
s2 
n 1
2
• For population data
2
f
(
M

m
)

i
i
s2 
N
2
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Sample Variance for Grouped Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Mi - x
-63.7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058.96 32471.71
1910.56 32479.59
6745.97
562.16
110.11
13.76
1857.55
265.36
1316.96
5267.86
3168.56
6337.13
5820.16 23280.66
9271.76 18543.53
13523.36 81140.18
208234.29
continued
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Sample Variance for Grouped Data
• Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89
• Sample Standard Deviation
s  3,017.89  54.94
This approximation differs by only $.20
from the actual standard deviation of $54.74.
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
End of Chapter 3, Part B
© 2006 by Thomson Learning, a division of Thomson Asia Pte Ltd..
Slide ‹#›
Download