covariance

advertisement
Descriptive Statistics:
Numerical Measures
 Measures of association (between two
or more variables)
 Weighted mean and Grouped Data
 Measures of Shape of a Distribution,
Relative Location and Outliers
Measures of Association
between Two Variables
Covariance
Correlation Coefficient
Covariance
Covariance
The covariance is a measure of the direction of movement
and linear association between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
Covariance
Covariance between two random variables ( X and Y)
is computed as follows:
for
populations
for
samples
 ( xi  x )( yi  y )
sxy 
n 1
 xy
( x   )( y   )


i
x
N
i
y
Correlation Coefficient
Correlation Coefficient
Correlation is a measure of the degree of linear association
between two variables.
However, it doesn’t indicate the causation. That is, just because
two variables are highly correlated, it does not mean that
one variable is the cause of the other.
Correlation Coefficient
The correlation coefficient is computed as follows:
rxy 
sxy
sx s y
for
samples
 xy
 xy

 x y
for
populations
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship.
Values near +1 indicate a strong positive linear
relationship.
Example--
A student is interested in investigating the relationship,
if any, between driving distance and the 18-hole scores
on a golf course.
Driving
Distance (yds.)
277.6
259.5
269.1
267.0
255.6
272.9
18-Hole Score
69
71
70
70
71
69
Compute the covariance and correlation
between distance and score
x
y
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
Average 267.0 70.0
Std. Dev. 8.2192 .8944
( xi  x ) ( yi  y ) ( xi  x )( yi  y )
10.65
-7.45
2.15
0.05
-11.35
5.95
-1.0
1.0
0
0
1.0
-1.0
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
Covariance and Correlation
Coefficient
Sample Covariance
sxy
( x  x )( y  y ) 35.40




i
i
n1
61
 7.08
Sample Correlation Coefficient
sxy
7.08
rxy 

 -.9631
sx sy (8.2192)(.8944)
Descriptive Statistics for Grouped Data
(Mean, Variance and Standard Deviation)
Suppose a student has taken five courses during the last semester. The
following table depicts the credit hours associated with each course and the
grades. Compute the student’s GPA for the semester.
Courses
Calculus
Psychology
Marketing
Economics
Stat
Credit Hours
4
3
3
3
2
Grade
B
A
C
D
A
Weighted Mean
wx

x
w
i i
i
where:
xi = value of observation i
wi = weight for observation i
Suppose a student has taken five courses during the last semester. The
following table depicts the credit hours associated with each course and the
grades. Compute the student’s GPA for the semester
Courses
Calculus
Psychology
Marketing
Economics
Stat
Credit Hours
(Wi)
4
3
3
3
2
13
Grade
B(3)
A(4)
C(2)
D(1)
A(4)
Grade
Points
(Wi X G)
12
12
6
3
8
41
Semester GPA
wx

x
w
i i
i
where:
xi = value of observation i
wi = weight for observation i
Weighted Mean
 When the mean is computed after giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
Weighted mean is computed often when data values
vary in importance. The weights are often chosen to
best reflect the importance of each value.
Working with Grouped
Data
Given below is a sample of monthly rents for
70 efficiency apartments.
Compute the mean and variance of the data
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
What if the data came organized
in this format? (Grouped Data)
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Frequency
8
17
12
8
7
4
2
4
2
6
Computing the mean and variance
of a grouped data
 To compute the weighted mean from a grouped data
we treat the midpoint of each class as though it
were the mean of all items in the class.
 We compute a weighted mean of the data using
class midpoints and class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
Mean for Grouped Data
 Sample Data
fM

x
i
i
n
 Population Data
fM


i
i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped Data
Given below is a sample of monthly rents for 70 efficiency
apartments as grouped data--- in the form of a frequency
distribution.
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Frequency
8
17
12
8
7
4
2
4
2
6
Sample Mean for Grouped Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569.5
589.5
609.5
f iMi
3436.0
7641.5
5634.0
3916.0
3566.5
2118.0
1099.0
2278.0
1179.0
3657.0
34525.0
34,525
x
 493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
Variance for Grouped Data
 For sample data
2
f
(
M

x
)
 i
i
s2 
n 1
 For population data
2
f
(
M


)

i
i
2 
N
34,525
x
 493.21
70
Sample Variance for Grouped Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569.5
589.5
609.5
Mi - x
-63.7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058.96 32471.71
1910.56 32479.59
562.16
6745.97
13.76
110.11
265.36
1857.55
1316.96
5267.86
3168.56
6337.13
5820.16 23280.66
9271.76 18543.53
13523.36 81140.18
208234.29
Sample Variance for Grouped Data
 Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89
 Sample Standard
Deviation
s  3,017.89  54.94
This approximation differs by only $.20
from the actual standard deviation of $54.74.
Minor Other Sub-Topics in this Chapter
 Shape of a Distribution
 z-Scores (Standardized Values)
 Chebyshev’s Theorem
 Empirical Rule
 Detecting Outliers
Shape of a Distribution:
Skewness
 An important measure of the shape
of a distribution is called skewness.
 The formula for computing the
skewness of a data set is
somewhat complex.
S
E( X -  )

3
x
3
Distribution Shape: Skewness
Skewness (S):
S
E( X -  )

3
3
x
 Is a measure of the asymmetry of a probability
distribution
 S=0: Symmetrical
 S>0: the distribution is right (positively) skewed
 S<0: the distribution is left (negatively) skewed
Distribution Shape: Skewness
Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0
Skewness = 0
Distribution Shape: Skewness
 Moderately Skewed Left
 Skewness is negative.
 Mean will usually be less than the median.
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness =  .31
Distribution Shape: Skewness

Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness = 1.25
Standardizing Values (Z-Score)
xi  x
zi 
s
Z-Score is a measure of the number of standard units
(deviations) a given data value is located from the mean.
As a result, z-score is called a standardized value.
z-Score (Standardized Value)
xi  x
zi 
s
 A data value less than the sample mean
will always have a z-score less than zero
 A data value greater than the sample
mean will always have a z-score greater
than zero.
 A data value equal to the sample mean
will always have a z-score of zero.
For any data set: When standardized,
At least 75% of the data values lie within
z = 2 standard deviations
of the mean.
At least 89% of the data values lie
within z = 3 standard deviations of the mean.
At least 94% of the data values lie
within z = 4 standard deviations of the mean.
A theorem that describes the position of a certain proportion
of observation in any data set with the above pattern of
distribution (after the data values are standardized) is
known as ……
Chebyshev’s Theorem
For a data Empirical
with a bell-shaped
Rule
distribution:
68.26% of the values of a normal random variable
are within +/- 1 standard deviation of its mean.
95.44% of the values of a normal random variable
are within +/- 2 standard deviations of its mean.
99.72% of the values of a normal random variable
are within +/- 3 standard deviations of its mean.
Empirical Rule
99.72%
95.44%
68.26%
 – 3
 – 1
 – 2

 + 3
 + 1
 + 2
x
Z-Scores allow us to detect
Outliers
 An outlier is an unusually small or unusually large
value in a data set.
 It might be the result of:
• an incorrect recording
• an incorrectly included value in the data set
• a correctly recorded data value but an unusual
occurrence
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
Other methods of data
Description
Five-Number Summary and Box Plot
Five-Number Summary
1
Smallest Value
2
First Quartile
3
Median
4
Third Quartile
5
Largest Value
Five-Number Summary
Lowest Value = 425
First Quartile = 445
Median = 475
Third Quartile = 525
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
Largest Value = 615
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Box Plot
 A box is drawn with its ends located at the first and
third quartiles.
 A vertical line is drawn in the box at the location of
the median (second quartile).
375 400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
Box Plot

Lower and upper Limits are located (not drawn) using
the interquartile range (IQR).

Data outside these limits are considered outliers.

The locations of each outlier is shown with the symbol *
Box Plot
 The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(75) = 332.5
 The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5
 There are no outliers (values less than 332.5 or
greater than 637.5) in the apartment rent data.
Box Plot
 Whiskers (dashed lines) are drawn from
the ends of the box to the smallest and
largest data values inside the limits.
375 400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
Download