Descriptive Statistics: Numerical Measures Measures of association (between two or more variables) Weighted mean and Grouped Data Measures of Shape of a Distribution, Relative Location and Outliers Measures of Association between Two Variables Covariance Correlation Coefficient Covariance Covariance The covariance is a measure of the direction of movement and linear association between two variables. Positive values indicate a positive relationship. Negative values indicate a negative relationship. Covariance Covariance between two random variables ( X and Y) is computed as follows: for populations for samples ( xi x )( yi y ) sxy n 1 xy ( x )( y ) i x N i y Correlation Coefficient Correlation Coefficient Correlation is a measure of the degree of linear association between two variables. However, it doesn’t indicate the causation. That is, just because two variables are highly correlated, it does not mean that one variable is the cause of the other. Correlation Coefficient The correlation coefficient is computed as follows: rxy sxy sx s y for samples xy xy x y for populations Correlation Coefficient The coefficient can take on values between -1 and +1. Values near -1 indicate a strong negative linear relationship. Values near +1 indicate a strong positive linear relationship. Example-- A student is interested in investigating the relationship, if any, between driving distance and the 18-hole scores on a golf course. Driving Distance (yds.) 277.6 259.5 269.1 267.0 255.6 272.9 18-Hole Score 69 71 70 70 71 69 Compute the covariance and correlation between distance and score x y 277.6 259.5 269.1 267.0 255.6 272.9 69 71 70 70 71 69 Average 267.0 70.0 Std. Dev. 8.2192 .8944 ( xi x ) ( yi y ) ( xi x )( yi y ) 10.65 -7.45 2.15 0.05 -11.35 5.95 -1.0 1.0 0 0 1.0 -1.0 -10.65 -7.45 0 0 -11.35 -5.95 Total -35.40 Covariance and Correlation Coefficient Sample Covariance sxy ( x x )( y y ) 35.40 i i n1 61 7.08 Sample Correlation Coefficient sxy 7.08 rxy -.9631 sx sy (8.2192)(.8944) Descriptive Statistics for Grouped Data (Mean, Variance and Standard Deviation) Suppose a student has taken five courses during the last semester. The following table depicts the credit hours associated with each course and the grades. Compute the student’s GPA for the semester. Courses Calculus Psychology Marketing Economics Stat Credit Hours 4 3 3 3 2 Grade B A C D A Weighted Mean wx x w i i i where: xi = value of observation i wi = weight for observation i Suppose a student has taken five courses during the last semester. The following table depicts the credit hours associated with each course and the grades. Compute the student’s GPA for the semester Courses Calculus Psychology Marketing Economics Stat Credit Hours (Wi) 4 3 3 3 2 13 Grade B(3) A(4) C(2) D(1) A(4) Grade Points (Wi X G) 12 12 6 3 8 41 Semester GPA wx x w i i i where: xi = value of observation i wi = weight for observation i Weighted Mean When the mean is computed after giving each data value a weight that reflects its importance, it is referred to as a weighted mean. Weighted mean is computed often when data values vary in importance. The weights are often chosen to best reflect the importance of each value. Working with Grouped Data Given below is a sample of monthly rents for 70 efficiency apartments. Compute the mean and variance of the data 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 What if the data came organized in this format? (Grouped Data) Rent ($) 420-439 440-459 460-479 480-499 500-519 520-539 540-559 560-579 580-599 600-619 Frequency 8 17 12 8 7 4 2 4 2 6 Computing the mean and variance of a grouped data To compute the weighted mean from a grouped data we treat the midpoint of each class as though it were the mean of all items in the class. We compute a weighted mean of the data using class midpoints and class frequencies as weights. Similarly, in computing the variance and standard deviation, the class frequencies are used as weights. Mean for Grouped Data Sample Data fM x i i n Population Data fM i i N where: fi = frequency of class i Mi = midpoint of class i Sample Mean for Grouped Data Given below is a sample of monthly rents for 70 efficiency apartments as grouped data--- in the form of a frequency distribution. Rent ($) 420-439 440-459 460-479 480-499 500-519 520-539 540-559 560-579 580-599 600-619 Frequency 8 17 12 8 7 4 2 4 2 6 Sample Mean for Grouped Data Rent ($) 420-439 440-459 460-479 480-499 500-519 520-539 540-559 560-579 580-599 600-619 Total fi 8 17 12 8 7 4 2 4 2 6 70 Mi 429.5 449.5 469.5 489.5 509.5 529.5 549.5 569.5 589.5 609.5 f iMi 3436.0 7641.5 5634.0 3916.0 3566.5 2118.0 1099.0 2278.0 1179.0 3657.0 34525.0 34,525 x 493.21 70 This approximation differs by $2.41 from the actual sample mean of $490.80. Variance for Grouped Data For sample data 2 f ( M x ) i i s2 n 1 For population data 2 f ( M ) i i 2 N 34,525 x 493.21 70 Sample Variance for Grouped Data Rent ($) 420-439 440-459 460-479 480-499 500-519 520-539 540-559 560-579 580-599 600-619 Total fi 8 17 12 8 7 4 2 4 2 6 70 Mi 429.5 449.5 469.5 489.5 509.5 529.5 549.5 569.5 589.5 609.5 Mi - x -63.7 -43.7 -23.7 -3.7 16.3 36.3 56.3 76.3 96.3 116.3 (M i - x )2 f i (M i - x )2 4058.96 32471.71 1910.56 32479.59 562.16 6745.97 13.76 110.11 265.36 1857.55 1316.96 5267.86 3168.56 6337.13 5820.16 23280.66 9271.76 18543.53 13523.36 81140.18 208234.29 Sample Variance for Grouped Data Sample Variance s2 = 208,234.29/(70 – 1) = 3,017.89 Sample Standard Deviation s 3,017.89 54.94 This approximation differs by only $.20 from the actual standard deviation of $54.74. Minor Other Sub-Topics in this Chapter Shape of a Distribution z-Scores (Standardized Values) Chebyshev’s Theorem Empirical Rule Detecting Outliers Shape of a Distribution: Skewness An important measure of the shape of a distribution is called skewness. The formula for computing the skewness of a data set is somewhat complex. S E( X - ) 3 x 3 Distribution Shape: Skewness Skewness (S): S E( X - ) 3 3 x Is a measure of the asymmetry of a probability distribution S=0: Symmetrical S>0: the distribution is right (positively) skewed S<0: the distribution is left (negatively) skewed Distribution Shape: Skewness Symmetric (not skewed) • Skewness is zero. • Mean and median are equal. .35 Relative Frequency .30 .25 .20 .15 .10 .05 0 Skewness = 0 Distribution Shape: Skewness Moderately Skewed Left Skewness is negative. Mean will usually be less than the median. Relative Frequency .35 .30 .25 .20 .15 .10 .05 0 Skewness = .31 Distribution Shape: Skewness Highly Skewed Right • Skewness is positive (often above 1.0). • Mean will usually be more than the median. Relative Frequency .35 .30 .25 .20 .15 .10 .05 0 Skewness = 1.25 Standardizing Values (Z-Score) xi x zi s Z-Score is a measure of the number of standard units (deviations) a given data value is located from the mean. As a result, z-score is called a standardized value. z-Score (Standardized Value) xi x zi s A data value less than the sample mean will always have a z-score less than zero A data value greater than the sample mean will always have a z-score greater than zero. A data value equal to the sample mean will always have a z-score of zero. For any data set: When standardized, At least 75% of the data values lie within z = 2 standard deviations of the mean. At least 89% of the data values lie within z = 3 standard deviations of the mean. At least 94% of the data values lie within z = 4 standard deviations of the mean. A theorem that describes the position of a certain proportion of observation in any data set with the above pattern of distribution (after the data values are standardized) is known as …… Chebyshev’s Theorem For a data Empirical with a bell-shaped Rule distribution: 68.26% of the values of a normal random variable are within +/- 1 standard deviation of its mean. 95.44% of the values of a normal random variable are within +/- 2 standard deviations of its mean. 99.72% of the values of a normal random variable are within +/- 3 standard deviations of its mean. Empirical Rule 99.72% 95.44% 68.26% – 3 – 1 – 2 + 3 + 1 + 2 x Z-Scores allow us to detect Outliers An outlier is an unusually small or unusually large value in a data set. It might be the result of: • an incorrect recording • an incorrectly included value in the data set • a correctly recorded data value but an unusual occurrence A data value with a z-score less than -3 or greater than +3 might be considered an outlier. Other methods of data Description Five-Number Summary and Box Plot Five-Number Summary 1 Smallest Value 2 First Quartile 3 Median 4 Third Quartile 5 Largest Value Five-Number Summary Lowest Value = 425 First Quartile = 445 Median = 475 Third Quartile = 525 425 440 450 465 480 510 575 430 440 450 470 485 515 575 430 440 450 470 490 525 580 435 445 450 472 490 525 590 435 445 450 475 490 525 600 Largest Value = 615 435 445 460 475 500 535 600 435 445 460 475 500 549 600 435 445 460 480 500 550 600 440 450 465 480 500 570 615 440 450 465 480 510 570 615 Box Plot A box is drawn with its ends located at the first and third quartiles. A vertical line is drawn in the box at the location of the median (second quartile). 375 400 425 450 475 500 525 550 575 600 625 Q1 = 445 Q3 = 525 Q2 = 475 Box Plot Lower and upper Limits are located (not drawn) using the interquartile range (IQR). Data outside these limits are considered outliers. The locations of each outlier is shown with the symbol * Box Plot The lower limit is located 1.5(IQR) below Q1. Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(75) = 332.5 The upper limit is located 1.5(IQR) above Q3. Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75) = 637.5 There are no outliers (values less than 332.5 or greater than 637.5) in the apartment rent data. Box Plot Whiskers (dashed lines) are drawn from the ends of the box to the smallest and largest data values inside the limits. 375 400 425 450 475 500 525 550 575 600 625 Smallest value inside limits = 425 Largest value inside limits = 615