Introduction to Probability and Statistics I Lecture 5 Chebyshev’s Theorem and Exercises Review Example Cruise agency – number of weekly specials to the Caribbean: 20, 73, 75, 80, 82 Compute the mean, median and mode and interpret your results? Review Example: Summary Statistics 20, 73, 75, 80, 82 x xi 330 66 n 5 Mean: Median: middlemost observation = 75 Mode: no unique mode exists The median best describes the data due to the presence of the outlier of 20. This skews the distribution to the left. The manager should first check to see if the value ‘20’ is correct. Review Example: Summary Statistics 20, 73, 75, 80, 82 x xi 330 66 n 5 Mean: Median: middlemost observation = 75 Mode: no unique mode exists The median best describes the data due to the presence of the outlier of 20. This skews the distribution to the left. The manager should first check to see if the value ‘20’ is correct. Review Example common stocks 4 14.3 19 -14.7 -26.5 treasury bills 6.5 4.4 3.8 6.9 8 stocks x i N 57.12 8.16 7 Tbills 37.2 23.8 5.8 5.1 x 40.502 5.786 N 7 i The mean annual % return on stocks is higher than the return for U.S. Treasury bills Review Example common stocks 4 14.3 19 -14.7 -26.5 treasury bills 6.5 4.4 3.8 6.9 8 stocks 2 (x ) 37.2 23.8 5.8 5.1 2 i N (4.0 8.16) (14.3 8.16) (19 8.16) ( 14.7 8.16) ( 26.5 8.16) (37.2 8.16) (23.8 8.16) 2 2 2 2 2 2 2 = 20.648 7 Tbills 2 (x ) 2 i N (6.5 5.8) (4.4 5.8) (3.8 5.8) (6.9 5.8) (8.0 5.8) (5.8 5.8) (5.1 5.8) 2 2 2 2 2 2 2 7 The variability of the U.S. Treasury bills is much smaller than the return on stocks. =1.362 Review Example common stocks 4 14.3 19 -14.7 -26.5 treasury bills 6.5 4.4 3.8 6.9 8 stocks 2 (x ) 37.2 23.8 5.8 5.1 2 i N (4.0 8.16) (14.3 8.16) (19 8.16) ( 14.7 8.16) ( 26.5 8.16) (37.2 8.16) (23.8 8.16) 2 2 2 2 2 2 2 = 20.648 7 Tbills 2 (x ) 2 i N (6.5 5.8) (4.4 5.8) (3.8 5.8) (6.9 5.8) (8.0 5.8) (5.8 5.8) (5.1 5.8) 2 2 2 2 2 2 2 7 The variability of the U.S. Treasury bills is much smaller than the return on stocks. =1.362 Chebyshev’s Theorem For any population with mean μ and standard deviation σ , and k > 1 , the percentage of observations that fall within the interval [μ + kσ] Is at least 100[1 (1/k )]% 2 Chebyshev’s Theorem (continued) Regardless of how the data are distributed, at least (1 - 1/k2) of the values will fall within k standard deviations of the mean (for k > 1) Examples: At least within (1 - 1/12) = 0% ……..... k=1 (μ ± 1σ) (1 - 1/22) = 75% …........ k=2 (μ ± 2σ) (1 - 1/32) = 89% ………. k=3 (μ ± 3σ) The Empirical Rule If the data distribution is bell-shaped, then the interval: μ 1σ contains about 68% of the values in the population or the sample 68% μ μ 1σ The Empirical Rule μ 2σ contains about 95% of the values in the population or the sample μ 3σ contains about 99.7% of the values in the population or the sample 95% 99.7% μ 2σ μ 3σ Coefficient of Variation Measures relative variation Always in percentage (%) Shows variation relative to mean Can be used to compare two or more sets of data measured in different units s CV 100% x Review Example A random sample of data has Mean = 75, variance = 25. Use Chebychev’s theorem to determine the percent of observations between 65 and 85. If the data are mounded use the emprical rule to find the approximate percent of observations between 65 and 85. Review Example A random sample of data has Mean = 75, variance = 25. Use Chebychev’s theorem. +/- 2 standard deviations: proportion must be at least 100[1 (1/ k 2 )]% = 100[1 (1/ 22 )]% = at least 75% The empirical rule. +/- 2 standard deviations: Approximately 95% of the observations are within 2 standard deviations from the mean Comparing Coefficient of Variation Stock A: Average price last year = $50 Standard deviation = $5 s $5 CVA 100% 100% 10% $50 x Stock B: Average price last year = $100 Standard deviation = $5 s $5 CVB 100% 100% 5% $100 x Both stocks have the same standard deviation, but stock B is less variable relative to its price Weighted Mean The weighted mean of a set of data is n w x x w i1 i i w 1x1 w 2 x 2 w n x n wi Where wi is the weight of the ith observation Use when data is already grouped into n classes, with wi values in the ith class Approximations for Grouped Data Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK For a population of N observations the mean is K μ fimi K where N fi i1 i1 N For a sample of n observations, the mean is K x fm i1 i i K where n fi i1 n Approximations for Grouped Data Suppose a data set contains values m1, m2, . . ., mk, occurring with frequencies f1, f2, . . . fK For a population of N observations the variance is K σ2 2 f (m μ) i i i1 N For a sample of n observations, the variance is K s2 2 f (m x ) i i i1 n 1 The Sample Covariance The covariance measures the strength of the linear relationship between two variables The population covariance: N Cov (x , y) xy (x i i1 x )(yi y ) N The sample covariance: n Cov (x , y) s xy (x x)(y y) i1 i i n 1 Only concerned with the strength of the relationship No causal effect is implied Interpreting Covariance Covariance between two variables: Cov(x,y) > 0 x and y tend to move in the same direction Cov(x,y) < 0 x and y tend to move in opposite directions Cov(x,y) = 0 x and y are independent Coefficient of Correlation Measures the relative strength of the linear relationship between two variables Population correlation coefficient: Cov (x , y) ρ σXσY Sample correlation coefficient: Cov (x , y) r sX sY Features of Correlation Coefficient, r Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative linear relationship The closer to 1, the stronger the positive linear relationship The closer to 0, the weaker any positive linear relationship Scatter Plots of Data with Various Correlation Coefficients Y Y Y X X r = -1 X r = -.6 r=0 Y Y Y r = +1 X X r = +.3 X r=0 Interpreting the Result Scatter Plot of Test Scores r = .733 100 There is a relatively strong positive linear relationship between test score #1 and test score #2 Test #2 Score 95 90 85 80 75 70 70 75 80 85 90 Test #1 Score Students who scored high on the first test tended to score high on second test 95 100 Obtaining Linear Relationships An equation can be fit to show the best linear relationship between two variables: Y = β 0 + β 1X Where Y is the dependent variable and X is the independent variable Least Squares Regression Estimates for coefficients β0 and β1 are found to minimize the sum of the squared residuals The least-squares regression line, based on sample data, is yˆ b0 b1 x Where b1 is the slope of the line and b0 is the yintercept: sy Cov(x, y) b1 r 2 sx sx b0 y b1x Review Example The following data give X, the price charged per piece of plywood($) and Y, the quantitiy sold ( in thousands) (6,80) (7,60) (8,70) (9,40)(10,0) Compute the covariance Correlation coefficient Compute and interpret regression coefficients. What quantity of plywood is expected to be sold if the price were $7 per piece? Review Example (6,80) (7,60) (8,70) (9,40)(10,0) ( xi x ) = 8.00 ( yi y ) ( xi x ) 2 ( yi y ) 2 ( xi x ) ( yi y ) 6 80 -2 4 30 900 -60 7 60 -1 1 10 100 -10 8 70 0 0 20 400 0 9 40 1 1 -10 100 -10 10 0 2 4 -50 2500 -100 40 250 0 10 0 4000 -180 = 50.00 = 2.5 =1000 = 1.5811 =31.623 Cov(x,y) = -45 Compute the covariance = -45 Correlation coefficient= -.900. The correlation coefficient indicates the strength of the linear association between the two variables Compute and interpret regression coefficients. What quantity of plywood is expected to be sold if the price were $7 per piece? Review Example (6,80) (7,60) (8,70) (9,40)(10,0) Compute and interpret regression coefficients. b1 Cov( x, y ) 45 18.0 sx2 2.5 For a one dollar increase in the price per piece of plywood, the quantity sold of plywood is estimated to decrease by 18 thousand pieces b0 y b1 x = 50.0 – (-18)(8.0) = 194.00 What quantity of plywood is expected to be sold if the price were $7 per piece? yˆ b0 b1 x 194.00 18.0(7) 68 Summary Described measures of central tendency Illustrated the shape of the distribution Symmetric, skewed Described measures of variation Mean, median, mode Range, interquartile range, variance and standard deviation, coefficient of variation Discussed measures of grouped data Calculated measures of relationships between variables covariance and correlation coefficient