Describing and Summarizing Data (Section 3) I. One variable A. Measures of Central Tendency (A measure of central location) Measures of Central Tendency- A measure that attempts to find the middle or “typical” value of the data - There are many measures of central tendency. We will concentrate on 3 of them (mean, median, and mode) 1. Means - Population Mean- 1 N μ x = ∑ xi N i =1 Example - Sample mean- 1 n x = ∑ xi n i =1 Example 1 - Weighted Means- Sometimes the data you have found has the observation in classes, then you will have to use a weighted mean. 1 C ∑ f i mi C is the number N i =1 of classes m i is the class’ middle value - Population weighted Mean- μx = 1 C x = ∑ f i mi n i =1 - Sample weighted mean- Example- Sample of years it takes to graduate from college Years 0-3 3-6 6-9 9-12 Number of people 3 80 13 4 2. Medians Median- The middle point in the data. Ex 1: 1,2,3,4,5,6,7 Median= Ex 2: 1,2,3,4,5,6,7,8 Median= Why might the median be a better measure of “average” income than the mean? Ex Calculate the mean and median of the following incomes: $25,000 / $30,000 / $30,000 / $35,000 / $35,000 / $40,000 / $2,000,000 2 Median Class- The first class to have a cumulative relative frequency of 50% or more. Ex Sample of years it takes to graduate from college Year s 0-3 Number of People 3 3-6 80 6-9 13 9-12 4 Relative Frequency Cumulative Relative Frequency What is the Median Class? 3. Mode Mode- Value that occurs most often Ex- 1,2,2,3,4,7,7,9,11 Mode= Modal Class- Class in a frequency distribution that has the most occurrences in it. What is our modal class in the above example? 3 B. Other measures of location 1. Percentile- pth percentile is a value such that p percent falls at or below the value. and (100-p) percent of the measure fall at or above the value Exs 90th percentile: 80th percentile on a test: Calculating the pth percentile: 1. arrange the data from smallest to largest 2. compute index i p − percentile of interest ⎛ p ⎞ i=⎜ ⎟n n − number of observations ⎝ 100 ⎠ 3. a) if i is not an integer round up b) if i is an integer add 1 to i Ex: 200, 190, 170, 210, 220, 120, 140, 250 What numbers are in the 60th percentile? 2. Quartiles- Dividing the data into four parts (using three lines) Q1 First quartiles (25th percentile) Q2 Second quartiles (50th percentile also median) Q3 Third quartiles (75th percentile) Ex 5 10 11 12 13 19 23 25 28 29 30 33 Use your eyes to divide the numbers into quartiles Use the percentile calculation to divide the above into quartiles 4 Divide the Histogram below into quartiles: Histogram 3.5 Frequency 3 2.5 2 Frequency 1.5 1 0.5 9 7 5 3 1 0 Bin C. Skewness- - Skewness- A measure of the degree of asymmetry of a distribution. Positive (Right) Skewed Mean>Median Negative (Left) Skewed Mean<Median Symmetric Distribution 5 D. Measures of Dispersion - Measures of Dispersion- Measures of the variability of a distribution (often used to measure risk) 1. Range- Differences between smallest and largest number in a data set Example- 2. Interquartile RangeInterquartile Range= Q3 – Q1 Ex Previous example from quartiles to calculate Interquartile range 3. Variance - Population Variance: - Sample Variance: 1 N σ = ∑ ( xi − μ ) 2 N i =1 1 n 2 s = ( xi − x) 2 ∑ n − 1 i =1 2 Ex- Sample of drinks per week of law students at IU: 0,1,2,3,4 - Why is sample variance divided by n-1 instead of n? 6 4. Standard Deviation - Population Standard Deviation: σ = σ 2 - Sample Standard Deviation: s = s2 Example- Find the Standard deviation for our previous sample Why do we look at standard deviation instead of just looking at variance? What does a Standard deviation tell us? Why would this be important if our data is returns of a particular stock? Stock 1 Stock 2 Year 1 10% -10% Year 2 Year 3 Avg Return 10% 10% 40% 0% 7 5. Coefficient of Variation Coefficient of Variation = Standard Deviation s σ * 100 or * 100 or * 100 Mean x μ Why? It gets rid of units of measurement The standard deviation gets larger as the magnitude of values used in calculation gets larger so an unadjusted comparison of the standard deviations on stocks is not a measure to compare their “riskyness”. 8 Ex 1 August 17 1998 issue of Fortune Magazine Mean return of Legg Mason Value Primary Fund is 29.4% and the standard deviation is 17.3% Mean return of the Reynolds Blue Chip Growth Fund is 23.7% and its standard deviation is 17.5% Calculate the coefficient of variation for both: Interpret: The coefficient of variation is the best measure of risk we have studied thus far. 9 Ex 2: Stock 1 returns: 5%, 6%, 7% Stock 2 returns: 7%, 10%, 13% Calculate the mean & standard deviation of both (assume population): If you measured risk using standard deviation (or variance or range) which one would you say is more risky? Calculate the coefficient of variation for each stock: What does coefficient of variation say is the riskiest stock? Why does this not make sense? 10 6. Variance and Standard Deviation for Grouped Data: σ2 = - Population Variance: 1 N C ∑ f i ( mi − μ ) 2 i =1 How would you write this if you had relative frequency? - Sample Variance: s2 = - Population Standard Deviation: - Sample Standard Deviation: 1 C f i ( mi − x ) 2 ∑ n − 1 i =1 σ= s= Example- Sample of years it takes to graduate from college Years Number of people 0-3 3 3-6 80 6-9 13 9-12 4 Calculate sample variance: 11 E. Measures of relative location & detecting outliers 1. Z- scores- Number of standard deviations our observation (xi) is away from the mean. zi = xi − x s Z-score can be interpreted as measure of relative location in a data set. Ex Mean return of a stock is 15% and the standard deviation is 5%, we get a return of 10%. Observe how many standard deviations our return is from the mean. Confirm your answer by calculating a z-score. 2. General Rules for Bell shaped distributions Looking at this distribution would you believe that you were 5 standard deviations away from the mean? 12 D. Descriptive (Summary) Statistics in Excel Click: Tools → Data Analysis → Descriptive Statistics Click the summary Statistics box Column1 Mean Standard Error Median Mode Standard Deviation Sample Variance Kurtosis Skewness Range Minimum Maximum Sum Count 5.5 0.507194614 5.5 3 2.484736011 6.173913043 -0.938233756 0 9 1 10 132 24 This is the summary statistics for the Histogram we used earlier: Histogram 3.5 Frequency 3 2.5 2 Frequency 1.5 1 0.5 9 7 5 3 1 0 Bin Why does the mean equal the mean? Why is skewness=0? Draw 1 standard deviation above and below the mean on the histogram. Draw 2 standard deviations above and below the mean on the histogram. Would you believe someone who told you that you were 1 standard deviation below the mean? Would you believe someone who told you that you were 2 standard deviations below the mean? 13 Other commands is Excel: Mean: Median: Mode: =average() =median() =mode() Population variance: Sample variance: Square root: =VARP() =Var() =SQRT() 14 II. Association between two variables A. Covariance Population CovarianceSample Covariance- σ xy = ∑ (xi − μ x )(yi − μ y ) N 1 N i =1 n s xy = n1−1 ∑ ( xi − x )( yi − y ) i =1 Ex 1: Calculate the population covariance for the following two variables Day 1 2 3 4 5 Food Sales $ 2000 500 1000 600 900 Drink Sales $ 200 800 500 700 300 Negative Covariance: Loose Interpretation: More Technical Interpretation: Positive Covariance: Loose Interpretation: More Technical Interpretation: 15 Ex 2: Calculate the sample covariance for the following two variables Day 1 2 3 4 5 Drink Sales $ 200 800 500 700 300 Ice Cream Sales $ 50 200 100 150 0 Interpret your result: B. Correlation Coefficient Population correlation coefficient: ρ xy = Sample correlation coefficient: rxy = σ xy σ xσ y s xy sx s y • The value of the correlation coefficient is between –1 and 1 (including –1 and 1) +1: -1: Perfect positive linear relationship between the two variables Perfect negative linear relationship between the two variables + number: Positive linear relationship between the two variables - number: Negative linear relationship between the two variables 16 Ex 1: Calculate the population correlation coefficient for the following two variables Day 1 2 3 4 5 Food Sales $ 2000 500 1000 600 900 Drink Sales $ 200 800 500 700 300 Ex 2: Calculate the sample correlation coefficient for the following two variables Day 1 2 3 4 5 Drink Sales $ 200 800 500 700 300 Ice Cream Sales $ 50 200 100 150 0 17 C. Scatter Plots, Covariance, & Correlation Coefficient Perfect Positive Linear Relationship Positive Linear Relationship 8 8 6 6 4 4 2 2 0 -2 0 0 5 10 15 -10 -5 -2 0 -4 -4 -6 -6 -8 -8 Perfect Negative Linear Relationship 5 10 Negative Linear Relationship 6 8 4 6 2 4 2 -10 0 -10 -5 -2 0 5 -5 0 -2 0 5 10 10 -4 -4 -6 -6 -8 -8 Almost No Linear Relationship -4 -2 4 3 2 1 0 -1 0 -2 -3 -4 2 4 Additional Excel commands: Population Covariance: =COVAR(,) Sample Covariance: =(n/n-1)*COVAR(,) where n is the number of observations you have Correlation Coefficient =CORREL(,) 18 Formulas for the Section Below write down the formulas for this section in an organized manner that will help you remember them. 19 PRACTICE QUESTIONS 1. Assume the following data is a population : (Make sure that you can do the following calculations by hand) 1, 5, 6, 9, 200, 9, 6, 2, 5, 17, 25 a. Calculate the mean b. Calculate the mode c. Calculate the median d. What numbers are in the 40th percentile? e. What is the skewness of the data? f. Calculate the range g. Calculate the variance h. Calculate the standard deviation i. Calculate the coefficient of variation 2. Assume the following data is a sample: (Make sure that you can do the following calculations by hand) 223, 699, 1222, 845, 111, 3 a. Calculate the mean b. Calculate the mode c. Calculate the median d. What is the skewness of the data? e. Calculate the range f. Calculate the variance g. Calculate the standard deviation h. Calculate the coefficient of variation 3. Assume the following data is a population: (Make sure that you can do the following calculations by hand) Data on speeders MPH over the limit 0-5 10-15 15-20 20-25 Number of people 3 23 50 25 a. Calculate the mean b. Calculate the mode c. Which is the median class? d. Calculate the variance e. Calculate the standard deviation 4. Assume the following data is a sample (for stocks): (Make sure that you can do the following calculations by hand) Yield % -10% but under -5% -5% but under 0% 0% but under 5% 5% but under 10% 10% but under 15% 15% but under 20% Cumulative Relative Frequency 0.10 0.20 0.45 0.60 0.85 1.00 a. Calculate the mean b. Calculate the mode c. Which is the median class? 20 5. Answer the following questions using the histogram below: Frequency Histogram 10 9 8 7 6 5 4 3 2 1 0 Frequency 20 30 40 50 60 70 80 90 100 Bin a) Which way is the skew of the histogram? b) What can you say about the relationship between the mean and median? 6. Answer the following questions using the histogram below: Frequency Histogram 9 8 7 6 5 4 3 2 1 0 Frequency 20 30 40 50 60 70 80 90 100 Bin c) Which way is the skew of the histogram? d) What can you say about the relationship between the mean and median? 7. The mean of data set is 17, the median is 15, the mode is 5, and the variance is 25. The data point you are looking at is 25. How many standard deviations away from the mean is this data point? 21 8. Use the Population data below to answer the following questions: City Amount of Rain (inches) A 17 B 44 C 100 D 50 E 60 a. b. ski sales (thousands) 200 100 80 90 77 Calculate the covariance Calculate the correlation coefficient 9. Use the sample data below to answer the following questions: FIRM % OF WOMAN WORKING FOR THE FIRM A 17 B 44 C 100 D 50 E 60 a. Calculate the covariance b. Calculate the correlation coefficient PROFITS (MILLIONS) 2 1 7 0.5 3 10. Is a histogram used to describe one variable or two? 11. Is a scatter plot used to describe one variable or two? Use the following table to answer the next 4 questions weight # of dogs 0-50 lbs 15 50-100 25 100-150 10 Relative frequency Percent frequency Cumulative frequency Cumulative relative frequency (in %) III I IV II V 12. Assume this is a population. Calculate the weighted mean. (round to the nearest tenth) a. 47.5 b. 48.5 c. 70.0 d. 17.4 e. None of the above 13. Assume this is a population. Calculate the median class a. 15 dogs b. 25 dogs c. 0-50 lbs d. 50-100 lbs e. 100-150 lbs 14. Assume this is a population. Calculate the modal class a. 15 dogs b. 25 dogs c. 0-50 lbs 22 d. e. 50-100 lbs 100-150 lbs 15. Assume this is a population. Calculate the variance (round to the nearest who number) a. 1,225 b. 1,250 c. 15,475 d. 15,791 e. None of the above Use the following table to answer the next 2 questions Cost Quality rating 10 25 30 90 50 35 16. Assume the data is a population. Calculate the covariance a. 1.3 b. 2.2 c. 66.7 d. 100.0 e. None of the above 17. Assume the data is a population. Calculate the correlation coefficient (round to the nearest 100th) a. 0.00 b. 0.04 c. 0.07 d. 0.10 e. 0.14 18. The mean return of investment 1 is 2% and the variance is 100 percent2 The mean return of investment 2 is 50% and the variance is 10,000 percent2 According to the standard deviation which stock is riskier? a. Stock 1 b. Stock 2 c. They are equally risky d. Neither of the stocks has any risk 19. The mean return of investment 1 is 2% and the variance is 100% The mean return of investment 2 is 50% and the variance is 10,000% According to the coefficient of variation which stock is riskier? a. Stock 1 b. Stock 2 c. They are equally risky d. Neither of the stocks have any risk Questions from book: Page 84: 8 Page 92: 18 Page 112: 47 & 48 Page 119: 58 23 ANSWERS (Some of the Answers may be incorrect; let me know if you think you found an incorrect answer) 1. a) 25.91 b) 5, 6, 9 c) 6 d) 6 and above e) right skewed (mean>median) f) 199 g) 3074.45 (remember it is a population) h) 55.45 i) 214 2. a) 517.17 b) none c) 461 d) right e) 1219 f) 230,640.17 (remember it is a sample) g) 480.25 h) 93 3. a) 17.15 MPH b) 15-20 MPH c) 15-20 MPH d) 18.44 MPH2 e) 4.29 4. it is easier to work with relative frequency so add a column: Yield % -10% but under -5% -5% but under 0% 0% but under 5% 5% but under 10% 10% but under 15% 15% but under 20% Cumulative Relative Frequency 0.10 0.20 0.45 0.60 0.85 1.00 Relative Frequency 0.10 0.10 0.25 0.15 0.25 0.15 a) 6.5% b) 5-10% c) 0-5% or 10-15% 5. a) left skewed b) mean < median 6. a) right skewed b) mean > median 7. It is 1.6 standard deviations above the mean 8. a) -945.48 b) - 0.76 9. a) 62.83 b) 0.80 10. One variable 24 11. Two variables 12 13 14 15 16 17 18 19 20 C D D A C E B A 25