Today: - Common problems from assignment 1. Q1,2,3,4. Question 1. Consider the data set {15, 9, 7, 20, 4, 12, 8, 0, 31} 1f: Determine if this distribution is positively or negatively skewed. How do you know this works? Common question: How do we know if it’s skewed without graphing the distribution? First: What is skew? A positive skew means that most extreme values are high values. Negative skew means that most extreme values are low. Synthesis: What does this have to do the other things we know about this dataset (from previous parts of the question) - Range Median Interquartile Range Outliers if any Mean The mean is ‘pulled’ by extreme values (sensitive to them) The median is not affected by extreme values (robust to them) Since the mean is ‘pulled’ by extreme values, it will be closer to the extreme values than the median will be. So if Mean > Median, we have positive skew. If Mean < Median, we have negative skew. For {15, 9, 7, 20, 4, 12, 8, 0, 31} Mean = (0 + 4 + 7 + 8 + 9 + 12 + 15 + 20 + 31)/9 = 11.78 th Median = ½ * (9 + 1)th value, or 5 value. Median = 9 Remember to sort the numbers lowest highest first! Mean = 11.78 > 9 = Median, so we have positive skew. The median is not pulled by extreme values. Consider two datasets: {1,2,3,4,5} and {1,2,3,4,99999} The median only cares how many numbers are above/below, in both cases the median is 3. There are two values above the median and two below and it doesn’t matter how far above/below. The mean IS pulled by extremes. Consider two datasets: {1,2,3,4,5} and {1,2,3,4,99999} To find the mean, we get the total of the numbers and divide by the how many there are. By changing the last number from 5 to 99,999 we change the total a great deal. We also change the mean from 3 to 20,002. I’m glad we solved that mystery. Question 2 (Part C) “Find the median” Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 f 23 10 2 26 8 17 14 N = 100 This is a frequency table. If we were to put the coded attitudes in a raw data set it would look like {1,1,1,1,1,1,1,….,1,1,2,2,2,….,2,2,3,3,4,4,…. ,7,7,7,7} “ …. “ means skipped values. Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 f 23 10 2 26 8 17 14 N = 100 “f” stands for “Frequency”, as in “how often”. In this case, 1 has a frequency of 23, meaning that 1 appears in the data set 23 times. N at the bottom is the sample size. There are 100 values in this data set as a whole. Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 f 23 10 2 26 8 17 14 N = 100 cf 23 33 35 61 69 86 100 Since there are 100 values, the median is the ½ x (100 + 1) = 50.5th value, or between the 50th and 51st value. We find that using the cumulative frequency (part b of question) Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 f 23 10 2 26 8 17 14 N = 100 cf 23 33 35 61 69 86 100 By the cumulative frequency (cf), we find there are 23 values of 1 or less, and 33 values of 2 or less. Sorted, the 1st, 2nd, 3rd, … , 22nd, 23rd values are all 1’s, and the 24th, 25th, … , 32nd, 33rd values are all 2’s… Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For th th Coded Attitude 1 2 3 4 5 6 7 th th f 23 10 2 26 8 17 14 N = 100 th cf 23 33 35 61 69 86 100 st ..and that the 36 , 37 , 38 , … , 59 , 60 , 61 values are all 4 (Neutral) The 50th and 51st values are in the 4 (Neutral) range, so that’s where the median is, at Neutral. All good? Onward then! Question 2 (Parts D and E) “What assumption is needed to get the mean. Find the mean.” Note: This opinion data is ordinal, (part a). That means the mean doesn’t work for it. The mean is only for interval data. But… Interval data is just ordinal data that has even spaces between responses and no gaps. Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 In this question, that means we can use the mean IF we treat this data like interval data. We have assume that differences between the categories are evenly spaced. Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 That is, a difference between “Strongly Against” and “Somewhat Against” is assumed to be the same as the difference between “Neutral” and “Slightly For”. Both of these are a difference of one category. If we do that, we also assume that the mean makes sense. Finding the mean. Attitude Towards Lab Meat Strongly Against Somewhat Against Slightly Against Neutral Slightly For Somewhat For Strongly For Coded Attitude 1 2 3 4 5 6 7 f 23 10 2 26 8 17 14 N = 100 The mean is the total of all the values divided by the number of values. There are N = 100 values. 23 of those values are “1” 10 are “2” and so on. We add up the 23 “1”s, 1 + 1 + 1 + … = 23 x 1 Then add up the “2”s 2 + 2 + 2… = 10 x 2 and so on. Total = 23x1 + 2x10 + … + 7x14 = 23 + 20 + … + 98 = 393 Mean = 393 / 100 = 3.93. By the mean, people as a whole are very slightly against lab grown meat. (The mean is slightly less than 4, and the lower values are against) Question 3 – Crosstab of academic success vs Mother’s age. Child born before/after 30 * Academic Success Cross-tabulation Count Academic Success Excellent Satisfactory Marginal Total Mother’s Over 30 29 45 18 92 age at first Under 30 21 55 32 108 childbirth. Total 50 100 50 200 Crosstabs also show frequency (or count) data, but they show the relationship between two categories of ordinal or nominal data. Nominal data is just “what type is something”, like flavour of ice cream. Ordinal data is similar, but has a natural ordering like “OK, good, very good, doubleplusgood.” 3d. What percentage of students are only marginally successful? Child born before/after 30 * Academic Success Cross-tabulation Count Academic Success Excellent Satisfactory Marginal Total Mother’s Over 30 29 45 18 92 age at first Under 30 21 55 32 108 childbirth. Total 50 100 50 200 The tricky part is identifying.. What group are we discussing? All Students. What do we want? % that are marginal. 3d. What percentage of students are only marginally successful? Mother’s Over 30 age at first Under 30 childbirth. Total Academic Success Excellent Satisfactory Marginal 29 45 18 21 55 32 50 100 Total 50 92 108 200 Only looking at one aspect (success), so use the totals. 200 students in total, 50 are marginal. Proportion that are marginal: 50 / 200 = .25, or 25% 3e. What is the ratio of students with mothers over 30 at childbirth to those under 30? Mother’s Over 30 age at first Under 30 childbirth. Total Academic Success Excellent Satisfactory Marginal 29 45 18 21 55 32 50 100 50 Total 92 108 A ratio is comparison of the size of one group to another. 92 older mothers to 108 young mothers. Ratio = 92 to 108, or 92/108. Alternative answers: 46/54, or 0.85/1. 200 3f. Of the students whose mothers were under 30 at first childbirth, what percentage have Satisfactory success or better? Mother’s Over 30 age at first Under childbirth. Academic Success Excellent Satisfactory Marginal 29 45 18 Total 92 21 55 32 108 50 100 50 200 30 Total Start by is identifying.. What group are we discussing? ONLY young mother students. What do we want? % that are satisfactory or better. Only considering one category is called conditioning. Our analysis for this question is conditional on the mothers being young (it doesn’t apply to all students). Satisfactory or better means Satisfactory or Excellent. Academic Success (UNDER 30 ONLY) Excellent Satisfactory Marginal 21 55 32 Total 108 That’s 21 + 55 = 76 students out of a total 108. Proportion = 76/108 = .703 = 70.3% Feelin’ refined and edified? One more problem to go. Question 4 – Children’s Luck Scores The histogram should have looked like this. I would say it’s bimodal, others say it’s multimodal. Both work. A cumulative frequency table is built going to Analyse Descriptive Stats Frequencies, and just leaving the frequency table button checked off. Frequency Percent 0 11 20.4 1 10 18.5 2 3 5.6 3 2 3.7 5 5 9.3 6 1 1.9 7 1 1.9 8 3 5.6 9 6 11.1 10 12 22.2 Total 54 100.0 Cumulative Percent 20.4 38.9 44.4 48.1 57.4 59.3 61.1 66.7 77.8 100.0 How many students reported that they were 2 or less lucky? Cumulative Percent 0 1 20.4 38.9 2 44.4 3 5 6 7 8 9 10 48.1 57.4 59.3 61.1 66.7 77.8 100.0 The cumulative percent is adding from lowest values to highest, so 44.4% of the students reported being 2 or less lucky. What percentage of students said they were 8 or more lucky? Cumulative Percent 0 1 2 3 5 6 20.4 38.9 44.4 48.1 57.4 59.3 7 61.1 8 9 10 66.7 77.8 100.0 The table only shows how many are X or less lucky. However, 8 or more is the opposite of 7 or less. So 100 – 61.1% = 38.9% are 8 or more lucky. Alternative method: Add up the frequencies. Frequency 0 1 2 3 5 6 7 8 9 10 Total 11 10 3 2 5 1 1 3 6 12 54 11 + 10 + 3 = 24 respondents reported being 2 or less lucky. 24 out of 54 total. 24/54 = .444 = 44.4% Alternative method: Add up the frequencies. Frequency 0 1 2 11 10 3 3 5 6 7 2 5 1 1 8 9 10 3 6 12 Total 54 3 + 6 + 12 = 21 respondents reported being 8 or more lucky. 21 out of 54 total. 21/54 = .389 = 38.9% Next time, on StatsClass: Friday: - Review of midterm material - (from the assignment 2 realm) Monday: - Midterm (Extra office hours… most of the morning?)