Matching Histograms, Box Plots, and Normality Plots 1 WHAT IF YOU DO NOT KNOW THE STANDARD DEVIATION OF THE POPULATION? What are t-distributions? When you don’t know the standard deviation of the population you use the standard deviation of the sample. The majority of time you do not know the standard deviation of the population so most times on the MEAN side you will be using the t-distribution. The majority of time when you are on the mean side you will be using a t-distribution and when you are on the proportion side you will be using a z score. The t distribution will only go with mean problems because you are working with a standard deviation. On the proportion side you are CREATING the standard deviation based on the sample size. When dealing with a t-distribution there are two values that have variation, the mean and the standard deviation. Every sample will most likely have a different mean and different standard deviation based on the sample. The CLT tells us that the sampling distribution of sample means approaches a Normal model as n increases. That's based on the assumption that we know sigma. In the real world, that's exceedingly rare. Not knowing sigma forces us to use the sample's s, and that means we're no longer playing by CLT rules. We have to take this extra uncertainty into account, and Gosset's t does exactly that. The critical issue in deciding between z and t is the fact that we don't (and indeed can't) know sigma. In the real world, inference for means ALWAYS requires t. The rule of thumb suggesting that one should use z if n > 30 is a pre-technology approximation that's no longer necessary. No matter how large n is, we use t. Before we had stats software or calculators, finding values of t for any number of df would have required many pages of tables. The easy out was to print only one page of the t-table (usually up to about 30 df) and then tell people to switch over to z. That's not because the sampling distribution miraculously changed, it's because the t distribution approaches a Normal as n increases, and 30 was a convenient time to say "okay, it's close enough now". The book suggesting this approach is out of date We no longer have to resort to this approximation, because the calculator can work with any number of df and that lets us always (and properly) use t for inference for means. Suggesting that we should sometimes use t and sometimes z confuses kids. Don't go there. We won't know sigma. t is for means, z is for proportions. Keep it simple. Jeane Swaynos AP Workshop July 2008 1 Matching Histograms, Box Plots, and Normality Plots 2 There is a family of graphs for the t-distributions and each one depends on the size of the sample. T distributions This is a graph of the NORMAL distribution This is a graph of the tdistribution with 2 degrees of freedom. Do you notice that the tails are thicker on the t distribution This is a t distribution with 10 degrees of freedom. As the degrees of freedom increase the sampling distribution is closer to a normal distribution. The degrees of freedom is found by taking your sample size –1. For example is your sample size is 15 your degrees of freedom would be 14 so you would look at the distribution for t = 14. Look at the chart in your book and find the t-distribution table. Do you notice that there are values along the left side? These represent the degrees of freedom. Jeane Swaynos AP Workshop July 2008 2 Matching Histograms, Box Plots, and Normality Plots 3 These are additional assumptions necessary when using the “t” distribution. If you are given the data points you must graph the points and discuss what you are looking for on the graph. If you are not given the data points then you will have to assume that these conditions occur. Here are the conditions 1. n< 15 distribution approximately Normal With a “t” distribution you have additional assumptions. If your sample size is less than 15 then you have to assume that the population of interest is NORMAL. Think about this, it would be difficult to show normality with only a small sample size. If the data is provided the reader still expects you to display the data and comment on the histogram. 2. 15<= n < 30 If your sample size is between 15 and 30 your data does not necessary have to be normal but you cannot have skewness or outliers. Again, if the data is given show a histogram and describe what you see. 3. n > 30 If your sample size is greater than 30 then it does not have to be approximately normal and it can also have some skewness in either direction. You will still have to be cautious of outliers so you will want to state this as part of your assumptions. You will choose one of the assumptions above based on your sample size. You will still need to address the other two assumptions: You assume this is a SRS from your population of interest You assume that your sample independent, which means that it, is less than 10% of the entire population. The assumptions on the AP test are graded as right or wrong. If one little piece for example if you do not define the population in the SRS or you do not state the correct information involving your sample size then it is wrong, there is no partial credit. Jeane Swaynos AP Workshop July 2008 3 Matching Histograms, Box Plots, and Normality Plots 4 One sample T Confidence Interval Mrs. Swaynos took a random sample of gas prices around the area and found the price per gallon of regular gasoline was as follows: 3.12 3.21 3.34 3.67 3.78 3.10 3.12 3.32 3.45 3.56 3.78 Make a 95% confidence interval for the data and interpret the results. Show all parts Define the statistic because that is what you are using in the formula _ x = the average price of regular gasoline per gallon FROM MY SAMPLE Assumptions 1. I assume this is a simple random sample of locations 2. I assume that the 11 samples are independent of each other Since n< 15 and I am using a t distribution I will assume the population of interest is normal. Below is a normality plot to show the 11 data points and the histogram shows a slightly skewed distribution. It is difficult to show data is normal with such a small sample size. I will, therefore, assume that the population of all gas prices is normally distributed. Be sure to describe your graph You could either make a normality plot or a histogram to show the data is normal. Mechanics (Name or Formula with substitution, df, and the specific interval) Note at the start of the year I do not give the students a choice, they must give the formula with the correct substitutions. If they do not have t-inverse on their calculator have them use the chart. Jeane Swaynos AP Workshop July 2008 4 Matching Histograms, Box Plots, and Normality Plots _ 95% CI = x t * 5 sd n 95% CI = 3.4 2.228 .26 11 ( 3.23 to 3.58) Conclusion Interpretation of the INTERVAL I am 95% confident that the true price of gas in Florida is between $3.23 and $3.58 per gallon. Interpretation of the LEVEL If I did this process again and again, I would capture the true mean of gas prices approximately 95% of the time in the various intervals. Two Sample T-Test A teacher wants to know if the method of instruction affects how well students learn. Using two classes of the same level of statistics, she teaches one class using lecture only and the other class using lecture and group work. She measures the level of learning by giving both classes the same test. Assuming that the two classes are representative of all statistics students, what type of inference procedures should be used? State the hypotheses for the appropriate test and identify the inference procedure you would use. Justify your response and include comments on the design of the study. Solution Procedure type: Two-sample t-test H o : L G Where L represents the mean score of the tests in the class where lecture H a : L G only was used and G represents the mean score of the tests in the class where lecture and group work were both used. Notes The response variable, scores of individual students, is numerical, and there are two independent groups, classes with lecture only and classes with both lecture and group work. This leads us to conclude it is a difference of means two-sample t problem. The teacher is looking to see if one method is different from the other which would indicate a two-tailed test. Matched Pair T- Test Situation Jeane Swaynos AP Workshop July 2008 5 Matching Histograms, Box Plots, and Normality Plots 6 Having done poorly on their math final exam in June, six students repeat the course in summer school, and then take another exam in August. If we consider these students representative of all students who might attend this summer school in other years, do these results provide evidence that the program is worthwhile? Show all parts June Aug 54 50 49 65 68 74 66 64 62 68 62 72 This is a matched-pair because we are taking two measurements from one experimental unit. The treatment is the summer course remediation. Any pre and posttest is a Matched Pair design and we look at the Difference. This is really a onesample t test on the difference. We call it a matched pair because the data of the difference comes from two data points. These data points, however, are NOT independent. The same person is taking the test in June and in August. What to look for when making a decision about whether to use a “Matched Pair” test for the difference or a “Two sample” test: Data must be paired for a matched pair test. Pairing is not a problem, it is an opportunity. The independent assumption is violated but we can actually do much better than the two-sample t-test. After all, we should be focusing on the changes. You make a decision about whether the data is paired from understanding how they were collected and what they mean. There is no test to determine whether the data are paired. This comes from reading the problem Mechanically, a matched pair t-test is just a one-sample t-test for the means of these paired differences. The sample size is the number of pairs. Things to remember Don’t use a two-sample t-test for paired data. Don’t use a paired-t method when the samples aren’t paired. When two groups do not have the same number of values, it’s pretty easy to see that they can’t be paired. But just because two groups have the same number of observations doesn’t mean they can be paired even if they are shown side-by-side in a table. We might have 25 men and 25 women in the study, but they might be completely independent of one another. There is most often less variability in the matched pair design then the two sample t test. Matching pairs generally removes so much extra variation that it more than compensates for having only half the degrees of freedom. Decide whether the following situations are paired or two samples. . Jeane Swaynos AP Workshop July 2008 6 Matching Histograms, Box Plots, and Normality Plots 7 Define Parameter D - This is the AVERAGE DIFFERENCE between the Test score in August – Test score in June. Mean of the differences! Null and Alternative in terms of the PARAMETER H o D 0 H A D 0 Assumptions Assumptions 1. This is a simple random sample of students 2. The six students are independent of each other. 3. Since I am using a t-distribution with only 6 points and I have the data points I MUST SHOW A GRAPH OF THE DATA. I will assume the population is NORMAL. OR Since the Normality plot shows an outlier. I will assume that the POPULATION of interest is normal. The histogram does not appear normal because there is a gap. It is difficult to show that a sample size this small would be normal. That is why we assume that the population of interest is normal. Many students did not include a normality plot or a histogram. Test statistics, type of test by name of formula, pvalue, degrees of freedom Matched Pair t-test t score is 1.75 degrees of freedom is 5 P(t >1.75) = .0699 T= 5.33 0 1.75 7.44 6 Conclusion There is approximately a .07 probability that a DIFFERENCE of 5.33 or larger would happen by chance alone if the true difference in scores from August – June was 0. I will NOT reject the Null at = .05 Jeane Swaynos AP Workshop July 2008 7 Matching Histograms, Box Plots, and Normality Plots 8 Proportion Problem (Two Sample) Notes about the standard error the two proportion test and two-proportion confidence interval The standard error for a two-proportion test is pC (1 pC ) 1 1 ( ) 1 n1 n2 pC This is the combination of the proportions. The book does not use this notation but I use it to say it is a combination of both proportions for example p1 12 15 27 and p2 then pC 30 40 70 When you are doing a TEST you are stating that the two proportions are equal so you can pool the standard error. When you pool the proportions you combine both the values and create a proportion based on both samples. Confidence Interval When you are doing a confidence interval this is not the case. In this situation the p (1 p1 ) p2 (1 p2 ) standard error is ( 1 n1 n2 So the formula for a two-proportion confidence interval is p1 (1 p1 ) p2 (1 p2 ) n1 n2 confidence interval. Remember how to find z* ____% CI = p z* ( the z* has to match the (1-c) / 2. This will determine the area under the curve at each tail. Use InvNorm to find the z* matching the confidence interval. Remember for a confidence interval you will be defining the p from your sample. Jeane Swaynos AP Workshop July 2008 8 Matching Histograms, Box Plots, and Normality Plots 9 Situation A Gallup Poll taken in May 2000 asked the question: “In general, do you feel that the laws covering the sale of firearms should be made: more strict, less strict, or kept as they are now?” Of the n = 493 men who responded, 52% said “more strict,” while of the n = 538 women who responded, 72% said “more strict.” Assuming these respondents constitute random samples of U.S. men and women, is there sufficient evidence to conclude that a higher proportion of women than men in the population think these laws should be made stricter? Justify your answer. Procedure type: Difference of two proportions z test H o : pw pm 0 H o : pw pm OR H a : pw pm 0 H a : pw pm where pm and pw represent the proportion of men and women respectively who support “more strict” laws in the sale of firearms. Notes Performing a large-sample difference of two proportions test solves this problem. This is evident because there are two populations that we are studying, men and women. The problem also asks students to find evidence of a higher percentage or proportion of women than men who think that the sale of firearms should be stricter, indicating a one-tailed test. Students may need to be reminded that for the hypothesis test, in calculating the test statistic, students must use a pooled estimate for the proportion. In a hypothesis test we are assuming the null hypothesis is true, and the null hypothesis assumes population proportions for men and women are equal. Students need to check the conditions that the sample size is large enough. One rule for checking this would be: nw pw 10 , nw (1 pw ) 10 and nm pm 10 , nm (1 pm ) 10 where nm and nw represent the number of men and women respectively. Some texts use: nw pˆ 10 , nw (1 pˆ ) 10 and nm pˆ 10 , nm (1 pˆ ) 10 where p̂ represents the pooled estimate of p. Other texts use 5 in place of 10. Possible incorrect solutions would include difference of two means or Chi-Square. Difference of means cannot be correct because we have proportions as opposed to average percentages. If students are confused on this issue, ask them what the original data must look like. Is it numerical or categorical? For answers of categorical, tests of proportion are correct. For answers of numerical, t-tests are appropriate. In this case the raw data must be in the form, “more strict,” “less strict,” or “kept as they are now.” This is categorical data. If our original data were presented as a list of percentages for many different samples (which is numerical), a t-test would be used. Students could also attempt a solution using Chi-Square. Since Chi-Square is always two-tailed and we are doing a one-tailed test this would not be appropriate. Chi-Square could work as an alternative solution if this example was not one-tailed. Jeane Swaynos AP Workshop July 2008 9 Matching Histograms, Box Plots, and Normality Plots 10 It is important that you define the Null and Alternative Hypothesis in terms of the PARAMETER State Null and Alternative in terms of parameters HO: Pm = Pf HA: Pm < Pf Define Parameters in context Pm = The proportion of males who think that the laws about gun control should be MORE STRICT Pf = The proportion of females who think the laws about gun control should be made more strict Give Assumptions for both groups Individually Males 1. I assume this is a simple random sample of males from the population 2. I assume that the 493 males are independent 3. np> 10 .52(493) > 10 256 > 10 and n(1-p) > 10 493(.48) > 10 236 > 10 Show numerical values Females 1. I assume that this is a simple random sample of females from the population 2. I assume that the sample of 538 females are independent of each other 3. np> 10 of .72(538) > 10 387 > 10 and .28(538) > 10 150> 10 Show numerical values 4. I assume that the 493 males sampled are independent of the 538 females Give formula with test statistic, zscore and pvalue All the values you need for the formula are on the calculator screen Jeane Swaynos AP Workshop July 2008 10 Matching Histograms, Box Plots, and Normality Plots phatm phat f Z= pc (1 pc )( 1 1 ) nF nM = 11 .52 .72 0 6.623 1 1 .62(.38)( ) 538 493 If you entered the females first you would have a positive z-score and the p value would be the same. Remember the p-value will never be exactly zero even though that is what the calculator gives. There is always some small tiny area so have the p-value APPROXIMATELY equal to zero. If you had done females – males the z value would be + 6.623 and this would represent the right area of the tail. The p-value would be the same approximately zero. Conclusion (pvalue, difference, context, reject or not reject, alpha level) There is approximately a 0% probability that a DIFFERENCE of .20 or larger would happen by chance alone if the true difference between males and females that prefer a stricter gun control law were 0. I will reject the Null at =.05 Jeane Swaynos AP Workshop July 2008 11 Matching Histograms, Box Plots, and Normality Plots 12 Chi Square Test Chi Square test is used when you have categorical data and are comparing a number of proportions. There are two basic types of Chi Square Test Test of Goodness of Fit Test for Independence Test for Homogeneity The Goodness of Fit is testing whether a particular distribution is as stated. This is comparing many different proportions. Only one of the proportions has to be statistically off, not all of the proportions. These types of problems only have one row and 2 or more columns. The Test for Independence also, known as the Test for Association is testing whether there is a relationship between the two categories for example gender and grades. The problem will have more than one row. (Frog Problem 2009 #3) What does the Chi Square graph look like? Chi Square for each degree of freedom has a different type of graph. As the degrees of freedom increase the graph becomes less skewed and is not as tall. You may get significance with a higher degree of freedom due to the area of the tail. You will only be using a one sided test for Chi Square The Goodness of Fit test has one row and the Null state that the distribution is as the manufacture states or perhaps as the newspaper states. Example Problem Arnold Palmer states that the percentage of hair color for a random sample of newborn babies is as state 10% Red 20% Black 20% Brown and 50% Bald and you take a random sample of 200 babies and find the result 25 red Jeane Swaynos 60 black 50 Brown 65 Bald AP Workshop July 2008 12 Matching Histograms, Box Plots, and Normality Plots 13 You could do a Chi Square Test for Goodness of Fit HO: The distribution of newborn hair color is as Arnold Palmer states: 10% red, 20% Black, 20% Brown, and 50% Bald H A: The distribution of newborn hair color is NOT as Arnold Palmer states: 10% red, 20% Black, 20% Brown, and 50% Bald Assumptions 1. The data are COUNTS 2. We assume this is a simple random sample of 200 newborn babies 3. We assume the 200 babies are independent of each other 4. All expected counts are greater than 5. (20, 40, 100) YOU MUST SHOW THE EXPECTED COUNTS. There are two ways to look at assumption 4. We can say that all expected counts are greater than 1 and no more than 20% of the expected counts are less than 5. This is about EXPECTED counts not the actual data. The degrees of freedom is the number of columns you have minus 1: column – 1 List the data in a table. Observed Expected Red 25 .1(200) = 20 Black 60 .2(200) = 40 Brown 50 .2(200) = 40 Bald 65 .5(200)=100 (Observed Expected )2 Expected 2 (O E ) This is the symbol for Chi SquareX2 = = E (25 20) 2 (60 40) 2 (50 40) 2 (65 100) 2 = ___26____ 20 40 40 100 Chi Square is found by taking the total of 1.25 + 10 + 2.5 + 12.25 = 2 6 Jeane Swaynos AP Workshop July 2008 13 Matching Histograms, Box Plots, and Normality Plots 14 Parts of the Chi Square Test State the Null and Alternative ( If this is written in words then you have already defined the variables) Give the assumptions 1. The data are counts 2. SRS of ______ from _____ 3. The samples are independent 4. All expected counts are greater than 5 ( show expected counts) Chi Square Value, degrees of freedom and P Value Conclusion There is a ____% probability that a chi square value of ____ or larger would happen by chance alone if the true distribution were _____. I ______ reject the Null at =.05 (Be sure to connect the conclusion to the context of the problem) OR There is a ___% probability that I would get this Chi Square value or larger by chance alone. Therefore, I _____ reject the Null at alpha = .05. Therefore it appears that ---- and ---- are or are not independent. M&M Activity (on your own) Show all parts for this activity, including the table with observed and expected counts. Show all the mechanical steps to find your Chi Square value M&M Plain Brown 13% Yellow 14% Red 13% Blue 24% Orange 20% Green 16% M&M Peanut Both Types Brown 12% Yellow 15% Red 12% Blue 23% Orange 23% Green 15% Peanut Butter 10 Brown 20 Yellow 10 Red 20 Blue 20 Orange 20 Green Dark Choc. Uniform Distribution Review of Chi Square \ There are several procedures to compare the distribution of categorical data. When a categorical variable has multiple categories, when there are two categorical variables under Jeane Swaynos AP Workshop July 2008 14 Matching Histograms, Box Plots, and Normality Plots 15 consideration, or when there are multiple populations under study, a Chi Square test is used. Goodness of Fit: One Row. One categorical variable with multiple categories from ONE population Test for Homogeneity: More than one row. (One categorical variable with multiple categories from two or more populations) The test compares the distribution of sample counts with the hypothesized distribution of the population assume the populations have identical distributions Test of Independence: More than one row (Two categorical variables with multiple categories from ONE population) The test compares the distribution of sample counts with the hypothesized distribution of the population assuming the two variables are independent Test for Homogeneity Situation: A certain brand of bits-size candies comes in three varieties: creamy, crispy, and chewy. The manufacturer is interested if preference for the types of candies differs between three school age groups: elementary, middle, and high school. Random samples at three local schools, one of each age group are taken and the sample data compiled in the table below Population Elementary Middle High School Variety Creamy 33 21 16 Crispy 14 16 12 Chewy 19 17 32 *Do you notice that I have three different samples here, so I am looking at three different populations. This is what tells me it is a test for Homogeneity and not Association. HO: P creamy in elementary = P creamy in middle school = P creamy in high school Pc = P crispy in middle school = P crispy in high school rispy in elementary P chewy in elementary = p chewy in middle school = P chewy in high school HA : At least one of the statements in HO is not true Chi Square test for Homogeneity Assumptions IN CONTEXT 1. Data comes from an independent simple random sample: OK – each group of students was randomly selected from their respective schools. Jeane Swaynos AP Workshop July 2008 15 Matching Histograms, Box Plots, and Normality Plots 16 2. The samples were independent or the size of the sample is less than 10% of the population sizes: OK as long as there are at least 660 elementary, 540 middle, and 600 high school students in the respective populations. 3. All expected counts are as least 5 – see the expected counts below. It is important that you indicate exactly where to find the expected counts. Calculate the Chi Square test statistic The expected counts are found by take the (Row Total)(Column Total) and divide this by the (Total Total) The degrees of freedom is (# of rows – 1) ( # of columns – 1) Elementary Middle High School TOTAL Creamy 33 21 16 70 Crispy 14 16 12 42 Chewy 19 17 32 68 Creamy (66)(70)/ 180 (54)(70)/180 (60)(70)/180 Crispy (66)(42)/180 (42)(54)/180 (60)(42)/180 Chewy (66)(68)/180 (54)(68)/180 (68)(60)/180 TOTAL 66 54 60 180 Expected Counts Elementary Middle High School (Observed Expected )2 Chi Square is found by taking the total of Expected 2 (O E ) This is the symbol for Chi SquareX2 = = E The calculator will automatically store the expected counts in Matrix B. This will only work when you have more than one row. The conclusion There is approximately a 2% probability that a result of Would happen by chance alone if all the proportions of creamy, crispy and chewy candy were the same for the three types of schools. I will reject the Null at =.05 Jeane Swaynos AP Workshop July 2008 Elementary Middle High School Variety Creamy 33 21 16 Crispy 14 16 12 Chewy 19 17 32 16 Matching Histograms, Box Plots, and Normality Plots 17 OR There is a approximately a 2% probability that if the preference for candies were distributed in the same proportion among all three age groups of students a chi-square statistics 0f 11.552 or larger would occur. Reject the Null at =.05 Test for Independence The chi-Square test of independence is used to compare the distribution of sample counts of two categorical variables from a SINGLE population to see if there is an association between the variables. For instance, parents of incoming freshman in a large school district were asked if they supported school uniforms. Parents were classified by whether or not they favored uniforms, and by the type of uniform policy. Situation According to the Orlando Sentinel the proportion of schools that allow cell phones depends on the population of students at the school. This is the data they have collected. Complete an appropriate significance test. Show all Parts Cell Phone Population Less than 500 500-1000 1001-1500 1501-2000 2001-2500 2501-3000 Over 3000 Yes 12 11 16 15 18 12 34 No 23 5 34 17 24 32 30 Solution This is a Chi Square test for Independence. There is a subtle difference in this problem. We are assuming this is all from one sample and then we separated the schools and gathered the data. If this was taken from individual samples then we would do a test for Homogeneity and be comparing the proportions. Both Null and Alternatives will be accepted. The stem of the problem was not clear in how the sample was gathered. Null and Alternative Ho The number of schools that allow cell phones is independent of the size of the school HA The number of schools that allow cell phone use is not independent of the size OR Ho The proportion of cell phone use allowed is the same for each of the 7 different size schools HA The proportion of cell phone use allowed is NOT the same for at least one of the 7 different size schools. Jeane Swaynos AP Workshop July 2008 17 Matching Histograms, Box Plots, and Normality Plots 18 Assumptions Assumptions 1. I assume this is a simple random sample of schools 2. I assume that the sample of schools are independent of each other 3. All expected counts are greater than 5. The approximate value of each is as follows: 14, 20, 6, 9, 20, 29, 13, 18, 17, 24, 18, 25, 26, 37 Show the expected counts. You get the expected counts by looking at Matrix B. You do not have to do anything expect put the values in Matrix A. The calculator will do the rest. You need to remember how the expected values are calculated because this could be asked on a free response. They could also show you a computer printout of the values. Define Chi Square test, give test statistic, degrees of freedom and p-value Chi Square test for independence Chi Square is 15.12 Degrees of freedom 6 P(X6 > 15.12) = .019 Conclusion There is a .019 probability that I would get a Chi Square value of 15.12 or larger. I will reject the Null at =.05. It appears that the size of the school is not independent of cell phone use. Jeane Swaynos AP Workshop July 2008 18 Matching Histograms, Box Plots, and Normality Plots 19 Test for Goodness of Fit Example According to the USA Today 20% of children are blond, 40% have brown hair, 10% have red hair, and 30% have black hair. Mrs. Swaynos has taken a random sample of children and found the following results 32 Blond 48 Brown 15 Red 40 Black Is there evidence to think USA today is not correct Parts to Chi Square for Goodness of Fit 1. Define Null and Alternative in words 2. Give the assumptions SRS of _____ from ____ Data are counts N< 10% of the population All expected counts are greater than 1 Not more than 20% of expected counts are less than 5 3. Give the Chi Square value showing the work 4. Give the p-value 5. Give the conclusion There is a _______% probability that I would get a result of ____ OR LARGER by chance alone if the true proportion were as stated. I _______reject the Null at alpha = ______. Make a statement that connects to the problem. Jeane Swaynos AP Workshop July 2008 19 Matching Histograms, Box Plots, and Normality Plots 20 Notes about Chi Square There are three types of Chi Square problems Goodness of Fit, Test for Independence, Test for Homogeneity The difference between Independence and Homogeneity is that Independence comes from one population. Homogeneity comes from two samples. The way the sample is taken and the question you are trying to answer dictates the type of test. A test for Goodness of Fit only has ONE row. A test for Independence and Homogeneity has at least two rows. This type of test can be done on the calculator. The goodness of fit must be done by hand. Example for Goodness of Fit According to USA today skittles are equally distributed. Mrs. Swaynos has opened a package of skittles and found the following results 12 yellow 14 red 22 green and 28 purple. Do a statistical test for this. Chi Square Test for Independence According to Ms. Michalik the number of juniors and seniors that attend prom is as follows: Males Females Spend 0-100 21 4 Spend 101-250 32 32 Spend 251-400 12 23 Spend over 400 8 30 Chi Square Test for Homogeneity Ms. Michalik took a sample from Seminole, Oviedo, and Lake Mary and found out how much people spend on prom Oviedo Seminole Lake Mary Spend 0-100 34 12 5 Spend 101-250 56 25 34 Spend 251-400 123 231 129 Spend over 400 200 129 321 Assumptions 1. SRS of _______ from _______ 2. SRS< 10% of the population 3. All expected counts are >1 4. No more than 20% of Expected Counts can be less than 5 5. Make sure the data are counts. Degrees of Freedom (r-1) (c-1) for independence and homogeneity. For a goodness of fit it is (c – 1) for degrees of freedom. Conclusion Jeane Swaynos AP Workshop July 2008 20 Matching Histograms, Box Plots, and Normality Plots 21 There is a _p-value_____% probability that a result of __ or larger from your sample_______would happen by chance alone if the true _Null______were _______. I _will or will not_______reject the Null at alpha = .05 Conclusion must always connect to the context of the problems. Hypothesis Test for Slope The hypothesis test for slope allows us to determine if there is a useful linear relationship between x and y in the population. That is, does the slope of the population model differ from 0 – does y tend to change linearly with changes in x? If there is a linear relationship between the two variables, the slope should not equal 0. The Null will state: HO 0 Here is used as the symbol for the slope of a linear regression. This is the same symbol that is often use as a Type 2 error so be careful and make sure you define the variable of interest. The Null will most often say 0 but it could also say <0 or >0. The formula to compute the test statistic (t score) is t b1 where b1 is the slope of the sample and SEb is the standard error of the slope SEb1 1 The number of degrees of freedom is n-2 because you are looking at two variables for each data set. The P-value will be computed using the t-distribution. The assumptions 1. SRS of ____ from _____ 2. The scatter plot looks linear 3. The residuals do not show a pattern 4. The residuals follow a NORMAL distribution Conclusion There is a _____% probability I would get a slope of ______ given the true slope is 0. I ______ reject the Null at =.05 (Include a statement that ties this in with the CONTEXT of the problem) Jeane Swaynos AP Workshop July 2008 21 Matching Histograms, Box Plots, and Normality Plots 22 Key Words in Advance Placement Statistics Questions If it says………… Describe the distribution (numerical) Describe the distribution (Categorical) Compare the distributions Design an experiment Then you must address….. Center Shape Spread Gaps Outlier G-SOCS Make sure the graph is appropriate compare Compare each of the sets ( 2009 #1) Label and Scale If the data are not the same size then you must use relative frequency as the y axis Use words like bigger, smaller, wider You must compare the center, shape and spread When comparing the shape just list the types of distribution Be very careful using the word NORMAL, not all symmetric distributions are NORMAL but all NORMAL distributions are symmetric. Random -allocation of experimental units to treatments RAT Define what tool you will use to randomize Same size groups Clearly define groups Replication – enough experiment units Control – Control for lurking variables (often by blocking) Draw picture Define what you will compare and give units Write in sentences What additional information……. Based on the diagram above……. Based on parts a and b……. Give appropriate statistical evidence Provide evidence Jeane Swaynos You must state NEW information not what is previously stated You must use the graph or diagram that is provided You must use your answers from the previous parts Must perform a test of significance, Hypothesis test choices o One sample t o Two sample t AP Workshop July 2008 22 Matching Histograms, Box Plots, and Normality Plots 23 Conduct an appropriate analysis o One sample proportion z test o Two sample proportion z test o Chi Square o Linear Regression Must show all parts (Null & Alternative, name test, assumptions, test statistic, p value, df, conclusion) In context of the question, drawing, table, graph, etc. You must connect back to the stem of the problem. Explain the slope ___ change in y is PREDICTED for every 1unit change in x. CONTEXT Fudge factor or wiggle word must be there R2% of the variation in Y is explained by the LSRL of Y on X. CONTEXT Strength, direction, context Means x or less Means x or less Means x or greater Pvalue, alpha level, context, reject or not reject, conditional probability statement Never accept the alternative Never reject the alternative Never accept the Null Use a linear regression Do not make your interval special. Your are ____% confident that the true ____ is between___ and ____ This is referring to the process and all the other intervals that were obtains. If I did this process again and again I would capture the TRUE ____ approximately ___% of the time in the various INTERVALS. Few words are underline, pay attention to what words the author chooses to underline. They direct you to the answer they are looking for. Strength, direction, placement of the data, outliers, CONTEXT Explain the coefficient of determination Explain “r” At most No more than At least Conclusion/linkage Predict Confidence Interval Confidence Level Underlined words Describe scatterplots Jeane Swaynos AP Workshop July 2008 23 Matching Histograms, Box Plots, and Normality Plots 24 Assumption and Conditions Assumptions Proportions (z) One Sample 1. Individuals are independent 2. Samples are sufficiently large Two Samples 1. Samples are independent 2. Data in each sample are independent 3. Both samples are sufficiently large Means (t) One sample (df = n-1) 1. Individuals are independent 2. Population distribution is normal If data is given you must show a normality plot or histogram and check the data Jeane Swaynos Conditions that support the assumptions 1. SRS of ____ from _____and n < 10% of the population 2. np>10 and n(1-p) > 10 this is the same as there are at least 10 successes and 10 failures in each sample 1. (Think about how the data was collected) 2. Both samples are SRS of ___ from ____ and n < 10% of the population OR Random Allocation of Treatment 3. np>10 and n(1-p) > 10 both each sample or there are at least 10 successes and 10 failures in each of the two samples 1. SRS of ___ from ---- and n < 10% of the population 2. n < 15 data comes from a normal population. For 15 < n<30 the data has no outliers or skewness For n> 30 data this is considered sufficiently large and outliers could e a problem. The data does not have to be normal AP Workshop July 2008 24 Matching Histograms, Box Plots, and Normality Plots 25 Matched Pair (df = n-1) 1. Data are matched 2. Individuals are independent 3. Population of differences is Normal If data is given you must show a normality plot or histogram and check the data of the differences 1. (Think about how the data was collected) 2. SRS of ___ from ___ and n < 10% of the population OR Random Allocation of Treatment 3. n<15 Data comes from a normal population For 15 < n<30 the data has not outliers or skewness For n> 30 data this is considered sufficiently large and outliers could be a problem. The data does not have to be normal Two independent samples (df from technology) 1. Samples are independent 2. Data in each sample are independent 3. Both population have a Normal distribution 4. Data is independent of each other Jeane Swaynos 1. (Think about the design) 2. SRS of ___ from ___ and n < 10% of the population OR Random Allocation of Treatment for each of the samples 3. The data for each sample is approximately normal for n < 15 For 15 < n<30 both data sets has not outliers or skewness For n> 30 both data sets are is considered sufficiently large and outliers could be a problem. The data does not have to be normal 4. (Think about how the data was collected) AP Workshop July 2008 25 Matching Histograms, Box Plots, and Normality Plots 26 AP Statistics Writing Conclusions and Interpretations for Statistical Inference Note: All conclusions must be connected to the context of the problem. The must include pvalue, alpha level, reject or not reject, and must state given the Null. Interpretation of R-sqaure ______% of the variation in ______ is explained by the least square regression line of ___ on _____ (y, y, x) Interpretation of SLOPE The ______ will change by APPROXIMATELY _____ as ____ increases by 1. Example; y = 3 +14x y = mileage x = gas The mileage will increase by approximately 14 as the gas increase by 1 gallon. You may also define the slope as the PREDICTED slope instead of approximate slope Know how to read a computer printout. Know how to find the standard deviation of the residuals from the printout and how to interpret this in context to the problem Interpretation of R This is the correlation involves a LINEAR relationship and has strength and direction. You must address all three of these issues. Residual Distance from the Observed – Expected. Points above the line are an UNDER prediction and points below the line are an OVER prediction. Confidence Interval I am _____% confident that the true ______ is between ____ and ______ Confidence Level If I repeated this process again and again I would capture the true _____ ___% of the time in the various intervals. Jeane Swaynos AP Workshop July 2008 26 Matching Histograms, Box Plots, and Normality Plots 27 Conclusion for a Linear Regression Slope With a p-value of ______ I will or will not reject the Null at the ___% level. I would get this t score of ____ OR LARGER, SMALLER OR BOTH DEPENDING ON THE PROBLEM by chance alone _____% of the time given that the TRUE slope was 0. You could also say given that the x and y are independent. Conclusion for one sample t test or z test There is a _____probability that I would get a result of ____ or (larger, smaller, or both) by chance alone given the true mean of ________(context)________ is ________I will or will not reject the Null at the ___% level. CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT Conclusion for a two sample t test or z test There is a ______ probability that I would get a difference of _______(larger, smaller, or both) by chance alone given the true difference in ________context is 0. I will or will not reject the null at alpha equal to .05. CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT Conclusion for one sample proportion There is a _____ probability that I would get a proportion of ___________(larger, smaller, or both) by chance alone given the true proportion of __________context ____________is _____. I will or will not reject the Null at alpha equal to .05. CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT Conclusion for two sample proportion There is a _______ probability that I would get this proportion difference of _____ or LARGER by chance alone given the true difference _________________(context) ___________is 0. I will or will not reject the Null at the ___% level. CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT Conclusion for Chi Square TEST FOR INDEPENDENCE (You have one sample and are comparing two difference categories within the one sample) There is a _____ probability I would get a Chi Square value of ___ or larger. I will or will not reject the Null at alpha equal to .05. There is a _____ probability I would get (give observed values here and context) With a p-value of ____ I will or will not reject the Null at the __% level. I would get a chi square value of ___ by chance alone ___ % of the time if the ___ and ___ were independent ( or I could say if the ___ and ___ had no association) CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT CONTEXT Jeane Swaynos AP Workshop July 2008 27 Matching Histograms, Box Plots, and Normality Plots 28 Conclusion for Chi-Square for TEST OF GOODNESS OF FIT (you have one sample and are comparing the distribution to something that the manufacturer claims) There is a ____ probability that I would get this Chi Square value ______or larger by chance alone. I will or will not reject the Null at alpha equal to .05. There is a ____ probability I would get ( give the observed values and the context here) by chance alone given the true proportions were (give expected here with context.) Conclusion for Chi- Square for TEST OF HOMOGENITY (you have two samples and are comparing the distributions of each) With a p-value of _____ I will or will not reject the Null at the ___% level. I would get a chi square value of ___ or larger by chance alone ___% of the time given that the two (or more) distributions were the same. CONTEXT Jeane Swaynos AP Workshop July 2008 28 Matching Histograms, Box Plots, and Normality Plots 29 Here are few pointers and reminders to help you do well on the AP Statistics Exam. The Exam The AP Stat exam has 2 sections that take 90 minutes each. The first section is 40 multiple choice questions, and the second section is 6 (technically, 4 to 7, but it’s always been 6) free response questions. Each section counts for half of the overall score. The last free response question counts for 25% of the Section II score. You are allowed to use your calculator(s) throughout the exam, and a standard set of formulas and tables is printed right in the test booklet for your use. General tips for writing free response answers Understand your obligation as a test taker You are being evaluated not only on the correctness of your answers, but also on your ability to communicate the methods you used to reach them. The answer is everything you write down, not just the last line or number at the end. Convince the reader that you understand the key concepts in the question. Don’t just give them the numbers and hope they will assume you understand the concepts. Be smart about multi-part questions Most AP Stat questions have several parts. Read all the parts before you start answering and think about how they might be related (sometimes they aren’t). If the last part asks you to answer a question based on your results to the previous parts, be sure to actually use your prior results to answer. If you couldn’t do one of the previous parts, make up an answer and explain what you would have done. Answer the question you are asked The test writers spend over a year writing these questions. They word them carefully and specifically. Spend more time reading and less time writing to make sure you really understand what is being asked. When you have answered the question asked, stop writing. They give you much more space than you need. Don’t panic because you haven’t used all the space provided. Answer in context Most, if not all, AP Stat problems will have a real life context. Make sure your answers include the context. This is especially important when defining symbols/variables and writing conclusions. Use vocabulary carefully This isn’t English class. There’s no poetic license here. Terms like normal, independent, and sampling distribution have specific meanings. Don’t say “normal” if you mean “approximately normal” and don’t mix up populations and samples in either words or symbols. Leave enough time for the last question Jeane Swaynos AP Workshop July 2008 29 Matching Histograms, Box Plots, and Normality Plots 30 The last free response question counts for more points and is designed to take 20 to 30 minutes. At least read it first, and if you feel OK about it, go ahead and answer. If it looks hard, you can save it for the end, but no matter what, when there are 30 minutes left in the test, stop and go to the last question. Relax Having met many of the people who write the exam and grading standards, I can assure they are not out to trick you. They write challenging but straightforward questions designed to give you an opportunity to demonstrate what you have learned. Seize the opportunity and do your best. Keep in mind that you only need to earn roughly 65 to 70% (it varies from year to year) of the points on the exam to get a 5. Collecting Data There are 2 broad areas of data collection we cover in AP Stat, Experiments and Sampling. You are expected to know some general concepts and specific techniques related to each area. Experiments vs. Samples Many students confuse experimentation with sampling or try to incorporate ideas from one into the other. This is not totally off-base since some concepts appear in both areas, but it is important to keep them straight. The purpose of sampling is to estimate a population parameter by measuring a representative subset of the population. We try to create a representative sample by selecting subjects randomly using an appropriate technique. The purpose of an experiment is to demonstrate a cause and effect relationship by controlling extraneous factors. Experiments are rarely performed on random samples because both ethics and practicality make it impossible to do so. For this reason, there is always a concern of how far we can generalize the results of an experiment. Generalizing results to a population unlike the subjects in the experiment is very dangerous. Blocking vs. Stratifying Students (and teachers) often ask, "What is the difference between blocking and stratifying?" The simple answer is that blocking is done in experiments and stratifying is done with samples. There are similarities between the two, namely the dividing up of subjects before random assignment or selection, but the words are definitely not interchangeable. Jeane Swaynos AP Workshop July 2008 30 Matching Histograms, Box Plots, and Normality Plots 31 Blocking In blocking we divide our subjects up in advance based on some factor we know or believe is relevant to the study and then randomly assign treatments within each block. The key things to remember: 1. You don't just block for the heck of it. You block based on some factor that you think will impact the response to the treatment 2. The blocking is not random. The randomization occurs within each block essentially creating 2 or more miniature experiments. 3. Blocks should be homogenous (i.e. alike) with respect to the blocking factor. For example, I want to find out if playing classical music during tests will result in higher mean scores. I could randomly assign half my students to the room with the music and the other half to the normal room, but I know that my juniors consistently score higher than my seniors, and I want to account for this source of variation in the results. I block according to grade by separating the juniors and seniors first and then randomly assigning half the juniors to the music room and the other half to the normal room. I do the same with the seniors. For this design to be valid, I have to expect that each grade will respond to the music similarly. In other words, I know that juniors will score higher, but I expect to see a similar improvement or decline in both groups as a result of having the music. At the end of my study I can subtract out the effect of grade level to reduce the unaccounted for variation in the results. You have learned how to analyze the results of one special type of blocked design, namely, matched pairs. In matched pairs you subtract each pair of values which eliminates the variation due to the subject. Similar techniques are available for fancier blocked designs. Stratified Sampling vs. Cluster Sampling Many students confuse stratified and cluster sampling since both of them involve groups of subjects. There are 2 key differences between them. First, in stratified sampling we divide up the population based on some factor we believe is important, but in cluster sampling the groups are naturally occurring (I picture schools of fish). Second, in stratified sampling we randomly select subjects from each stratum, but in cluster sampling we randomly select one or more clusters and measure every subject in each selected cluster. (Note: There are more advanced techniques in which samples are taken within the cluster(s)) Final Thoughts It is especially important to stay focused when answering questions about design. Too many students get caught up in minor details but miss the big ideas of randomization and control. Always remember that your mission in responding to questions is to demonstrate your understanding of the major concepts of the course. Jeane Swaynos AP Workshop July 2008 31 Matching Histograms, Box Plots, and Normality Plots 32 Describing Data IQR is a number Many students write things like "The IQR goes from 15 to32". Every AP grader knows exactly what you mean, namely, "The box in my boxplot goes from 15 to 32.", but this statement is not correct. The IQR is defined a Q3 - Q1 which gives a single value. Writing the statement above is like saying "17 goes from 15 to 32." It just doesn't make sense. Be able to construct graphs by hand You may be asked to draw boxplots (including outliers), stemplots, histograms, or other graphs by hand. The test writers have become very clever and present problems in such a way that you cannot depend on your calculator to graph for you. Label, Label, Label Any graph you are asked to draw should have clearly labeled axes with appropriate scales. If you are asked to draw side-by-side boxplots, be sure to label which boxplot is which. Refer to graphs explicitly When answering questions based on a graph(s), you need to be specific. Don¹t just say, "The female times are clearly higher than the male times.", instead say, "The median female time is higher than the first quartile of the male times." You can back up your statements by marking on the graph. The graders look at everything you write, and, often, marks on the graph make the difference between 2 scores. Look at all aspects of data When given a set of data or summaries of data, be sure to consider the Center, Spread, Shape, and Outliers/Unusual Features. Often a question will focus on one or two to these areas. Be sure to focus your answer to match. It's skewed which way? A distribution is skewed in the direction that the tail goes, not in the direction where the peak is. This sounds backwards to most people, so be careful. Slow down The describing data questions appear easy, so many students dive in and start answering without making sure they know what the problem is about. Make sure you know what variable(s) are being measured and read the labels on graphs carefully. You may be given a type of graph that you have never seen before. Jeane Swaynos AP Workshop July 2008 32 Matching Histograms, Box Plots, and Normality Plots 33 Inference Not every problem involves inference You have spent most if not all of this semester on inference procedures. This leads many students to try to make every problem an inference problem. Be careful not to turn straightforward probability or normal distribution questions into full-blown hypothesis tests. Hypotheses are about populations The point of a hypothesis test is to reach a conclusion about a population based on a sample from it. We don't need to make hypotheses about the sample. When writing hypotheses, conclusions, and formulas, be careful with your wording and symbols so that you do not get the population and sample mixed up. For example, don't write "Ho: x = 12" or "µ = mean heart rate of study participants". Check Assumptions/Conditions Checking assumptions/conditions is not the same thing as stating them. Checking means actually showing that the assumptions are met by the information given in the problem. For example, don't just write "np>10". Write "np=150(.32)=48>10". Everyone knows you can do the math in your head or on your calculator, but writing it down makes it very clear to the reader that you're tying the assumption to the problem rather than just writing a list of things you memorized. Confidence intervals have assumptions too Confidence intervals have the same assumptions as their matching tests, and you need to check them just as carefully. Link conclusions to your numbers Don't just say "I reject Ho and conclude that the mean heart rate for males is greater than 78." This sentence doesn't tell us why you rejected Ho. Instead, say "Since the p-value of .0034 is less than .05, I reject Ho and ...” Be consistent Make sure your hypotheses and conclusion match. If you find an error in your computations, change your conclusion if necessary. Even if your numbers are wrong, you will normally get credit for a conclusion that is correct for your numbers. If you get totally stuck and can't come up with a test statistic or p-value, make them up and say what you would conclude from them. Interpreting a confidence interval is different than interpreting the confidence level Interpreting the confidence interval usually goes something like, "I am 95% confident that the proportion of AP Statistics students who are highly intelligent is between 88% and 93%" or "The superintendent should give seniors Fridays off since we are 99% confident that between 72% and 81% of parents support this plan." Jeane Swaynos AP Workshop July 2008 33 Matching Histograms, Box Plots, and Normality Plots 34 Interpreting a confidence level usually goes something like "If this procedure were repeated many times, approximately 95% of the intervals produced would contain the true proportion of parents who support the plan." Jeane Swaynos AP Workshop July 2008 34 Matching Histograms, Box Plots, and Normality Plots 35 Regression Graph First, Calculate Later The most important part of the regression process is looking at plots. Regression questions will frequently provide a scatterplot of the original data along with a plot of residuals from a linear regression. Look at these plots before answering any part of the question and make sure you understand the scales used. Is it linear? Remember that an r value is only useful for data we have already decided is linear. Therefore, an r value does not help you decide if data is linear. To determine if data is linear, look at a scatterplot of the original data and the residuals from a linear regression. If a line is an appropriate model, the residuals should appear to be randomly scattered. Computer Output It is very likely that you will be given computer output for a linear regression. If you can read the output correctly, these questions are normally easy. You should be able to write the regression equation using the coefficients in the output and also be able to find the values of r and r2. Most software packages provide the value of r2. If you are asked for the value of r, you will need to take the square root and look at the slope to determine if r should be positive or negative. Interpreting r If asked to interpret an r value, be sure to include strength, direction, type, and the context. A good interpretation will be something like, “There is a weak positive linear relationship between the number of math classes a person has taken and yearly income.” After you make a 5, be sure to take more statistics in college. Jeane Swaynos AP Workshop July 2008 35