AP Statistics Chapter 12 Note-Taking Guide More about regression 3 Review Name _______________________ Per ___ Date _________________ Scatterplots and Linear Regression Scatterplots allow us to visualize the relationship between an _____________ variable (x-axis) and a ___________________ variable (y-axis). r Correlation (r) allows us to describe how closely the data fits a ___________________pattern. Correlation values are always between _______________. x If r =____, then the data forms a straight line with a negative slope. x If r = ____, the data is truly random pattern and does not come close to forming a line. x If r = ____, then the data forms a straight line with a positive slope. Least-Squares Regression Line A regression line takes the data and summarizes the relationship between the explanatory and response variable. It is usually used to predict the value of the _________________variable. Forms for the regression line: y a bx y D Ex y b0 b1 x Where a, D , b0 represent the y-intercept x Interpretation of y-intercept: _____________ value when __________________value is 0 Where b, E , b1 represent the slope of the line x Interpretation of slope: amount the _____________ value changes for each increase of 1 in _________________ value A Residual is the difference in the ______________ response value and the ___________ response value for a certain explanatory value. The purpose of the regression line is to make the square of the residuals (added up) as ________________ as possible. Residual plots tell us if a _______________ model for the prediction equation is appropriate. Does the data really show a linear pattern or does it have more of a curved pattern? s The standard deviation of the residuals (s) allows us to examine the _______________________ _________________ for our regression line, acts almost like a margin of error. r2 Correlation squared (r2) is called the coefficient of determination. It is the proportion of the response value that can be _________________ to the explanatory value alone. Other factors could affect the response value so it can also be looked at as the “accuracy” of the prediction equation. Interpretation: “The regression line accounts for __% of the variation in the (response variable).” 12.1A Many people believe that students learn better if they sit closer to the front of the classroom. Does sitting closer cause higher achievement, or do better students simply choose to sit in the front? To investigate, an AP Statistics teacher randomly assigned students to seat locations in his classroom for a particular chapter and recorded the test score for each student at the end of the chapter. The explanatory variable in this experiment is which row the students were assigned (row 1 is closest to the front and row 7 is the farthest away). Here are the results, including a scatterplot and least-squares regression line: Row 1: 76, 77, 94, 99 y 86 1.12 x Row 2: 83, 85, 74, 79 Row 3: 90, 88, 68, 78 Row 4: 94, 72, 101, 70, 79 Row 5: 76, 65, 90, 67, 96 Row 6: 88, 79, 90, 83 Row 7: 79, 76, 77, 63 1. Interpret the slope of the least-squares regression line in this context. 2. Interpret the y-intercept of the least-squares regression line in this context. 12.1A In this section we will analyze what happens to the slope of the regression line with different samples. x We start with a population that we know the data for. x We sample many, many times and record the slopes of the regression line for each sample. x We analyze the sampling distribution for patterns and relationships. What do we find? 1. The shape of the sampling distribution is close to ____________________. 2. The mean of the sampling distribution is close to the __________________ of the population. b b1 E Pb 3. The standard deviation of the sampling distribution is close to the ___________ standard deviation of the population. Vb 12.1A V sx n 1 From this we can perform inference tests (confidence intervals and significance tests) for the slope of a regression line. To do this we have to meet the conditions for the sampling distribution of a regression line. x Linear: The actual relationship between the two variables is ____________. We analyze the original _____________________ of the sample (overall pattern is roughly linear) and the _______________ (look for a random placement, any pattern suggests that it is not linear in nature). x Random: comes from a well-designed random _________ or a randomized _____________. x Normal: Make a stemplot, histogram, or normal probability plot of the residuals. Check for clear ______________________or other major departures from normality. x Equal variance: Analyze the residual plot, is the __________________ of scatter roughly the same for each possible value of x. x Independent: How was the data produced? Random sampling and random assignment indicates __________________ observations. Sampling without replacement requires using the ______________________ 12.1A Problem: Check whether the conditions for performing inference about the regression model are met. Solution: x Linear x Random x Normal x Equal variance x 12.1B Independent Standard error of slope Since we rarely know the standard deviation of the population, we alter our standard deviation formula to create a standard error formula. Where x s is the standard deviation of the residuals (if s is larger so is the standard error, which means more scatter and lower correlation) x sx is the standard deviation of the x variable (if sx is larger than standard error gets smaller) x n is the sample size (as n gets larger than standard error gets smaller) Interpreting standard error of slope: How far we expect our ______________ to be from the truth in repeated random sampling or repeated random assignment. The substitution of the s value for σ, causes our distribution to become a _______________ with a df = Confidence Interval for Slope 12.1B State: “We will estimate the true slope β of the population regression line relating __(context)___ to __(context)___ at a _____ confidence level” Plan: Use a t-interval if the conditions are met x Linear? Scatterplot looks linear (no curve) and residual plot shows no pattern x Random condition? The data are produce by a random sample of size ______ from population of interest 2 OR group of size ___________ in a randomized experiment. x Normal condition? ¾ Create histogram, boxplot, or normal probability plot of the ________________. ¾ Look for skewness or outliers x Equal Variance? Looking at the residual plot, is the amount of scatter above and below the line, the amount of scatter should be roughly the same from the smallest x-value to the highest x-value x Independence condition? individuals are independent or meet the_____ condition. Do: If conditions are met, perform calculations x Compute linear regression on calculator: ¾ Find the slope of the regression line ¾ 1-var stats of x list (Sx) ¾ Calculate s ¦ resid 2 n2 ¾ Calculate SEb x x x OR use regression output provided. Compute t*= invT ((1-confidence then ÷ 2), df = n - 2) Compute the interval (show work) Conclude: Interpret the results of your test in the context of the problem. x We are ___% confident that the interval from ___ to ___ captures the actual slope β of the population regression line relating __(context)___ to __(context)____ 12.1B Earlier, we used Minitab to analyze the results of an experiment designed to see if sitting closer to the front of a classroom causes higher achievement. We checked the conditions for inference earlier. Here is a scatterplot of the data and some output from the analysis. n = 30 Problem: (a) State the equation of the least-squares regression line. Define any variables you use. (b) Identify the standard error of the slope SEb from the computer output. Interpret this value in context. (b) Calculate the 95% confidence interval for the true slope. Show your work. State: We will estimate the true slope β of the population regression line relating ______________ to ____________________________ at a _____ confidence level Plan: Use a t-interval for the slope, conditions were met in earlier in our notes. Do: Conclude: We are ___% confident that the interval from ___ to ___ captures the actual slope β of the population regression line relating _____________________ to _______________________ 12.1B For their second-semester project, two AP Statistics students decided to investigate the effect of sugar on the life of cut flowers. They went to the local grocery store and randomly selected 12 carnations. All the carnations seemed equally healthy when they were selected. When they got home, the students prepared 12 identical vases with exactly the same amount of water in each vase. They put one tablespoon of sugar in 3 vases, two tablespoons of sugar in 3 vases, and three tablespoons of sugar in 3 vases. In the remaining 3 vases, they put no sugar. After the vases were prepared and placed in the same location, the students randomly assigned one flower to each vase and observed how many hours each flower continued to look fresh. Here are the data: Sugar 0 0 0 1 1 1 2 2 2 3 3 3 Freshness 168 180 192 192 204 204 204 210 210 222 228 234 (a) Construct and interpret a 99% confidence interval for the slope of the true regression line. State: We will estimate the true slope β of the population regression line relating ______________ to ____________________________ at a _____ confidence level Plan: Use a t-interval for the slope, conditions Linear: Random: Normal: Equal Variance: Independent: Do: Conclude: We are ___% confident that the interval from ___ to ___ captures the actual slope β of the population regression line relating _____________________ to _______________________ (b) Would you feel confident predicting the hours of freshness if 10 tablespoons of sugar are used? Explain. . 12.1C Significance Test for Slope 12.1C State: Test the hypothesis true slope of the regression line relating __(context)___ and __(context)___ is 0 at a ____ significance level. x H0 : β = 0 x Ha : β > 0, β < 0, β ≠ 0 Where β is the true slope the regression line Plan: We will use a t test for slope if the conditions are met x Linear? Scatterplot looks linear (no curve) and residual plot shows no pattern x Random condition? The data are produce by a random sample of size ______ from population of interest OR by a group of size ___________ in a randomized experiment. x Normal condition? ¾ Create histogram, boxplot, or normal probability plot of the ________________. ¾ Look for skewness or outliers x Equal Variance? Looking at the residual plot, is the amount of scatter above and below the line, the amount of scatter should be roughly the same from the smallest x-value to the highest x-value x Independence condition? individuals are independent or meet the_____ condition. Do: x Compute linear regression on calculator: ¾ Find the slope of the regression line ¾ 1-var stats of x list (Sx) ¾ Calculate s ¦ resid 2 n2 ¾ Calculate SEb x x OR use regression output provided. Compute the t-statistic x Calculate the p-value (_______) by calculating the __________ of getting a t statistic this large or larger in the direction specified by the alternative hypothesis Ha: Conclude: Compare the p-value to the _________________________. x ¾ If P-value is less than α, we ¾ If P-value is not less than α, we Interpret those results in context ¾ We reject H0, so we _________________ convincing evidence of Ha ¾ We Fail to reject H0, so we _________________ convincing evidence of Ha 12.1C Do customers who stay longer at buffets give larger tips? Charlotte, an AP statistics student who worked at an Asian buffet, decided to investigate this question for her second semester project. While she was doing her job as a hostess, she obtained a random sample of receipts, which included the length of time (in minutes) the party was in the restaurant and the amount of the tip (in dollars). Do these data provide convincing evidence that customers who stay longer give larger tips? Here is the data: Time (minutes) 23 39 44 55 61 65 67 70 74 85 90 99 Tip (dollars) 5.00 2.75 7.75 5.00 7.00 8.88 9.01 5.00 7.29 7.50 6.00 6.50 (a) What is the equation of the least-squares regression line for predicting the amount of the tip from the length of the stay? Define any variables you use. (b) Carry out an appropriate test to answer Charlotte’s question. State: Test the hypothesis true slope βof the regression line relating time spent at table and tip amount is 0 at a ____ significance level. x H0 : β = 0 x Ha : β ____ 0 Where β is the true slope the regression line Plan: If the conditions are met we will do a t test for the slope E . x Linear x Random x Normal x Equal variance x Independent Do: Conclude: Since our P-value _________ less than __________, we ______________ the null hypothesis. We __________ have convincing evidence that customers that spend more time at a table leave larger tips. (c) How does the provided data below compare to what you calculated above? 12.1C On the calculator LinRegTTest or LinRegTInt ****Must have ______________________________________________________ 12.2 What happens if my scatterplot forms a curve? 12.2A What happens if the scatterplot does not show a linear pattern? x We try to transform the data so that it straightens a _________________ pattern. We can then use the regression line to make _____________________. x If the conditions for inference are met, we can _______________ or ___________ a claim about the slope of the true population regression line. 12.2A Transforming with powers and roots. How do we transform for linearity? x Pick a power and its inverse to try: x Raise the values of one variable to the ___________ and graph the scatterplot x Root the value of one variable and _____________ the scatterplot. x If they both look ____________, graph the __________________ to determine which transformation works the best. x 12.2A If you don’t know what power to choose, ________ until you find a transformation that works. For our purposes we will x identify from a ___________________________ what transformation was performed, x write the _________________________ for the transformed regression line, x make ________________________ from the regression line, x identify which _____________________will be best based on transformed graphs 12.2A Power model: y ax p or y Examples of power models: § © S · x2 ¸ 4 ¹ x Area of a pizza related to diameter ¨ y x Distance that an object dropped from a given height related to time y x Time it takes a pendulum to complete one back-and-forth swing related to length of pendulum y x 12.2A p a x or a x ax 2 ax1/2 § © Intensity of a light bulb related distance from the bulb ¨ y a x2 · ax 2 ¸ ¹ What does a country’s under-5 child mortality rate (per 1000 live births) tell us about the income per person for residents of that country? Here are the data and a scatterplot for a random sample of 14 countries in 2009 What does the scatterplot show us? We applied the transformation of and got the following graph. 1 mortality We applied the transformation of and got the following graph. Which graph looks more appropriate and why? 1 income Here is Minitab output from separate regression analyses of the two sets of transformed data. Problem: Do the following for both transformations (a) Give the equation of the least-squares regression line. Define any variables you use. (b) Predict the income per person for Turkey, which had a child mortality rate of 20.3. (c) Which regression is equation is the best fit? Explain your answer. 12.2B Transforming with logarithms Not all curved relationships are described by power models. Logarithm model y = a + b logx Exponential model y = abx Using logarithms to transform an exponential model: x Log the explanatory values ____________ the response values x Graph the points (log x, y) or (x, log y) x Does it display a linear pattern? If so, check the _____________________ to confirm. A student opened a bag of M&M’s, dumped them out, and ate all the ones with the M on top. When he finished, he put the remaining 30 M&M’s back in the bag and repeated the same process over and over until all the M&M’s were gone. Here is a table and scatterplot showing the number of M&M’s remaining at the end of each “course”. Course M&M’s remaining 1 30 2 13 3 10 4 3 5 2 6 1 7 0 (a) A scatterplot of the natural log of the number of M&M’s remaining versus course number is shown. The last observation in the table is not included since ln(0) is undefined. Explain why it would be reasonable to use an exponential model to describe the relationship between the number of M&M’s remaining and the course number. (b) Minitab output from a linear regression analysis on the transformed data is shown below. Give the equation of the least-squares regression line defining any variables you use. Regression Analysis: LnRemaining versus Course Predictor Constant Course Coef 4.0593 -0.68073 S = 0.198897 SE Coef 0.1852 0.04755 R-Sq = 98.1% T 21.92 -14.32 P 0.000 0.000 R-Sq(adj) = 97.6% (c) Use your model from part (b) to predict the original number of M&M’s in the bag. (d) A residual plot of the linear regression in part (b) is shown below. Discuss what this graph tells you about the appropriateness of the model. 0.3 0.2 Residual 12.2B 0.1 0.0 -0.1 -0.2 -0.3 1 2 3 4 Course 5 6 12.2B Which transformation do we choose? Here is a shortcut for you! x ______________ Model – If you log both variables and the graph is relatively linear. x ______________ Model – If you log one variable and the graph is relatively linear. 12.2B What type of model is graph (a)? What type of model is graph (b)? Which model would be most appropriate based on the transformations? 12.2B Rheumatoid Arthritis patients are often treated with large quantities of aspirin. The concentration of aspirin in the bloodstream increase for a period of time after the drug is administered and then decreases in such a way that the amount of aspirin remaining is a function of the amount of time that has elapsed since peak concentration. The table shows the data for a particular arthritis patient who has taken a large dose of aspirin. Use this data to predict how much aspirin remains in the patient’s bloodstream 12 hours after the 15mg reading. Hours Concentration 0 15.0 1 12.5 2 10.5 3 8.70 4 7.35 5 6.10 6 5.10 7 4.30 8 3.60 The scatterplot tell us ____________________________________________________________. We transform the data three different ways: a.) ln of hours b.) ln of blood concentration What type of model is each? c.) ln of both Here is the computer regression analysis of each transformation, write the regression equation for each: a.) b.) c.) For each equation, predict the blood concentration of aspirin 15 hours after peak concentration. a.) b.) c.) The residual plots for each regression analysis is below: Which is the most appropriate regression equation?