Chapter 8 homework Name/period __________________ Chapter notes linear regression If there is a linear relationship between two variables, we can _______ the relationship with a line and give its equation. The linear model (called the Least squares regression line) is just an ____________ of a straight line through the data. The modeled line cannot go through all of the points on the scatterplot and still be linear, but it can summarize the general pattern of the relationship. Remember, models are not perfect, they don’t match reality exactly. Once we have a modeled line (equation), we can predict the response variable (y) using the explanatory variables (x). The estimate for the y-variable is called the _________ __________, and is denoted as _____. The difference between the observed y and the ^ predicted y is called a residual. ^ resid y y The least square regression line is the “best fit” for the data, meaning that the line will pass through the scatterplot minimizing the sum of squares (which is a fancy way of saying that the line will cross through all points in a way that minimizes all residuals). Example 1 Correlation and the line. This section is optional reading. It will help build a better understanding of regression if you read it, but it is not required. Skip to page 176 (The regression line in real units) We know the line of best fit is called the least square regression line (LSRL). The LSRL is an equation similar to what you learned in Algebra class. Back then, the equation was in slope intercept form y = mx + b. In Statistics, it’s the same thing but equation looks different. In Stats, the equation is y b0 b1 x notice that the y-intercept is b0 and the slope is b1 . In your calculator, the y-intercept is a and the slope is b. This equation is used to predict y values for a given x value. The formula for the slope _____________________ y intercept __________________ These formulas are given on your formula sheet, but you will never have to utilize them, the calculator will always find them for you or you will be given the slope and y-intercept in a computer output page. You have to be able to know what they both mean in context. The slope is____________________________________________________________ The units for slope is ________________________________________________. The y-intercept is ______________________________________. The units of the y-intercept are ____________________________ *******It is very important that you understand how to define the slope and y-intercept in context perfectly. Do the example on page 178 on your own. Calculating a regression equation. 2 Residual are also very important. The formula for the residual is Residual plots – A residual plot is a very important idea in regression analysis. A residual scatterplot is very similar to a normal scatterplot of x and y’s. Basically, a residual scatterplot is a plot where the y-values are the residuals and the x values stay the same. Sometimes, the y values are the residuals and the x-values become the predicted values. You should be comfortable reading both ways. The Residual plot’s main purpose is to determine if the original data sets (x and y) have ______________________ form. This happens if the form of the residual scatterplot has ______________________ form. We can also use the residual plot to find residuals, raw y-values, or predicted y-values. for example. 3 The standard deviation of the residuals se - The standard deviation of the residual gives us a measure of how much the points are spread around the regression line. The points spread around the line are measured as residuals, so basically se is the standard deviation of the residuals. We always want the smallest possible residuals in a regression analysis, so we want the se to be small too. The residuals should also have a symmetric unimodal shape, (which implies a normal distribution) Example R-squared – R2 is the only definition that you have to memorize without a full understanding of the concept. I will try my best. Here’s the definition R2 is the 4 Regression assumptions 4 Homework 1. Cereals For many people, breakfast cereal is an important source of fiber in their diets. Cereals also contain potassium, a mineral shown to be associated with maintaining a healthy blood pressure. An analysis of the amount of fiber (in grams) and the potassium content (in milligrams) in servings of 77 breakfast cereals produced the ^ regression model Potassium 38 27 Fiber . If your cereal provides 9 greams of fiber per serving, how much potassium does the model estimate you will get? 2. Horsepower In Chapter 7’s Exercise 33 we examined the relationship between the fuel economy (mpg) and horsepower for 15 models of cars. Further analysis produces ^ the regression model mpg 46.87 0.084HP. If the car you are thinking of buying has a 200-horsepower engine, what does this model suggest your gas mileage would be? 3. More cereal Exercise I describes a regression model that estimates a cereal’s potassium content for the amount of fiber it contains. In this context, what does it mean to say that a cereal has a negative residual? 4. Horsepower again Exercise 2 describes a regression model that uses a car’s horsepower to estimate its fuel economy. In this context, what does it mean to say that a certain car has a positive residual. 5 5. What slope If you create a regression model for predicting the weight of a car ( in pounds) form its Length (in feet) is the slope most likely to be 3,30,300, or 3000? Explain. 6. Real Estate A random sample of records of sales of homes from Feb. 15 to Apr. 30, 1993, from the files maintained by the Albequerque Board of Realtors gives the Price and Size (in square feet) of 117 homes. A regression to predict Price (in thousands of dollars) from size has an r = 0.845. a) What are the variables and units in this regression (list explanatory and response) b) What units does the slope have? c. Do you think the slope is positive or negative? Explain 7. More real estate Consider the Albuquerque home sales from Exercise 6 ^ again. The regression analysis gives the model price 47.82 0.061( Size ) . a) Explain what the slope of the line says about housing prices and house size b) What price would you predict for a 3000-square foot house in this market? c) A real estate agent shows a potential buyer a 1200-square-foot home, saying that the asking price is $6000 less that what one would expect to pay for a house of this size. What is the asking price, and what is the $6000 called? 6 8. More slope practice Refer to questions 1 and 2 Expain what the slope and y-intercept mean in context ^ a) Potassium 38 27 Fiber b) mpg 46.87 0.084HP. Potassium is in mg and Fiber is in grams ^ 9. Birthrates 2005 The table shows the number of live births per 1000 women aged 15-44 years in the US, starting in 1965. (National Center for Health Statistics, www.cdc.gov/nchs/) Year Rate 1965 19.4 1970 18.4 1975 14.8 1980 15.9 1985 15.6 1990 16.4 1995 14.8 2000 14.4 2005 14 a) Make a scatterplot and describe the association and what the scatterplot tells us overall. (Enter Year as years since 1900: 65,70,75, etc.) b) Find the Least square regression equation. c) Interpret the slope in context d) The table gives rates only at 5-year intervals. Estimate what the rate was in 1978. e) In 1978, the birthrate was actually 15. How close did your model come? ll f) Find the residual for year 1980 7 10. Cereals again The correlation between a cereal’s fiber and potassium contents is r = 0.903. What percent of the variability in potassium is accounted for by the amount of fiber that servings contain? 11. Residuals Tell what each of the residual plots below indicates about the appropriateness of the linear model that was fit to the data. (page 193, #11) 12. Real estate again 2 had R =71.4%. The regression of Price and Size of homes in Albuquerque a. Write a sentence (in context) summarizing what the R2 says about this regression. b. What is the correlation between Price and Size? Explain why you chose + or - 13. Cereal again The correlation between a cereals fiber and potassium contents is r = 0.903. What percent of the variability in potassium is accounted for by the amount of fiber that servings contain? 14. Last Cereal For the cereal regression problem model predicting potassium content (in mg) from the amount of fiber (in g) in breakfast cereals, se 30.77 . Explain in this context what this means. 8 15. Cigarettes Is the nicotine content in a cigarette related to the “tars”? A collection of data (in milligrams) on 29 cigarettes produced the scatterplot , residual plot, and regression analysis shown on page 194 (#27) a. Do you think a linear model is appropriate her explain? b. Explain the meaning of R2 in this context. c. What is the correlation between Tar and Nicotine? 16. Last Cigarette Take another look at the regression analysis of tar and nicotine content of the cigarette in problem 15 (page 194, #27) a. Write the equation of the least square regression line. b. Estimate the nicotine content of cigarettes with 4 milligrams of tar. c. Find slope of the regression line and interpret its meaning in context. (be sure to include the units of slope) d. What does the y-intercept mean in context. e. If a new brand of cigarette contains 7 mg of tar and a nicotine level whose residual is -0.5, what is the actual nicotine content. 9 17. Online Clothes An online realtor keeps track of its customers purchases. For those customers that signed up for the company credit card, the company also has information on the customer’s Age and Income. A random sample of 500 of these customers showed the following scatterplot of Total Yearly Purchases and Age: (scatterplot is shown on page 196 on top left corner) The correlation between Total Yearly Purchases and Age is r = 0.037 Summary statistics for the two variables are Mean SD Age 29.67 years 8.51 years Total Yearly Purchases $572.52 $253.62 a. What is the linear regression equation for predicting Total Yearly Purchases from Age? b. Do the assumptions and conditions for regression appear to be met? c. What is the predicted average Total Yearly Purchases for and 18 year old? A 50 year old? d. What percent of the variability in Total Yearly Purchases is accounted for by the increase of Age? e. Do you think the regression might be useful for this company? Explain f. The s e $175, explain what this number is in context 10 For the last 4 problems, put them on you own sheet of paper(s) and staple them to this packet. If you can do these problems with minimal help, then you are track to do well on the next test. I want to see a lot of effort and good writing on these problems 1. #37 (a-g) on page 196 2. #41 (a-e) on page 197 3. #43 (a-f) on page 197 4. #45 (a-g) on page 198 11