9/16/09 Regression line Correlation coefficient a nice numerical summary of two Regression FPP 10 kind of quantitative variables It indicates direction and strength of association But does it quantify the association? It would be of interest to do this for Predictions Understanding phenomena Regression line Slope intercept form review Correlation measures the direction and strength of the straight-line (linear) relationship between two quantitative variables If a scatter plot shows a linear relationship, we would like to summarize this overall pattern by drawing a line on the scatter plot This line represents a mathematical model. Later we will make the mathematical model a statistical one. Regression line Regression Slope intercept form notation y = mx + b Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 Regression form notation € yˆ = a + bx € 1 9/16/09 Which line is best Which model to use Different people might draw different lines by eye on a scatterplot Price = -90.2458 + 0.1598SQFT (red) Price = -300 + 0.3SQFT (blue) Price = 0 + 0.1SQFT (green) What are some ways we can determine which model(line) out of all the possible models(lines) is the “best” one? What are some ways that we can numerically rank the different models? (i.e. the different lines) This will come later in the course Slope interpretation yˆ = a + bx The slope, b, of a regression line is almost always important € for interpreting the data. The slope is the rate of change, the mean amount of change in y-hat when x increases by 1 Intercept interpretation yˆ = a + bx The intercept, a, of the regression line is the value of y-hat € when x = 0. Although we need the value of the intercept to draw the line, it is statistically meaningful only when x can actually take values close to zero. Slope interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 For every 1 sqft increase in size of home on average the house price increases by $159.8 dollars Intercept interpretation Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 If the sqft of a home was 0 on average the house price will be -$90,245.80 dollars This doesn’t make much sense here because x (sqft) doesn’t take on values close to zero. 2 9/16/09 OECD data: Income and unemployment in the U.S. Prediction What is the relationship between households’ disposable Price of Homes Based on Square Feet Price = -90.2458 + 0.1598SQFT r = 0.8718945 income and the nation’s unemployment rate? Data from the U.S. 1980 to 1998 For a 3500 sqft home we would predict the selling price to be price = -90.2458 + 0.1598*3500 price = $469,054.2 (data provided by the economics department at Duke) Disposable income vs unemployment rates Disposable income and unemployment rates regression output Does regression fit data well? A regression line is reasonable if Association between two variables is indeed linear When points are randomly scattered around line Income/unemployment rate data well-described by regression line. Regression of AIDS rates per 1000 people of GNP per capita Line is too low for GDP values near zero and too high for big GDP values. We shouldn’t use line for predictions 3 9/16/09 Birth and death rates in 74 countries Changing the response variable When the regression line fits the data badly, sometimes you can transform variables to obtain a better fitting line. With monetary variables, typically this can be accomplished by taking logarithms. Facts about regression Regression of log(AIDS) on log(GNP) The distinction between explanatory and response variable is essential in regression If you have a slope computed using x as the explanatory and y as the response variable you can’t “back solve” to get predictions of x given y Much better fit Predict log(AIDS) from log(GNP). Exponentiate to estimate AIDS If you want to predict x given a y then you must find the intercept and slope with y being the explanatory variable and x being the resopnse Facts about regression Warnings about regression There is a close relationship between the correlation Predicting y at values of x beyond the range of x in the data is coefficient and the slope of a regression line b=r SDy SDx They have the same sign called extrapolation This is risky, because we have no evidence to believe that the association between x and y remains linear for unseen x values They are proportional to each other Extrapolated predictions can be absolutely wrong € The intercept has no relationship with the correlation coefficient but here is the formula a = y − bx € 4 9/16/09 Extrapolation Extrapolation Diamond price and carat The relationship between Explanatory variable is measured by carats and response variable is dollars diamond carat and price doesn’t remain linear after a carat size of about 0.4 Predict price of hope diamond yˆ = 48.88 + 2430.77(45.52) = $110,697.53 € Extrapolation Green line is linear fit with only diamonds less then 0.4 carats Blue line is linear fit with all carat sizes Red curve a quadratic fit Lurking variable A variable not being considered could be driving the relationship In practice this is a difficult issue to tackle. Especially when everything seems OK Influential point Causality An outlier in either the X or Y direction which, if removed, On its own, regression only quantifies an association between would markedly change the value of the slope and y-interept. applet x and y It does not prove causality Under a carefully designed experiment (or in some cases observational studies) regression can be used to show causality. 5