CHAPTER 2 SCATTER PLOTS, CORRELATION, LINEAR REGRESSION, INFERENCES FOR REGRESSION By: Tasha Carr, Lyndsay Gentile, Darya Rosikhina, Stacey Zarko SCATTER PLOTS Shows the relationship between two quantitative variables measured on the same individuals Look at: Direction- positive, negative, none Form-straight, linear, curved Strength- little scatter means little association great scatter means great association Outliers- make sure there are no major outliers CORRELATION Measures the direction and strength of the linear relationship Usually written as r r is the correlation coefficient Not resistant CORRELATION Rules: It does not change if you switch x and y Both variables must be quantative Does not change when we change units of measurement Positive r shows positive association, negative r shows negative association Always between -1 and 1 Values near 0 show weak linear relationship Strength of relationship increases as r moves toward -1 and 1 (means points lie in straight line) Not resistant, so outliers can change the value Bad measure for curves LEAST-SQUARES REGRESSION Makes the sum of the squares of the vertical distances of the data points from the line as small as possible (not resistant) Ŷ = b0 + b1 x b1 x = slope b1 = (sy / sx )(r) Amount by which y changes when x increases by one unit b0 = y-intercept Value of y when x=0 b0 = (y-bar) - b1 x Extrapolation- making predictions outside of the given data ; inaccurate LEAST-SQUARES REGRESSION A Regression Line is a straight line that describes how a response variable as an explanatory variable x changes Based on correlation Used to predict the value of y for a given value of x R2 = Coefficient of Determination In the model, R2 of the variability in the y- variable is accounted for by variation in the x- variable. RESIDUALS Minimized by the LSRL Difference between actual and predicted data Observed – Expected Actual – Guess e=Y–Ŷ Positive residuals – underestimates Negative residuals – overestimates RESIDUAL PLOT A scatter plot of the regression residuals against the explanatory variable or predicted values Shows if linear model is appropriate If there is no apparent shape or pattern and residuals are randomly scattered, linear model is a good fit If there is a curve or horn shape, or big change in scatter, linear model is not a good fit LURKING VARIABLES Variable that has an important effect on the relationship among the variables in a study but is not included among the variables studied Make a correlation or regression misleading An outlier- point that lies outside the overall pattern of the other observations Influential point- removing it would change the outcome (outliers in the x- direction) CAUSATION An association between an explanatory and response variable does not show a causation, or cause and effect relationship, even if there is a high correlation Correlation based on averages is higher than data from individuals INFERENCE FOR REGRESSION Used to test if there is an association between two quantitative variables based on the population To test for an association we check β1 If no association exists this should be zero INFERENCE FOR REGRESSION Hypothesis: H0 : β1 = 0. There is no association HA : β1 ≠ 0. There is an association. Conditions: Straight Enough: Check for no curves in scatter plot. Independence: Data is assumed independent. Equal Variance: Check residual plot for changes in spread Nearly Normal: Create histogram or Normal Probability plot of the residuals. All conditions have been met to use a student’s tmodel for a test on the slope of a regression model. INFERENCE FOR REGRESSION Mechanics b0 Df = n – 2 t= (b1 – 0)/(SE(b1 ) P-value = 2P(tn-2 > or < t) Multiple Regression Model of House Prices Response attribute (numeric): Price Predictor Coefficient b1 Constant Age Std Error 1244.2712 75.4607 -5.3659 3.8596 t P Statistic Value P-value R2 16.489 -0.0000 -1.390 0.1691 0.0285 Regression Equation: Price = R-Squared: 0.0284526 Adjusted R-Squared: 0.0137322 Standard Deviation of the Error: 400.242 Age SE (b1 ) t= (b1 – 0)/(SE(b1 ) INFERENCE FOR REGRESSION Conclusion If the p-value is less than alpha, reject the null hypothesis If we reject H0, there is evidence of an association If the p-value is greater than alpha, we fail to reject the null hypothesis If we fail to reject the H0 , there is not enough evidence of an association