Lecture 17 • Interaction Plots • Simple Linear Regression (Chapter 18.118.2) • Homework 4 due Friday. JMP instructions for question 15.41 are actually for question 15.35. 18.1 Introduction • In Chapters 18 to 20 we examine the relationship between interval variables via a mathematical equation. • The motivation for using the technique: – Forecast the value of a dependent variable (y) from the value of independent variables (x1, x2,…xk.). – Analyze the specific relationships between the independent variables and the dependent variable. Uses of Regression Analysis • A building manager company plans to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. The costs incurred by the company are proportional to the number of cleaning crews needed for this task. How many crews will be enough? • The product manager in charge of a brand of children’s cereal would like to predict demand during the next year. She has available the following “predictor” variables: price of the product, number of children in target market, price of competitors’ products, effectiveness of advertising, annual sales this year and previous year Uses of Regression Analysis • A community in the Philadelphia area is interested in how crime rates affect property values. If low crime rates increase property values, the community might be able to cover the cost of increased police protection by gains in tax revenues from higher property values. • A real estate agent wants to more accurately predict the selling price of houses. She believes the following variables affect the price of a house: Size of house (sq. feet), number of bedrooms, frontage of lot, condition and location. 18.2 The Model The model has a deterministic and a probabilistic components House Cost Most lots sell for $25,000 House size 18.2 The Model However, house cost vary even among same size houses! Since cost behave unpredictably, House Cost we add a random component. Most lots sell for $25,000 House size 18.2 The Model • The first order linear model y b0 b1x e y = dependent variable x = independent variable b0 = y-intercept b1 = slope of the line e = error variable y b0 and b1 are unknown population parameters, therefore are estimated from the data. Rise b0 b1 = Rise/Run Run x Interpreting the Coefficients • E (Y | X ) b0 b1 X • Roomsclean=1.78+3.70*Number of Crews • called the y-intercept and called the slope. b1 b 0 • Interpretation of slope: “For every additional cleaning crew, we are able to clean an additional 3.70 rooms on average.” • Interpretation of intercept: Technically, how many rooms on average can be cleaned with zero cleaning crews but doesn’t make sense here because it involves extrapolation. Simple Regression Model • The data ( x1, y1 ),, ( xn ,are yn ) assumed to be a realization of y b b x e , i 1,n i 0 1 i i 2 b ,1e , b 2iid , e1 , ~ N ( 0 , n e ) b0 b1 xi is the “signal” and e i is “noise” (error) 0 1 • 2 b , b , • 0 1 e are the unknown parameters of the model. Objective of regression is to estimate them. 2 • What is the interpretation of e ? 18.3 Estimating the Coefficients • The estimates are determined by – drawing a sample from the population of interest, – calculating sample statistics. – producing a straight line that cuts into the data. y w Question: What should be considered a good line? w w w w w w w w w w w w w x w The Least Squares (Regression) Line A good line is one that minimizes the sum of squared differences between the points and the line. The Least Squares (Regression) Line Sum of squared differences = (2 - 1)2 + (4 - 2)2 +(1.5 - 3)2 + (3.2 - 4)2 = 6.89 Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99 4 3 2.5 2 Let us compare two lines The second line is horizontal (2,4) w w (4,3.2) (1,2) w w (3,1.5) 1 1 2 3 4 The smaller the sum of squared differences the better the fit of the line to the data. The Estimated Coefficients To calculate the estimates of the line coefficients, that minimize the differences between the data points and the line, use the formulas: b1 cov( X , Y ) s 2x b 0 y b1 x The regression equation that estimates the equation of the first order linear model is: ŷ b 0 b1x Typical Regression Analysis • Observe pairs of data ( x1, y1 ),, ( xn , yn ) • Plot the data! See if a simple linear regression model seems reasonable. If necessary, transform yˆ b b x the data. • Suspect (or hope) SRM assumptions are justified. • Estimate the true regression line by the LS regression line Check the model and make inferences. E ( y | x) b0 b1x 0 1 The Simple Linear Regression Line • Example 18.2 (Xm18-02) – A car dealer wants to find the relationship between the odometer reading and the selling price of used cars. – A random sample of 100 cars is selected, and the data recorded. – Find the regression line. Car Odometer Price 1 37388 14636 2 44758 14122 3 45833 14016 4 30862 15590 5 31705 15568 6 34010 14718 . . . Independent Dependent . . . variable x variable y . . . The Simple Linear Regression Line • Solution – Solving by hand: Calculate a number of statistics ( x x) 2 43,528,690 x 36,009 .45; s 2x y 14,822 .823; ( x x)( y y ) cov(X, Y) 2,712,511 i n 1 i i n 1 where n = 100. cov(X, Y) 1,712,511 b1 .06232 2 sx 43,528,690 b 0 y b1x 14,822.82 ( .06232 )( 36,009.45) 17,067 ŷ b0 b1x 17,067 .0623 x Interpreting the Linear Regression -Equation 17067 Odometer Line Fit Plot Price 16000 0 15000 14000 No data 13000 Odometer yˆ 17,067 .0623 x The intercept is b0 = $17067. Do not interpret the intercept as the “Price of cars that have not been driven” This is the slope of the line. For each additional mile on the odometer, the price decreases by an average of $0.0623 Fitted Values and Residuals • The least squares line decomposes the data into two parts yi yˆi ei where yˆi b0 b1xi , ei yi yˆi • yˆ1,, yˆn are called the fitted or predicted values. • are called the residuals. • The residuals e1,, en are estimates of the errors (e1,, e n ) 18.4 Error Variable: Required Conditions • The error e is a critical part of the regression model. • Four requirements involving the distribution of e must be satisfied. – – – – The probability distribution of e is normal. The mean of e is zero: E(e) = 0. The standard deviation of e is e for all values of x. The set of errors associated with different values of y are all independent. The Normality of e E(y|x3) The standard deviation remains constant, m3 b0 + b1x3 E(y|x2) b0 + b1x2 m2 but the mean value changes with x b0 + b1x1 E(y|x1) m1 From the first three assumptions we have: x1 y is normally distributed with mean E(y) = b0 + b1x, and a constant standard deviation e x2 x3 Estimating e n 1 2 ˆ se ( y y ) i i n 2 i1 • The standard error of estimate (root mean se squared error) is an estimate of s e • The standard error of estimate is basically the standard deviation of the residuals. • If the simple regression model holds, then approximately e – 68% of the data will lie within one se of the LS line. – 95% of the data will lie within two se of the LS line. Cleaning Crew Example • Roomsclean=1.78+3.70*Number of Crews • The building maintenance company is planning to submit a bid on a contract to clean 40 corporate offices scattered throughout an office complex. Currently, the company has only 11 cleaning crews. Will 11 crews be enough? Practice Problems • 18.4,18.10,18.12