Simple Linear Regression 14 Slope-intercept equation for a line y mx b y 8 10 12 Regression equation—an equation that describes the average relationship between a response (dependent) and an explanatory (independent) variable. (4,6) 6 slope y 6 3 3 (2,3) 4 y 3 1.5 x 2 2 x 4 2 2 2 4 6 8 10 x 1 2 120 Deterministic Model 9 F C 32 y-intercept 5 96 A model that defines an exact relationship between variables. slope Fahrenheit 48 72 Example: y 1.5 x There is no allowance for error in the prediction of y for a given x. 0 24 32 -10 0 10 20 30 40 50 Celsius 3 4 Includes both a deterministic component and a random error component. y 1.5 x deterministic component random error }random error y y 1.5 x random error 15 A model that accounts for random error. 10 Probabilistic Model 5 Also termed a residual This model hypothesizes a probabilistic relationship between y and x. 0 ~ N 0,1 0 2 4 6 8 10 x 5 6 1 Probabilistic Model—General Form First-Order (Straight Line) Probabilistic Model y 0 1 x y = Deterministic component + Random component where y = Dependent variable x = Independent variable 0 = population y-intercept of the line—the point at which the line intersects or cuts through the y-axis 1 = population slope of the line—the amount of increase (or decrease) in the deterministic component of y for every 1-unit increase (or decrease) in x. = random error component where y is the “variable of interest”. Assume that the mean value of the random error is zero the mean value of y, E(y), equals the deterministic component of the model 7 8 First-Order (Straight Line) Probabilistic Model Model Development 0 and 1 are population parameters. They will only 0 and 1 will generally be unknown. The process of developing a model, estimating model parameters, and using the model can be summarized in these 5-steps: 1. Hypothesize the deterministic component of the model that relates the mean, E(y) to the independent variable x be known if the population of all (x, y) measurements are available. 0 and 1, along with a specific value of the independent variable x determine the mean value of the dependent variable y. E y 0 1 x 2. Use sample data to estimate unknown model parameters find estimates: ˆ 0 or b0 , ˆ 1 or b1 9 Model Development (continued) 10 Example: Reaction time versus drug percentage 3. Specify the probability distribution of the random error term and estimate the SD of this distribution ~ N 0, will revisit this later 4. Statistically evaluate the usefulness of the model 5. Use model for prediction, estimation or other purposes 11 Subject Amount of Drug (%) x Reaction Time (seconds) y 1 2 3 4 5 1 2 3 4 5 1 1 2 2 4 12 2 4.0 Reaction Time (sec.) 0.7 1.5 2.3 3.2 y x 1 Errors or residuals (3,2) (2,1) 0 1 2 3 4 intercept = -1 -1.0 -1.0 -0.2 -0.2 4.0 Example: Reaction time versus drug percentage Reaction Time (sec.) 0.7 1.5 2.3 3.2 Example: Reaction time versus drug percentage 5 Drug (%) 0 1 Slope = (2-1)/(3-2) = 1/1 = 1 2 13 3 4 5 14 Drug (%) Example: Reaction time versus drug percentage Least Squares Line Errors of prediction---vertical differences between the observed and the predicted values of y Also called regression line, or the least squares prediction equation x y y 1 x 1 2 3 4 5 1 1 2 2 4 0 1 2 3 4 y y y y 2 Method to find this line is called the method of least squares (1-0) = 1 1 (1-1) = 0 0 (2-2) = 0 0 (2-3) = -1 1 (4-4) = 0 0 Sum of Sum of squared errors = 0 errors (SSE) = 2 For our example, we have a sample of n = 5 pairs of (x, y) values. The fitted line that we will calculate is written as ŷ b0 b1 x ŷ is an estimator of the mean value of y, E y ; b0 and b1 are estimators of 0 and 1 15 Least Squares Line (continued) Define the sum of squares of the deviations of the y values about their predicted values for all n data points as: n n SSE yi ˆy yi b0 b1 xi i 1 2 2 16 Formulas for the Least Squares Estimates SD y SS Slope: b1 xy or b1 r SS xx SD x xi yi i 1 We want to find b0 and b1 to make the SSE a minimum---termed least squares estimates x y i i xi2 n y-intercept: b0 y b1 x ŷ b0 b1 x is called the least squares line sxx = SS xx xi x s xy = SS xy xi x yi y y i n b1 2 x 2 i n x i n n = sample size 17 18 3 LS Calculations for Drug/Reaction Example 1 1 1 1 2 1 4 2 3 2 9 6 4 2 16 8 5 4 25 20 x i 15 SSxy 37 y i 10 1510 5 x 2 i 55 37 30 7 x y i i LS Line for Drug/Reaction Example 4.0 xi yi b1 7 0.7 10 10 15 .7 5 5 2 .7 3 b0 2 2.1 .1 37 SS xx 55 15 ŷ .1 .7 x 2 5 ŷ .1 .7 x Reaction Time (sec.) 0.7 1.5 2.3 3.2 xi2 x,y -0.2 yi 55 45 10 -1.0 xi 0 1 2 3 4 5 Drug (%) 19 20 LS Calculations for Drug/Reaction Example Least Squares Line—Interpretation of ŷ .1 .7 x y ˆy Estimated intercept is negative that the estimated mean reaction time is equal to -0.1 seconds when the amount of drug is 0%. x y ŷ .1 .7 x 1 2 3 4 5 1 1 2 2 4 .6 1.3 2.0 2.7 3.4 y ˆy 2 (1-.6) = .4 .16 (1-1.3) = -.3 .09 (2-2.0) = 0 .00 (2-2.7) = -.7 .49 (4-3.4) = .6 .36 Sum of Sum of squared errors = 0 errors (SSE) = 1.10 What does this mean since negative reaction times are not possible? Model parameters should be interpreted only within the sampled range of the independent variable. The LS line has a sum of errors = 0, but SSE = 1.1 < 2.0 for visual model 21 22 Least Squares Line—Interpretation of ŷ .1 .7 x Coefficient of Determination The slope of 0.7 implies that for every unit increase of x, the mean value of y is estimated to increase by 0.7 units. A measure of the contribution of x in predicting y 4.0 Assuming that x provides no information for the prediction of y, the best prediction for the value of y is y SS yy yi y 2 -1.0 For every 1% increase in the amount of drug in the bloodstream, the mean reaction time is estimated to increase by 0.7 seconds over the sampled range of drug amounts from 1% to 5%. ˆy y Reaction Time (sec.) -0.2 0.7 1.5 2.3 3.2 In the context of the problem: 0 23 1 2 3 Drug (%) 4 5 24 4 4.0 Coefficient of Determination (continued) Reaction Time (sec.) 0.7 1.5 2.3 3.2 SSE yi ˆyi Coefficient of Determination (continued) SS yy yi y 2 2 --total sample variation around mean i SSE yi ˆyi --unexplained sample variability after 2 fitting SS yy SSE --explained sample variability attributable to linear relationship -0.2 SS yy SSE -1.0 SSyy 0 1 2 3 4 explained proportion of total sample total variability explained by the linear relationship 5 Drug (%) 25 26 Coefficient of Determination (continued) r2 SS yy SSE SS yy 1 SSE SS yy } Unexplained variability In simple linear regression r 2 is computed as the square of the correlation coefficient, r 0 r2 1 Interpretation— r 2 .75 means that the sum of squared deviations of the y values about their predicted values has been reduced by 75% by the use ŷ, instead of y , to predict y of the least squares equation. 27 5