ESTIMATING A REGRESSION LINE F. Chiaromonte 1 This course is about REGRESSION ANALYSIS: • Constructing quantitative descriptions of the statistical association between y (response variable) and x (predictor, or explanatory variable) on the sample data. • Introducing models, to interpret estimates and inferences on the parameters of these descriptions in relation to the underlying population. MULTIPLE regression, when we consider more than one predictor variable. F. Chiaromonte 2 Fitting a line through bi-variate sample data (as a descriptor) Least squares fit: find the line that minimizes the sum of squared (vertical) distances from the sample points. y = β 0 + β1 x generic equation of line ⎧ n ⎫ min ⎨∑ ( yi − ( β 0 + β1 xi )) 2 ⎬ β0 , β1 ⎩ i =1 ⎭ Normal equations (derivatives of obj fct) n n ⎧ ⎪ ∑ yi = nβ 0 + β1 ∑ xi ⎪ i =1 i =1 ⎨ n n n 2 ⎪ xy =β + x β x 0∑ i 1∑ i i i ⎪⎩∑ i =1 i =1 i =1 F. Chiaromonte obj fct Solution (unique) n b1 = ∑ ( x − x )( y − y ) i i =1 i n 2 ( ) x − x ∑ i i =1 b0 = y − b1 x 3 yˆi = b0 + b1 xi fitted value ei = yi − yˆi residual for sample point i=1…n Geometric properties of the least square line: n ∑e i =1 i =0 n 2 e ∑ i = min i =1 n n i =1 i =1 ∑ yˆi = ∑ yi n ∑xe i =1 i i n ∑ yˆ e i =1 i i e (x-bar,y-bar) y-hat =0 =0 y x y = b0 + b1 x F. Chiaromonte 4 Simple linear regression MODEL: Assume the sample data is generated as yi = β 0 + β1 xi + ε i xi , i = 1...n fixed (or condition on) ε i , i = 1...n random errors s.t. E (ε i ) = 0, ∀i no systematic component var(ε i ) = σ 2 , ∀i cor(ε i , ε j ) = 0, ∀i ≠ j constant variance uncorrelated The values of y (given various values of x) scatter about a line, with constant variance and no correlations among the departures from the line carried by different observations… …quite simplistic, but very useful in many applications! Note: distribution of errors is unspecified for now. F. Chiaromonte 5 If we assumed a bell-shaped distribution for the errors (which we will do later!) here is how the “population” picture would look like: E ( yi ) = β 0 + β1 xi var( yi ) = σ 2 cor( yi , y j ) = 0 Interpretation of parameters: β1=slope; change in the mean of y when x changes by one unit. β0=intercept; if x=0 is a meaningful value, mean of y when x=0. σ2=error variance; scatter of y about the regression line. F. Chiaromonte 6 Under the assumptions of our simple linear regression model, the slope and intercept of the least square line are (point) estimates of the population slope and intercept, with the following very important properties: GAUSS-MARKOV THEOREM: Under the conditions of the simple linear regression model: • b1 and b0 are unbiased estimates for β1 and β0 E (b1 ) = β1 E (b0 ) = β 0 • they are the most accurate (smallest MSE i.e. variance) among all unbiased estimates that can be computed as linear functions of the response values. Linearity: n b1 = n ∑ ( x − x )( y − y ) ∑ ( x − x ) y i i =1 i = n ∑ (x − x ) i =1 i 2 i =1 n i i ∑ (x − x ) i =1 2 n = ∑ ki yi i =1 i n F. Chiaromonte n n 1 b0 = y − b1 x = ∑ yi − ∑ ki x yi = ∑ k%i yi i =1 n i =1 i =1 7 Point estimation of error variance n n 1 1 2 2 s = y − b + b x = e ( ( )) ∑ ∑ i i 0 1 i n − 2 i =1 n − 2 i =1 SSE (error sum of squares) 2 MSE for the regression line Unbiased for σ2: dof of SSE (constraints = two parameters of line) E (s2 ) = σ 2 Point estimation of mean response Eˆ ( y at x) = yˆ ( x) = b0 + b1 x F. Chiaromonte 8 Example: Simple linear regression for y = log length ratio between human and chicken DNA , on x = log large insertion ratio, as sampled on n=100 genome windows. Estimates from least squares Line parameters: Intercept: 0.19210 Slope: 0.21777 Error variance: 0.033 on 98 dof’s Mean responses: yˆ (1) = 0.41 yˆ (2) = 0.63 … would you trust this? F. Chiaromonte 9 Example: Simple linear regression for y = mortality rate due to malignant Regression Plot skin Mortality =on 389.189 - 5.97764 Latitude melanoma per 10 million people, on x = latitude, as sampled 49 US states. S = 19.1150 LAT 33.0 34.5 35.0 37.5 39.0 MORT 219 160 170 182 149 43.0 134 Estimates from least squares Line parameters: Intercept: 389.19 Slope: -5.977 R-Sq(adj) = 67.3 % 200 Mortality # State 1 Alabama 2 Arizona 3 Arkansas 4 California 5 Colorado ... 49 Wyoming R-Sq = 68.0 % 150 100 30 40 50 Latitude Error variance: 365.57 on 47 dof’s Mean responses: F. Chiaromonte yˆ (30) = 209.88 … would you trust this? yˆ (40) = 150.11 10 Maximum likelihood estimation under normality xi , i = 1...n fixed (or condition on) Assume (error distribution) ε i , i = 1...n iid N (0, σ 2 ) Then yi ~ N (β0 + β1 xi , σ 2 ) and indep, i = 1...n Likelihood function n L(β0 , β1 , σ ) = ∏ 2 i =1 ⎧ 1 ⎫ exp ⎨− 2 ( yi − (β0 + β1 xi ))2 ⎬ ⎩ 2σ ⎭ 2πσ 2 1 ⎧ 1 n 2⎫ exp ⎨− 2 ∑ ( yi − (β0 + β1 xi )) ⎬ = 2 n/ 2 (2πσ ) ⎩ 2σ i =1 ⎭ 1 F. Chiaromonte 11 max2 L( β 0 , β1 , σ 2 ) β 0 , β1 ,σ obj fct Solution (unique) n b1 = ∑ ( x − x )( y − y ) i i =1 i n 2 x − x ( ) ∑ i i =1 same as from least square fit! b0 = y − b1 x n 1 s% 2 = ∑ ( yi − (b0 + b1 xi )) 2 n i =1 F. Chiaromonte n−2 2 s n 12 Some remarks • a strong statistical association between y and x does not automatically imply causation (e.g. a functional relationship of linear form); x can “proxy” the real causing variable, perhaps is a spurious fashion! • In observational studies, x (likewise y) is not controlled by the researchers; we “condition” on the observed values of x. In experimental studies, x is controlled; we can consider the values of x as fixed (although assignment of x levels to units may be arranged at random in an experimental design). Experimental design facilitates causality assessment. • Extrapolating a statistical association (e.g. a regression line) outside the range of the data on which it was “fitted” is dangerous; we don’t really know how the association would be shaped where we didn’t get to look! With an experimental design we can make sure we cover the range that is of interest. F. Chiaromonte 13