732G21/732A35/732G28 1 732G21 Sambandsmodeller http://www.ida.liu.se/~732G21 One semester=Regr.analysis+ + analysis of variance (teacher: Lotta Hallberg) 732G28 Regression methods http://www.ida.liu.se/~732G28 Half of semester=Regr. analysis 732A35 Linear statistical models http://www.ida.liu.se/~732A22 Almost one semester=Regr. Analysis+ + analysis of variance (teacher: Lotta Hallberg) 732G21/732A35/732G28 2 Course language: English, but you may use Swedish We use It’s learning (accessed via Student portal) (show…) 9 Lectures 8 Labs (computer). Deadlines, around 5 days after lab ends 8 Lessons=I solve problems on the whiteboard + lab discussion One written final exam Course book: Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. Applied Linear Statistical Models with Student Data CD, 5th Edition, ISBN 0073108742. 732G21/732A35/732G28 3 Linear statistical models are widely used in ◦ ◦ ◦ ◦ ◦ Business Economics Engineering Social, biological sciences Etc Example: A database contains price of houses sold in Linköping in 2009, their age, size, other parameters. ◦ Given parameters of a new house determine its approximate market price Determine reasonable price bounds 732G21/732A35/732G28 4 Analysis of databases No Area (X1) Age (X2) Price (Y) 1 320 14 2,530,000 2 210 1 1,800,000 … … … … Observations (records, cases) in rows Variables in columns ◦ Explanatory variables (predictors, inputs) Xi ◦ Response Y, we assume Y=f(X1,…,Xn) In this lecture, models with only one explanatory variable 732G21/732A35/732G28 5 Real data can seldom be presented as Y=βX (observation errors, missing inputs etc) Example: Age and salary for a sample of eight persons from a company. Age Salary 50 45 21 32 40 56 61 55 39 33 17 30 27 35 44 38 36 25 40 35 Salary (y) 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 Age (x) Scatterplot 732G21/732A35/732G28 6 Presented relation is almost linear Linear regression analysis: find a linear finction as close as possible to the data 50 y = 0.5471x + 8.4545 45 40 35 Salary (y) 30 25 20 15 10 5 0 0 10 20 40 30 50 60 70 Age (x) 732G21/732A35/732G28 7 For each X, there is a probability distribution P(Y=y|X=x) of Y The aim is to find a regression function E(Y|X=x) 732G21/732A35/732G28 8 Construction of regression models Selection of prediction variables (variance reduction) Functional form (from theory, approximation) Domain of the model Software MINITAB SAS SPSS Matlab Excel 732G21/732A35/732G28 9 Formal statement Yi 0 1 X 1 i Yi is i th response value β0 β1 model parameters, regression parameters (intercept, slope) Xi is i th predictor value i is i.i.d. random vars with expectation zero and variance σ2 732G21/732A35/732G28 10 Features (show…) E Yi 0 1 X i 2 Yi 2 All Yi and Yj are uncorrelated Meaning of regression parameters β0 response value at X=0 β1 change in EY per unit increase in X 732G21/732A35/732G28 11 Given data set S X 1 , Y1 ,..., X n , Yn Method of least squares: Observed response Yi Estimated response 0 1 X i Deviation Yi 0 1 X i Regression fit is good when all deviations are minimized (see pict) -> minimimize sum of squares n Q Yi 0 1 X i 2 i 1 732G21/732A35/732G28 12 How to find minimum of Q? Q 0 0 Q 0 1 Estimators of β0 and β1 X n b1 i 1 i X Yi Y X n i 1 X 2 i b0 Y b1 X 732G21/732A35/732G28 13 Exercise (For salary data, MINITAB): 1. 2. 3. 4. Make scatterplot (Scatterplot…, with, without regression lien) Perform regression using ”Regression…” Perform regression using ”Fitted line plot..” Calculate coefficients by hand 732G21/732A35/732G28 14 50 y = 0.5471x + 8.4545 45 40 Salary (y) 35 30 25 20 15 10 5 0 0 10 20 30 40 50 60 70 Age (x) 732G21/732A35/732G28 15 Gauss-Markov theorem Estimators b0 and b1 are unbiased and have minimum variance among all unbiased estimators Unbiased bias=Eb0-β0=0 Eb0=β0 Analogously, Eb1=β1 Show illustration… 732G21/732A35/732G28 16 Mean (expected response) Point estimator of mean response (fitted value) 0 1 X Yˆ b0 b1 X Residuals ei Yi Yˆi 732G21/732A35/732G28 17 Plot of residuals (obtain it with MINITAB) 8 6 4 Residuals 2 0 0 10 20 30 40 50 60 70 -2 -4 -6 Age 732G21/732A35/732G28 18 Properties of residuals n 1. e i 1 n 2. i 0 2 e i i 1 n 3. 4. is minimum possible n Y Yˆ i 1 n i i 1 i X i ei 0 , i 1 5. Q 0) (because 0 (because of 1) n Yˆ e i 1 i i 0 (can be shown) Regression line always goes through X , Y 732G21/732A35/732G28 19 Estimate of variance of single population (sample variance) n 1 2 s2 Y Y i n 1 i 1 In regression, we compute s2 using residuals (look at residual plot) n SSE Yi Yˆi i 1 s 2 MSE e 2 n i 1 2 i SSE n2 732G21/732A35/732G28 20 Why divided by n-2? Because E(MSE)=σ2 Important: In general, unbiased SSE s MSE nd 2 d - degrees of freedom, number of model parameteres Example: Compute residuals, SSE, MSE, find it in MINITAB output 732G21/732A35/732G28 21 Minitab ◦ Graph → Scatterplot ◦ Stat → Regression ◦ Stat->Fitted Line Plot 732G21/732A35/732G28 22 Course book, Ch. 1 up to page 27. 732G21/732A35/732G28 23