Statistical Modelling Simple Linear Regression Models for explanation and prediction Single explanatory variable • Response variable - y • Explanatory variables – – Can be continuous, categorical or a combination • Parameters – Intercept – Slope • Simple random sample of n observations • Seek relationship between y and x’s – Aim for a parsimonious model • Errors 1 Assumptions 2 Fitting Least Squares - find parameters to minimize error sum of squares • Linearity • Error terms – Mean zero – Constant variance – Independence – Often also assume normality, i.e. For normality this is equivalent to maximum likelihood 3 4 Linear Regression - Mussels Fitting the Least squares Linear Regression Line 0.4 • Consider relationship between edible mass and shell mass clear 0.3 Scatterplot of edible vs shell 0.2 50 0.1 40 1 2 3 4 5 edible 0.0 6 weight 30 20 10 0 0 100 5 200 shell 300 400 6 Regression Analysis: edible versus shell Regression Output The regression equation is Predictor edible = 4.42 + 0.133 shell Predictor Coef SE Coef T P 4.4234 0.8569 5.16 0.000 0.132712 0.005768 23.01 0.000 Constant shell S = 4.27444 R-Sq = 86.9% R-Sq(adj) = 86.7% Constant shell Coef SE Coef T P 4.4234 0.8569 5.16 0.000 0.132712 0.005768 23.01 0.000 Gives estimate and variability Test of significance given by t-value, e.g. for testing H0: = 0 vs H1: z0 use the t-value and associated p-value Rule-of-thumb: t-values with absolute value greater than 2 are significant 7 8 • Estimate of V Analysis of Variance S = 4.27444 Source Regression Residual Error Total based on 80 degrees of freedom (= 82 obs – 2 estimated parameters) • Proportion of variation explained DF 1 80 81 SS 9671.1 1461.7 11132.8 MS 9671.1 18.3 F 529.32 F-test gives overall test of significance of regression R-Sq = 86.9% For simple regression equivalent to the ttest for the slope estimate Residual Error MS gives error variance s2 multiple R2 coefficient – here squared correlation coefficient 9 Fitted Line Plot Key components for model checking are: • Fitted values edible = 4.423 + 0.1327 shell S R-Sq R-Sq(adj) 4.27444 86.9% 86.7% 40 edible 10 Model Checking Fitted Line Plot 50 P 0.000 • Residuals 30 observed - fitted 20 Can be obtained as additional output. Basic diagnostic plots can also be requested Minitab automatically prints out a list of points with large residuals (R) or large influence (X) 10 0 0 100 200 shell 300 400 11 12 Basic Residual Plots • Plot residuals vs x-variable – linearity • Plot residuals vs fitted values – Linearity – Constant variance Plot residuals against included variables and other variables of potential importance • Normal probability plot of residuals – normality 13 14 Residual Plots for edible Normal Probability Plot of the Residuals Percent 99 90 50 10 1 0.1 -4 -2 0 2 Standardized Residual 4 4 2 0 -2 -4 Frequency 30 20 10 -3.2 -1.6 0.0 1.6 Standardized Residual 0 12 24 36 Fitted Value 48 Residuals Versus the Order of the Data Standardized Residual Histogram of the Residuals 40 0 Interval Estimates Residuals Versus the Fitted Values Standardized Residual 99.9 3.2 4 2 0 -2 -4 1 10 20 30 40 50 60 Observation Order 70 • Parameter estimates: coefficient r 2 * se of coefficient • Fitted values – confidence interval • Predicted values – prediction interval New Obs 1 - 150 Fit 24.330 SE Fit 0.495 95% CI (23.344, 25.316) 95% PI (15.767, 32.894) New Obs 1 - 400 extreme Fit SE Fit 57.508 1.661 95% CI (54.203, 60.813) 95% PI (48.382, 66.634)XX 80 What do we see? 15 16