STAT 602 - Simple Linear Regression and Correlation Spring 2015 Regression is used to study relationships between variables. Linear regression is used for a special class of relationships, namely, those that can be described by straight lines, or by generalizations of straight lines to many dimensions. In regression we seek to understand how the value of a response of variable (Y) is related to a set of explanatory variables or predictors (X’s). This is done by specifying a model that relates the mean or average value of the response to the predictors in mathematical form. In simple linear regression we have single predictor variable (X) and we use a line (of the form y = mx + b) to relate the response to the predictor. Regression Model Notation: E(Y|X) = Var(Y|X) = In regression we first specify a functional form or model for E(Y|X). We then fit that model with using available data and then assess the adequacy of the model. We may then decide to remove some of the explanatory variables that do not appear important, add others to improve the model, change the function form of the model, etc. Regression is an iterative process where we fit a preliminary model and then modify our model based on the results. Multiple Regression Examples: a) E(Mercury Level in Walleyes |Weight, Length, Alkalinity, pH) = 0 1Weight 2 Length 3 Alkalinity 4 pH b) E(Systolic BP| )= 1 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Example 1: Length of Sunfish in Camp Lake Data File: Camplake.JMP Background: These data were collected as part of study of bluegill sunfish in Camp Lake. The variables measured in this study were: Age - age of the sunfish (determined by counting growth rings on scales, and recorded to the nearest year) Scalerad - scale radius (mm/100) Length - length of sunfish (mm) Response of interest (Y) Goal: Investigate the relationship between age (X) and the length of a sunfish from Camp Lake (Y). Also the relationship between scale radius and length is of potential interest. Assumptions for a Simple Linear Regression Model 1. The mean of the response variable (Y) can be modeled using X in the following form: E (Y | X ) o x * X 2. The variability in the response variable (Y) must be the same for each X, i.e. Var (Y | X ) 2 or equivalently SD(Y | X ) . In other words, the variation in the response Y must be constant across the entire range of X. 3. The response measurements (Y’s) should be independent of each other. If a simple random sample is taken from the population this typically the case. One situation where this assumption is violated is when the data are collected over time. 4. The response measurements (Y), for a given value of X, should follow a normal distribution. We will discuss how to check the assumptions outlined above after fitting our initial model. 2 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Correlation (initial investigation): This gives us some idea about whether or not the predictor(s) X and response Y are linearly related. The correlations are obtained under Analyze > Multivariate. Next, select the variables of interest and place them all in the Y box. Correlation measures to what degree two numeric variables X and Y are related to another in a linear fashion. If the correlation is 1 or -1, then the variables are perfectly linearly related. A scale for correlations is given below. Sample Correlation (r) (Pearson’s) n r (x i 1 n (x i 1 i i x )( y i y ) x)2 n (y i 1 i y) 2 What is the correlation between the variables here? In particular, what is the correlation between age and scale radius with length? 3 STAT 602 - Simple Linear Regression and Correlation Spring 2015 NEVER CALCULATE CORRELATIONS WITHOUT PLOTTING YOUR DATA! To test whether a correlation is significantly different from 0, we select Multivariate > Pairwise Correlations. Here we see that all three pair-wise correlations are significantly different from zero. 4 STAT 602 - Simple Linear Regression and Correlation Spring 2015 More on Correlation 5 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Fitting the regression model relating Length (Y) to Age (X) What is the population model? We want to model the mean value of Y using X, so the model is given by: E (Y | X ) o x * X or being specific for this situation, we have E ( Length | Age) o 1 Age Again: E(Y|X) is the notation we use to denote the mean value of Y given X. Taking a closer look at its pieces: For this data set, Y = Length (mm) and X = Age (years). How is the line that best fits our data determined? Using the Method of Least Squares. 6 STAT 602 - Simple Linear Regression and Correlation Spring 2015 To fit the model E ( Length | Age) o Age * Age we first select Analyze > Fit Y by X and place Length in the Y box and Age in the X box as shown below. This will give us scatter plot of Y vs. X, from which we can fit the simple linear regression model. The resulting scatter plot is shown below. With regression line added. To perform the regression of Length on Age in JMP, select Fit Line from the Bivariate Fit... pull-down menu. 7 STAT 602 - Simple Linear Regression and Correlation Spring 2015 The resulting output is shown below. We begin by looking at whether or not this regression stuff is even helpful: H o : Regression is NOT useful H a : Regression is useful This says that Age (X) explains a significant amount of the variation in the response Length (Y). Next we can assess the importance of the X variable, Age in this case: H o : Age 0 H 1 : Age 0 This is equivalent to saying the X is not useful for explaining Y. Thus the results of these two tests are identical in every 8 way! STAT 602 - Simple Linear Regression and Correlation Spring 2015 Conclusions from both tests: Determining how well the model is doing in terms of explaining the response is done using the R-Square and Root Mean Square Error: The proportion or percentage of variation explained by the regression of Length (Y) on Age (X) is given by the R-Square = .838 or 83.8% More on the R-Square The amount of unexplained variability in Length (Y) taking Age (X) into consideration is given by the RMSE (root mean square error). This is an estimate of the SD(Length|Age). Describing the Relationship: Eˆ ( Length | Age) 70.99 16.71 * Age Interpret each of the parameter estimates: 9 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Estimating the Mean Y for a Given X and Making Individual Predictions What do we estimate for the mean length of all sunfish in Camp Lake that are 4 years old? If picked a 4 year old fish from Camp Lake at random, what do we predict its length will be? C hecking the Assumptions: To check the adequacy of the model in terms of the basic assumptions involves looking a plots of the residuals from the fit. One plot that is particular helpful for checking a number of the model assumptions is a plot of the residuals vs. the fitted values. When we are performing simple linear regression we can alternatively plot the residuals vs. X, which is what JMP gives by default when selecting the Linear Fit > Plot Residuals option. Ideal Residual Plot: Violations to Assumption #1: The trend need not be linear (BAD) The trend need not be linear (BAD) 10 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Violations to Assumption #2: Variance of Y given X is NOT constant. Megaphone (BAD) Violations to Assumption #3: One point closely following another -positive autocorrelation, (BAD) Extreme bouncing back and forth -- negative autocorrelation (BAD) Violations to Assumption #4: To check this assumption, simple save the residuals out and make a histogram of the residuals and/or look at a normal quantile plot. Recall, you can easily make a histogram of a variable under Analyze > Distribution. We should generally assess normality using a normal quantile plot as well. 11 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Checking for outliers: Determine the value of 2*RMSE. Any observations outside these bands are potential outliers and should be investigated further to determine whether or not they adversely affect the model. Checking for outliers in this example we find: To obtain the residual plot in JMP select Linear Fit > Plot Residuals 12 STAT 602 - Simple Linear Regression and Correlation Spring 2015 THE ASSUMPTION CHECKLIST: Model Appropriate: Constant Variance: Independence: Normality Assumption (see histogram above): Identify Outliers: Confidence Interval for the Average Length: (i.e. the average for the entire population of sunfish of a specified age.) Select Confid Curves Fit from the Linear Fit pull-down menu located below the scatter plot. The narrow bands in plot below represent the CI for the Average Length. For example, from the plot below we estimate that the mean length for 2 year old sunfish is likely to be somewhere between 95 – 107 mm. We will examine a more precise way for obtaining such intervals later in the tutorial. 13 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Prediction Interval for the Length of Single Sunfish : (i.e., for a single sunfish sampled from the population of all sunfish with a specified age.) Select Confid Curves Indiv from the Linear Fit pull-down menu which is located directly beneath the scatter plot. These are the wider bands in the plot above. For example if we were to sampled a single 4 year old sunfish we estimate with 95% confidence that its length will be somewhere between 115 – 155 mm. We will examine a more precise way for obtaining exact intervals of this form later in the tutorial. 14 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Using the Analyze > Fit Model Option to Perform the Regression An alternative to using Fit Y by X to perform simple linear regression, is to use the Fit Model option from the Analyze menu. The advantages of this approach are two-fold: 1) You have access to more detailed results from your regression and have enhanced features for estimation/prediction of Y. 2) Allows for the addition of more predictors (X’s) to your model. This is called multiple regression and will be discussed later. For the Camp Lake sunfish example we fit the model as follows: Select Analyze > Fit Model and place Length in the Y box and Age in Model Effects box. Y variable X variables go here. If we had more predictors (X’s) that we wanted to add to our model we would simply put them in the Model Effects box, e.g. we could scale radius as a predictor of length as well. Remember when we have more than one predictor in a linear regression we call it a multiple regression. 15 STAT 602 - Simple Linear Regression and Correlation Spring 2015 The output from the Analyze > Fit Model option is shown below: The same numeric summaries, parameter estimates, and test results are contained as part of the standard output. The plot at the top is NOT a scatter plot of Y vs. X. It is a plot of the actual Y values vs. the predicted values from the regression model. The stronger the trend exhibited the better the fit. It is essentially a visualization of the R-square for the fitted model. In cases where we have multiple values of the response for some of the X values we will be given the results of the Lack Of Fit test. If p-value here is small it can indicate that our model does not adequately represent the mean of the response variable Y. For example, if we fit line to clearly nonlinear relationship we would expect this p-value to be significant. In cases where there is significant lack of fit the plot of the residuals vs. the fitted values will generally show some non-linear trend. Notice the bulk of the output is same as that obtained using the Analyze > Fit Y by X approach. 16 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Interval Estimation for E(Y|X), the Mean Value of Y for a given X & Prediction of Y for an Individual with a given X (when using Fit Model) You can save 95% Confidence Intervals for E(Y|X) to the data spread sheet by selecting Mean Confidence Interval. You can save 95% Prediction Intervals for Individual Y values to the data spread sheet by selecting Indiv Confidence Interval. The table on the following page is a portion data spread sheet showing both types of intervals for these data. Interpretation of the 95% Confidence Interval for E(Length|Age=5) Consider estimating the average/mean length for all five year old sunfish in Camp Lake. A 95% confidence interval for this mean is given by the interval 151.76 mm to 157.28 mm. There is a 95% chance this interval covers the true mean length of 5 year old sunfish in Camp Lake. Interpretation of the 95% Prediction Interval for Length|Age=5 Suppose we picked one five year old sunfish at random from the population of all 5 year old sunfish in Camp Lake. What do estimate the length of this sunfish will be? We estimate, with 95% confidence, that the actual length for this particular sunfish will be somewhere between 135.2 mm and 173.8 mm. This range of scores has a 95% chance of covering the actual length for this randomly selected 5 year old sunfish. Notice how much wider this interval is when compared to interval for the mean length for all 5 year old sunfish in Camp Lake. This should seem natural as it is much harder to predict the length of a single randomly selected fish than the average score for all fish. 17 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Example 2 – Length (Y) vs. Scale Radius (X) 190 180 Length 170 160 150 140 130 120 110 100 90 80 1.5 2 2.5 3 3.5 Scalerad Linear Fit Length = 42.346763 + 40.17735 Scalerad Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.731336 0.726858 12.28663 142.9355 62 Summary of R-square: Analysis of Variance Source Model Error C. Total DF 1 60 61 Sum of Squares 24656.064 9057.678 33713.742 Mean Square 24656.1 151.0 F Ratio 163.3271 Prob > F <.0001 Parameter Estimates Term Intercept Scalerad Estimate 42.346763 40.17735 Std Error 8.02401 3.143781 t Ratio 5.28 12.78 Prob>|t| <.0001 <.0001 Conclusions from tests results highlighted above: 18 STAT 602 - Simple Linear Regression and Correlation Spring 2015 Residual Plot (residuals vs. scale radius (X)) 40 30 Residual 20 10 0 -10 -20 -30 1.5 2 2.5 3 3.5 Scalerad Discussion of Residual Plot: Distribution of the Residuals 19