This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed. Section C Variability in MLR: Assessing Uncertainty and Goodness of Fit MLR The algorithm to estimate the equation of the MLR line is called the “least squares” estimation The idea is to find the line (actually multi-dimensional object, like a plane or beyond) that gets “closest” to all of the points in the sample How to define closeness to multiple points? In regression, closeness is defined as the cumulative squared distance between each point’s y-value and the corresponding value of for that point’s set of x values: in other words the squared distance between an observed y-value and the estimated mean yvalue for all points with the same values of each x 3 MLR Each distance is computed for each data point in the sample The algorithm to estimate the equation of the line is called the “least squares” estimation The values chosen for are the values that minimize the cumulative distances squared: i.e., 4 Example: Arm Circumference and Height The values chosen for single sample are just estimates based on a If you were to have a different random sample the resulting estimates would likely be different: i.e., the values that minimized the cumulative squared distance from this second sample of points would likely be different As such, all regression coefficients have an associated standard error that can be used to make statements about the true relationship between the mean of y and x1, x2,..xp (the true slopes) based on a single sample 5 Example: Arm Circumference and Height Random sampling behavior of estimated regression coefficients is normal for large samples (n > 60), and centered at true values As such, we can use the same ideas to create 95% CIs and get p-values 6 Arm Circumference MLR How were the 95% CIs for the slopes of height and weight estimated? 7 Arm Circumference MLR Notice each slope has an estimated standard error 8 Arm Circumference MLR Just like in SLR, sampling distributions of estimated MLR slopes is normal, and centered at true population values (when n is large, i.e., n > 60) So the approach to constructing a 95% CI for same old approach: So for example, slope of height from previous regression results: - is given by the 95% CI for : ≈ (-0.21, -0.11) 9 Arm Circumference MLR How to get a p-value? - Ho: - Ha: (no relationship between y and xp after accounting for other xs) (xp is associated with y after accounting for other xs) Same “recipe” as before 10 Arm Circumference MLR How to get a p-value for slope of height in MLR? - Ho: - Ha: (no relationship between arm circumference and height after accounting for weight) Same “recipe” as before - Assume Ho true - Compute “distance” of sample result from 0 in unit of standard error - Compare distance to sampling distribution to get a p-value 11 Arm Circumference MLR To get p-value - We have a result that is 6.6 standard errors below 0; the sampling distribution is normal and centered at the assumed null truth of 0 - The resulting probability of getting a sample estimate 6.6 or greater standard errors away from 0 is through Ho being true and also really small, p < .0001 12 Hemoglobin MLR How were the 95% CIs for the slopes of PCV and age estimated? 13 Hemoglobin MLR When n is small, i.e., n < 60, just like in SLR, sampling distributions of MLR slopes is a t-distribution, but with n-(1+ “# of xs”) degrees of freedom So the approach to constructing a 95% CI for same old approach: So for example, slope of PCV from previous regression results: - 95% CI for is given by the : ≈ (0.037, 0.163) 14 Hemoglobin MLR How to get a p-value? - Ho: - Ha: (no relationship between y and xp after accounting for other xs) (xp is associated with y after accounting for other xs) Same “recipe” as before 15 Hemoglobin MLR How to get a p-value for slope of PCV in MLR of hemoglobin on PCV and age? - Ho: - Ha: (no relationship between hemoglobin and PCV after accounting for age) Same “recipe” as before - Assume Ho true - Compute “distance” of sample result from 0 in units of standard error - Compare distance to sampling distribution to get a p-value 16 Arm Circumference MLR To get a p-value - We have a result that is 3.3 standard errors below 0; the sampling distribution in a t-distribution with 18 degrees of freedom and centered at the assumed null truth of 0 - The resulting probability of getting a sample estimate 3.3 or greater standard errors away from 0 if Ho true is the p-value: p = 0.004 17 The Overall F-Test In both small and large samples, the p-values for each slope in a MLR are on testing for a relationship between y and a specific x, in a model that includes multiple xs In some sense, it may be nice to know whether any xs are associated, before assessing which xs are by looking at the inferences (CI, p-value) on individual slopes The overall F-test provides an answer to the prior query 18 The Overall F-Test Generic formulation: null and alternative - Ho: - Ha: at least one slope ( ) not equal to 0 The test gives only a p-value (no 95% CI) for choosing between the null and alternative hypotheses - If null is rejected, individual CIs/p-values for each can be used to find out which are statistically significant 19 The Overall F-Test Null and alternative - Ho: - Ha: at least one slope ( ) not equal to 0 p-value 20 Measuring Variability Explained by MLR (SR1 flashback) the sample standard deviation of the y-values ignoring the corresponding potential information in the xs is - - This measures how far on average each of the sample y-values falls from the overall mean all y-values In the arm circumference examples, s = 1.48 cm 21 Measuring Variability Explained by MLR “Visualization” on the scatterplot 22 Measuring Variability Explained by MLR Standard deviation of regression, referred to as root mean square error is “average” distance of points from the line: how far on average each y falls from its mean predicted by its corresponding x-values 23 Measuring Variability Explained by MLR regress command in Stata gives sy|x (named root MSE on the output) 24 Measuring Variability Explained by MLR If s = sy|x, then knowing x does not yield a better guess for the mean of y than using the overall mean (flat regression line) The smaller sy|x is relative to s, the closer the points are to the regression line R2 functionally measures how much smaller sy|x is than s: as such it is an estimate of the amount of variability in y explained by taking all the xs into account 25 Measuring Variability Explained by MLR regress command in Stata gives R2: child height, sex, and weight together explain (an estimated) 78% of the variation in arm circumferences 26 Example: Arm Circumference and Height One mathematical quirk about R2 in MLR is that adding more xs will always increase R2, even if an x is not informative about y There is a quantity called “Adjusted R2” that penalizes the original R2 for this property 27 Measuring Variability Explained by MLR Regress command in Stata gives R2: child height, sex, and weight together explain (an estimated) 78% of the variation in arm circumferences 28