Page 29 Multiple Regression Analysis Example. Data: SDSS Quasar A scientifically interesting study uses z as the dependent variable. The 5 independent variables of interest would be: g_mag, u_mag-g_mag, g_mag-r_mag, r_mag-i_mag, and r_mag-z_mag where "-" is a simple "minus" operation. For example, u_mag -g_mag is the difference between u_mag and g_mag That is, we seek to predict z values (quasar distance) as a function of a brightness (g_mag) and four "colors" represented by the difference between brightness in adjacent spectral bands. So, these new variables are created in the dataset. We use these four colors and the original g_mag to predict redshift z. We are not making any transformation on z or the other variables. g_mag (and the other mags) and M_i are already log variables. z is a linear measure of distance, but there is a long tradition of treating it without sqrt or log transformation. Note astronomers never use ln, only log10 (really?). We first examine correlations among the variables. Correlations: z, u-g, g-r, r-i, i-zmag (u=u_mag, g=g_mag, r=r_mag, i=i_mag) z 0.445 0.000 g_mag u-g 0.591 0.000 0.343 0.000 g-r 0.436 0.000 0.531 0.000 0.478 0.000 r-i 0.234 0.000 0.372 0.000 0.096 0.000 0.343 0.000 -0.015 0.001 0.127 0.000 0.112 0.000 0.130 0.000 g_mag i-zmag u-g g-r r-i 0.085 0.000 Cell Contents: Pearson correlation/P-Value Page 30 Covariances: z, g_mag, u-g, g-r, r-i, i-zmag z z g_mag u-g g-r r-i i-zmag SD 0.6579231 0.3188847 0.3640678 0.1170554 0.0335933 -0.0021379 0.811125 g_mag u-g 0.7810997 0.2302510 0.1552934 0.0581651 0.0191642 0.5758677 0.1200554 0.0129094 0.0145140 0.883798 g-r r-i i-zmag 0.1094255 0.0200688 0.0073726 0.0313701 0.0025639 0.0292995 0.758859 0.330795 0.177116 0.171171 The correlation between two variables is equal to their covariance divided by their respective standard deviations: For example. The correlation between z and g_mag (or g) is obtained as follow: 0.3188846/[(0.811125](0.883798)) = 0.445. Regression Analysis: Simple Linear Regression. We now proceed to a regression analysis of z (the response) on the five predictors g (=g_mag), u-g, g-r, r-I, and i-z_mag. Regression Analysis: z versus g_mag, u-g, g-r, r-i, i-zmag The regression equation is z = - 2.70 + 0.203 g_mag + 0.519 u-g + 0.172 g-r + 0.414 r-i - 0.543 i-zmag Predictor Constant g_mag u-g g-r r-i i-zmag Coef -2.69896 0.203461 0.519470 0.17162 0.41442 -0.54282 S = 0.607280 SE Coef 0.07400 0.003922 0.004294 0.01110 0.01757 0.01668 R-Sq = 44.0% Analysis of Variance Source DF Regression 5 Residual Error 46414 Total 46419 SS 13423.2 17117.0 30540.1 T -36.47 51.88 120.96 15.46 23.59 -32.55 P 0.000 0.000 0.000 0.000 0.000 0.000 VIF 1.5 1.3 1.7 1.2 1.0 R-Sq(adj) = 43.9% MS 2684.6 0.4 F 7279.59 P 0.000 Page 31 Source g_mag u-g g-r r-i i-zmag DF 1 1 1 1 1 Seq SS 6043.1 6664.7 137.6 187.0 390.7 Explanation of Output: SST0 = (Yi Y ) 2 = 30540.1 SS(Residual Error) = SSE = Σ(Yi – Yˆi )2 = 17117.0 Total SS = SSTO = SS(Regression)=SSTO-SSE(Residual Error)= 13423.2 R2 = SSR/SSTO= 13423.2/30540.1=.44 (or 44%) MSE = SSE/DF = 17117.0/46414 = 0.368790 ~ 0.4 s = SS(Residual Error)/DF = SD of Residuals = 0.607289 Seq SS (see below after ‘Unusual Observations) Unusual Observation. These are observations with either large standardized residuals and/or are highly influential (high leverage). Studentized residuals are defined by Studentized Residual = (Yi Yˆi ) / s.e.(ei ) where s.e.(ei) = MSE (1 hii ) , and hii is the ith diagonal element of the hat matrix, Altogether, there are 3163 ‘Unusual’ observations, of which 1821 are ones with high leverage only, 391 have both high leverage and large standardized residuals, and 1001 have large standardized residuals only. What is an observation with a large studentized residual? Minitab classifies an observation as such if its value is greater than 2 in absolute value. An observation I is classified as having high leverage if its value of hii > 3p/n where p is the number of parameters in the model and n is the number of observations. In this example, p = 6 (5 predictors and the Page 32 constant), n=46420, so an observation has high leverage if hiii > 18/46420 = 0.0003877. The average value of hiii is p/n. Here are the first 10 observations declared to have large studentized residuals and.or large influence: Obs 16 55 61 73 77 82 98 109 112 g_mag 18.4 21.8 20.5 20.0 18.6 21.3 20.9 19.8 23.0 z 2.64020 3.94060 3.64670 0.47620 2.43290 3.19920 3.68640 0.60430 1.52000 Fit 1.42290 4.14009 4.33044 1.71640 1.20272 3.47476 3.75297 1.85453 3.09531 SE Fit 0.00730 0.01533 0.01874 0.00540 0.00451 0.01321 0.01368 0.00529 0.02118 Residual 1.21730 -0.19949 -0.68374 -1.24020 1.23018 -0.27556 -0.06657 -1.25023 -1.57531 St Resid 2.00R -0.33 X -1.13 X -2.04R 2.03R -0.45 X -0.11 X -2.06R -2.60RX What do the residuals look like? Here are descriptive statistics for them: Descriptive Statistics: RESI1 Variable RESI1 N 46420 N* 0 Variable RESI1 Median -0.00494 Mean -1.33621E-14 Q3 0.43333 SE Mean 0.00282 StDev 0.60725 Minimum -4.81673 Q1 -0.39589 Maximum 2.91842 Descriptive Statistics: SRES1 Variable SRES1 N 46420 N* 0 Variable SRES1 Median -0.00813 Mean 0.000000961 Q3 0.71362 SE Mean 0.00464 StDev 1.00006 Minimum -7.94115 Q1 -0.65194 Maximum 4.82384 Graphical displays of residuals can be informative, about assumptions for regression analysis. Here is a ‘4 in 1 plot (but based on 3272 observations—the plots are very similar): (open new file to view it) Comments: 1. The normal probability shows some departure of the residuals from a normal distribution, but not really bad considering the sample size 2. The histogram looks reasonably normal. 3. The graph of fitted values vs. residuals has an apparent aberrant feature, indicating maybe ‘truncation’ or ‘censoring’ . Page 33 4. The plot of residuals by order seems to indicate randomness although there are so many observations that the resolution is very low. Test for Normality of Errors. The Anderson-Darling (AD) test for normality of residuals is significant at the .005 level. The graph indicates the presence of some exceedingly large standardized residuals. This test is essentially based on a linear combination of the order statistics of the residuals. Sequential SS (Seq SS): Sequential SS (sum of squares) represent the contribution to regression of variables adjusted for the order in which they are included in the regression analysis. In this example, the variables were entered sequentially in the order g_mag, u – g, g –r, r – i, I – z_mag. Then SS(Regression on g-mag) = 6043.1: Regression Analysis: z versus g_mag The regression equation is z = - 6.37 + 0.408 g_mag Predictor Constant g_mag Coef -6.37083 0.408251 S = 0.726464 SE Coef 0.07380 0.003815 R-Sq = 19.8% T -86.33 107.01 P 0.000 0.000 R-Sq(adj) = 19.8% Analysis of Variance Source Regression Residual Error Total DF 1 46418 46419 SS 6043.1 24497.1 30540.1 MS 6043.1 0.5 F 11450.62 P 0.000 Next, we add u-g to the regression model, resulting in the following output Page 34 Regression Analysis: z versus g_mag, u-g The regression equation is z = - 3.58 + 0.252 g_mag + 0.532 u-g Predictor Constant g_mag u-g Coef -3.58350 0.251537 0.531635 S = 0.619820 SE Coef 0.06642 0.003466 0.004036 R-Sq = 41.6% T -53.95 72.58 131.71 P 0.000 0.000 0.000 R-Sq(adj) = 41.6% Analysis of Variance Source P Regression 0.000 Residual Error Total Source g_mag u-g DF 1 1 DF SS MS F 2 12707.8 6353.9 16538.94 46417 46419 17832.4 30540.1 0.4 Seq SS 6043.1 6664.7 The increase in SS(Regression) due to adding u-g to the model containing g_mag is SS[Regression on u – g|adjusted for g_mag] = 12707.8 – 6043.1 = 6664.7. The remaining sequential sum of squares are calculated similarly. If one wishes to look at the relative contributions of predictors in terms of reducing the sum of squares for error (or increasing their contribution to SS(Regression), one can perform a ‘Stepwise Regression) (or some other model selection procedure). : Page 35 Stepwise Regression: z versus g_mag, u-g, g-r, r-i, i-zmag Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15 Response is z on 5 predictors, with N = 46420 Step Constant u-g T-Value P-Value 1 1.231 2 -3.584 3 -3.732 4 -3.109 5 -2.699 0.6322 158.04 0.000 0.5316 131.71 0.000 0.5406 134.89 0.000 0.5448 136.95 0.000 0.5195 120.96 0.000 0.2515 72.58 0.000 0.2615 75.85 0.000 0.2255 61.59 0.000 0.2035 51.88 0.000 -0.512 -30.42 0.000 -0.532 -31.83 0.000 -0.543 -32.55 0.000 0.472 27.42 0.000 0.414 23.59 0.000 g_mag T-Value P-Value i-zmag T-Value P-Value r-i T-Value P-Value g-r T-Value P-Value S R-Sq R-Sq(adj) Mallows C-p 0.172 15.46 0.000 0.654 34.98 34.98 7425.2 0.620 41.61 41.61 1939.8 0.614 42.75 42.75 996.6 0.609 43.66 43.66 243.1 0.607 43.95 43.95 6.0 I originally regressed M_i on z, g-mag and i-mag and got an R-square of about 86%. Then regressing m_i on sqrt(z) and g-mag and i-mag R-square goes up to 96% and then the regression of M_i on ln z, g-mag, and i-mag gives (of course)R-square = 99.9%. “But it will be dominated by the uninteresting M_i-z correlation; in this bivariate plot you will see a sharp parabolic envelope that represents the uninteresting detection limit of the survey. That is, there are no point with high M_i (i.e. negative value closer to zero) and high z (i.e. great distance) because we simply can't see faint & distant quasars.”