Multiple Regression Analysis Example. Data: SDSS Quasar Page 29

advertisement
Page 29
Multiple Regression Analysis Example. Data: SDSS Quasar
A scientifically interesting study uses z as the dependent variable. The 5
independent variables of interest would be:
g_mag, u_mag-g_mag, g_mag-r_mag, r_mag-i_mag, and r_mag-z_mag
where "-" is a simple "minus" operation. For example, u_mag -g_mag is the
difference between u_mag and g_mag That is, we seek to predict z values
(quasar distance) as a function of a brightness (g_mag) and four "colors"
represented by the difference between brightness in adjacent spectral bands.
So, these new variables are created in the dataset. We use these four colors
and the original g_mag to predict redshift z.
We are not making any transformation on z or the other variables. g_mag
(and the other mags) and M_i are already log variables. z is a linear measure
of distance, but there is a long tradition of treating it without sqrt or log
transformation. Note astronomers never use ln, only log10 (really?).
We first examine correlations among the variables.
Correlations: z, u-g, g-r, r-i, i-zmag (u=u_mag, g=g_mag, r=r_mag, i=i_mag)
z
0.445
0.000
g_mag
u-g
0.591
0.000
0.343
0.000
g-r
0.436
0.000
0.531
0.000
0.478
0.000
r-i
0.234
0.000
0.372
0.000
0.096
0.000
0.343
0.000
-0.015
0.001
0.127
0.000
0.112
0.000
0.130
0.000
g_mag
i-zmag
u-g
g-r
r-i
0.085
0.000
Cell Contents: Pearson correlation/P-Value
Page 30
Covariances: z, g_mag, u-g, g-r, r-i, i-zmag
z
z
g_mag
u-g
g-r
r-i
i-zmag
SD
0.6579231
0.3188847
0.3640678
0.1170554
0.0335933
-0.0021379
0.811125
g_mag
u-g
0.7810997
0.2302510
0.1552934
0.0581651
0.0191642
0.5758677
0.1200554
0.0129094
0.0145140
0.883798
g-r
r-i
i-zmag
0.1094255
0.0200688
0.0073726
0.0313701
0.0025639
0.0292995
0.758859
0.330795
0.177116
0.171171
The correlation between two variables is equal to their covariance divided by
their respective standard deviations:
For example. The correlation between z and g_mag (or g) is obtained as
follow:
0.3188846/[(0.811125](0.883798)) = 0.445.
Regression Analysis: Simple Linear Regression.
We now proceed to a regression analysis of z (the response) on the five
predictors g (=g_mag), u-g, g-r, r-I, and i-z_mag.
Regression Analysis: z versus g_mag, u-g, g-r, r-i, i-zmag
The regression equation is
z = - 2.70 + 0.203 g_mag + 0.519 u-g + 0.172 g-r + 0.414 r-i - 0.543 i-zmag
Predictor
Constant
g_mag
u-g
g-r
r-i
i-zmag
Coef
-2.69896
0.203461
0.519470
0.17162
0.41442
-0.54282
S = 0.607280
SE Coef
0.07400
0.003922
0.004294
0.01110
0.01757
0.01668
R-Sq = 44.0%
Analysis of Variance
Source
DF
Regression
5
Residual Error 46414
Total
46419
SS
13423.2
17117.0
30540.1
T
-36.47
51.88
120.96
15.46
23.59
-32.55
P
0.000
0.000
0.000
0.000
0.000
0.000
VIF
1.5
1.3
1.7
1.2
1.0
R-Sq(adj) = 43.9%
MS
2684.6
0.4
F
7279.59
P
0.000
Page 31
Source
g_mag
u-g
g-r
r-i
i-zmag
DF
1
1
1
1
1
Seq SS
6043.1
6664.7
137.6
187.0
390.7
Explanation of Output:
SST0 =  (Yi  Y ) 2 = 30540.1
SS(Residual Error) = SSE = Σ(Yi – Yˆi )2 = 17117.0
Total SS = SSTO =
SS(Regression)=SSTO-SSE(Residual Error)= 13423.2
R2 = SSR/SSTO= 13423.2/30540.1=.44 (or 44%)
MSE = SSE/DF = 17117.0/46414 = 0.368790 ~ 0.4
s = SS(Residual Error)/DF = SD of Residuals = 0.607289
Seq SS (see below after ‘Unusual Observations)
Unusual Observation. These are observations with either large standardized
residuals and/or are highly influential (high leverage). Studentized residuals
are defined by
Studentized Residual = (Yi  Yˆi ) / s.e.(ei )
where s.e.(ei) = MSE (1  hii ) , and hii is the ith diagonal element of the hat
matrix,
Altogether, there are 3163 ‘Unusual’ observations, of which 1821 are
ones with high leverage only, 391 have both high leverage and large
standardized residuals, and 1001 have large standardized residuals only.
What is an observation with a large studentized residual? Minitab
classifies an observation as such if its value is greater than 2 in absolute
value.
An observation I is classified as having high leverage if its value of hii >
3p/n where p is the number of parameters in the model and n is the
number of observations. In this example, p = 6 (5 predictors and the
Page 32
constant), n=46420, so an observation has high leverage if hiii > 18/46420
= 0.0003877. The average value of hiii is p/n.
Here are the first 10 observations declared to have large studentized
residuals and.or large influence:
Obs
16
55
61
73
77
82
98
109
112
g_mag
18.4
21.8
20.5
20.0
18.6
21.3
20.9
19.8
23.0
z
2.64020
3.94060
3.64670
0.47620
2.43290
3.19920
3.68640
0.60430
1.52000
Fit
1.42290
4.14009
4.33044
1.71640
1.20272
3.47476
3.75297
1.85453
3.09531
SE Fit
0.00730
0.01533
0.01874
0.00540
0.00451
0.01321
0.01368
0.00529
0.02118
Residual
1.21730
-0.19949
-0.68374
-1.24020
1.23018
-0.27556
-0.06657
-1.25023
-1.57531
St Resid
2.00R
-0.33 X
-1.13 X
-2.04R
2.03R
-0.45 X
-0.11 X
-2.06R
-2.60RX
What do the residuals look like? Here are descriptive statistics for them:
Descriptive Statistics: RESI1
Variable
RESI1
N
46420
N*
0
Variable
RESI1
Median
-0.00494
Mean
-1.33621E-14
Q3
0.43333
SE Mean
0.00282
StDev
0.60725
Minimum
-4.81673
Q1
-0.39589
Maximum
2.91842
Descriptive Statistics: SRES1
Variable
SRES1
N
46420
N*
0
Variable
SRES1
Median
-0.00813
Mean
0.000000961
Q3
0.71362
SE Mean
0.00464
StDev
1.00006
Minimum
-7.94115
Q1
-0.65194
Maximum
4.82384
Graphical displays of residuals can be informative, about assumptions for
regression analysis. Here is a ‘4 in 1 plot (but based on 3272 observations—the
plots are very similar): (open new file to view it)
Comments:
1. The normal probability shows some departure of the residuals from
a normal distribution, but not really bad considering the sample
size
2. The histogram looks reasonably normal.
3. The graph of fitted values vs. residuals has an apparent aberrant
feature, indicating maybe ‘truncation’ or ‘censoring’ .
Page 33
4. The plot of residuals by order seems to indicate randomness
although there are so many observations that the resolution is very
low.
Test for Normality of Errors. The Anderson-Darling (AD) test for normality
of residuals is significant at the .005 level. The graph indicates the presence
of some exceedingly large standardized residuals. This test is essentially
based on a linear combination of the order statistics of the residuals.
Sequential SS (Seq SS):
Sequential SS (sum of squares) represent the contribution to regression of
variables adjusted for the order in which they are included in the regression
analysis.
In this example, the variables were entered sequentially in the order g_mag,
u – g, g –r, r – i, I – z_mag.
Then SS(Regression on g-mag) = 6043.1:
Regression Analysis: z versus g_mag
The regression equation is
z = - 6.37 + 0.408 g_mag
Predictor
Constant
g_mag
Coef
-6.37083
0.408251
S = 0.726464
SE Coef
0.07380
0.003815
R-Sq = 19.8%
T
-86.33
107.01
P
0.000
0.000
R-Sq(adj) = 19.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
1
46418
46419
SS
6043.1
24497.1
30540.1
MS
6043.1
0.5
F
11450.62
P
0.000
Next, we add u-g to the regression model, resulting in the following output
Page 34
Regression Analysis: z versus g_mag, u-g
The regression equation is
z = - 3.58 + 0.252 g_mag + 0.532 u-g
Predictor
Constant
g_mag
u-g
Coef
-3.58350
0.251537
0.531635
S = 0.619820
SE Coef
0.06642
0.003466
0.004036
R-Sq = 41.6%
T
-53.95
72.58
131.71
P
0.000
0.000
0.000
R-Sq(adj) = 41.6%
Analysis of Variance
Source
P
Regression
0.000
Residual Error
Total
Source
g_mag
u-g
DF
1
1
DF
SS
MS
F
2
12707.8
6353.9
16538.94
46417
46419
17832.4
30540.1
0.4
Seq SS
6043.1
6664.7
The increase in SS(Regression) due to adding u-g to the model containing
g_mag is
SS[Regression on u – g|adjusted for g_mag] = 12707.8 – 6043.1 = 6664.7.
The remaining sequential sum of squares are calculated similarly.
If one wishes to look at the relative contributions of predictors in terms of
reducing the sum of squares for error (or increasing their contribution to
SS(Regression), one can perform a ‘Stepwise Regression) (or some other
model selection procedure).
:
Page 35
Stepwise Regression: z versus g_mag, u-g, g-r, r-i, i-zmag
Alpha-to-Enter: 0.15 Alpha-to-Remove: 0.15
Response is z on 5 predictors, with N = 46420
Step
Constant
u-g
T-Value
P-Value
1
1.231
2
-3.584
3
-3.732
4
-3.109
5
-2.699
0.6322
158.04
0.000
0.5316
131.71
0.000
0.5406
134.89
0.000
0.5448
136.95
0.000
0.5195
120.96
0.000
0.2515
72.58
0.000
0.2615
75.85
0.000
0.2255
61.59
0.000
0.2035
51.88
0.000
-0.512
-30.42
0.000
-0.532
-31.83
0.000
-0.543
-32.55
0.000
0.472
27.42
0.000
0.414
23.59
0.000
g_mag
T-Value
P-Value
i-zmag
T-Value
P-Value
r-i
T-Value
P-Value
g-r
T-Value
P-Value
S
R-Sq
R-Sq(adj)
Mallows C-p
0.172
15.46
0.000
0.654
34.98
34.98
7425.2
0.620
41.61
41.61
1939.8
0.614
42.75
42.75
996.6
0.609
43.66
43.66
243.1
0.607
43.95
43.95
6.0
I originally regressed M_i on z, g-mag and i-mag and got an R-square of about 86%.
Then regressing m_i on sqrt(z) and g-mag and i-mag R-square goes up to 96% and then
the regression of M_i on ln z, g-mag, and i-mag gives (of course)R-square = 99.9%. “But
it will be dominated by the uninteresting M_i-z correlation; in this bivariate plot you will
see a sharp parabolic envelope that represents the uninteresting detection limit of the
survey. That is, there are no point with high M_i (i.e. negative value closer to zero) and
high z (i.e. great distance) because we simply can't see faint & distant quasars.”
Download