Sample R 2 May Be Excluded From the Confidence Interval for Population R 2
Conf-Interval-R2-Regr.sas may produce a confidence interval which excludes the sample R 2 . For example, I used it with F = 3.0928, df num
= 153, and df den
= 9753. [Why
I was playing with such outrageously high df num
I do not recall.] Conf-Interval-R2-
Regr.sas reported R 2 = .046), with a CI that ran from .024 to .039. William Sears
[wsears@uoguelph.ca] explained this to me (below). With great df num
the bias in the estimation of
ρ 2 can push the sample R 2 outside of the confidence interval. If, however, you employ a less biased estimate (shrunken R 2 ), this should not happen. on R 2
I employ Conf-Inerval-R2.sas
and CI-R2-SPSS to construct confidence intervals
when the predictors are fixed (regression model) and Steiger and Foul adi’s R2 when the predictors are random (correlation model). In the tables below I compare the output of these procedures.
R 2 = .046, F (153, 9753) = 3.0928, N = 9907, shrunken R 2 = .0313
Method
95% Confidence Interval
Lower Upper
Conf-Interval-R2.sas .0242
CI-R2-SPSS .0126
.0392
.0132
Stieger & Fouladi R2 N cannot exceed 5,000
Sears gives .0242 .0393
1) You are using the non-central F (ncf) to approx. the intervals; the intervals so constructed are not, in fact, exact. While it is true that the R 2 can be transformed to a
(central) F under the null, the same is not true under alternatives where the true rhosquared (rho2) is > 0. Try this: for df num
= 3, df den
= 111 and R 2 = .3, the exact 95% CI should be 0.1493887, 0.4283222 (your program gives 0.15203, 0.40890). Now try df num
= 4, df den
= 12 and R 2 = 0.6; the exact 95% CI should be 0.022857, 0.784126 while your program gives limits of .0217, 0.714.
R 2 = .3, F (3, 111) = 15.857, N = 115
Method
95% Confidence Interval
Lower
Conf-Interval-R2.sas .1520
CI-R2-SPSS .1521
Upper
.4089
.4089
Stieger & Fouladi R2
Sears gives
.1494
.1494
.4283
.4283
R 2 = .6, F (4, 12) = 4.5, N = 17
95% Confidence Interval
Method Lower
Conf-Interval-R2.sas .0217
CI-R2-SPSS .0218
Stieger & Fouladi R2 .0229
Sears gives .0229
Upper
.7141
.7141
.7833
.7841
2) You are upset that the CI runs from 0.024236 to 0.039231 when the estimate is
0.046273. However, you are not taking into account the fact that the Maximum
Likelihood Estimate (MLE) of rho2 is biased; it becomes increasingly biased the larger ndf becomes compared to ddf. In fact, for the degrees of freedom you give (153, 9753), the bias is quite large and an unbiased estimate of rho2 is 0.03132. The CI given is, in fact, correct (using the correct inversion of the R 2 distribution gives a 95% CI of
0.024207, 0.039304). An approx. correction for bias is: 1 - ( df num
+ df den
)/ df den
*(1-R2); the approx. is fairly good, generally improving as n increases. [I refer to this statistic by the term J. Cohen used, “shrunken R 2 .”] Using R 2 = 0.0462731488, the approx. unbiased estimate of rho2 is 0.0313116 which compares favorably with the
UMVUEstimate 0.0313177. For another instance, the expected value of R 2 , if the true rho2 is 0, is given by df num
/( df num
+ df den
) exactly. So, if the true rho 2 is 0, for df num
=
153 and df den
= 9753, the expected value of R 2 is 0.0154452. A 95% CI using this R 2 gives limits 0, 0.003343 (the upper limit uses an alpha of 0.05, since we know the lower limit is 0).
The method of inverting the true distribution for R 2 is tedious, to say the least! The version I use involves the hypergeometric functions of order 3 [denoted: F(a,b;c;x) ] in addition to numerical integrations.
I wrote a Fortran program to find p values and then to numerically search for the confidence limits. A (usually) quite good approx. may be found in: Moschopoulos, PG
& Mudholker, GS., 1983 Commun. Statist. (Simul. & Computa.) 12: 355-371 (the approximation is a bit tedious, but could be implemented in SAS with some work). An exact routine may be found in: Ding, CG. 1991. Applied Statistics 40: 195-236 (it would be quite a bit harder to implement this in SAS, but it could be done).
On further thinking about this problem, I should mention that the exact methods applied to the multi-variate normal correlation problem; thus, CIs are on rho 2 when all explanatory variables are normal. However, in general ANOVA, some (or all) explanatory variables may be categorical, and yet you might want a CI on the model rho2 (estimated biasedly by R2). In this case, I wonder if your non-central F solution may have some merit after all (but I don't know)--this might be worthy of a paper!
My statements concerning bias still hold, however. I should mention that the bias correction formula is the one used by SAS in giving what it labels in Proc REG an
'adjusted R 2 and it can give negative values; the formula is given in many regression text books. For instance, suppose df num
=25, df den
and R 2 = 0.1 then the approx.
adjusted R 2 is -0.35 (which would be set to 0); for these degrees of freedom, any R 2 less than 0.333 will give a negative result.
An Approximation Procedure
I found an approximation procedure described http://rss.acs.unt.edu/Rdoc/library/psychometric/html/CI.Rsq.html
. Conf-Interval_R2-
LargeN.sas uses this approximation procedure and is recommended when N is very large. For the R 2 = .046, F (153, 9753) = 3.0928, N = 9907, shrunken R 2 = .0313, it returned a CI of .038, .054.
Return to Wuensch’s SAS Programs Page