Conf-Interval-R2-Regr

advertisement

Sample R 2 May Be Excluded From the Confidence Interval for Population R 2

Conf-Interval-R2-Regr.sas may produce a confidence interval which excludes the sample R 2 . For example, I used it with F = 3.0928, df num

= 153, and df den

= 9753. [Why

I was playing with such outrageously high df num

I do not recall.] Conf-Interval-R2-

Regr.sas reported R 2 = .046), with a CI that ran from .024 to .039. William Sears

[wsears@uoguelph.ca] explained this to me (below). With great df num

the bias in the estimation of

ρ 2 can push the sample R 2 outside of the confidence interval. If, however, you employ a less biased estimate (shrunken R 2 ), this should not happen. on R 2

I employ Conf-Inerval-R2.sas

and CI-R2-SPSS to construct confidence intervals

when the predictors are fixed (regression model) and Steiger and Foul adi’s R2 when the predictors are random (correlation model). In the tables below I compare the output of these procedures.

R 2 = .046, F (153, 9753) = 3.0928, N = 9907, shrunken R 2 = .0313

Method

95% Confidence Interval

Lower Upper

Conf-Interval-R2.sas .0242

CI-R2-SPSS .0126

.0392

.0132

Stieger & Fouladi R2 N cannot exceed 5,000

Sears gives .0242 .0393

1) You are using the non-central F (ncf) to approx. the intervals; the intervals so constructed are not, in fact, exact. While it is true that the R 2 can be transformed to a

(central) F under the null, the same is not true under alternatives where the true rhosquared (rho2) is > 0. Try this: for df num

= 3, df den

= 111 and R 2 = .3, the exact 95% CI should be 0.1493887, 0.4283222 (your program gives 0.15203, 0.40890). Now try df num

= 4, df den

= 12 and R 2 = 0.6; the exact 95% CI should be 0.022857, 0.784126 while your program gives limits of .0217, 0.714.

R 2 = .3, F (3, 111) = 15.857, N = 115

Method

95% Confidence Interval

Lower

Conf-Interval-R2.sas .1520

CI-R2-SPSS .1521

Upper

.4089

.4089

Stieger & Fouladi R2

Sears gives

.1494

.1494

.4283

.4283

R 2 = .6, F (4, 12) = 4.5, N = 17

95% Confidence Interval

Method Lower

Conf-Interval-R2.sas .0217

CI-R2-SPSS .0218

Stieger & Fouladi R2 .0229

Sears gives .0229

Upper

.7141

.7141

.7833

.7841

2) You are upset that the CI runs from 0.024236 to 0.039231 when the estimate is

0.046273. However, you are not taking into account the fact that the Maximum

Likelihood Estimate (MLE) of rho2 is biased; it becomes increasingly biased the larger ndf becomes compared to ddf. In fact, for the degrees of freedom you give (153, 9753), the bias is quite large and an unbiased estimate of rho2 is 0.03132. The CI given is, in fact, correct (using the correct inversion of the R 2 distribution gives a 95% CI of

0.024207, 0.039304). An approx. correction for bias is: 1 - ( df num

+ df den

)/ df den

*(1-R2); the approx. is fairly good, generally improving as n increases. [I refer to this statistic by the term J. Cohen used, “shrunken R 2 .”] Using R 2 = 0.0462731488, the approx. unbiased estimate of rho2 is 0.0313116 which compares favorably with the

UMVUEstimate 0.0313177. For another instance, the expected value of R 2 , if the true rho2 is 0, is given by df num

/( df num

+ df den

) exactly. So, if the true rho 2 is 0, for df num

=

153 and df den

= 9753, the expected value of R 2 is 0.0154452. A 95% CI using this R 2 gives limits 0, 0.003343 (the upper limit uses an alpha of 0.05, since we know the lower limit is 0).

The method of inverting the true distribution for R 2 is tedious, to say the least! The version I use involves the hypergeometric functions of order 3 [denoted: F(a,b;c;x) ] in addition to numerical integrations.

I wrote a Fortran program to find p values and then to numerically search for the confidence limits. A (usually) quite good approx. may be found in: Moschopoulos, PG

& Mudholker, GS., 1983 Commun. Statist. (Simul. & Computa.) 12: 355-371 (the approximation is a bit tedious, but could be implemented in SAS with some work). An exact routine may be found in: Ding, CG. 1991. Applied Statistics 40: 195-236 (it would be quite a bit harder to implement this in SAS, but it could be done).

On further thinking about this problem, I should mention that the exact methods applied to the multi-variate normal correlation problem; thus, CIs are on rho 2 when all explanatory variables are normal. However, in general ANOVA, some (or all) explanatory variables may be categorical, and yet you might want a CI on the model rho2 (estimated biasedly by R2). In this case, I wonder if your non-central F solution may have some merit after all (but I don't know)--this might be worthy of a paper!

My statements concerning bias still hold, however. I should mention that the bias correction formula is the one used by SAS in giving what it labels in Proc REG an

'adjusted R 2 and it can give negative values; the formula is given in many regression text books. For instance, suppose df num

=25, df den

and R 2 = 0.1 then the approx.

adjusted R 2 is -0.35 (which would be set to 0); for these degrees of freedom, any R 2 less than 0.333 will give a negative result.

An Approximation Procedure

I found an approximation procedure described http://rss.acs.unt.edu/Rdoc/library/psychometric/html/CI.Rsq.html

. Conf-Interval_R2-

LargeN.sas uses this approximation procedure and is recommended when N is very large. For the R 2 = .046, F (153, 9753) = 3.0928, N = 9907, shrunken R 2 = .0313, it returned a CI of .038, .054.

Return to Wuensch’s SAS Programs Page

Download