LIBRARY OF THE MASSACHUSETTS INSTITUTE OF TECHNOLOGY coey I Evaluation of an Ad Hoc Procedure for Estimating Parameters of Some Linear Models A. Ando and G.M. Kaufmati MASS. ir^CT. Tc OCT 30 1' 94-64 DEWEY MASSACHUSE 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETl LIDRA: COPY I Evaluation of an Ad Hoc Procedure for Estimating Parameters of Some Linear Models A. Ando and G.M. Kaufman. MASS. INST. TcCH. 94-64 j OCT 30 ^9P4 Authors are Associate Professor of Economics and Finance, University of Pennsylvania and Assistant Professor of Industrial Management, Massachusetts Institute of Technology. The contribution of Ando to this paper was supported by a grant from the National Science Foundation. The authors acknowledge the contribution of Mr. Bridger Mitchell of Massachusetts Institute of Technology, who prepared the computer program for calculations reported in this paper. Y\A NOV ,, , I. 9 1965 LlBRAKlE-S Economists and other users of statistical methodology often posit a probabilistic model of some real world phenomenon which has more unknown parameters than there are sample observations. In such cases it is usually impossible to jointly estimate all parameters from the sample data. Even in those instances where there are well established estimation procedures when the number of sample observations n is larger than the number of parameters r, these methods are generally inadequate when n < r^ as the necessary calculations cannot be carried out. Furthermore, when n is only "slightly larger" than r, such estimates often prove to be unreliable in more than one sense. One particular example of such a problem that frequently occurs in analysis of psychological and of economic data is this: the researcher posits a linear regression model as defined in (2) below with r-1 independent variables but only n < r observations on the dependent variable. Since the standard least squares procedure cannot be applied, he may ask the following seemingly reasonable question: "What subset of the r-1 independent variables should for inclusion in a "new" model to which I I select can apply the standard least squares procedure?" Our purpose here is to demonstrate that one frequently used ad hoc method for determining such a subset by ordering simple sample correlation coefficients can be highly misleading. Any procedure which uses to both determine the structure of the model a given set of sample data to be tested and to estimate param- eters of this model is intuitively unsettling. Here we present tables which quantitatively demonstrate how dangerous such ad hoc methods can be. Ad Hoc Use of Simple Correlation Coefficients to Determine Model Structure Consider an r-dimensional Independent Multinormal process defined as one that generates independent r ~(1) x random vectors x 1 ,x (J) with identical densities 4'^(2ikh) = (2 ^)-ir e"^(3S-li)''^(2i-y.) |h|2 (1) h is PDS We wish to estimate parameters of the conditional distribution of x^ ,...,x^ ' when neither ^ nor vector sample observations x "^ h is ^x ~(k)' xj^ given known with certainty from a set of n ,...^x^" . Alternatively, we may consider an r-dimensional Normal Regression process defined as a process generating independent scalar random variables according to the model ;p>, 'i the X. s + iiz A^^^i. ^^^""^ (2) are known numbers which in general may vary from one observation to the next,' and the e.s are independent random variables with identical normal J fj^(e|0, h) = (2n)" Suppose now that we have n observations, and let (1) ^2 (3) . be a matrix of observations of "independent" variables together with an additional column of I's. If the rank of X X is singular the following method, or a slight variation of it, is often employed to make standard least squares "work": 1. Using the n sample observations, compute r-1 sample statistics p., i=2,...,r, where p. coefficient between x, is the simple sample correlation and x. > > 2. Relabel variables X2,...,x 3. Choose an integer r* < rank (X X). 4. Restructure the model, eliminating from consideration so that relabelled variables x .,.,...,' x r*+l' . jp-] [po] ... > |p [• (Henceforth we call the r initial model, "Model A" and the restructured model, "Model B".) 5. Assuming that Model B is an adequate approximation to Model A use X X to estimate parameters of the conditional distribution of X, given the values of relabelled variables X2,<...^x^^, 3« Simulation Study Model B is a strongly biased approximation to Model A, and the magnitude of the bias is a function of the joint distribution of the r*-l largest simple sample correlation coefficients. Analytical expressions for this distribution and for the distribution of the sample multiple correlation coefficient in Model B are in general extremely complex and unwieldy. Hence we have resorted - to simulation! 4 - in order to numerically determine the salient features of Model B. The steps in this simulation were: 1. We assumed that: (a) the data generating process is a Model A as defined in (2)j (b) the dependent variables x!" , > . . yX , y*'-''^ and the independent variables j=1^2^..,,n, are mutually independent random variables identically distributed uniformly on [0^ 1], 2. For each (n, r) pair, n=10(2)20, 30, 40, 50, and r=10(5)50, 50(10)100, 150, 200 we generated 50 sets of n observations each. 3« For each set of n observations we used the ad hoc procedure out- lined in section 2 to calculate the five largest simple sample correlation coefficients and Model B multiple correlation coefficients for Model B's with m=2, 3, 4, and 5 independent variables. 4. Regarding each of the five sets of 50 simple sample correlation coefficients as a set of sample observations we then calculated the sample meanjp.| and sample variance V(|p.|) of the absolute value of each of the. five largest simple sample correlation coefficients. results are tabulated in Tables 1 and 2. 'Ames and Reiter [1] performed a somewhat similar experiment in a different context: they drew 100 economic time series at random from the Historical Statistics of the United States and then computed the sampling distributions of correlation and autocorrelation coefficients of the series drawn, finding that "correlations and lagged cross-correlations are quite high for all classes of data. E.g., given a randomly selected series, it is possible to find, by random drawing, another series which explains at least 50 per cent of the variances of the first one, in from 2 to 6 random trials, depending on the class of data involved." [1], p. 637. The | 5. We follow a similar procedure with the sample multiple correlation coefficients. In Table 3 the mean R of the square of 50 sample multiple correlation coefficients for model B's with a constant term plus m=2, 3, k, and 5 independent variables are tabulated. The tables given us a reasonably accurate ideal of the order of magnitude of Model B sample correlation coefficients and of the five largest simple sample correlation coefficients whatever the distribution of the independent variables-so long as they are mutually independent and have finite mean and variance. Tl Tables 2 and 4 of the sample variances of the |p | and of sample multiple . correlation coefficients show that variations in Tables to sampling error are practically negligible; of Ip-,1 for n= 10 and r= 10 is 1 and 3 entries due e.g. the coefficient of variation less than .001 as calculated from Tables 1 and 2. Evaluation Examination of Tables 1 procedure outlined in section shown in Tables 1 and 3 illustrate clearly that not only is the 2 biased, but that bias of an order of magnitude and 3 entries will occur with extremely high probability. For 'In order to assess the effect of assuming that the data generating process generates values of mutually independent but uniformly distributed random variables, we duplicated the experiment for n= and r= 10 under a different assumption: each sample observation was generated by summing 10 values of mutually independent, identically uniformly distributed random variables. This spot check using an approximately normal data generating process revealed no differences to the third significant digit between values generated under this assumption and values tabled in Tables 1, 2, 3, and 4. _1_L _ 2 ''One interesting feature of Table 1 is that for given r and i, |p. decreases almost exactly proportional to 1/n. This is most pronounced for i=2. - example^ for n=10^ r=10^ n= f ^= } \po\- population values of |p.| |p^2l = "61^ 6 - with sample variance of only „0172 and for with sample variance of only even when the i=2,...,r are controlled to be 01 One effective way of dealing with the problem is to reformulate it in Bayesian terms so that we may systematically incorporate information the researcher possesses from sources other than the sample into the analysis. In [2] we indicate how Bayesian estimation of the mean vector and variance-covariance matrix of a multinormal process may be done even when the dimensionality of the process is r and there are only n < r sample observations available. And in [3] we show how Bayesian estimation can be done when the data generating process is one frequently occurring in econometric analysis--a set of simultaneous linear equations with stochastic components--and there are less vector observations than parameters. [1] Ames^E. and Reiter S.;, "Distributions of Correlation Coefficients in Economic Time Series," Journal of the American Statistical Association , 56 (1961), pp. 637-656, [2] Ando, Albert and Kaufman, Gordon M. ,"Bayesian Analysis of the Independent Multinormal Process-Neither Mean nor Precision Known-Part I,' Massachusetts Institute of Technology, Alfred P« Sloan School of Management, Working Paper No, 41-63. [3] Ando, Albert and Kaufman, Gordon M. ,'^ Bayesian Analysis of the Reduced Form System, "Massachusetts Institute of Technology, Alfred P, Sloan School of Management, Working Paper No. 79-64, rooaovOiA O-grOtOoO lA-tONt-o OOvONJtn TOO»r«-cy.in t<\ ^ r^ vOMooor* >tT>u^oao in-vjooh* -*-*C*%iAO .n-iosr«-vo -«-<ooo iT, cr> -«-tr-4oo ^^-ijyiaor'- o '-« (\l -< O vO r^ --H t>a .->< 00 -*• N o ^ -* u> vO r* -* -• c^ -H -^ '-' -t i-t ^ ^ i-M (MiAOO'^O r~t ^ o f^ o> (*> .-«vOOr>--4 r»fN4<7,,ocn f>4^^^^ r^'Nr»inr<> -< f^ -r -a- <N ^• r» >^ rvj r>j rg ~« o ^^ in vO rg c^ '>i 00 in ro rvj o i>J (*\ ^^^ O rvjrsj^^^ O <«> (O (*\ 4 >0 vO O^ r* CT> fO oo in rg eg ,-, ^ t«N in c^ r^ ro rg <\i fsj fn (V4 -^ o o* ^ O -H <o m ^ r<^ r'> cn -^ O* >0 ,-, -^ (J» (*^ ^ (Nj • o o o O om oo 00 -4 (*> •*««« o» rg vO in —< 00 \0 "NJ rg ^H -H ^ f«% CO vO ^ o r> rg rg -H nj 4- ON 4- in <0 r» in r-t •* fO CO rg (M o o ^ eg n <j- m \o O 4 o M -H --I -1 oooo^f^in r-'HoovO^ 'Mrg,^^^,^ fOrgpg.-<,i^ r*-vOor>»in f^vOOr-in ng,n-<(><0 AAAAA Oinrg<>i*. roojrg-H-* A.-.^. m r~ r* oo in r» vO >j <o r\4 ^^ f-t in 00 oo in >0 >J -» rg ro in o» in og (y> -g — ,'^J ,-Nj f* ^J' rg <o in in rg o» ro rg rg rg .^ o (t% <*\ ^ o m o «^ >o Qo C^ 04 (N rg ..H o m o o> r- ,» CO in oo rg CO >^ -4 f> rg rg rg in CO f\j 00 o* '-« CO 0* in rg -» CO rg rg rg r- o> in o» rg 00 in -4 -» CO rg f\i rg QO in -# CO On On CO On >0 CO CO (O CNJ rg pg CO r- o» in -« CO 00 CO On \^ >*• CO CO rg rg CO >» fO «! CO 0^ •» (O CO in f>j ^» rg in r«. rg <0 CO in ^^ «o <o (O <^ On On rg r» nO oo rg 00 nO in ro >0 00 >0 CM CM 00 >o in nO in .^ o o •#• vO CO in CO CO eg pg oo in -4 00 oo in in r«- ^H lO «o »n <o o fvi M'-«-|^-4 (*> O nj •^ nO «o rCO (O <o >o in lo <o • o in m M rg in -» .<• vO in rg On <n rg rg f\j -< in r» oo rH 00 r- r-t CO (O «0 <-• vO /> » -gp^™«r--ao m «o r^ «o lA eg <o r* in >* <n «o rg o^ r>j >J >OoaoinfO -« 00 in rin oo vO rn rg rg ,^ .^ O J> >J f^ CO rg >i -* -« _» O O 30 >J >o ^ ^ r^ ^ ^ •^ro^o.O >^r'it/\m.fl "^^JOr^in On »n fN" <o -H ON in in •* «o CM rg f*- ~t ^ ^ ^ -« Oao-«oco aofV(>iA(*> rgoj,-^rHr^ in <*% in 0» 0« rg >0 -« <«% lo rg rg ^ ^ fOf\jrg-<-( -^ -*• >J- C> 00 «o 00 c*\ (^ .o CO rg rg -^ r» -* ^ '-•uiu^r^>-i I **••• (*> r- f^ 00 >0 >» >o r~ ,» ro rg rg -• ^ r-i c<^rgrg-Hr-» .v-<'<~v— O rS vO CO CO O»r»\0«>O-* f^ o St -* x» r— f>4 -< r- rgrg,-<-«^ f^ CO rH (Jv rg in r>. ro ro iM rg r-< -.^ o -«-«-*oo r^ rO xO (*^ CO J^ ro <—* -» r^ -» rH r» >» rg Ai -• -• -I -St t>i-<-^-<-i ^ :>j inOOrOO '^ -t est -O -t ~i r\ r-f -O -* r>4 r>t -* -* -^ .'n O NJ r^ 4^ "M -H >i -< -H -< r-< c'N («>a04oac> '>J-»-<-«0 -*ooin(NO oo r*% ^ o -J-Lnvooo >0(n-»oo -* M ~* ^ a r- -o vo rro —« 0» -* --i r-i (M ~i O x>Nrr»ao«o ^o«^-*c^3^ -i-«-»oo o o * ^ "-^ .» r~ in o» -H r- <o ro rg oj rg o O rin o ^ (*> 4 ^ r»- o O O4 00 >o >t r~ CM 00 in >J- <o <o CM rg m * o m «o ^ § oooo oooo '^oooo ^ >-* a> -f cr\ ^ •t ^ OrHOoft oooo oooo OOOO oooo oooo ^i/>vOf^ ^vTVin^O -f a00j-<0 -^^r<%C> oooo oooo O oo oo o o-« -H rg 0» r- ao (niNvAvO o» -• oo . O -* _, - iS\ O \0 yO fO -« iM -• r~ a> -« tn^r-r* \f\ t^ r»CT»Of>4 (NJO^^O (M OO O OO O o OrH O -H N -^ -H r-* o o '-* OO-HrM O O O -< O OO -H CO -I -« OO—'-^ 0'^<-H'-« rHO'tT* ODO'HCM x\Oir»vO -4-«0(M O^^^ a>'-«(<>-* ^H^^OQ0 Op^,-|rH OOOfO^ in'-^f*-r* rgo^ooo >-> oooorxj-^ f^ c^^^<^-* rginr-flo Os'HfO^ O^fHrH -t o\ ~i oo-«-« ^ 00 in o-r ON CO ^AirOO ^,-4^^ 00000>0 p^OiAr* ao r^ vO'-^'^'^ >-«inr»o» <f\ ,H^>-HrH '^rou>r^ ofo-^iA ^^^^ o^^^ ^^^^ >^f>inf^ ^ ^ ^ _, lAifxo^iA {*>c<>fO<-^ -r-f^oj r^-H^rg ,-«^^rg *r»o»>-i »r>oO'^-* o^oiNcA 0-i-«-« o>«OflO'^ r-* r^ • CO -t 'C ^ ^^ • • • ^ O -t r^ ,t r* (> -H P-l rH rH fVJ '-4 -t <j\ ^,Hp-t.H cnoooo <MO<*^<o or-rg-H ^r^rgrg ^.-(pjjvj ^r»ong ^flOr^tn ^^^^ -H-^rjrg (O^Of>4 -j-rg^r* .^rgogrg ^r^r\jrg ooroin iao^^nj-^ ^1 '^ i 1 1 2 no. 95-64 no. 96-64 PAMrcTTirn CANCELLED .f'SmQ''! ""Date Due ir-bH 3 laao D03 TDD 2bD 3 TOfiD 3 TDfiD DD3 flbT 21fl 1751801 DD3 TDD ED3 ''Ran mmiiimi 0D3 TOO 511 - iiiifilii TDfiD 0( TOfiD 3f'llf[..., 3 ^DflO 3 TOAD 003 DD3 TDD 4?! flhT 4S7 ""''' liiiiiiiiiyjijniiiiiii 3 TOfiO 003 fibT SD7 <1i'bH <\^-bH 3 TOflO 003 TOO Mfl4 3 TOfiO 003 4b5 3 TOflO 003 TOO 4bfl 3 TOflO 003 4fll 9v^6W fibT q~l^&>H flbT f 7 '^^l 3 TOflO 003 flbT 531 ^ '\%