Document 11043150

advertisement
LIBRARY
OF THE
MASSACHUSETTS INSTITUTE
OF TECHNOLOGY
coey
I
Evaluation of an Ad Hoc Procedure for Estimating
Parameters of Some Linear Models
A. Ando and G.M.
Kaufmati
MASS.
ir^CT.
Tc
OCT 30
1'
94-64
DEWEY
MASSACHUSE
50 MEMORIAL DRIVE
CAMBRIDGE, MASSACHUSETl
LIDRA:
COPY
I
Evaluation of an Ad Hoc Procedure for Estimating
Parameters of Some Linear Models
A. Ando and G.M.
Kaufman.
MASS. INST. TcCH.
94-64
j
OCT 30
^9P4
Authors are Associate Professor of Economics and
Finance, University of Pennsylvania and Assistant
Professor of Industrial Management, Massachusetts
Institute of Technology. The contribution of Ando
to this paper was supported by a grant from the
National Science Foundation.
The authors acknowledge the contribution of Mr. Bridger Mitchell of
Massachusetts Institute of Technology, who prepared
the computer program for calculations reported in
this paper.
Y\A
NOV
,,
,
I.
9
1965
LlBRAKlE-S
Economists and other users of statistical methodology often posit a
probabilistic model of some real world phenomenon which has more unknown parameters than there are sample observations.
In such cases it is usually impossible
to jointly estimate all parameters from the sample data.
Even in those instances
where there are well established estimation procedures when the number of sample
observations n is larger than the number of parameters r, these methods are
generally inadequate when n < r^ as the necessary calculations cannot be carried
out.
Furthermore, when n is only "slightly larger" than r, such estimates often
prove to be unreliable in more than one sense.
One particular example of such a problem that frequently occurs in analysis
of psychological and of economic data is this:
the researcher posits a linear
regression model as defined in (2) below with r-1 independent variables but
only n <
r
observations on the dependent variable.
Since the standard least
squares procedure cannot be applied, he may ask the following seemingly reasonable question:
"What subset of the r-1 independent variables should
for inclusion in a "new" model to which
I
I
select
can apply the standard least squares
procedure?"
Our purpose here is to demonstrate that one frequently used ad hoc method
for determining such a subset by ordering simple sample correlation coefficients
can be highly misleading.
Any procedure which uses
to both determine the structure of the model
a
given set of sample data
to be tested and to estimate param-
eters of this model is intuitively unsettling.
Here we present tables which
quantitatively demonstrate how dangerous such ad hoc methods can be.
Ad Hoc Use of Simple Correlation Coefficients to Determine Model Structure
Consider an r-dimensional Independent Multinormal process defined as one
that generates independent
r
~(1)
x
random vectors x
1
,x
(J)
with identical
densities
4'^(2ikh)
=
(2
^)-ir
e"^(3S-li)''^(2i-y.)
|h|2
(1)
h is PDS
We wish to estimate parameters of the conditional distribution of
x^
,...,x^
'
when neither ^ nor
vector sample observations x
"^
h is
^x
~(k)'
xj^
given
known with certainty from a set of n
,...^x^"
.
Alternatively, we may consider
an r-dimensional Normal Regression process defined as a process generating
independent scalar random variables according to the model
;p>,
'i
the X.
s
+ iiz A^^^i.
^^^""^
(2)
are known numbers which in general may vary from one observation to
the next,' and the e.s are independent random variables with identical normal
J
fj^(e|0, h)
=
(2n)"
Suppose now that we have n observations, and let
(1)
^2
(3)
.
be a matrix of observations of "independent" variables together with an
additional column of I's.
If the rank of X X is singular the following method,
or a slight variation of it, is often employed to make standard least squares
"work":
1.
Using the n sample observations, compute r-1 sample statistics
p., i=2,...,r, where p.
coefficient between
x,
is
the simple sample correlation
and x.
>
>
2.
Relabel variables X2,...,x
3.
Choose an integer r* < rank (X X).
4.
Restructure the model, eliminating from consideration
so that
relabelled variables x .,.,...,' x
r*+l'
.
jp-]
[po]
...
>
|p
[•
(Henceforth we call the
r
initial model, "Model A" and the restructured model, "Model B".)
5.
Assuming that Model B is an adequate approximation to Model A
use X X to estimate parameters of the conditional distribution
of X, given the values of relabelled variables X2,<...^x^^,
3«
Simulation Study
Model B is a strongly biased approximation to Model A, and the magnitude
of the bias is a function of the joint distribution of the r*-l largest simple
sample correlation coefficients.
Analytical expressions for this distribution
and for the distribution of the sample multiple correlation coefficient in
Model B are in general extremely complex and unwieldy.
Hence we have resorted
-
to simulation!
4
-
in order to numerically determine the salient features of Model B.
The steps in this simulation were:
1.
We assumed that:
(a)
the data generating process is a Model A as defined in (2)j
(b)
the dependent variables
x!"
, > . .
yX
,
y*'-''^
and the independent variables
j=1^2^..,,n, are mutually independent random
variables identically distributed uniformly on [0^ 1],
2.
For each (n, r) pair, n=10(2)20, 30, 40, 50, and r=10(5)50, 50(10)100,
150, 200 we generated 50 sets of n observations each.
3«
For each set of n observations we used the ad hoc procedure out-
lined in section
2
to calculate the five largest simple sample
correlation coefficients and Model B multiple correlation coefficients
for Model B's with m=2, 3, 4, and 5 independent variables.
4.
Regarding each of the five sets of 50 simple sample correlation
coefficients as a set of sample observations we then calculated the
sample meanjp.| and sample variance V(|p.|) of the absolute value of
each of the. five largest simple sample correlation coefficients.
results are tabulated in Tables
1
and 2.
'Ames and Reiter [1] performed a somewhat similar experiment in a
different context:
they drew 100 economic time series at random from the
Historical Statistics of the United States and then computed the sampling
distributions of correlation and autocorrelation coefficients of the series
drawn, finding that "correlations and lagged cross-correlations are quite high
for all classes of data.
E.g., given a randomly selected series, it is possible
to find, by random drawing, another series which explains at least 50 per cent
of the variances of the first one, in from 2 to 6 random trials, depending
on the class of data involved." [1], p. 637.
The
|
5.
We follow a similar procedure with the sample multiple correlation
coefficients.
In Table 3 the mean R
of the square of 50 sample
multiple correlation coefficients for model B's with a constant
term plus m=2, 3, k, and
5
independent variables are tabulated.
The tables given us a reasonably accurate ideal of the order of magnitude
of Model B sample correlation coefficients and of the five largest simple sample
correlation coefficients whatever the distribution of the independent variables-so long as they are mutually independent and have finite mean and variance. Tl
Tables 2 and 4 of the sample variances of the
|p
|
and of sample multiple
.
correlation coefficients show that variations in Tables
to sampling error are practically negligible;
of
Ip-,1
for n= 10 and r= 10
is
1
and 3 entries due
e.g. the coefficient of variation
less than .001 as calculated from Tables
1
and 2.
Evaluation
Examination of Tables
1
procedure outlined in section
shown in Tables
1
and 3 illustrate clearly that not only is the
2
biased, but that bias of an order of magnitude
and 3 entries will occur with extremely high probability.
For
'In order to assess the effect of assuming that the data generating process
generates values of mutually independent but uniformly distributed random
variables, we duplicated the experiment for n=
and r= 10 under a different
assumption:
each sample observation was generated by summing 10 values of mutually
independent, identically uniformly distributed random variables. This spot check
using an approximately normal data generating process revealed no differences to
the third significant digit between values generated under this assumption and
values tabled in Tables 1, 2, 3, and 4.
_1_L
_
2
''One interesting feature of Table 1 is that for given r and i, |p.
decreases almost exactly proportional to 1/n. This is most pronounced for
i=2.
-
example^ for n=10^ r=10^
n=
f
^=
}
\po\-
population values of
|p.|
|p^2l
= "61^
6 -
with sample variance of only „0172 and for
with sample variance of only
even when the
i=2,...,r are controlled to be 01
One effective way of dealing with the problem is to reformulate it in
Bayesian terms so that we may systematically incorporate information the researcher
possesses from sources other than the sample into the analysis.
In [2] we indicate
how Bayesian estimation of the mean vector and variance-covariance matrix of a
multinormal process may be done even when the dimensionality of the process is r
and there are only n < r sample observations available.
And in [3] we show how
Bayesian estimation can be done when the data generating process is one frequently
occurring in econometric analysis--a set of simultaneous linear equations with
stochastic components--and there are less vector observations than parameters.
[1]
Ames^E. and Reiter S.;, "Distributions of Correlation Coefficients in Economic
Time Series," Journal of the American Statistical Association , 56 (1961),
pp. 637-656,
[2]
Ando, Albert and Kaufman, Gordon M. ,"Bayesian Analysis of the Independent
Multinormal Process-Neither Mean nor Precision Known-Part I,' Massachusetts
Institute of Technology, Alfred P« Sloan School of Management, Working
Paper No, 41-63.
[3]
Ando, Albert and Kaufman, Gordon M. ,'^ Bayesian Analysis of the Reduced Form
System, "Massachusetts Institute of Technology, Alfred P, Sloan School
of Management, Working Paper No. 79-64,
rooaovOiA
O-grOtOoO
lA-tONt-o
OOvONJtn
TOO»r«-cy.in
t<\
^
r^
vOMooor*
>tT>u^oao
in-vjooh*
-*-*C*%iAO
.n-iosr«-vo
-«-<ooo
iT,
cr>
-«-tr-4oo
^^-ijyiaor'-
o
'-«
(\l
-<
O
vO
r^
--H
t>a
.-><
00
-*•
N o
^
-*
u> vO
r* -* -• c^
-H -^
'-'
-t
i-t
^
^
i-M
(MiAOO'^O
r~t
^
o f^
o>
(*>
.-«vOOr>--4
r»fN4<7,,ocn
f>4^^^^
r^'Nr»inr<>
-< f^ -r -a- <N
^•
r» >^ rvj
r>j rg ~«
o
^^
in vO rg c^
'>i 00 in
ro rvj
o
i>J
(*\
^^^
O
rvjrsj^^^
O
<«>
(O
(*\
4
>0 vO O^ r*
CT> fO oo in
rg eg
,-,
^
t«N in
c^ r^
ro rg
<\i
fsj
fn
(V4
-^
o
o*
^
O
-H <o
m
^
r<^
r'>
cn -^
O* >0
,-, -^
(J»
(*^
^
(Nj
•
o
o
o
O
om
oo
00 -4
(*>
•*«««
o» rg vO
in —< 00 \0
"NJ rg ^H -H
^
f«%
CO vO
^ o
r>
rg rg -H
nj 4- ON 4- in
<0 r»
in r-t
•* fO CO rg (M
o
o
^
eg
n
<j-
m
\o
O
4
o
M
-H
--I
-1
oooo^f^in
r-'HoovO^
'Mrg,^^^,^
fOrgpg.-<,i^
r*-vOor>»in
f^vOOr-in
ng,n-<(><0
AAAAA
Oinrg<>i*.
roojrg-H-*
A.-.^.
m
r~ r* oo in r»
vO >j
<o
r\4 ^^ f-t
in 00 oo in
>0 >J -» rg ro
in o» in og (y>
-g —
,'^J ,-Nj
f* ^J'
rg <o
in
in rg o»
ro rg rg rg .^
o
(t%
<*\
^
o
m o
«^ >o
Qo
C^ 04 (N rg ..H
o
m
o
o> r- ,» CO in
oo rg CO >^ -4
f>
rg rg rg
in CO f\j 00 o*
'-« CO 0* in rg
-» CO rg rg rg
r- o> in
o»
rg 00 in -4
-» CO rg f\i rg
QO in -# CO On
On CO On >0 CO
CO (O CNJ rg pg
CO r- o» in -«
CO 00 CO On \^
>*• CO CO rg rg
CO
>» fO «! CO
0^ •»
(O CO
in f>j
^» rg
in r«. rg <0 CO
in ^^
«o <o
(O <^ On On rg
r» nO oo rg 00
nO in
ro
>0 00 >0 CM CM
00 >o
in
nO in
.^
o o
•#•
vO CO in
CO CO eg pg
oo
in
-4 00 oo
in in r«- ^H lO
«o »n
<o
o
fvi
M'-«-|^-4
(*>
O
nj •^ nO «o rCO
(O <o
>o in
lo <o
•
o in
m M
rg in -»
.<•
vO
in rg On
<n rg rg f\j -<
in r» oo
rH 00 r- r-t
CO (O
«0
<-•
vO
/>
»
-gp^™«r--ao
m
«o r^ «o
lA eg <o
r*
in >* <n «o rg
o^
r>j
>J
>OoaoinfO
-« 00 in rin
oo vO
rn rg rg ,^ .^
O
J>
>J
f^
CO rg
>i -*
-« _»
O O 30 >J >o
^ ^ r^ ^
^
•^ro^o.O
>^r'it/\m.fl
"^^JOr^in
On »n fN" <o
-H
ON in
in •* «o CM rg
f*-
~t
^
^
^ -«
Oao-«oco
aofV(>iA(*>
rgoj,-^rHr^
in <*%
in
0» 0« rg >0 -«
<«% lo rg rg
^
^
fOf\jrg-<-(
-^ -*• >J- C> 00
«o 00 c*\ (^ .o
CO rg rg
-^
r»
-*
^
'-•uiu^r^>-i
I
**•••
(*> r- f^ 00 >0
>» >o
r~ ,»
ro rg rg -•
^
r-i
c<^rgrg-Hr-»
.v-<'<~v—
O rS
vO CO CO
O»r»\0«>O-*
f^
o
St
-* x» r—
f>4
-< r-
rgrg,-<-«^
f^ CO rH (Jv
rg in
r>. ro
ro iM rg r-< -.^
o
-«-«-*oo
r^ rO xO (*^
CO J^ ro <—*
-» r^ -» rH
r» >» rg
Ai -• -•
-I
-St
t>i-<-^-<-i
^
:>j
inOOrOO
'^ -t
est
-O -t ~i r\
r-f -O -* r>4
r>t -* -* -^
.'n
O
NJ r^ 4^ "M -H
>i -< -H -< r-<
c'N
(«>a04oac>
'>J-»-<-«0
-*ooin(NO
oo
r*%
^ o
-J-Lnvooo
>0(n-»oo
-* M ~* ^ a
r- -o vo rro —« 0»
-* --i r-i
(M ~i
O
x>Nrr»ao«o
^o«^-*c^3^
-i-«-»oo
o
o
* ^
"-^
.»
r~ in
o» -H r- <o
ro rg oj rg
o
O
rin
o
^
(*>
4 ^
r»-
o
O
O4
00 >o
>t r~ CM 00 in
>J-
<o <o CM rg
m
*
o
m «o
^
§
oooo oooo
'^oooo
^
>-*
a>
-f
cr\
^
•t
^
OrHOoft
oooo oooo OOOO oooo
oooo ^i/>vOf^
^vTVin^O
-f
a00j-<0
-^^r<%C>
oooo oooo
O oo
oo
o o-« -H
rg 0»
r- ao
(niNvAvO
o»
-•
oo
.
O -*
_,
-
iS\
O
\0 yO
fO -«
iM
-•
r~ a>
-«
tn^r-r*
\f\
t^
r»CT»Of>4
(NJO^^O
(M
OO
O
OO
O
o
OrH
O -H N
-^
-H
r-*
o o
'-*
OO-HrM
O
O
O -<
O
OO
-H CO
-I -«
OO—'-^
0'^<-H'-«
rHO'tT*
ODO'HCM
x\Oir»vO
-4-«0(M
O^^^
a>'-«(<>-*
^H^^OQ0
Op^,-|rH
OOOfO^
in'-^f*-r*
rgo^ooo
>->
oooorxj-^
f^
c^^^<^-*
rginr-flo
Os'HfO^
O^fHrH
-t o\ ~i
oo-«-«
^
00
in
o-r
ON
CO
^AirOO
^,-4^^
00000>0
p^OiAr*
ao r^
vO'-^'^'^
>-«inr»o»
<f\
,H^>-HrH
'^rou>r^
ofo-^iA
^^^^
o^^^ ^^^^
>^f>inf^
^
^ ^ _,
lAifxo^iA
{*>c<>fO<-^
-r-f^oj
r^-H^rg
,-«^^rg
*r»o»>-i
»r>oO'^-*
o^oiNcA
0-i-«-«
o>«OflO'^
r-*
r^
•
CO
-t 'C
^
^^
•
•
•
^ O
-t r^
,t r* (> -H
P-l
rH rH
fVJ
'-4
-t
<j\
^,Hp-t.H
cnoooo
<MO<*^<o
or-rg-H
^r^rgrg
^.-(pjjvj
^r»ong
^flOr^tn
^^^^
-H-^rjrg
(O^Of>4
-j-rg^r*
.^rgogrg
^r^r\jrg
ooroin
iao^^nj-^
^1
'^
i
1
1
2
no. 95-64
no. 96-64
PAMrcTTirn
CANCELLED
.f'SmQ''!
""Date
Due
ir-bH
3
laao D03 TDD 2bD
3
TOfiD
3
TDfiD
DD3
flbT
21fl
1751801
DD3 TDD ED3
''Ran
mmiiimi
0D3 TOO 511
-
iiiifilii
TDfiD 0(
TOfiD
3f'llf[...,
3
^DflO
3
TOAD 003
DD3 TDD 4?!
flhT
4S7
""'''
liiiiiiiiiyjijniiiiiii
3 TOfiO 003 fibT SD7
<1i'bH
<\^-bH
3
TOflO
003 TOO
Mfl4
3
TOfiO
003
4b5
3
TOflO
003 TOO
4bfl
3
TOflO
003
4fll
9v^6W
fibT
q~l^&>H
flbT
f 7 '^^l
3
TOflO
003
flbT
531
^
'\%
Download