Solution 2005 (postponed)

advertisement
UNIVERSITY OF OSLO
DEPARTMENT OF ECONOMICS
Exam: ECON4135 - Applied statistics and econometrics, fall 2005, continuation exam
Date of exam: Monday, January 16, 2006
Time for exam: 2:30 p.m. – 5:30 p.m.
The problem set covers 4 pages
Resources allowed:
 All written and printed resources, as well as calculators, are allowed
Grades given: A (best), B, C, D, E and F, with E as the weakest passing grade.
Comments in arial font
Scientific journals constitute the medium of communication between scientists, and also the
memory (storage) of science. The economics of (scientific) journals is interesting. Bergstrom 1
argues that journals owned by private publishers are grossly overpriced, and he recommends
several actions to reduce the large profits made by these publishers. Bergstrom provides data
to substantiate his case. There are 180 economic journals in his database, of which 16 are
published by scholarly societies such as the American Economic Association. These 16
journals are published on a non-profit basis, as opposed to the remaining journals that have
private publishers. We shall particularly be interested in the separation between society
journals and privately published journals. Consider the variables:
P : Library subscription price for the journal per year (USD).
Y : Number of libraries subscribing to the journal.
C : Total number of times papers in the journal were cited in 1998.
A : Age of the journal (years).
N : Number of pages in the journal in 1998.
S : Binary variable (dummy); 1 if non-profit (scholarly society), 0 otherwise.
1. Figure 1. shows dummy S for society journal plotted against age A . Would you think
that age of journals is normally distributed within the two groups of journals? Explain
what it would mean that A and S are stochastically independent. Would
E  A | S  1  E  A | S  0 if A and S are independent?
No, the distribution seems skewed, with a long tail towards high ages,
particularly for privately published journals. Yes, since the conditional densities
of A given S  s for the two values of s are equal (and equal the marginal
density), the conditional expected values must be equal.
1
Bergstrom, T.C. 2000. Free Labor for Costly Journals? Journal of Economic Perspectives. 15: 183-198.
2
2. To estimate the mean journal age in the two groups one could consider the regression
presented in R1 below. What is the estimated mean age for privately owned journals, and
what is it for society journals? It is of interest to estimate the difference in age distribution
between the two groups. What is the p-value for testing the null hypothesis
H0 : E  A | S  1  E  A | S  0 , versus a two-sided alternative? What would the p-value
be for testing H 0 versus the one-sided alternative H1 : E ( A | S  1)  E ( A | S  0) ? Can you
find a 95% confidence interval for the difference in mean age?
Let s  E ( A | S  s) . Then, ˆ 0  33.98 years, and ˆ1  33.98  12.52  46.50
years. The two-sided p-value is 0.032, and the one-sided is half of that. The
95% confidence interval for 1  0 is 1.12; 23.91 years.
3. Regression R2 is similar to R1, but now the response variable is LA  log( A) . Histograms
are shown in Figure 2. Calculate estimated mean log age in the two groups of journals. Do
your results agree with those in point 2? If you now want to test the independence
between A and S you might test H0 : E  LA | S  1  E  LA | S  0  versus a two-sided
alternative. What is the p-value? Why is it different from what you found in point 2?
Would you prefer to compare age between the two groups on the arithmetic scale (age in
years) or on the logarithmic scale?
The estimated mean log ages are 3.3149=log(27.52) for private and
3.7223=log(41.36) for society journals (in log year units). The two sets of
results do not agree quite – estimated centre of distribution on the arithmetic
scale is systematically higher than that obtained from the mean log age –
because log(a) is a concave function of a, and thus E (log( A))  log( E ( A)) by
Jensen’s inequality. The two-sided p-value is now 0.003. I would rather use
the logarithmic scale because the distribution is more symmetric on that scale
(Figure 2), and the mean is thus a more meaningful measure of the distribution
centre. Also, the two conditional distributions separates better on the log scale,
as indicated by the two-sided p-value being less.
4. If now the issue is to find determinants for what makes a journal being society published
rather than privately published, logistic regression might be useful. Regression R3 shows
the result of fitting the equation P(S  1)  F  0  1LA where F ( y )  1/ 1  e  y  is the
cumulative logistic distribution function. Explain why the estimated probability of a one
year old journal being society published is 0.00275. What is the estimated probability of a
hundred year old journal being society published? What is the age which would make the
probability about ½ for the journal being society published? Note that natural logarithms
are used.
For A  1 , LA  0 and P(S  1)  F  0  , which is estimated as
1/ 1  e  ( 5.5916)   0.0275 . log(100)  4.605 . Therefore, the estimated probability
is 1/ 1  e ( 5.891597 1.013834.60517   0.2268 . To get the estimated probability equal to
½, the exponent must be 0. That is, log( A)  5.891597 /1.01383  5.8147 and
A  335.2 years, which is well outside the range of the data.
2
3
5. A more complex logistic regression is shown in R4, where LY  log(Y ) and
LN  log( N ) etc. How would you explain to your fellow economist who has never heard
of logistic regression what these results mean? Are the data markedly better fitted by this
regression than by R3?
I would only try to explain what we can read out from the signs of the
estimated regression coefficients. Everything else being constant, if the age
increases, the probability is reduced; if the subscription increases, it also is
reduced; but if the number of pages increases, the probability increases; as it
does when the number of citations increases. The most important determinant
is the price, and pari pasu, if the price increases the probability of the journal
being non-profit increases – as expected. In R3, LA is clearly significant. But
not in model R4, presumably since LA, LY , LN , LC are quite strongly
correlated. None of these have a significant logistic regression effect on their
own, but as a collective they are strongly significant with p-value 0.0058. The
data certainly fits the data better, with log likelihood increased by 13 units on 4
extra parameters. The over-all significance of the set of covariates is
increased, with p-value decreased from 0.0144 to 0.0000.
6. Let the odds of the probability of a journal being society published be
O  P( S  1) / P ( S  0) . Show that log(O)  0  1LA  2 LY  3 LN  4 LC  5 LP
when P(S  1)  F  0  1LA  2 LY  3 LN  4 LC  5 LP  . Could you interpret  5 as
the price elasticity of the odds for a journal being society published? What is the 95%
confidence interval for the price elasticity on the odds?



log 1/ 1  e y  / 1 1/ 1  e y   log  e y   y . Thus the expression for log(O) , and
d
d
log(O)  5
log( P) .  5 is therefore the price elasticity of the odds for being
dP
dP
society- rather than privately published. The )5% confidence interval is given in R4,
and is  2.27;  0.61 .
7. Explain, with your statistically ignorant fellow economist in mind, what is meant by the
interval having degree of confidence 95%. Our sample of 180 economics journals is about
the total population of English language journals in the field of economics. The sample
can therefore not rightly be thought of as a random sample from a big population. Is this a
problem for your interpretation of the confidence interval?
If the experiment was repeated over again many times, and if the same
method was used to calculate the 95% interval, it would cover the true value in
95% of the replicates in the long run. But what should the experiment be (my
dear friend)? The history of the field of economics has had its realization so
far, with its 180 journals in the English language. You can probably not
envisage the development being repeated independently a large number of
times. But you might think hypothetically. If the logistic model is true – in the
hypothetical sense that Age, subscription etc developed as they did, but that
for each journal a coin is flipped to determine whether it should be an S  1 or
3
4
an S  0 journal, with success probability determined by the logistic equation,
then replicated data from the assumed model can be simulated. And if
simulated a good many time, the 95% intervals calculated by the method will –
if Stata is right – cover the true value of the parameter in about 95% of the
replications. From this explanation, it is really no problem that the sample is
the complete population of economics journals in the English language.
R1
Regression with robust standard errors
Number of obs
F( 1,
178)
Prob > F
R-squared
Root MSE
=
=
=
=
=
180
4.70
0.0315
0.0193
25.534
-----------------------------------------------------------------------------|
Robust
Age
A |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------Society
S |
12.51829
5.774714
2.17
0.032
1.122582
23.914
_cons |
33.98171
2.02106
16.81
0.000
29.99339
37.97003
------------------------------------------------------------------------------
R2
Regression with robust standard errors
Number of obs
F( 1,
178)
Prob > F
R-squared
Root MSE
=
=
=
=
=
180
9.05
0.0030
0.0335
.6261
-----------------------------------------------------------------------------|
Robust
Log age
LA |
Coef.
Std. Err.
t
P>|t|
[95% Conf. Interval]
-------------+---------------------------------------------------------------S |
.4074081
.1354304
3.01
0.003
.1401523
.6746639
_cons |
3.314908
.0497229
66.67
0.000
3.216786
3.413031
------------------------------------------------------------------------------
R3
Logit estimates
Number of obs
LR chi2(1)
Prob > chi2
Pseudo R2
Log likelihood = -51.000589
=
=
=
=
180
5.98
0.0144
0.0554
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------LA |
1.01383
.4217932
2.40
0.016
.1871301
1.840529
_cons | -5.891597
1.571597
-3.75
0.000
-8.97187
-2.811323
R4
Logit estimates
Number of obs
LR chi2(5)
Prob > chi2
Pseudo R2
Log likelihood = -37.935238
=
=
=
=
180
32.11
0.0000
0.2974
-----------------------------------------------------------------------------S |
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
4
5
-------------+---------------------------------------------------------------LA | -.0690083
.5582708
-0.12
0.902
-1.163199
1.025182
LY | -.4324953
.4960825
-0.87
0.383
-1.404799
.5398084
LN |
1.825738
.9343176
1.95
0.051
-.0054905
3.656967
LC |
.6336932
.4133132
1.53
0.125
-.1763857
1.443772
LP | -1.438341
.4243647
-3.39
0.001
-2.270081
-.6066017
_cons | -8.368748
5.135016
-1.63
0.103
-18.43319
1.695699
-----------------------------------------------------------------------------. test LC LN LA LY
1)
2)
3)
4)
LC
LN
LA
LY
=
=
=
=
0
0
0
0
14.54
0.0058
.2
.4
S
.6
.8
1
chi2( 4) =
Prob > chi2 =
0
(
(
(
(
0
50
100
150
A
Figure 1. Dummy S for society journal plotted against age A .
5
6
1
.5
0
Density
1
0
2
3
4
5
2
3
4
5
LA
Graphs by S
Figure 2. Histogram of log age by journal type (privately published to the left)
6
Download