UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS Exam: ECON4135 - Applied statistics and econometrics, fall 2004 Date of exam: Wednesday, December 1, 2004 Time for exam: 14:30 – 17:30 The problem set covers 6 pages Resources allowed: All written and printed resources, as well as calculators, are allowed Grades given: A (best), B, C, D, E and F, with E as the weakest passing grade. Comments given in arial font “Broken limits to life expectancy” by Oeppen and Vaupel (Science, VOL 296, 10 May 2002) shoved that many previous claims of upper limits to expected life length for a newborn have been broken, and also that expected life length has shown a remarkable linear development since 1840. We shall look at some of these data for females. For each of the years 1840 – 2000 Oeppen and Vaupel looked at observed life expectancy in best practicing country, called record life expectancy. Best practicing country is defined as the country with the highest life expectancy in the actual year. Life expectancy in a country, often denoted by e(0) , is calculated from the observed age-specific mortality rates in the actual year, and is the expected life length for a newborn under the hypothesis that it is subject to mortality rates throughout its life as was observed in its year of birth. Record life expectancy in a given year is denoted by Yyear . Problems 1. Female life expectancy in best practicing country is plotted against year in Figure 1. A linear regression model Yyear 0 1 year U year is fitted to the data by ordinary least squares, see Stata output in Exhibit 1. What is the interpretation of ̂1 ? Is the intercept estimate directly meaningful? Give a 95% confidence interval for the gain in life expectancy in one calendar year, and also in 10 calendar years. What is the 99% confidence interval for yearly gain in expected life length? 1 is yearly gain in female life expectancy in best practicing country in the model, and ̂1 is its estimate from the1840-2000 series. Since year is measured after Christ, year=0 is way outside the observed data span, and extrapolation to the value at year 0, the intercept, is risky business. The estimated intercept is large negative, which is nonsense for life expectancy. 2 95% CI for 1 : (0.238, 0.248); 95% CI for 101 is 10 times that for 1 , (2.38, 2.48); 99% CI for 1 : (0.237, 0.249). 2. From Figure 1 there was clearly more variation in the data in the first part of the period than in the remaining period, and there were outlying observations in the period 19161919. What could have caused these patterns? Another pattern is that record life expectancy is flat over several periods around 1900. Why could that be? The regression results in Exhibit 1 were calculated with robust standard errors. Why is it a good idea to calculate robust standard errors in this case? Few countries gathered statistical data on death rates in the early part of the period, and those who did produced vital statistics more prone to error and variability than in the 20th Century. 1916-1919: war and Spanish disease. Flats over periods: Around 1900, several countries, including Norway, published vital statistics every fifth year. With the best practicing country in this group, the series has flats between publication years. The standard errors for regression coefficients are biased if computed with the classical method rather than the robust method when there is heteroscedasticity such as the observed. 3. For a given year from 1841 on, the first difference in record life expectancy is Dyear Yyear Yyear 1 . Figure 2 shows first differences record life expectancy versus year, and Exhibit 2 gives summary information for this variable. Comment briefly on Figure 2 in view of Figure 1, and explain how the regression result relates to the mean in Exhibit 2. Figure 2 shows more variability early in the period (due to more variation around the regression curve), several flats at zero around 1900 (due to flats in Figure 1), and variation around a constant level slightly above zero (due to linearity in Figure 1). The mean 0.243 in Exhibit 2 estimates the yearly gain in record life expectancy, which is modelled as 1 . It agrees with the regression estimate of 0.243. The standard error obtained from Exhibit 2, 1.097 / 160 is 0.087, which does not match the standard error in Exhibit 1 (0.0024). The discrepancy is due to the former being based on all differences having the same variance, which certainly is not the case, while the latter is calculated by the robust method not relying on this assumption. 4. From 1946 on D appears to have a rather stable development. An auto-regressive model of order 1 was estimated for this period. What could a rationale be for this model? The Stata result (in condensed form) is given in Exhibit 3, which means that the estimated model is Dˆ year .257-.135 Dyear 1 . What is the standard error for the autoregressive coefficient? Is there a significant first order auto regression in the first differences in record life expectancy? The standard error for the auto-regressive coefficient is 0.148. With a twosided p-value of 0.36 (from the exhibit) when testing for no auto-correlation, the auto-regression coefficient is certainly not statistically different from zero. 2 3 The first differences in record life expectancy might thus very well be uncorrelated. 5. It is puzzling that record life expectancy has been growing nearly linearly over such a long period, and indeed seems to continue to grow at about the same pace. Figure 3 is taken from Oeppen and Vaupel (2000). In the first half of the period, only a few countries had life expectancy close to record life expectancy (or, in fact, adequate vital statistics). In more recent years, more and more countries, such as Chile, are getting their vital statistics in shape, and are catching up with the leading group. That the group of nations with nearly record life expectancy is growing in number is due to economic and other development in many nations. Discuss whether the continued growth in record life expectancy could partly be a statistical consequence of the fact that more and more countries belong to the group of leading nations regarding female life expectancy. In a hypothetical situation with life conditions (underlying mortality) not changing in the group of best practicing countries, but with new countries joining this group, estimated record life expectancy will tend to grow simply since the record is the maximum of a larger and larger number of largely independent random variables. If, say country i in the group of size nt in year t has observed life expectancy X it which are iid with cumulative distribution function F, the record Yt max( X 1t , , X ntt ) has distribution nt P Yt y P i X it y P X it y F ( y )nt due to independence. As nt i 1 increases this certainly decrease for each y such that F(y)<1. The distribution of record life expectancy is thus moving to the right from year to year. This formal argument was not required at the exam. 6. Scholars have made claims of upper limits to female life expectancy. These claims have been based on a variety of biological, demographic and other grounds. A claim of an upper limit, say 64.8 years, is that no country will ever have a female life expectancy above that limit. Oeppen and Vaupel (2000) identify 19 independent such claims or asserted ceilings on female life expectancy, see Figure 4. The first claim was made by Dublin in 1928, and the claim was that female life expectancy could not exceed 64.8 years. This was a failure even when it was made; since the record life expectancy exceeded the limit already in 1921 (New Zealand had 65.9). Of the 19 claims, 14 have come out as failures by 2002. For claim i let ti be the year the claim was made, and let Fi be the binary variable (coded 1 for failure) recording whether the claim has come out as a failure, i.e. has been beaten by record life expectancy by 2002. Exhibit 4 shows output from two logistic regressions, both with F as the dependent variable. The first logistic regression had only t as regressor, while in the second case both year of claime and lapse time x 2002 t were attempted introduced as regressors. Interpret the two sets of results. In the second case t was dropped by Stata. Why? The two results agree since the linear predictor in the two logistic regressions are identical: a bt a 2002b bx . Here, b .599558 is the regression estimate, and a is the intercept in Exhibit 1. t and x 2002 t are perfectly 3 4 collinear, and Stata rightly rejects to have both terms in a linear logistic regression. Otherwise regression coefficients would not have been identifiable. The regression is logistic, and the regression coefficient is therefore interpreted as a log odds ratio. With pt P Ft 1 as the failure probability, the model is ln pt / 1 pt ln Ot 0 1t . This gives 1 ln Ot 1 / Ot for all t. The result is thus that the odds for failure for a claime made in year t is reduced by an estimated factor of exp ˆ1 0.55 to the odds for failure made in year t 1 . The second regression gives exactly the same result. The confidence interval for this log odds ratio is however wide and contains 0. One should expect a reduction in the odds, and thus in the failure probability the closer to 2002 the claim was made. The result is in agreement with this, but due to the data being weak, the result is not a statistyically significant improvement in failuer probability (decrease). From Figure 4 it seems that only 4 or 5 of the 19 claims helt water by 2002, and they were mixed with failed claims made in recent years. The large standard error is due to the low number of data points (19) and the relatively low resolution in the timing of non-failed claims. 7. To what extent is life expectancy determined by economic variables like GDP per capita? Suppose you had data on life expectancy e(0) and GDP per capita, Z , for your own country, or say Norway. Would you think that a regression of the form e(0) year 0 1Z year U year would yield valid results regarding the posed question? Regard your hypothetical data as the outcome of a quasi-experiment, and discuss potential threats to internal and external validity. The suggested regression will probably be subject to omitted variables bias since both GDP and mortality is likely to depend on common variables. One might also have validity problems with the regression since GDP and life expectancy are likely to be mutually dependent in the sense that they are determined endogenously. These are threats to internal validity. The dependency between the two variables need not be the same in different parts of the world. One might, for example, have egalitarian societies like the Norwegian, which had high life expectancy while being among the poorer European nations in the 19th Century. Differences in relationship between the two variables are threats to external validity: results from Norway might not be valid for U.A.R. etc. 4 5 Exhibit 1 Regression with robust standard errors Number of obs F( 1, 159) Prob > F R-squared Root MSE = 161 = 9928.61 = 0.0000 = 0.9821 = 1.5325 -----------------------------------------------------------------------------| Robust Y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------year | .2429773 .0024385 99.64 0.000 .2381613 .2477933 _cons | -401.4199 4.754271 -84.43 0.000 -410.8096 -392.0303 Exhibit 2 Variable | Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------D | 160 .2431875 1.096574 -4.209999 5.060001 Exhibit 3 Sample: 1946 to 2000 Number of obs Wald chi2(1) Prob > chi2 = = = 55 0.84 0.3607 -----------------------------------------------------------------------------D | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------_cons | .2566701 .0598694 4.29 0.000 .1393282 .3740119 -------------+---------------------------------------------------------------ar L1 | -.1353576 .1480772 -0.91 0.361 -.4255836 .1548684 -------------+---------------------------------------------------------------- Exhibit 4 Logit estimates Log likelihood = Number of obs LR chi2(1) Prob > chi2 Pseudo R2 -3.865933 = = = = 19 14.17 0.0002 0.6470 -----------------------------------------------------------------------------F | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------t | -.5955806 .4474039 -1.33 0.183 -1.472476 .2813149 _cons | 1184.766 889.9979 1.33 0.183 -559.5983 2929.129 note: t dropped Logit estimates Log likelihood = Number of obs LR chi2(1) Prob > chi2 Pseudo R2 -3.865933 = = = = 19 14.17 0.0002 0.6470 -----------------------------------------------------------------------------F | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------x | .5955806 .4474039 1.33 0.183 -.2813149 1.472476 _cons | -7.586783 5.772359 -1.31 0.189 -18.9004 3.726832 ------------------------------------------------------------------------------ 5 40 50 60 Y 70 80 90 6 1850 1950 1900 2000 year -4 -2 0 ,D 2 4 6 Figure 1. Female life expectancy (in years) in best practicing country by calendar year. Source: Oeppen and Vaupel (2000). 1850 1900 1950 year Figure 2. First differences in record life expectancy versus year. 6 2000 7 Figure 3. Female life expectancy in five countries compared with the trend in record life expectancy. Source: Oeppen and Vaupel (2000). 7 8 Figure 4. Record female life expectancy from 1840 to the present. The linear-regression trend is depicted by a bold black line and the extrapolated trend by a dashed gray line. The horizontal black lines show asserted ceilings on life expectancy, with a short vertical line indicating the year of publication. The three dashed red lines denote projections of female life expectancy in Japan published by the United Nations in 1986, 1999, and 2001: It is encouraging that the U.N. altered its projection so radically between 1999 and 2001. Oeppen and Vaupel (2001). 8