Stochastic Population Forecasting and ARIMA time series modelling Lectures QMSS Summer School, 2 July 2009 Nico Keilman Department of Economics, University of Oslo Stochastic • Stochastic (from the Greek "Στόχος" for "aim" or "guess") means random. • A stochastic process is one whose behaviour is nondeterministic in that a system's subsequent state is determined both by the process's predictable actions and by a random element. • In a stochastic population forecast, uncertainty is made explicit: random variables are part of the forecast model. Stochastic population forecast Future population / births / deaths /migrations as probability distributions, not one number (perhaps three) Why Stochastic Population Forecasts (SPF)? Users should be informed about the expected accuracy of the forecast - probability of alternative future paths? - which forecast horizon is reasonable? Traditional deterministic forecast variants (e.g. High, Medium, Low) - do not quantify uncertainty Prob(MediumPop) = 0 !! - give a misleading impression of uncertainty (example later) - leave room for politically motivated choices by the user Outline • • • • • • Uncertainty of population forecasts Principles of SPF Time series models (selected examples) Alho’s scaled model for error Examples from UPE Using a SPF Focus on national forecasts How uncertain are population forecasts? Empirical findings – historical forecasts evaluated against actual population numbers (ex post facto) Main findings for official forecasts in Western countries • Uncertainty in forecasts of certain population variables surprisingly large • Forecasts for the young and the old age groups are the least reliable • Forecast errors increase as forecast interval lengthens • Large uncertainty for small countries • Large uncertainty for countries that are strongly affected by migration • European forecasts have not become more accurate since WW2 Errors in age structure forecasts Europe Percentage errors in age structure forecasts for Europe UN forecasts 1968-1990 20 % 5 years ahead 15 % 10 years ahead forecasts too high 15 years ahead 10 % 5% 0% -5 % -10 % forecasts too low -15 % -20 % 0-4 10-14 20-24 30-34 40-44 age group 50-54 60-64 70-74 80+ United Kingdom - men Percentage errors in age structure forecasts for the UK, GAD forecasts 1971-1994, men 30 % 10 years ahead 20 % 15 years ahead forecasts too high 20 years ahead 10 % 0% -10 % forecasts too low -20 % -30 % 0-4 10-14 20-24 30-34 40-44 50-54 age groups 60-64 70-74 80-84 85+ United Kingdom - women Percentage errors in age structure forecasts for the UK, GAD forecasts 1971-1994, women 30 % 10 years ahead 20 % 15 years ahead forecasts too high 20 years ahead 10 % 0% -10 % forecasts too low -20 % -30 % 0-4 10-14 20-24 30-34 40-44 50-54 age groups 60-64 70-74 80-84 85+ Why uncertain? • Data quality (LDC’s) • Social science predictions, no accurate behavioural theory • Rely on observed regularities instead Problems when sudden trend shifts occur - stagnation life expectancy men 1950s - baby boom/baby bust Traditional population forecasts do not give a correct impression of uncertainty Example: Old Age Dependency Ratio (OADR) for Norway in 2060 Source: Statistics Norway population forecast of 2005 High Middle Low |H-L|/M millions (%) POP67+ 1.55 1.33 1.13 31 POP20-66 4.03 3.39 2.83 36 OADR 0.38 0.39 0.40 4 Two major problems • Wide margins for some variables, narrow margins for others • Narrow margins in the short run, wide margins in the long run - implicitly assumed perfect autocorrelation (and sometimes perfect correlation across components) Coverage probabilities for H-L margin of total population in official forecasts Statistics Norway Statistics Sweden - Fertility - Mortality - Migration 2010 2050 47% 78% 19% 4% 1% 32% 20% 34% Sources: Stochastic population forecasts from UPE Traditional forecasts from Statistics Norway and Statistics Sweden Cohort-component method Deterministic population forecast Needed for the country in question: annual assumptions on future – Fertility Total Fertility Rate – Mortality Life expectancy at birth M/F – Migration Net immigration – as well as rates (fertility, mortality) & numbers (migration) by age & sex Stochastic Population Forecast: How? • Cohort-component method • Random rates for fertility and mortality, random numbers for net-migration • Normal distributions in the log scale (rates) or in the original scale (migration numbers) - expected values (“point predictions”) – cf. Medium variant in traditional deterministic forecast - standard deviations - correlations (age, time, sex, components, countries) SPF: How? (cntnd) • Joint distribution of all random input variables (rates, migration numbers) • In practice: simplifications, e.g. - independence of components (fertility, mortality, migration) - correlation between male and female mortality (constant across ages, time) • One random draw from all prob. distributions one sample path • Repeated draws thousands of sample paths SPF: How? (cntnd) Three main approaches: uncertainty parameters based on - historical errors - expert knowledge - statistical model SPF: Examples Multivariate time series models for all parameters of interest Examples for Norway 1995-2050, see http://folk.uio.no/keilman/6-15.pdf and European countries 2003-2050, see http://www.stat.fi/tup/euupe/index_en.html Alho’s scaled model for error, implemented in PEP (Program for Error Propagation) Example for aggregate of 18 European countries 2003-2050, see http://www.stat.fi/tup/euupe/index_en.html Time series example, Norway: log(TFR) = ARIMA(1,1,0) Zt = 0.67Zt-1 + εt-1 , (0.10) Zt = log(TFRt) - log(TFRt-1) Prediction intervals, age-specific fertility rates, Norway 2050 Time series models for • parameters of Gamma model for age-specific fertility (TFR, MAC, variance in age at childbearing) • e0 • parameters of Heligman-Pollard model for agespecific mortality • immigration numbers • emigration numbers (deterministic age patterns for both migration flows) 5000 simulations Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 Population size, Norway 6200000 5800000 5400000 5000000 4600000 4200000 3800000 2000 2010 2020 2030 2040 2050 100 simulations of population size, Norway 2003-2050 10000000 9000000 8000000 7000000 6000000 5000000 4000000 3000000 2000 2005 2010 2015 2020 2025 2030 2035 2040 2045 2050 Time series models, two examples 1. Autoregressive model of order 1 - AR(1) Zt = φZt-1 + εt |φ| < 1, εt i.i.d random variables, zero expectation, constant variance – ”white noise” Var(Zt) = Var(εt)(1- φ2t)/(1- φ2) constant (in the long run – large t) For large t: k-step ahead autocorrelation Corr(Zt, Zt+k) equals φk , independent of time 2. Random Walk - RW Zt = Zt-1 + εt Var(Zt) = t*Var(εt) unbounded for large t Independent increments (zero autocorrelation) Forecasts and 95% prediction intervals for net migration. Data 1960-2000 Outliers: 1989 AR(1) & const: Zt=5688+0.76Zt-1+εt Outliers: 1962, 1988 AR(1) & const: Zt=7819+0.39Zt-1+εt Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data 1950-2000. Observed TFR-value for the year 2000 is given as “y2000” Model: AR(1) & constant Zt (=logTFRt) = 0.001 + 0.988Zt-1 + εt Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data 1900-2000. Observed TFR-value for the year 2000 is given as “y2000” Model: AR(1) & constant Outliers 1920, 1942 Zt (=logTFRt) = -0.003 + 0.995Zt-1 + εt Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data 1950-2000. Observed TFR-value for the year 2000 is given as “y2000” Model: AR(2) & constant Zt (=logTFRt) = 0.002 + 0.941Zt-1 - 0.408Zt-2 + εt Forecasts and 67%, 80%, and 95% prediction intervals for the TFR. Data 1900-2000. Observed TFR-value for the year 2000 is given as “y2000” Model: AR(2)-ARCH(1) Outliers 1919, 1920, 1940, 1941 Zt (=logTFRt) = 0.005 + 0.981Zt-1 + vt + dummies vt = 0.214 vt-2 + εt, εt = (√ht)et, ht = 7E-4+0.708(εt2) Time series approach to SPF + conceptually simple - inflexible Alternative: Alho’s scaled model for error Implemented in Program for Error Propagation (PEP) http://www.joensuu.fi/statistics/software/pep/pepstart. htm . Scaled model for error Suppose the true age-specific rate in age j during forecast year t > 0 is of the form R(j,t) = F(j,t)exp(X(j,t)), where F(j,t) is the point forecast, and X(j,t) is the relative error Suppose that the error processes are of the form X(j,t) = ε(j,1) + ... + ε(j,t) with error increments of the form ε(j,t) = S(j,t)(ηj + δ(j,t)) S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t) are independent of ηj for all t and j ηj ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ) , 0 ≤ κ ≤ 1 Note that Var(ε(j,t)) = S(j,t)2 A positive kappa means that there is systematic error in the time trend of the rate. κ = Corr[ε(j,t), ε(j,t+h)] for all h > 0, thus κ is the (constant) autocorrelation between the error increments. Together, the autocorrelation κ and the scale S(j,t) determine the variance of the relative error X(j,t). Ex. 1. Under a random walk model the error increments are uncorrelated with κ = 0. Ex. 2. The model with constant scales (S(j,t)=S(j)) can be interpreted as a random walk with a random drift. The relative importance of the two components is determined by κ. Migration Migration (net) is represented in absolute terms Dependence on age is deterministic, given by a fixed distribution g(j,x) over age x The error of net migration in age x, for sex j, during year t > 0, is additive and of the form Y(j,x,t) = S(j,t)g(j,x)(ηj + δ(j,t)) Key properties of the scaled model • The choice of the scales S(j,t) is unrestricted. Hence any sequence of non-decreasing error variances can be matched (e.g. heteroscedasticity) • Any sequence of cross-correlations over ages can be majorized using the AR(1) models of correlation • Any sequence of autocorrelations for the error increments can be majorized. Scaled model for error Used for UPE project: Uncertain Population of Europe • 18 countries: EU15 + Iceland, Norway, Switzerland (EEA+) • 2003 – 2050 • Probability distributions specified on the basis of - time series analysis (TFR, e0, net-migr.) - empirical forecast errors - expert judgement • 3000 simulations for each country, PEP • http://www.stat.fi/tup/euupe/index_en.html Population size EEA+ median (black), 80% prediction intervals (red) 500 475 450 million 425 400 375 350 325 300 2000 2010 2020 2030 2040 2050 77% chance > 400 million in 2050 (UN) 83% chance > 392 million in 2050 (2003) age Age pyramid for 2050, EEA+-countries 95+ 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 25-29 20-24 15-19 10-14 5-9 0-4 Men 20 median (black), 80% prediction intervals (red) Women 15 10 5 0 5 10 15 20 numbers in millions b. 2003 age 95+ 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 25-29 20-24 15-19 10-14 5-9 0-4 Men 20 15 Women 10 5 0 5 numbers in millions 10 15 20 How to use SPF results? User’s Loss function What are the costs associated with underpredictions/ overpredictions of certain sizes? Loss function, stylized example F = forecast O = observed Loss = c.(F - O) = λ.c.(O - F) F>O F<O (c, λ > 0) λ characterizes degree of symmetry in the loss function λ > 1: underprediction is more severe than overprediction Forecast F is a stochastic variable with a predictive distribution Hence Loss is a s.v., which has a distribution Compute expected Loss Pick that value of F, which minimizes expected Loss The optimal F is that value of F at which the statistical distribution function equals λ /(λ +1) λ =1: median value of F λ > 1: optimal F is larger than the median e62 ~ Normal(20, stdev) Optimal choice for e62 26 years 24 22 20 18 stdev = 2 years 16 stdev = 4 years 14 0.1 1 2 3 4 5 6 7 8 lambda λ > 1: underprediction is more severe than overprediction 9 10 Important Are overpredictions more/less harmful than underpredictions? Challenges • Multi-state forecasts (sub-national, household) • Limited data • Educate the users Thank you! Autocorrelation of error increments The error processes are of the form X(j,t) = ε(j,1) + ... + ε(j,t) with error increments of the form ε(j,t) = S(j,t)(ηj + δ(j,t)) S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t) are independent of ηj for all t and j ηj ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ) , 0 ≤ κ ≤ 1 A positive kappa means that there is systematic error in the time trend of the rate. UPE: age specific fertility rates We assumed that kappa = 0 random walk, non-correlated error increments ε(j,t) = S(j,t)δ(j,t) δ(j,t) i.i.d. ~ N(0, 1) Example Italy Pop aged 0 in 2050: - Expected value = 474,000 - Median = 420,000 - Standard deviation = 261,000 - Coefficient of variation = 0.55 Alternative assumption: kappa = 0.05 Italy Pop aged 0 in 2050: - Expected value - Median - Standard deviation - Coefficient of variation = 678,000 = 457,000 = 794,000 = 1.17 Kappa = 0.1 gives unrealistically wide prediction intervals for Pop aged 0 in 2050 EEA+ • 15 EU countries: Austria, Belgium, Denmark, Finland, France, Germany, Greece, Italy, Ireland, Luxembourg, Netherlands, Portugal, Spain, Sweden, United Kingdom • Iceland, Norway, Switzerland Total Fertility Rate in 18 European countries 6 5 Finland Iceland 4 ch/w Ireland 3 2 France 1 0 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Net migration to the countries of the EEA+: upward trend 1.5 immigration surplus millions 1.0 0.5 0.0 -0.5 emigration surplus -1.0 1960 1965 1970 1975 1980 1985 1990 1995 2000 Net migration to Italy 200000 150000 100000 50000 0 -50000 -100000 -150000 -200000 1960 1965 1970 1975 1980 1985 1990 1995 2000 UPE assumptions for net migration • Increase to ca. 3.5 ‰ by 2050 for the whole of the EEA+ • Demand for labour (ageing, economic developments) • North – South divide Life expectancy at birth, 18 European countries, men 90 Iceland 80 70 years Sw eden 60 Portugal 50 40 30 France 20 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 Life expectancy at birth, 18 European countries, women 90 80 70 Norw ay Portugal years 60 50 40 Italy 30 20 1900 1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 UPE assumptions for mortality • By 2030, mortality reductions in EEA+ countries will follow a common pattern • Sex gap of life expectancy reduces to 4 years • Life expectancy gains to 2050 by 6.5 (NL) -10 (Lux, Pt, E) years for men 5.7 (NL) – 9.6 (EIR) years for women • On average 2-3 years higher than Eurostat/UN UPE life expectancies too high? Historically, increases in European life expectancies have been under-estimated by - 2 years (15 years ahead) - 4.5 years (25 years ahead) Record life expectancy is higher & increases faster than UPE - ca. 0.23 years per calendar year UPE assumptions for fertility • Mediterranean and German speaking countries low - little catching up - problems with child care facilities, housing - preference for one child Total Fertility Rate = 1.4 c/w • Western and Northern Europe Total Fertility Rate = 1.8 c/w • Similar to Eurostat, on average 0.2 c/w lower than UN UPE: probabilistic forecast • Similar method as UN, Eurostat (cohort-component) • But parameters are drawn from assumed distributions -- simulation • Volatility in fertility, mortality, migration • Autocorrelations • Correlations across ages, sexes, countries Population size medians (black) and 80% prediction intervals (red) Norway 12 10 10 million million Sweden 12 8 6 6 4 2000 SCB SSB 8 2010 2020 2050 10.5 mln 4.8 mln 2030 2040 2050 4 2000 2010 2020 2030 2040 2050 Age pyramid 2050 medians & 80 % prediction intervals Norway age 95+ 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 25-29 20-24 15-19 10-14 5-9 0-4 600 Men 400 95+ 90-94 85-89 80-84 75-79 70-74 65-69 60-64 55-59 50-54 45-49 40-44 35-39 30-34 25-29 20-24 15-19 10-14 5-9 0-4 Women 200 0 Sweden age 200 numbers in thousands 400 600 600 Men 400 Women 200 0 200 numbers in thousands 400 600 UPE assumptions Sweden 2050 exp. 80%L 80%H value TFR 1.80 1.12 2.89 SCB e0M e0F 83.6 86.2 84.7 80.3 88.7 84.2 89.4 94.3 migr 26 600 3 500 49 600 1.85 23 000