Principles of stochastic population projection

advertisement
Stochastic Population Forecasting
and
ARIMA time series modelling
Lectures QMSS Summer School, 2 July 2009
Nico Keilman
Department of Economics, University of Oslo
Stochastic
• Stochastic (from the Greek "Στόχος" for "aim" or
"guess") means random.
• A stochastic process is one whose behaviour is nondeterministic in that a system's subsequent state is
determined both by the process's predictable actions
and by a random element.
• In a stochastic population forecast, uncertainty is
made explicit: random variables are part of the
forecast model.
Stochastic population forecast
Future population / births / deaths /migrations
as probability distributions, not one number (perhaps
three)
Why Stochastic Population Forecasts (SPF)?
Users should be informed about the expected accuracy of the
forecast
- probability of alternative future paths?
- which forecast horizon is reasonable?
Traditional deterministic forecast variants (e.g. High, Medium,
Low)
- do not quantify uncertainty  Prob(MediumPop) = 0 !!
- give a misleading impression of uncertainty
(example later)
- leave room for politically motivated choices by the user
Outline
•
•
•
•
•
•
Uncertainty of population forecasts
Principles of SPF
Time series models (selected examples)
Alho’s scaled model for error
Examples from UPE
Using a SPF
Focus on national forecasts
How uncertain are population
forecasts?
Empirical findings – historical forecasts evaluated
against actual population numbers (ex post facto)
Main findings for official forecasts in
Western countries
• Uncertainty in forecasts of certain population
variables surprisingly large
• Forecasts for the young and the old age groups are
the least reliable
• Forecast errors increase as forecast interval
lengthens
• Large uncertainty for small countries
• Large uncertainty for countries that are strongly
affected by migration
• European forecasts have not become more accurate
since WW2
Errors in age structure forecasts
Europe
Percentage errors in age structure forecasts for Europe
UN forecasts 1968-1990
20 %
5 years ahead
15 %
10 years ahead
forecasts too high
15 years ahead
10 %
5%
0%
-5 %
-10 %
forecasts too low
-15 %
-20 %
0-4
10-14
20-24
30-34
40-44
age group
50-54
60-64
70-74
80+
United Kingdom - men
Percentage errors in age structure forecasts for the UK,
GAD forecasts 1971-1994, men
30 %
10 years ahead
20 %
15 years ahead
forecasts too high
20 years ahead
10 %
0%
-10 %
forecasts too low
-20 %
-30 %
0-4
10-14
20-24
30-34
40-44
50-54
age groups
60-64
70-74
80-84 85+
United Kingdom - women
Percentage errors in age structure forecasts for the UK,
GAD forecasts 1971-1994, women
30 %
10 years ahead
20 %
15 years ahead
forecasts too high
20 years ahead
10 %
0%
-10 %
forecasts too low
-20 %
-30 %
0-4
10-14
20-24
30-34
40-44
50-54
age groups
60-64
70-74
80-84 85+
Why uncertain?
• Data quality (LDC’s)
• Social science predictions, no accurate behavioural
theory
• Rely on observed regularities instead
 Problems when sudden trend shifts occur
- stagnation life expectancy men 1950s
- baby boom/baby bust
Traditional population forecasts do
not give a correct impression of
uncertainty
Example: Old Age Dependency Ratio (OADR)
for Norway in 2060
Source: Statistics Norway population forecast of 2005
High Middle Low
|H-L|/M
millions
(%)
POP67+
1.55
1.33
1.13
31
POP20-66
4.03
3.39
2.83
36
OADR
0.38
0.39
0.40
4
Two major problems
• Wide margins for some variables, narrow margins for
others
• Narrow margins in the short run,
wide margins in the long run
- implicitly assumed perfect autocorrelation (and
sometimes perfect correlation across components)
Coverage probabilities for H-L margin
of total population in official forecasts
Statistics Norway
Statistics Sweden
- Fertility
- Mortality
- Migration
2010
2050
47%
78%
19%
4%
1%
32%
20%
34%
Sources:
Stochastic population forecasts from UPE
Traditional forecasts from Statistics Norway and Statistics Sweden
Cohort-component method
Deterministic population forecast
Needed for the country in question:
annual assumptions on future
– Fertility
 Total Fertility Rate
– Mortality  Life expectancy at birth M/F
– Migration  Net immigration
– as well as rates (fertility, mortality) & numbers (migration) by age &
sex
Stochastic Population Forecast: How?
• Cohort-component method
• Random rates for fertility and mortality, random
numbers for net-migration
• Normal distributions in the log scale (rates) or in the
original scale (migration numbers)
- expected values (“point predictions”) – cf. Medium
variant in traditional deterministic forecast
- standard deviations
- correlations (age, time, sex, components,
countries)
SPF: How? (cntnd)
• Joint distribution of all random input variables (rates,
migration numbers)
• In practice: simplifications, e.g.
- independence of components (fertility, mortality,
migration)
- correlation between male and female mortality
(constant across ages, time)
• One random draw from all prob. distributions  one
sample path
• Repeated draws  thousands of sample paths
SPF: How? (cntnd)
Three main approaches: uncertainty parameters based
on
- historical errors
- expert knowledge
- statistical model
SPF: Examples
Multivariate time series models for all parameters of interest
Examples for Norway 1995-2050, see
http://folk.uio.no/keilman/6-15.pdf
and European countries 2003-2050, see
http://www.stat.fi/tup/euupe/index_en.html
Alho’s scaled model for error, implemented in PEP (Program for
Error Propagation)
Example for aggregate of 18 European countries 2003-2050,
see
http://www.stat.fi/tup/euupe/index_en.html
Time series example, Norway:
log(TFR) = ARIMA(1,1,0)
Zt = 0.67Zt-1 + εt-1 ,
(0.10)
Zt = log(TFRt) - log(TFRt-1)
Prediction intervals, age-specific fertility rates, Norway 2050
Time series models for
• parameters of Gamma model for age-specific fertility
(TFR, MAC, variance in age at childbearing)
• e0
• parameters of Heligman-Pollard model for agespecific mortality
• immigration numbers
• emigration numbers
(deterministic age patterns for both migration flows)
 5000 simulations
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
Population size, Norway
6200000
5800000
5400000
5000000
4600000
4200000
3800000
2000
2010
2020
2030
2040
2050
100 simulations of population size,
Norway 2003-2050
10000000
9000000
8000000
7000000
6000000
5000000
4000000
3000000
2000
2005
2010
2015
2020
2025
2030
2035
2040
2045
2050
Time series models, two examples
1. Autoregressive model of order 1 - AR(1)
Zt = φZt-1 + εt
|φ| < 1, εt i.i.d random variables, zero expectation, constant
variance – ”white noise”
Var(Zt) = Var(εt)(1- φ2t)/(1- φ2) constant (in the long run – large t)
For large t:
k-step ahead autocorrelation Corr(Zt, Zt+k) equals φk , independent
of time
2. Random Walk - RW
Zt = Zt-1 + εt
Var(Zt) = t*Var(εt) unbounded for large t
Independent increments (zero autocorrelation)
Forecasts and 95% prediction intervals for net migration. Data 1960-2000
Outliers: 1989
AR(1) & const:
Zt=5688+0.76Zt-1+εt
Outliers:
1962, 1988
AR(1) & const:
Zt=7819+0.39Zt-1+εt
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR.
Data 1950-2000.
Observed TFR-value for the year 2000 is given as “y2000”
Model: AR(1) & constant
Zt (=logTFRt) = 0.001 + 0.988Zt-1 + εt
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR.
Data 1900-2000.
Observed TFR-value for the year 2000 is given as “y2000”
Model: AR(1) & constant
Outliers 1920, 1942
Zt (=logTFRt) = -0.003 + 0.995Zt-1 + εt
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR.
Data 1950-2000.
Observed TFR-value for the year 2000 is given as “y2000”
Model: AR(2) & constant
Zt (=logTFRt) = 0.002 + 0.941Zt-1 - 0.408Zt-2 + εt
Forecasts and 67%, 80%, and 95% prediction intervals for the TFR.
Data 1900-2000.
Observed TFR-value for the year 2000 is given as “y2000”
Model: AR(2)-ARCH(1)
Outliers 1919, 1920, 1940, 1941
Zt (=logTFRt) = 0.005 + 0.981Zt-1 + vt + dummies
vt = 0.214 vt-2 + εt,
εt = (√ht)et,
ht = 7E-4+0.708(εt2)
Time series approach to SPF
+ conceptually simple
- inflexible
Alternative: Alho’s scaled model for error
Implemented in Program for Error Propagation (PEP)
http://www.joensuu.fi/statistics/software/pep/pepstart.
htm .
Scaled model for error
Suppose the true age-specific rate in age j during
forecast year t > 0 is of the form
R(j,t) = F(j,t)exp(X(j,t)),
where F(j,t) is the point forecast, and X(j,t) is the
relative error
Suppose that the error processes are of the form
X(j,t) = ε(j,1) + ... + ε(j,t)
with error increments of the form
ε(j,t) = S(j,t)(ηj + δ(j,t))
S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t)
are independent of ηj for all t and j
ηj ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ) , 0 ≤ κ ≤ 1
Note that Var(ε(j,t)) = S(j,t)2
A positive kappa means that there is systematic error in the time
trend of the rate.
κ = Corr[ε(j,t), ε(j,t+h)] for all h > 0,
thus κ is the (constant) autocorrelation between the
error increments.
Together, the autocorrelation κ and the scale S(j,t)
determine the variance of the relative error X(j,t).
Ex. 1. Under a random walk model the error increments
are uncorrelated with κ = 0.
Ex. 2. The model with constant scales (S(j,t)=S(j)) can
be interpreted as a random walk with a random drift.
The relative importance of the two components is
determined by κ.
Migration
Migration (net) is represented in absolute terms
Dependence on age is deterministic, given by a fixed
distribution g(j,x) over age x
The error of net migration in age x, for sex j, during
year t > 0, is additive and of the form
Y(j,x,t) = S(j,t)g(j,x)(ηj + δ(j,t))
Key properties of the scaled model
• The choice of the scales S(j,t) is unrestricted.
Hence any sequence of non-decreasing error
variances can be matched (e.g. heteroscedasticity)
• Any sequence of cross-correlations over ages can
be majorized using the AR(1) models of correlation
• Any sequence of autocorrelations for the error
increments can be majorized.
Scaled model for error
Used for UPE project: Uncertain Population of Europe
• 18 countries: EU15 + Iceland, Norway, Switzerland
(EEA+)
• 2003 – 2050
• Probability distributions specified on the basis of
- time series analysis (TFR, e0, net-migr.)
- empirical forecast errors
- expert judgement
• 3000 simulations for each country, PEP
• http://www.stat.fi/tup/euupe/index_en.html
Population size EEA+
median (black), 80% prediction intervals (red)
500
475
450
million
425
400
375
350
325
300
2000
2010
2020
2030
2040
2050
77% chance > 400 million in 2050 (UN)
83% chance > 392 million in 2050 (2003)
age
Age pyramid for 2050, EEA+-countries
95+
90-94
85-89
80-84
75-79
70-74
65-69
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
5-9
0-4
Men
20
median (black),
80% prediction intervals (red)
Women
15
10
5
0
5
10
15
20
numbers in millions
b. 2003
age
95+
90-94
85-89
80-84
75-79
70-74
65-69
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
5-9
0-4
Men
20
15
Women
10
5
0
5
numbers in millions
10
15
20
How to use SPF results?
User’s Loss function
What are the costs associated with underpredictions/
overpredictions of certain sizes?
Loss function, stylized example
F = forecast
O = observed
Loss
= c.(F - O)
= λ.c.(O - F)
F>O
F<O
(c, λ > 0)
λ characterizes degree of symmetry in the loss function
λ > 1: underprediction is more severe than overprediction
Forecast F is a stochastic variable with a predictive distribution
Hence Loss is a s.v., which has a distribution
Compute expected Loss
Pick that value of F, which minimizes expected Loss
 The optimal F is that value of F at which the statistical distribution
function equals λ /(λ +1)
λ =1: median value of F
λ > 1: optimal F is larger than the median
e62 ~ Normal(20, stdev)
Optimal choice for e62
26
years
24
22
20
18
stdev = 2 years
16
stdev = 4 years
14
0.1
1
2
3
4
5
6
7
8
lambda
λ > 1: underprediction is more severe than overprediction
9
10
Important
Are overpredictions more/less harmful than
underpredictions?
Challenges
• Multi-state forecasts (sub-national, household)
• Limited data
• Educate the users
Thank you!
Autocorrelation of error increments
The error processes are of the form
X(j,t) = ε(j,1) + ... + ε(j,t)
with error increments of the form
ε(j,t) = S(j,t)(ηj + δ(j,t))
S(j,t) deterministic scales. δ(j,t) are independent over time t. δ(j,t)
are independent of ηj for all t and j
ηj ~ N(0, κ), δ(j,t) ~ N(0, 1 - κ) , 0 ≤ κ ≤ 1
A positive kappa means that there is systematic error in the time
trend of the rate.
UPE: age specific fertility rates
We assumed that kappa = 0  random walk, non-correlated error
increments
ε(j,t) = S(j,t)δ(j,t)
δ(j,t) i.i.d. ~ N(0, 1)
Example Italy Pop aged 0 in 2050:
- Expected value
= 474,000
- Median
= 420,000
- Standard deviation
= 261,000
- Coefficient of variation
= 0.55
Alternative assumption: kappa = 0.05
Italy Pop aged 0 in 2050:
- Expected value
- Median
- Standard deviation
- Coefficient of variation
= 678,000
= 457,000
= 794,000
= 1.17
Kappa = 0.1 gives unrealistically wide prediction
intervals for Pop aged 0 in 2050
EEA+
• 15 EU countries:
Austria, Belgium, Denmark, Finland, France,
Germany, Greece, Italy, Ireland, Luxembourg,
Netherlands, Portugal, Spain, Sweden, United
Kingdom
• Iceland, Norway, Switzerland
Total Fertility Rate in 18 European countries
6
5
Finland
Iceland
4
ch/w
Ireland
3
2
France
1
0
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Net migration to the countries of the EEA+:
upward trend
1.5
immigration surplus
millions
1.0
0.5
0.0
-0.5
emigration surplus
-1.0
1960
1965
1970
1975
1980
1985
1990
1995
2000
Net migration to Italy
200000
150000
100000
50000
0
-50000
-100000
-150000
-200000
1960
1965
1970
1975
1980
1985
1990
1995
2000
UPE assumptions for net migration
• Increase to ca. 3.5 ‰ by 2050 for the whole of the
EEA+
• Demand for labour (ageing, economic
developments)
• North – South divide
Life expectancy at birth, 18 European countries, men
90
Iceland
80
70
years
Sw eden
60
Portugal
50
40
30
France
20
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
Life expectancy at birth, 18 European countries, women
90
80
70
Norw ay
Portugal
years
60
50
40
Italy
30
20
1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
UPE assumptions for mortality
• By 2030, mortality reductions in EEA+ countries will follow a
common pattern
• Sex gap of life expectancy reduces to 4 years
• Life expectancy gains to 2050 by
6.5 (NL) -10 (Lux, Pt, E) years for men
5.7 (NL) – 9.6 (EIR) years for women
• On average 2-3 years higher than Eurostat/UN
UPE life expectancies too high?
Historically, increases in European life expectancies
have been under-estimated by
- 2 years (15 years ahead)
- 4.5 years (25 years ahead)
Record life expectancy is higher & increases faster
than UPE - ca. 0.23 years per calendar year
UPE assumptions for fertility
• Mediterranean and German speaking countries low
- little catching up
- problems with child care facilities, housing
- preference for one child
Total Fertility Rate = 1.4 c/w
• Western and Northern Europe
Total Fertility Rate = 1.8 c/w
• Similar to Eurostat, on average 0.2 c/w lower than UN
UPE: probabilistic forecast
• Similar method as UN, Eurostat (cohort-component)
• But parameters are drawn from assumed
distributions -- simulation
• Volatility in fertility, mortality, migration
• Autocorrelations
• Correlations across ages, sexes, countries
Population size
medians (black) and 80% prediction intervals (red)
Norway
12
10
10
million
million
Sweden
12
8
6
6
4
2000
SCB
SSB
8
2010
2020
2050
10.5 mln
4.8 mln
2030
2040
2050
4
2000
2010
2020
2030
2040
2050
Age pyramid 2050
medians & 80 % prediction intervals
Norway
age
95+
90-94
85-89
80-84
75-79
70-74
65-69
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
5-9
0-4
600
Men
400
95+
90-94
85-89
80-84
75-79
70-74
65-69
60-64
55-59
50-54
45-49
40-44
35-39
30-34
25-29
20-24
15-19
10-14
5-9
0-4
Women
200
0
Sweden
age
200
numbers in thousands
400
600
600
Men
400
Women
200
0
200
numbers in thousands
400
600
UPE assumptions Sweden
2050 exp. 80%L 80%H
value
TFR 1.80 1.12 2.89
SCB
e0M
e0F
83.6
86.2
84.7 80.3
88.7 84.2
89.4
94.3
migr 26 600 3 500 49 600
1.85
23 000
Download