Stochastic Parametrisation and Model Uncertainty

advertisement
Stochastic Parametrisation and Model
Uncertainty
Hannah Mary Arnold
Jesus College
University of Oxford
A thesis submitted for the degree of
Doctor of Philosophy
Trinity Term 2013
Stochastic Parametrisation and Model Uncertainty
Hannah Mary Arnold, Jesus College
Submitted for the degree of Doctor of Philosophy, Trinity Term 2013
Abstract
Representing model uncertainty in atmospheric simulators is essential for the production
of reliable probabilistic forecasts, and stochastic parametrisation schemes have been proposed
for this purpose. Such schemes have been shown to improve the skill of ensemble forecasts,
resulting in a growing use of stochastic parametrisation schemes in numerical weather prediction. However, little research has explicitly tested the ability of stochastic parametrisations
to represent model uncertainty, since the presence of other sources of forecast uncertainty has
complicated the results.
This study seeks to provide firm foundations for the use of stochastic parametrisation
schemes as a representation of model uncertainty in numerical weather prediction models.
Idealised experiments are carried out in the Lorenz ‘96 (L96) simplified model of the atmosphere, in which all sources of uncertainty apart from model uncertainty can be removed.
Stochastic parametrisations are found to be a skilful way of representing model uncertainty
in weather forecasts in this system. Stochastic schemes which have a realistic representation of model error produce reliable forecasts, improving on the deterministic and the more
“traditional” perturbed parameter schemes tested.
The potential of using stochastic parametrisations for simulating the climate is considered,
an area in which there has been little research. A significant improvement is observed when
stochastic parametrisation schemes are used to represent model uncertainty in climate simulations in the L96 system. This improvement is particularly pronounced when considering
the regime behaviour of the L96 system — the stochastic forecast models are significantly
more skilful than using a deterministic perturbed parameter ensemble to represent model uncertainty. The reliability of a model at forecasting the weather is found to be linked to that
model’s ability to simulate the climate, providing some support for the seamless prediction
paradigm.
The lessons learned in the L96 system are then used to test and develop stochastic and
perturbed parameter representations of model uncertainty for use in an operational numerical
weather prediction model, the Integrated Forecasting System (IFS). A particular focus is on
improving the representation of model uncertainty in the convection parametrisation scheme.
Perturbed parameter schemes are tested, which improve on the operational stochastic scheme
in some regards, but are not as skilful as a new generalised version of the stochastic scheme.
The proposed stochastic scheme has a potentially more realistic representation of model error
than the operational scheme, and improves the reliability of the forecasts.
While studying the L96 system, it was found that there is a need for a proper score which
is particularly sensitive to forecast reliability. A suitable score is proposed and tested, before
being used for verification of the forecasts made in the IFS.
This study demonstrates the power of using stochastic over perturbed parameter representations of model uncertainty in weather and climate simulations. It is hoped that these
results motivate further research into physically-based stochastic parametrisation schemes, as
well as triggering the development of stochastic Earth-system models for probabilistic climate
prediction.
ii
Acknowledgements
I have benefitted from the help and advice of many over the course of my D.Phil.
Firstly, I would like to thank my supervisors, Tim Palmer and Irene Moroz, for all
their insightful comments, support and guidance over the last three years. I have
been very fortunate to have such excellent supervisors.
I have really enjoyed working at AOPP, and have never had far to go for advice. Thank you in particular to Andrew Dawson for being a first-rate office-mate,
and for his help with all things computational. Thanks to Peter Düben, Fenwick
Cooper and Hugh McNamara for many interesting conversations, and thanks to
Laure Zanna and Lesley Gray for their useful comments during my transfer and
confirmation of status vivas. Thanks also to Ed Gryspeerdt and Peter Watson, for
many hours of excellent discussion on life, the universe and The Simpsons.
I would like to thank everyone at ECMWF for their support. In particular, I am
grateful to Antje Weisheimer for all her time and patience spent explaining the
details of working with the IFS. I want to thank Paul Dando for his help with
running the IFS from Oxford, and for all his work which made it possible. I
also want to thank Alfons Callado Pallares for providing me with his SPPT code,
and Martin Leutbecher for many useful discussions about SPPT and advice on
developing Alfons’ work. Thanks to Sarah-Jane Lock for running the high resolution
experiments for me, and to Heikki Järvinen, Pirkka Ollinaho and Peter Bechtold
for providing me with the parameter uncertainty information for my perturbed
parameter scheme. Thanks also to Glenn Shutts and Simon Lang for many helpful
discussions on stochastic parametrisation schemes.
I want to thank Paul Williams for his continued interest in my work — I always
come away from our meetings with an improved understanding and with lots of
new ideas. I want to thank Jochen Bröcker, Chris Ferro and Martin Leutbecher for
teaching me about proper scoring rules, and Cecile Penland for teaching me about
stochastic processes. I have also enjoyed many statistical discussions with Dan
Rowlands, Dan Cornford and Jonty Rougier, for which I am very grateful. Thanks
must go to everyone who has helped me improve my thesis by commenting on
various chapters: Tim Palmer, Irene Moroz, David Arnold, Peter Düben, Fenwick
Cooper, Sarah-Jane Lock, Heikki Järvinen, Antje Weisheimer and Andrew Dawson.
On a personal note, thank you to my parents for always supporting my latest
endeavour, and for always being there for me. Thanks to all the people who have
made my time in Oxford a happy one: to friends from Jesus College, AOPP, Aldates
and Caltech. In particular, a big thank you to Nicola Platt, and to Benjamin
Winter, Matthew Moore and Duncan Hardy for being excellent flatmates.
Finally, thank you Nikolaj, for your limitless encouragement, love and support. I
truly couldn’t have done it without you.
Abbreviations
1DD 1 Degree Daily YOTC dataset
A Additive noise stochastic parametrisation used in Chapters 2 and 3
ALARO Aire Limitée Adaptation/Application de la Recherche à l’Opérationnel
AMIP Atmospheric Model Intercomparison Project
AR(1) First Order Autoregressive
BS Brier Score
BSS Brier Skill Score, usually calculated with respect to climatology.
CA Cellular Automaton
CAPE Convectively Available Potential Energy
CASBS Cellular Automaton Stochastic Backscatter Scheme
CCN Cloud Condensation Nuclei
CIN Convective Inhibition
CMIPn Climate Model Intercomparison Project, Phase n
CONV IFS Convection parametrisation scheme
CONVi CONV perturbed independently using SPPT
CRM Cloud Resolving Model
DEMETER Development of a European Multimodel Ensemble system for seasonal to inTERannual prediction
ECMWF European Centre for Medium-Range Weather Forecasts
EDA Ensembles of Data Assimilation
ENSO El Niño Southern Oscillation
EOF Empirical Orthogonal Function
EPS Ensemble Prediction System
EUROSIP European Seasonal to Interannual Prediction project
GCM General Circulation Model
GLOMAP Global Model of Aerosol Processes
GPCP Global Precipitation Climatology Project
IFS Integrated Forecasting System — the ECMWF global weather forecasting model
IGN Ignorance Score
IGNL Ignorance Score calculated following Leutbecher (2010)
IGNSS Ignorance Skill Score, usually calculated with respect to climatology
IPCC AR4 Intergovernmental Panel on Climate Change’s fourth assessment report
ITCZ Intertropical Convergence Zone
KL Kullback-Leibler Divergence
KS Kolmogorov-Smirnov Statistic
LES Large Eddy Simulation
LSWP IFS Large Scale Water Processes (clouds) parametrisation scheme
LSWPi LSWP perturbed independently using SPPT
L96 The Lorenz ’96 System — the second model described in Lorenz (1996)
M Multiplicative noise stochastic parametrisation used in Chapters 2 and 3
MA Multiplicative and Additive noise stochastic parametrisation used in Chapters 2 and 3
MME Multi-Model Ensemble
v
MOGREPS Met Office Global and Regional Ensemble Prediction System
MTU Model Time Units in the Lorenz ’96 system. One MTU corresponds to approximately
five atmospheric days.
NAO North Atlantic Oscillation
NCEP National Centers for Environmental Prediction
NOGW IFS Non-Orographic Gravity Wave Drag parametrisation scheme
NOGWi NOGW perturbed independently using SPPT
NWP Numerical Weather Prediction
PC Principal Component
pdf Probability Density Function
PPT Precipitation
RDTT IFS Radiation parametrisation scheme
RDTTi RDTT perturbed independently using SPPT
REL Reliability component of the Brier Score
RMS Root Mean Square
RMSE RMS Error
RPS Ranked Probability Score
RPSS Ranked Probability Skill Score, usually calculated with respect to climatology
SCM Single Column Model
SD State Dependent additive noise stochastic parametrisation used in Chapters 2 and 3
SKEB Stochastic Kinetic Energy Backscatter
SME Single-Model Ensemble
SPPT Stochastically Perturbed Parametrisation Tendencies
SPPTi Independent Stochastically Perturbed Parametrisation Tendencies
TCWV Total Column Water Vapour
TGWD IFS Turbulence and Gravity Wave Drag parametrisation scheme
TGWDi TGWD perturbed independently using SPPT
THORPEX The Observing-System Research and Predictability Experiment
T159 IFS spectral resolution - triangular truncation of 159
T850 Temperature at 850 hPa
U200 Zonal wind at 200 hPa
U850 Zonal wind at 850 hPa
UM Unified Model — the U.K. Met Office weather forecasting model
WCRP World Climate Research Programme
WWRP World Weather Research Programme
YOTC Year of Tropical Convection
Z500 Geopotential height at 500 hPa
vi
Contents
Abstract
i
1 Introduction
1.1 Why are Atmospheric Models Useful? . . . . . . . . . . . . . . . . . . . . . . .
1.2 The need for parametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Predicting Predictability: Uncertainty in Atmospheric Models . . . . . . . . .
1.3.1 Multi-model Ensembles . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.2 Multiparametrisation . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Perturbed Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Stochastic Parametrisations . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4.1 Proof of concept: Stochastic Parametrisations in the Lorenz ’96 System
1.4.2 Stochastic Parametrisation of Convection . . . . . . . . . . . . . . . . .
1.4.3 Developments in operational NWPs . . . . . . . . . . . . . . . . . . . .
1.5 Comparison with Other Representations of Model Uncertainty . . . . . . . . .
1.6 Probabilistic Forecasts and Decision Making . . . . . . . . . . . . . . . . . . .
1.7 Evaluation of Probabilistic Forecasts . . . . . . . . . . . . . . . . . . . . . . .
1.7.1 Scoring Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.7.2 Other Scalar Forecast Summaries . . . . . . . . . . . . . . . . . . . . .
1.7.3 Graphical Verification Techniques . . . . . . . . . . . . . . . . . . . . .
1.8 Outline of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.9 Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.10 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
3
5
7
8
9
10
11
13
22
26
27
30
30
35
36
38
38
40
2 The Lorenz ’96 System:
Initial Value Problem
2.1 Introduction . . . . . . . . . . . . . . . . . . . . .
2.2 The Lorenz ’96 System . . . . . . . . . . . . . . .
2.3 Description of the Experiment . . . . . . . . . . .
2.3.1 “Truth” model . . . . . . . . . . . . . . .
2.3.2 Forecast model . . . . . . . . . . . . . . .
2.4 Weather Forecasting Skill . . . . . . . . . . . . .
2.5 Representation of Model Uncertainty . . . . . . .
2.6 Perturbed Parameter Ensembles in the Lorenz ’96
2.6.1 Weather Prediction Skill . . . . . . . . . .
2.7 Conclusion . . . . . . . . . . . . . . . . . . . . . .
3 The Lorenz ’96 System:
Climatology and Regime Behaviour
3.1 Introduction . . . . . . . . . . . . . . . . . . . .
3.2 Climatological Skill: Reproducing the pdf of the
3.2.1 Perturbed Parameter Ensemble . . . . .
3.3 Climatological Skill: Regime Behaviour . . . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
43
43
44
45
52
56
57
59
62
. . . . . . .
Atmosphere
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
63
65
70
71
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
System
. . . . .
. . . . .
3.4
3.3.1 Data and Methods . . . . .
3.3.2 The True Attractor . . . . .
3.3.3 Simulating the Attractor . .
3.3.4 Simulating Regime Statistics
Conclusion . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
72
78
79
85
93
4 Evaluation of Ensemble Forecast Uncertainty: The Error-Spread Score
95
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95
4.2 Evaluation of Ensemble Forecasts . . . . . . . . . . . . . . . . . . . . . . . . .
96
4.3 The Error-Spread Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
98
4.4 Propriety of the Error-Spread Score . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Decomposition of the Error-Spread Score . . . . . . . . . . . . . . . . . . . . . 100
4.6 Testing the Error-Spread Score: Evaluation of Forecasts in the Lorenz ’96 System102
4.7 Testing the Error-Spread Score: Evaluation of Medium-Range Forecasts . . . . 103
4.8 Evaluation of Reliability, Resolution and Uncertainty for EPS forecasts . . . . 109
4.9 Application to Seasonal Forecasts . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5 Experiments in the IFS:
Perturbed Parameter Ensembles
5.1 Introduction . . . . . . . . . . . . . . . . . . .
5.2 The Integrated Forecasting System . . . . . .
5.2.1 Parametrisation Schemes in the IFS . .
5.3 Uncertainty in Convection: Generalised SPPT
5.4 Perturbed Parameter Approach to Uncertainty
5.4.1 Perturbed Parameters and the EPPES
5.4.2 Method . . . . . . . . . . . . . . . . .
5.5 Experimental Procedure . . . . . . . . . . . .
5.5.1 Definition of Verification Regions . . .
5.5.2 Chosen Diagnostics . . . . . . . . . . .
5.6 Verification of Forecasts . . . . . . . . . . . .
5.6.1 Verification in Non-Convecting Regions
5.6.2 Verification in Convecting Regions . .
5.7 Discussion and Conclusion . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
in Convection .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
121
121
122
124
125
127
127
129
133
135
136
137
137
139
149
6 Experiments in the IFS:
Independent SPPT
6.1 Motivation . . . . . . . . . . . . . . . . . . . .
6.2 Global Diagnostics . . . . . . . . . . . . . . .
6.3 Effect of Independent SPPT in Tropical Areas
6.4 Convection Diagnostics . . . . . . . . . . . . .
6.4.1 Precipitation . . . . . . . . . . . . . .
6.4.2 Total Column Water Vapour . . . . . .
6.5 Individually Independent SPPT . . . . . . . .
6.6 High Resolution Experiments . . . . . . . . .
6.6.1 Global Diagnostics . . . . . . . . . . .
6.6.2 Verification in the Tropics . . . . . . .
6.7 Discussion and Conclusion . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
155
155
157
160
169
169
171
175
179
182
184
186
7 Conclusion
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
195
viii
A Skill Score Significance Testing
A.1 Weather Forecasts in the Lorenz ‘96 System
A.2 Simulated Climate in the Lorenz ‘96 System
A.3 Skill Scores for the IFS . . . . . . . . . . . .
A.3.1 Experiments in the IFS . . . . . . . .
B The
B.1
B.2
B.3
B.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
205
205
206
210
210
Error-spread Score: A Proper Score
Derivation of the Form of the Error-Spread Score .
Confirmation of Propriety of the Error-spread Score
Decomposition of the Error-spread Score . . . . . .
Mathematical Properties of Moments . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
217
217
219
222
226
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
227
ix
x
1
Introduction
Det er svært at spå, især om fremtiden.
(It is difficult to make predictions, especially about the future)
– Niels Bohr
1.1
Why are Atmospheric Models Useful?
Mankind has always wanted to understand and predict the weather. In 650 B.C., the Babylonians recorded the weather, and predicted the short term weather using the appearance of
clouds (Nebeker, 1995). In 340 B.C., Aristotle wrote Meteorologica which included his theories
of the formation of winds, cloud, mist and dew. However, the earliest forecasts were not based
on theoretical descriptions of the weather, but were deduced by making records of observations, and identifying patterns in these records. With the birth of meteorological instruments
in the 17th Century, these records became quantifiable, and scientists such as Edmond Halley
proposed theories for the observed weather, such as the cause of the trade winds so important
for shipping (Halley, 1686). However, even up until the 1960s, pattern forecasting (or “analogues”) was promoted as a potential way to produce weather forecasts out to very long lead
times. Weather patterns are identified where the large-scale flow evolves similarly with time. If
a long enough historical record of the state of the atmosphere is maintained, the forecaster has
the (relatively) simple job of looking through the record for a day when the atmospheric state
looks the same as today, and then issuing the historical evolution of the atmosphere from that
state as today’s forecast. To allow this, catalogues were prepared of different weather regimes,
1
such as the Grosswetterlagen (Hess and Brezowsky, 1952). These qualitatively describe the
different major flow regimes of the atmosphere and their associated weather conditions.
Nevertheless, the atmosphere is inherently unpredictable, so such methods are doomed to
fail. The origin of this unpredictability is the chaotic nature of the atmosphere. Chaos theory
was first described by Lorenz in his seminal paper “Deterministic Nonperiodic Flow” (Lorenz,
1963). Chaos is a property of certain non-linear dynamical systems which exhibit a strong
sensitivity to their initial conditions. Two states of such a system, initially very close in phase
space, will diverge, making long term prediction impossible in general. The Lorenz (1963)
model is a set of three coupled equations which exhibit this chaotic behaviour, derived as a
truncation of Rayleigh-Bénard convection for a plane layer of fluid heated from below. Lorenz
did not agree with the notion that analogues would provide a way to predict the weather
months in advance, and successfully discredited the theory with his 1963 paper. By showing
that the behaviour of a simple deterministic system with just three variables could not be
predicted using analogues, he argued that neither could the atmosphere.
Given that the atmosphere is chaotic, more accurate weather forecasts can be derived by
acknowledging that the atmosphere is a fluid which obeys the Newtonian equations of motion. By bringing together observations and theory the weather can, in principle, be predicted
with higher accuracy than by using analogues. This theoretically based approach was first
proposed by Vilhelm Bjerknes in 1903, and first attempted by Lewis Fry Richardson during
the first world war, who solved the relevant partial differential equations by hand (Richardson,
2007). Richardson’s six hour forecast took six weeks to compute, but due to the sparseness and
noisiness of the observations used to initialise the forecast, the result was very inaccurate, predicting pressure changes of 145 mb over the duration of the forecast (Nebeker, 1995). It wasn’t
until the 1950s and 1960s with the birth of the electronic computer that numerical weather
prediction became practical and computational atmospheric models became indispensable.
A numerical weather prediction system provides a forecaster with a framework for using all
of his or her knowledge about the atmosphere to make a forecast. The atmospheric model uses
theoretical knowledge from fluid dynamics, thermodynamics, radiation physics, and numerical analysis to predict the future state of the atmosphere. Data from satellites, radiosondes
and ground based measurements are incorporated into the model using data assimilation and
provide the starting conditions for the forecast. By using data assimilation to combine our
2
observations with our theoretical knowledge of the atmosphere, we ensure that the models are
initialised from physically reasonable initial conditions, smoothing out the errors in the observations. With better starting conditions, Richardson’s forecast would have been significantly
improved, since high frequency gravity waves would not have been excited from imbalances in
the starting conditions (Lynch, 1992). The use of atmospheric models has unified the three
main fields of meteorology, with observationalists, theorists and forecasters all using atmospheric models to further their science and focus their research efforts (Nebeker, 1995).
1.2
The need for parametrisation
The Navier-Stokes equation (1.1), combined with the continuity equation (1.2) and equation
of state (1.3), describes the evolution of a fluid flow and forms the basis of all atmospheric
models:
!
∂u
ρ
+ u · ∇u = −∇p − ρg k̂ + µ∇2 u,
∂t
∂ρ
= −∇ · (ρu),
∂t
p = Ra T ρ,
(1.1)
(1.2)
(1.3)
where u is the fluid velocity, ρ is the fluid density, p is pressure, g is the gravitational acceleration, k̂ is the vertical unit vector, µ is the dynamic viscosity, T is the temperature, and
Ra is the gas constant per unit mass of air. In general, the Navier-Stokes equation cannot be
solved exactly. Instead, an approximate solution is obtained by discretising the equations, and
truncating scales below some scale in space and time. However, this leaves fewer equations of
motion than there are unknowns: the effect of the sub-grid scale variables on the grid scale
flow is required, but not explicitly calculated. This is the closure problem. Unknown variables
must be approximated in terms of known variables in order to complete the set of equations
and render them soluble.
In atmospheric models, closure is achieved through deterministically parametrising the
sub-grid scale processes as a function of the grid scale variables. The representation of these
processes often involves a conceptual representation of the physics involved (Jakob, 2010).
For example, convection is often represented by the mass-flux approximation, in which the
3
spectrum of clouds within a grid cell is represented by a single mean cloud. The grid cell
is assumed to be large enough to contain an ensemble of clouds but small enough that the
atmospheric variables are fairly constant within the grid box (Arakawa and Schubert, 1974).
This ensures the statistical effect of the cloud field on the grid scale variables is well represented
by the mean. In fact, this condition of convective quasi-equilibrium is rarely met in the
atmosphere, and deterministic parametrisations provide no way of estimating the uncertainty
due to such deficiencies.
The source of the problem can be found by considering (1.1). The Navier-Stokes equation
is scale invariant: if u(x, t), p(x, t) is a solution, then so too is:
x
t
uτ (x, t) = τ
u 1/2 ,
,
τ
τ
x t
,
pτ (x, t) = τ −1 p 1/2 ,
τ
τ
−1/2
(1.4)
(1.5)
for any τ > 0 (Palmer, 2012)1 . This scaling symmetry implies a power law spectrum of
energy in the flow, as observed in the atmosphere. Figure 1.1, taken from Nastrom and Gage
(1985), shows the atmospheric energy spectrum estimated from aircraft measurements of wind
and temperature. At smaller spatial scales (high wavenumbers, k) the spectral slopes are
approximately − 35 , while at larger scales the spectral slopes are close to −3. The − 35 spectral
slope is as expected for a three dimensional turbulent flow (Kraichnan and Montgomery, 1980).
At larger scales, the rotation of the Earth inhibits velocity variations with height, resulting
in a quasi-two-dimensional turbulent flow (Kraichnan and Montgomery, 1980), which indeed
predicts the k −3 slope observed at large spatial scales.
Importantly, Figure 1.1 shows a continuous spectrum of energy in the atmospheric flow;
there is no scale with a low observed energy density marking the boundary between small and
large scales at which we should truncate. Whatever the truncation scale, there will always be
motion occurring just below that scale, so the statistical assumptions of Arakawa and Schubert
(1974), which form the basis of deterministic parametrisation schemes, will break down.
An alternative approach is to use stochastic parametrisation schemes. These acknowledge
that the sub-grid scale motion is not fully constrained by the grid scale variables, so the effect
of the sub-grid on the grid scale cannot be represented as a function of the grid scale variables.
Instead, random numbers are included in the equations of motion to represent one possible
1
This scaling symmetry is only strictly true in the absence of gravity.
4
Figure 1.1: Power spectrum for wind and potential temperature near the tropopause, calculated
from aircraft data. The spectra for meridional wind and temperature are shifted to the right.
c American
The plotted lines have slopes -3 and − 35 . Taken from Nastrom and Gage (1985). Meteorological Society. Used with permission.
evolution of the sub-grid scale. An ensemble of forecasts is generated to give an indication
of the uncertainty in the forecasts due to the simplifications and approximations made when
developing the atmospheric model. Furthermore, by using spatially and temporally correlated
noise, the effects of poorly resolved processes occurring at scales larger than the grid scale can
be accounted for, going beyond the traditional remit of parametrisation schemes. The coupling
of scales in a complex system means a successful parametrisation must represent the effects
of sub-grid scale processes acting on spatial and temporal scales greater than the truncation
level. Stochastic parametrisations are therefore more consistent with the power law scaling
observed in the atmosphere than traditional deterministic schemes.
1.3
Predicting Predictability: Uncertainty in Atmospheric Models
There are two main sources of error in atmospheric modelling; errors in the initial conditions
and errors in the model’s representation of the atmosphere (Slingo and Palmer, 2011). A single
5
deterministic forecast is of limited use as it gives no indication of how confident the forecaster
is in his or her prediction. Instead, an ensemble of forecasts should be generated which explores
these uncertainties, and a probabilistic forecast issued to the user.
The first source of uncertainty, initial condition uncertainty, arises in part from measurement limitations. These restrict the accuracy with which the starting state of the atmosphere
may be estimated2 . The atmosphere is a chaotic system which exhibits a strong sensitivity
to its initial conditions (Lorenz, 1963): the non-linearity of the equations of motion describing the atmosphere results in error growth which is a function of the flow, and makes long
term prediction impossible in general (Lorenz, 1972). This uncertainty can be quantified by
initialising the ensemble of forecasts from perturbed initial conditions. These aim to represent
the probability density function (pdf) of initial error, and can be generated in such a way as
to capture the finite time linear instabilities of the flow using, for example, singular vectors
(Buizza and Palmer, 1995).
The second major source of uncertainty is model uncertainty, which stems from limitations
in the computational representation of the equations of motion of the atmosphere. The atmospheric model has a finite resolution and, as discussed above, sub-grid scale processes must
be represented through schemes which often grossly simplify the physics involved. For each
state of the resolved, macroscopic variables, there are many possible states of the unresolved
variables, so this parametrisation process is a significant source of forecast error. The largescale equations must also be discretised in some way, which is a secondary source of error. If
only initial condition uncertainty is represented, the forecast ensemble is under-dispersive, i.e.
it does not accurately represent the error in the ensemble mean (e.g. Stensrud et al. (2000)).
The verification frequently falls outside of the range of the ensemble; model uncertainty must
be included for a skilful forecast.
In this study, stochastic parametrisations are investigated as a way of accurately representing model uncertainty. However, before existing stochastic schemes are discussed in Section 1.4,
alternative methods of representing model uncertainty will be considered here.
2
In fact, there can also be a significant model error component to initial condition uncertainty. At the
European Centre for Medium-Range Weather Forecasts, the Ensembles of Data Assimilation (EDA) system is
used to estimate the initial conditions for each forecast. The EDA system requires both measurements and a
forecast model, so limitations in both contribute to initial condition uncertainty.
6
1.3.1
Multi-model Ensembles
There are many different weather forecasting centres, each developing its own Numerical
Weather Prediction (NWP) model. Initial condition perturbations allow for an ensemble
forecast to be made at each centre which represents the initial condition uncertainty. In a
multi-model ensemble, several centres’ ensemble prediction systems are combined to form one
super-ensemble. The different forecasts from different NWP models allow for a pragmatic
representation of model uncertainty.
This representation of model uncertainty is particularly common for climate projections.
Since the mid 1990s, the World Climate Research Program (WCRP) has organised global
climate model intercomparisons. Participating centres perform experiments with their models
using different suggested forcings for different emission scenarios. These are then compared,
most recently in the Coupled Model Intercomparison Project, Phase 5 (CMIP5) (Taylor et al.,
2012), which contains climate projections from more than 50 models, run by more than 20
groups from around the world.
Multi-model ensembles (MMEs) perform better than the best single model in the ensemble
if and only if the single-model ensembles are over-confident (Weigel et al., 2008). An overconfident (under-dispersive) single-model ensemble (SME) is penalised by the forecasting skill
score (Section 1.7) for not sampling the full range of model uncertainty. Since different models
are assumed to have different errors, combining a number of over-confident models allows the
full range of uncertainty to be sampled, improving the forecast performance of the MME over
the SMEs.
MME seasonal predictions were made at the European Centre for Medium-Range Weather
Forecasts (ECMWF) as part of the Development of a European Multimodel Ensemble system
for seasonal to inTERannual prediction (DEMETER) project (Palmer et al., 2004). Seasonal
predictions of the MME have higher skill than the ECMWF SME, which is mainly due to
an improvement in the reliability of the ensemble. This supports the use of MMEs as a
way of representing model uncertainty. The DEMETER project has evolved into EUROSIP
(European Seasonal to Interannual Prediction), a joint initiative between ECMWF, the U.K.
Met Office and Météo-France, which produces multi-model seasonal forecasts out to a lead
time of seven months.
An advantage of using MMEs to represent model uncertainty is that they represent un7
certainty due to assumptions made when designing the dynamical core, not just due to the
formulation of the parametrisation schemes. Different centres use different discretisations for
the dynamical core (e.g. ECMWF use a spectral discretisation method whereas the U.K. Met
Office use a grid point model), and may also implement different time stepping schemes. An
ensemble that only perturbs the models’ parametrisation schemes will not explore this aspect
of model uncertainty.
A major disadvantage of using MMEs is that they have no way of representing systemic
errors common to all models. In addition, MMEs are “ensembles of opportunity” (Masson and
Knutti, 2011) which have not been designed to fully explore the model uncertainty. Furthermore, it can be shown that the individual models in a MME are not independent. Masson and
Knutti (2011) use the Kullback-Liebler divergence applied to temperature and precipitation
projections to construct a ‘family tree’ of model dependencies for the 23 ensemble members in
the Climate Model Intercomparison Project, Phase 3 (CMIP3) MME. They found that different models from the same institution are closely related, as well as different models with (for
example) the same atmospheric model basis. This leads to the conclusion that the number of
independent models is far smaller than the total number of models. This result was supported
by a similar study (Pennell and Reichler, 2011), which proposes that the effective number of
climate models in CMIP3 is between 7.5 and 9. This lack of diversity adversely affects how
well a MME can represent model uncertainty.
1.3.2
Multiparametrisation
A large source of forecast model error is the assumptions built into the physical parametrisation
schemes. The model error from these assumptions can be explored by using several different
parametrisation schemes to generate an ensemble of forecasts. This is called multiparametrisation (or multiphysics). Ideally, the different parametrisation schemes should give equally
skilful forecasts.
Houtekamer et al. (1996) use the multiparametrisation approach to represent model uncertainty in the Canadian Meteorological Centre General Circulation Model (GCM). This was the
first attempt to represent model uncertainties in an ensemble prediction system. The parametrisations which were varied were the horizontal diffusion scheme, the convection and radiation
code, the representation of orography and including a gravity wave drag scheme. An ensemble
8
of eight models was run with different combinations of these schemes, together with initial
condition perturbations. Analysis of the ensemble showed that the spread improved with the
addition of the multiparametrisation scheme, but that the ensemble was still under-dispersive.
It was proposed to include a “more dramatic” perturbation to the model in a future study
to increase this spread further. The Meteorological Service of Canada operationally use this
multiparametrisation strategy to represent model uncertainty in their Ensemble Kalman Filter. They create a 24 member ensemble by altering the parametrisation schemes used for deep
convection, the land surface, and for calculating the turbulent mixing length (Houtekamer
et al., 2007).
The use of a multiparametrisation scheme requires several different parametrisations to be
maintained as operational. This is costly for a single centre to do, but could be shared between
multiple centres. Additionally, multiparametrisation schemes, like multi-model ensembles, are
ensembles of opportunity. It is unclear whether such an ensemble represents the full model error, resulting in under-dispersive ensembles. To overcome this limitation, new parametrisations
must be systematically designed to span the full range of uncertainty in the model physics,
further increasing the cost of this approach.
1.3.3
Perturbed Parameters
A simple alternative to MMEs or multiparametrisation schemes is using a perturbed parameter
ensemble. When developing a parametrisation scheme, new parameters are introduced to
describe unresolved physical processes. Many of these parameters are poorly constrained as
they cannot be measured directly (they often represent complex processes) and there are only
limited available data. Uncertainty due to the approximations in the parametrisation scheme
can therefore be represented by varying these uncertain parameters within their physical range.
The largest perturbed parameter experiment is ‘climateprediction.net’ (Stainforth et al.,
2005). This is a distributed-computing experiment which uses the idle processing time on the
personal computers of volunteers across the world. Model uncertainty is probed by varying
the parameters in the physical parametrisation schemes. Each parameter can be set to one
of three values — standard, low or high — where the range is proposed by an expert in the
parametrisation scheme. For each set of parameters, an initial condition ensemble is generated,
and the spread of the “ensemble of ensembles” used as an indicator of uncertainty in climate
9
change projections.
Perturbing parameters gives a greater control over the ensemble than multi-model or multiparametrisation approaches, but the final results of the ensemble depend on the choice of
parameters perturbed as well as the choice of base model. It is very expensive to run a GCM
many times with different parameter perturbations. However, a statistical emulator can be
constructed to allow interpolation away from the tested parameter sets (Rougier et al., 2009).
Lee et al. (2012) use emulation to construct a large perturbed parameter experiment for eight
parameters in the Global Model of Aerosol Processes (GLOMAP) system. By considering
cloud condensation nuclei (CCN) concentrations and performing a sensitivity analysis, they
are able to deduce which parameters (and therefore which processes) contribute the most to
the CCN uncertainty at different global locations. This is a powerful tool which can be used
to identify weaknesses in the model, and focus future research efforts.
There are several drawbacks to the perturbed parameter approach, including the inability
to explore structural or systemic errors as a single base model is used for the experiment.
Additionally, some combinations of parameter perturbations may be unphysical, though this
can be avoided by identifying “good” parts of the parameter space, and the different climate
projections weighted accordingly (Rodwell and Palmer, 2007). However, this constraint further
limits the degree to which the perturbed parameter ensemble can explore model uncertainty.
1.4
Stochastic Parametrisations
The equations governing the evolution of the atmosphere are deterministic. However, the process of discretising these equations in a GCM renders them undeterministic as the unresolved
sub-grid tendencies must be approximated in some way (Palmer et al., 2009). The unresolved
variables are not fully constrained by the grid-scale variables, so a one-to-one mapping of the
large-scale on to the small-scale variables, as is the case in a deterministic parametrisation,
seems unjustified. A stochastic scheme, in which random numbers are included in the computational equations of motion, is able to explore other nearby regions of the attractor compared
to a deterministic scheme. An ensemble generated by repeating a stochastic forecast gives an
indication of the uncertainty in the forecast due to the parametrisation process. A stochastic
parametrisation must be viewed as a possible realisation of the sub-grid scale motion, whereas
a deterministic parametrisation represents the average sub-grid scale effect.
10
1.4.1
Proof of concept: Stochastic Parametrisations in the Lorenz ’96
System
There are many benefits of performing proof of concept experiments using simple systems
before moving to a GCM or NWP model. Simple chaotic systems are transparent and computationally cheap, but are able to mimic certain properties of the atmosphere. They also
allow for a robust definition of “truth”, important for development and testing of parametrisations, and verification of forecasts. The Lorenz ’96 system was designed by Lorenz (1996)
to be a “toy model” of the atmosphere, incorporating the interaction of variables of different
scales. It is therefore particularly suited as a testbed for new parametrisation methods which
must represent this interaction of scales. This study will begin by testing stochastic parametrisation schemes using the second model proposed in Lorenz (1996), henceforth, the L96
system (Chapter 2). This system describes a coupled set of equations for two types of variables
arranged around a latitude circle (Lorenz, 1996):
kJ
X
dXk
hc
= −Xk−1 (Xk−2 − Xk+1 ) − Xk + F −
Yj ,
dt
b j=J(k−1)+1
hc
dYj
= −cbYj+1 (Yj+2 − Yj−1 ) − cYj + Xint[(j−1)/J]+1 ,
dt
b
k = 1, ..., K;
(1.6a)
j = 1, ..., JK;
(1.6b)
where the variables have cyclic boundary conditions; Xk+K = Xk and Yj+JK = Yj . The Xk
variables are large amplitude, low frequency variables, each of which is coupled to many small
amplitude, high frequency Yj variables. Lorenz suggested that the Yj represent convective
events, while the Xk could represent, for example, larger scale synoptic events. The interpretation of the other parameters is outlined in Chapter 2 (Table 2.1), where the L96 model is
used to test stochastic and perturbed parameter representations of model uncertainty.
A particular subclass of stochastic parametrisations are data driven schemes, which use a
statistical approach to derive the form of the parametrisation. In such models, the stochastic
parametrisation is conditioned on data collected from the system. While these do not necessarily aid understanding of the physical source of the stochasticity, they are free from a
priori assumptions and have been shown to perform well. Parametrisation schemes designed
and tested in the context of the L96 system are often of this form, firstly because there is
no physical basis from which to develop a deterministic parametrisation scheme, and secondly
because it is computationally feasible to perform the very long “truth” integrations required
11
to condition such a statistical scheme.
Wilks (2005) uses the L96 system as a testbed to explore the effects of stochastic parametrisations on the model’s short term forecasting skill and climatology. The full set of coupled
equations was first run to define the “truth”. The forecast model then used a truncated set
of equations in which the effect of the Y variables on the grid-scale motion was parametrised.
The parametrisation used was a quartic polynomial in X, with a first order autoregressive
additive stochastic term. The magnitude and degree of autocorrelation in the stochastic term
were determined through measurements of the true sub-grid tendency.
The climatology of the stochastically parametrised model was shown to improve over the
deterministic model, and the inclusion of temporally autocorrelated noise resulted in improvements over white additive noise. Wilks then studied the effects of stochastic parametrisations
on the short term forecast skill. Ten thousand perturbed initial condition ensembles of approximately 900 members were generated. Studying the root mean square error (RMSE) indicated
that the stochastic parametrisations improved over the deterministic parametrisations with an
ensemble size of 20, while the accuracy of single member stochastic integrations was worse
than the deterministic integrations. The stochastic parametrisation scheme resulted in an
improvement in the reliability of the ensemble forecast.
Crommelin and Vanden-Eijnden (2008) used Markov processes, conditional on the resolved
X variables, to represent the effects of the sub-grid scale Y variables on the X variables. The
closure they proposed was also determined purely from data with no knowledge of the physics
of the sub-grid scales. The sub-grid tendency, Bk , was modelled as a collection of Markov
chains. Bk (t2 ) is conditional on Bk (t1 ), Xk (t2 ) and Xk (t1 ). This parametrisation is local, i.e.
the tendency for the kth X variable, Xk , is dependent only on that variable. Secondly, including
Xk (t2 ) in the conditions for Bk (t2 ) gives a directionality to the parametrised tendency; Bk (t2 )
depends on the direction in which Xk is moving. The Markov chains are generated by splitting
the (Xk , Bk ) plane into (16 × 4) non-overlapping bins, and the transition probability matrix
between these bins is calculated.
This conditional Markov chain Monte Carlo scheme is more sophisticated than the Wilks
(2005) scheme, and performs better when reproducing the pdf of Xk . The model’s performance
in weather forecasting mode was also analysed for perturbed initial condition ensembles of 1, 5
and 20 members. Improvements in the forecast’s RMSE, anomaly correlation (AC) and rank
12
histograms were observed for the proposed parametrisation when compared to Wilks (2005)
and deterministic schemes.
Kwasniok (2012) proposed an approach which combines cluster weighted modelling (Gershenfeld et al., 1999) with conditional Markov chains (Crommelin and Vanden-Eijnden, 2008).
The sub-grid tendency is conditional on both Xk (t) and δXk (t) = Xk (t) − Xk (t − 1). The
closure model, referred to as the cluster weighted Monte Carlo (CWMC) model, is determined
purely from the initial “truth” dataset. Firstly, the three dimensional dataset (Xk , δXk , Bk ) is
mapped onto a series of discrete points (s, d, b) by binning the (Xk , δXk , Bk ) space into NX by
NδX by NB bins. The set of possible sub-grid tendencies is given by the average value of Bk
in each bin. A local Markov process dictates which of the NB values of the sub-grid tendency
is used for a given (Xk , δXk ) pair. The joint probability density p(s, b, d) is modelled as a
sum over M cluster states (Gershenfeld et al., 1999). The parameters of the sub-grid model
are fitted using an expectation-maximisation (EM) algorithm. This model makes no a priori
assumptions about the form of the stochastic parametrisation. The only parameters to be set
by the user are the number of clusters, M , and the fineness of the discretisation.
The CWMC closure shows improvement over the Wilks (2005) scheme in representation of
the long term dynamics (the pdf) of the system. Kwasniok then studied the CWMC model
in ensemble forecasting mode. Reliability diagrams indicate little improvement over the Wilks
(2005) scheme for studies with and without initial condition perturbations. However, the
forecast skill of the CWMC scheme shows a significant improvement over a simple first order
autoregressive (AR(1)) additive noise; this increase in skill must be due to an increase in
forecast resolution (see Section 1.7).
1.4.2
Stochastic Parametrisation of Convection
An important process in the atmosphere is convection. Convection is important for vertical
transport of heat, water and momentum, and occurs at scales on the order of a few kilometres
— smaller than the 10 km grid scale in NWP models, and far smaller than the 100 km grid
scale in GCMs. In order to capture the convection dynamics realistically, a grid scale of 100 m
is needed (Dorrestijn et al., 2012). Convection must therefore be parametrised in both weather
and climate models.
Representing moist convection in models is challenging because convection links processes
13
on vastly different scales. For example, the interaction between clouds and aerosol particles on
the micrometer scale alters the radiative forcing of the climate system on a global scale through
the aerosol direct and indirect effects (Solomon et al., 2007). Convection is also coupled to
the large scale dynamics of the atmosphere, as precipitation leads to production of latent
heat. Through its importance in the Hadley and Walker circulation, variability in convection
is linked to the El Niño Southern Oscillation (ENSO) (Oort and Yienger, 1996), affecting
the coupled ocean-atmosphere system on an interannual time scale. Therefore, a realistic
convective parametrisation must also take a wide variety of scales, and their interactions, into
account.
At the longest time scales, the Intergovernmental Panel on Climate Change’s fourth assessment report (IPCC AR4, Solomon et al., 2007) confirmed that cloud feedbacks are the
main cause for differences in predicted climate sensitivity between different GCMs. Climate
sensitivity is defined as the change in global mean surface temperature from a doubling of atmospheric CO2 concentration, and is sensitive to internal feedback mechanisms in the climate
system. Some estimates suggest that up to 30% of the variation in climate sensitivity can
be attributed to uncertainty in the convection parametrisation schemes, for example due to
uncertainty in the entrainment coefficient which governs the turbulent mixing of ambient air
into the cloud (Knight et al., 2007). In order to produce reliable probabilistic forecasts, it is
therefore imperative that we represent the uncertainty in models due to the representation of
convective clouds.
Current state-of-the-art deterministic convection schemes are designed to simulate the
mean (first-order moment) of convective ensembles, following the assumptions of Arakawa
and Schubert (1974). Higher order moments, which indicate the potential variability of the
forcing for a given resolved state, are not calculated. However, there is evidence that the
unresolved forcing for a given resolved state can show considerable variance about the mean
(Xu et al., 1992; Shutts and Palmer, 2007; Peters et al., 2013), so a given large scale forcing
could result in a range of small scale convective responses. It is not clear how much of this
variability feeds back to the larger scales. However, in current (deterministically parametrised) GCMs, the high-frequency convective variability is underestimated when compared with
observations, there is too little power in high frequency modes, and the spatial distribution
of variability shows significant deviations from the true distribution (Ricciardulli and Garcia,
14
2000). Stochastic convection parametrisation schemes provide a way to represent this sub-grid
scale variability and thereby aim to improve the variability and distribution of forcing associated with convective processes, which is likely to result in improved tropical dynamics in the
host GCM.
There has been much interest in recent years in developing stochastic convection parametrisation schemes for two reasons: the importance of convection, and the shortcomings of current
deterministic schemes. In this study, we will develop and compare a number of representations of model uncertainty in the ECMWF convection parametrisation scheme (Chapter 5).
In preparation for this chapter, and as an example of the breadth of possible stochastic parametrisation schemes, current research into stochastic parametrisation of convection will be discussed here in detail. Lin and Neelin (2002) describe two generalised approaches for stochastic
parametrisation of convection:
1. “Directly controlling the statistics of the overall convective heating by specifying a distribution as a function of the model variables, with this dependence estimated empirically”
2. “Stochastic processes introduced within the framework of the convective parametrisation,
informed by at least some of the physics that contribute to the unresolved variance”
Stochastic convection parametrisation schemes following each of these approaches will be
discussed below.
1.4.2.1
Statistical Approaches
As in the L96 system, there has been interest in developing statistical parametrisations of convection following the first approach outlined above. These are free from a priori assumptions,
so can explore the full range of uncertainty associated with convection. They are statistical
emulators, and are able to reproduce the sub-grid scale effects measured from observations or
high resolution simulations. However, they are only able to reproduce behaviour similar to that
in their training data-set, which may not be very long for the case of atmospheric simulations.
LES derived clusters: The approach taken by Dorrestijn et al. (2012) follows the method
used by Crommelin and Vanden-Eijnden (2008) using the L96 system. A Large Eddy Simulation (LES) is used to provide realistic profiles of heat and moisture fluxes due to shallow
cumulus convection. The profiles are clustered, and the parametrisation scheme is formulated
15
as a conditional Markov chain, which uses the cluster centroids as its states. Cluster transition
probabilities are estimated from the LES data and conditioned on the large scale state.
The parametrisation was tested in a single column model (SCM) setting, and produced a
realistic spread of fluxes and a good distribution of cloud states. It was not tested whether
the fluctuations at the grid scale were able to cascade up to the larger scales. Since the
parametrisation scheme did not include explicit spatial correlations (the Markov chain was
not conditioned on neighbouring states), the lack of mesoscale structures might prevent this
cascade. However, the implicit correlations imposed by the conditioning on the large scale
state could be sufficient (Dorrestijn et al., 2012).
In Dorrestijn et al. (2013), a similar approach was used for deep convection. However,
instead of using clustering to determine the states for the Markov chain, five physically motivated cloud types were defined according to their cloud top height and column rain fraction
(ratio of rain water path to cloud top height): clear sky, shallow cumulus, congestus, deep,
and stratiform, following the work of Khouider et al. (2010). As before, data from a LES was
used to derive the transition matrices for the Markov chain, and both conditioning on the large
scale variables and on neighbouring cells (stochastic cellular automata) was considered.
Dorrestijn et al. (2013) found that conditioning the Markov chain on convective available
potential energy (CAPE) and convective inhibition (CIN) gave reasonable fractions of different
clouds, but the variability in these fractions was too small when compared to the LES data.
The variability improved when a stochastic cellular automata was considered. Combining
both methods to produce a conditional stochastic cellular automaton gave the best results,
highlighting the importance of including spatial correlations and information about the large
scale in a parametrisation.
Empirical Lognormal Scheme for Rainfall Distribution: Lin and Neelin (2002) test
the first generalised approach for stochastic parametrisation by developing a parametrisation
scheme which captures the rainfall statistics obtained by remote sensing, aiming to simulate
both the observed variance and distribution of precipitation. The model’s deterministic convection parametrisation scheme is assumed to represent the relationship between the ensemble
mean sub-grid scale precipitation and the grid scale variables correctly. The convective heating
output by this deterministic scheme, Qdet
C , defines a mixed-lognormal probability distribution
for precipitation with mean equal to Qdet
C , and a constant shape factor estimated from obser16
vations3 . The parametrised value of convective heating is drawn from the defined lognormal
distribution and follows an AR(1) process. There was a large impact on intraseasonal variability, though the impact on the pdf of daily mean precipitation was poorer than when using
the stochastic CAPE scheme described in the following section. The authors conclude that
the impact of a stochastic scheme on the climatology of a model can be very different from its
impact on the model’s variability. The interactions between heating and large-scale dynamics
result in an atmosphere that selectively modifies the input stochasticity, making offline calibration difficult. Nevertheless, the effects of higher-order moments of convective motions have
an important impact on the climate system, and should therefore be included in atmospheric
models.
1.4.2.2
Physically motivated schemes
There are benefits to following the second approach outlined above. Physically motivated
schemes make use of the intuition of the scientist developing the scheme, in contrast to datadriven stochastic parametrisation schemes, which offer no insight as to the reasons for including
stochastic terms. Physically motivated schemes can also be developed to make use of existing
deterministic convection schemes, and can therefore benefit from the years of experience accumulated for that deterministic scheme. At a recent workshop at ECMWF (Representing model
uncertainty and error in numerical weather and climate prediction models, 20–24 June 2011),
the call went out to establish a firm physical basis for stochastic parametrisation schemes, and
ECMWF was urged to develop future parametrisations which are explicitly stochastic (Recommendations of Working Group 1). This section will discuss examples of such physically
motivated stochastic schemes.
Stochastic CAPE Closure: Lin and Neelin (2000) propose a simple stochastic modification
to a CAPE-based deterministic parametrisation scheme. In the deterministic scheme, the
convective heating Qc is set proportional to C1 , a measure of CAPE. In the stochastic scheme,
an AR(1) random noise term is added to C1 . The standard deviation of the noise term is
estimated from observations to be 0.1 K, and three autocorrelation time scales are tested. The
noise has a mean of zero, so the mean of Qc is not strongly affected by the stochastic term,
3
A constant shape factor in the lognormal distribution implies that the standard deviation of the rain rate
increases proportional to the mean.
17
but the variability of Qc is increased. In Lin and Neelin (2003), the stochastic modification
to the CAPE closure is justified by considering the link between CAPE and cloud base mass
flux. They show that the stochastic CAPE closure is equivalent to assuming the presence of
random variations in the mass flux at cloud base, which could represent the effects of small
scale dynamics on the convective cloud.
The scheme was tested in a model of intermediate complexity (Lin and Neelin, 2000).
Precipitation was found to be strongly affected by the stochastic scheme, with the longest time
scale scheme producing a distribution that closely resembles the observations. The variance of
precipitation was also much higher and had a more realistic spatial distribution for the longer
time scale than for the shorter time scale cases — it is clear that the autocorrelation time scale
is an important parameter in the stochastic parametrisation, and has a large impact on the
efficacy of the scheme. Zonal wind at 850 hPa shows an improved variability at longer time
scales of 10–40 days. This highlights the importance of capturing unresolved, short time scale
(hours–days) variability in convection as it can impact variability in the tropics at intraseasonal
time scales. The scheme was also tested in a climate model (Lin and Neelin, 2003), and showed
an improvement in both the variance and spatial distribution of daily precipitation.
Stochastic Vertical Heating Structure: The stochastic CAPE closure described above
assumes the vertical structure produced by the deterministic parametrisation scheme is satisfactory, and perturbs only the input to the deterministic scheme. However, there is also
uncertainty associated with the vertical structure of heating due to, for example, varying levels
of detrainment for different convective elements or due to differences in squall line organisation
in the presence of vertical wind shear (Lin and Neelin, 2003). In order to probe uncertainty in
the parametrised vertical structure of heating, Lin and Neelin (2003) propose a simple additive
noise scheme for the temperature, T , at each vertical level k:
T = T̃t + ξt −
∆pk
hξt i ,
∆ptot
(1.7)
where T̃t is the grid scale temperature at time step t after the convective heating has been
applied, ξt is the stochastic noise term, and the mass weighted vertical mean of the noise,
hξt i has been subtracted to ensure energy is conserved. The scheme is tested in a GCM, and
precipitation variance is observed to increase, though the placement of precipitation is not
18
improved. Since the scheme does not directly affect precipitation at a given time step, the
stochastic term must feed through the large scale dynamics before impacting on precipitation.
This scheme could therefore be used to identify large scale features which are sensitive to the
vertical structure of convective heating.
Stochastic Convective Inhibition: A model for stochastic CIN was proposed by Majda
and Khouider (2002). There is significant CAPE over much of the western Pacific warm
pool, yet deep convection only occurs over a small fraction of the area. A reason for this is
the presence of negative potential energy for vertical motion which inhibits convection: CIN
(Majda and Khouider, 2002). CIN has significant fluctuations at scales much smaller than the
grid scale due to turbulent motion in the boundary layer, so the authors propose a stochastic
model to account for the effect of this sub-grid scale variability on convection. They model CIN
using an integer parameter, σI , where σI = 1 indicates a site with CIN, and σI = 0 indicates
a site without CIN where deep convection may develop. The interaction rules governing the
state of the parameter at different sites are derived following a statistical mechanics “spin-flip”
formulation. The macroscopic value of CIN acts as a “heat bath” for the local sites, and the
spin-flip probabilities are defined following intuitive rules. This stochastic CIN formulation can
be coarse-grained and coupled to a standard mass-flux convection scheme to give a stochastic
convection parametrisation scheme. This parametrisation was tested in a local area model:
the scheme is shown to significantly alter the climatology and improves the variability when
compared to the deterministic scheme (Khouider et al., 2003).
Stochastic Multicloud Model: The deterministic convection parametrisation scheme proposed by Khouider and Majda (2006, 2007) is based on analysis of observations, and theoretical
understanding of tropical dynamics. They propose a parametrisation scheme centred around
three cloud types observed over the warm pool and in convectively coupled waves: shallow congestus, stratiform and deep penetrative cumulus clouds. The model emphasises the dynamic
role of each of the cloud types, and avoids introducing many of the ad hoc parameters common
in convection parametrisation schemes. The parametrisation reproduces large-scale organised
convection, and was tuned to reproduce the observed tropical wave dynamics. However, in
some physically motivated regions in parameter space, the model performs very poorly, and
simulations show reduced variability when compared to the model which has been tuned away
19
from these physical parameter values.
This multicloud scheme was used as the basis for a stochastic Markov chain lattice model for
use in GCMs with a grid box of ∼ 100 km (Khouider et al., 2010), with the aim of accounting
for the unresolved sub-grid scale variability associated with convective clouds. Each GCM grid
box is divided into n × n lattice sites, where n ∼ 100. Each lattice point is assumed to be
occupied by one of the three cloud types, or by clear sky, and is assumed to be independent
of its neighbours. A given site switches from cloud type to cloud type following a set of
probabilistic rules, conditioned on the large scale state. The transition time scales are tuned
to set the cloud coverage at equilibrium to the desired level. The stochastic multicloud model
produced the desired large degree of variability in single column mode. The model was tested
in a GCM using physically motivated regions in parameter space (Frenkel et al., 2012), and
was found to produce a mean circulation and wave structure similar to those observed in
high resolution cloud resolving model (CRM) simulations: including stochastic terms into the
deterministic model corrected the bias in the deterministic model. Furthermore, the stochastic
parametrisation was shown to scale well from a medium to a coarse resolution GCM grid,
preserving the high variability and the statistical structure of the convective systems.
Stochastic Cellular Automata: A cellular automaton (CA) is a set of rules governing the
temporal evolution of a grid of cells, each of which can be in a number of discrete states. The
rules can be probabilistic or deterministic. This provides an interesting option for a convection
parametrisation, as it already includes the self-organisation, horizontal communication and
memory observed in mesoscale convective systems (Palmer, 2001). Bengtsson et al. (2013)
describe a convection parametrisation scheme which uses a CA to represent sub-grid variability.
The CA is tested in the Aire Limitée Adaptation/Application de la Recherche à l’Opérationnel
(ALARO) limited area model, using a grid scale of 5.5 km. The CA is defined on a 4 × 4 finer
grid than the host model resolution, and both deterministic and probabilistic evolution rules
are tested. The size of the CA cells was chosen to represent the horizontal scale of one convective element. The fractional area of active CA cells acts as an input to the deterministic
mass-flux convection scheme. At each time step, variability is generated by randomly seeding
new CA cells in grid boxes where the CAPE exceeds some threshold value.
Forecasts made with the CA parametrisation scheme were compared to a control deterministic forecast. The CA scheme is able to reproduce mesoscale convective systems, and captures
20
the precipitation intensity and convective organisation observed in a squall line in summer 2009
better than the deterministic model. A time lagged ensemble is constructed for the deterministic and CA cases — a 10% increase in spread is observed when the CA is used, improving
the reliability of the forecasts, though the ensembles remain under-dispersive.
Insights from Statistical Mechanics: Convective variability can be characterised mathematically in terms of large scale properties of the atmosphere if a number of simplifying
assumptions are made (Craig and Cohen, 2006). Firstly, the equilibrium case is considered,
i.e., the forcing is assumed to vary slowly in time and space such that a grid box contains a
large ensemble of clouds that have adjusted to the environmental forcing. Secondly, the ensemble is assumed to be non-interacting: individual convective clouds interact with each other
only through the large scale flow. These two assumptions are reasonable in cases of weakly
forced, unorganised convection. Starting from these assumptions, and assuming that the large
scale constrains the mean total convective mass flux, where the mean is taken over possible
realisations of the ensemble of convective clouds, Craig and Cohen (2006) derive an expression
for the distribution of individual mass fluxes, and for the probability distribution of total mass
flux. The distribution is also a function of the mean mass flux per cloud, which some studies
indicate is independent of large scale forcing (Craig and Cohen, 2006). The variance of the
convective mass flux scales inversely with the number of convective clouds in the ensemble. In
the case of a large grid box, or a strong forcing, the number of clouds will be large and an
equilibrium convection parametrisation scheme will be at its most accurate. The variability
about the mean becomes increasingly important as the grid box size is reduced, and in cases
of weak forcing.
The predictions made by this theory were tested in CRM simulations (Cohen and Craig,
2006). The distribution of individual cloud mass fluxes closely followed the predicted distribution. The simulated distribution of total mass flux was also close to the predicted distribution,
but showed less variance, though this deficit was somewhat corrected for when the finite size
of simulated clouds was taken into account. Simulations with imposed vertical wind shear
produced organised convection, which also followed the theory. The theoretical distribution
predicted by Craig and Cohen (2006) characterises the observed convective distribution, so
appears suitable for use in a stochastic convective parametrisation scheme.
Plant and Craig (2008) describe such a stochastic parametrisation scheme. The theoretical
21
distribution of Craig and Cohen (2006) is assumed to represent the equilibrium statistics of
convection for a given atmospheric state. The distribution of convective mass fluxes for a grid
box is drawn from this distribution at each time step, and used to calculate the convective
tendencies experienced by the resolved scales. The scheme follows the assumptions of Arakawa
and Schubert (1974), namely that the observed ensemble of convective clouds is determined by
the large-scale properties of the environment. Since this large-scale region could be larger than
the size of a grid box, the atmospheric state is first averaged over neighbouring grid boxes to
ensure that the region will contain many clouds. This also introduces spatial correlations into
the parametrisation scheme. Temporal correlations are introduced by assuming that clouds
have a finite lifetime. An existing deterministic parametrisation scheme is required to link the
modelled distribution of cloud mass fluxes with a vertical profile of convective heating and
moistening. The scheme is tested in the single column version of the U.K. Met Office Unified
Model (UM), and the results show many desirable traits; the mean temperature and humidity
profiles approximate those observed in CRM integrations, and in the limit of a large grid box
the parametrisation scheme approaches a deterministic scheme, though further work testing
the variability introduced into the model by the stochastic scheme would be beneficial. The
scheme was later tested in a regional version of the UM (Keane and Plant, 2012). The resultant
mean vertical profiles were similar to conventional schemes, and the statistics of the mass flux
of convective clouds followed the predictions of the underlying theory (Craig and Cohen, 2006).
1.4.3
Developments in operational NWPs
Two complementary approaches to stochastic parametrisation have been developed at ECMWF
in collaboration with the U.K. Met Office. The Stochastically Perturbed Parametrisation
Tendencies (SPPT) scheme aims to represent random errors associated with model uncertainty
from the physical parametrisation schemes, and so perturbs the parametrised tendencies about
the average value that a deterministic scheme represents. In contrast, the Stochastic Kinetic
Energy Backscatter (SKEB) scheme (usually called Spectral stochastic Back-Scatter — SPBS
— at ECMWF) aims to represent a physical process absent from the parametrisation schemes
(Palmer et al., 2009).
22
1.4.3.1
Stochastically Perturbed Parametrisation Tendencies
SPPT involves multiplying the tendencies from parametrised processes by a random number.
The first version of SPPT was incorporated into the ECMWF ensemble prediction system
(EPS) in 1998 (Buizza et al., 1999). Prior to this, the EPS was based on the perfect model
assumption, i.e. it was assumed that the only uncertainty in the forecast is due to errors in the
initial conditions. However, the reliability of the forecast could not be made consistent over a
range of lead times by altering the initial condition perturbations. Including SPPT accounted
for errors in the model and significantly improved the reliability. In this first version of SPPT,
the perturbed tendencies, Xp , of the horizontal wind components, temperature and humidity
were calculated as
Xp = (1 + rX )Xc ,
(1.8)
where rX is a uniform random number between −0.5 and 0.5, and Xc is the deterministic
parametrised tendency. Different random numbers are used for the different variables. Spatial
correlations were imposed by using the same random numbers over a 10◦ by 10◦ area, and
temporal correlations by holding the random numbers constant for six model time steps (3
hours and 4.5 hours for T399 and T255 respectively). The amplitude of the stochastic term
and the degrees of correlation were determined by evaluating the forecast skill when using a
range of values, though this tuning process implied that these parameters were poorly constrained. Nevertheless, including this scheme into the ECMWF EPS resulted in a significant
improvement in the reliability of the forecasts.
This scheme was revised to remove the unphysical spatial discontinuities in the perturbations. The new scheme (Palmer et al., 2009) uses a spectral pattern generator (Berner et al.,
2009) to generate a smoothly varying perturbation field. All variables are perturbed with the
same random number:
Xp = (1 + rµ)Xc ,
where r =
P
mn r̂mn Ymn
(1.9)
and Ymn denotes a spherical harmonic of zonal wavenumber m and
total wavenumber n. The spectral coefficients, r̂mn , evolve in time according to an AR(1)
process. The constant µ in (1.9) tapers the perturbation to zero close to the surface and in the
stratosphere. This is because large perturbations in the boundary layer resulted in numerical
instabilities, and radiative tendencies are considered to be well known in the stratosphere.
23
The improved scheme was tested and its performance compared to the old version of SPPT
and to a “perturbed initial condition only” ensemble. The upper air temperature predicted
by the improved scheme showed a slight improvement over the old scheme in the extra-tropics
and a very significant improvement in the tropics in terms of the ranked probability skill score.
The effects on precipitation were also considered: the Buizza et al. (1999) version of SPPT
showed a significant wet bias in the tropics, which has been substantially reduced in the new
version of the scheme.
The main criticism against SPPT is that it is ad hoc in the form of the stochastic perturbations — the spatial and temporal time scales have no physical motivation and have simply been
tuned to give the best results. However, the magnitude and type of noise were retrospectively
justified using coarse graining studies. Shutts and Palmer (2007) defined an idealised cloud
resolving model (CRM) simulation as truth. The resultant fields and their tendencies were
then coarse grained to the resolution of a NWP model to study the sub-grid scale variability
which a stochastic parametrisation seeks to represent. The effective heating function for the
nth coarse grid box, Q˜n , was calculated by averaging over nine fine grid boxes. This was
compared to the heating calculated from a convective parametrisation scheme, Q1 = Q1(X),
where X represents the coarse grained CRM fields.
The validity of the multiplicative noise in the SPPT scheme was analysed by studying
histograms of Q̃ conditioned on different ranges of Q1. The mean and standard deviation of
Q̃ is observed to increase as a function of Q1, providing some support for the SPPT scheme.
The histograms also become more asymmetric as Q1 increases. It is interesting to note that
the mean and standard deviation are both non-zero for Q1 = 0, which is not represented
by a purely multiplicative scheme. Explicit measurements of standard deviation of Q̃ as a
function of the mean of Q̃ and its dependency on grid box size could be included in a future
parametrisation scheme.
1.4.3.2
Stochastic Kinetic Energy Backscatter
Kinetic energy loss is common in numerical integration schemes and physical parametrisations
(Berner et al., 2009). For example, studies of mesoscale organised convective systems indicate
that convection acts to convert CAPE to kinetic energy on the model grid; Shutts and Gray
(1994) showed that up to 30% of energy released by these systems is converted to kinetic energy
24
in the large scale balanced flow. However, most convection parametrisations do not include
this kinetic energy transfer, and instead focus on the thermodynamic effects of deep convection
(Shutts, 2005). The loss of kinetic energy is common in other parametrisation schemes, such as
the representation of sub-grid orography and turbulence. It was proposed that upscale kinetic
energy transfer could counteract the kinetic energy loss from too much dissipation, and that
this upscale transfer could be represented by random streamfunction perturbations.
The Stochastic Kinetic Energy Backscatter (SKEB) scheme proposed by Berner et al. (2009)
builds on the Cellular Automaton Stochastic Backscatter Scheme (CASBS) of Shutts (2005).
In CASBS, the streamfunction perturbations were modulated by a pattern generated by a CA,
as such patterns exhibit desirable spatial and temporal correlations. The SKEB scheme uses a
spectral pattern generator instead of CA to allow for easier manipulation of these correlations.
Each spherical harmonic, ψ, evolves separately in time according to an AR(1) process:
√
ψnm (t + ∆t) = (1 − α)ψnm (t) + gn αǫ(t),
(1.10)
where (1 − α) is the first order autoregressive parameter, gn is the noise amplitude for zonal
wavenumber n, and ǫ is Gaussian zero mean white noise with standard deviation σz .
Including the SKEB scheme resulted in an improved ensemble spread, and more importantly, a number of diagnostics indicated the overall skill of the forecast increased over much
of the globe. Palmer et al. (2009) investigated incorporating both SPPT and SKEB into the
ECMWF EPS — including SKEB results in a further improvement in spread and skill when
compared to SPPT alone.
In a recent study, Berner et al. (2012) show that including the SKEB scheme in the ECMWF
model at a typical climate resolution (T95) results in a reduction in a systematic bias observed
in the model’s Northern Hemisphere circulation. A reduction in the zonal flow and an improved
frequency of blocking was observed. Increasing the horizontal resolution significantly to T511
gave a comparable reduction in model bias in the Northern Hemisphere, implying it is the poor
representation of small-scale features which leads to this bias.
1.4.3.3
Stochastic Physics in the U. K. Met Office EPS
In addition to using a version of SKEB, “SKEB2”, (Shutts, 2009) in the Met Office Global and
Regional Ensemble Prediction System (MOGREPS), the Random Parameters (RP) scheme is
25
used at the Met Office to represent uncertainty in a subset of the physical parameters in the
UM (Bowler et al., 2008). The uncertain parameters describe processes in the convection, large
scale cloud, boundary layer and gravity wave drag parametrisation schemes in the UM. The
parameters are varied globally following a first order autoregressive process. The parameters
are bounded, and the maximum and minimum permitted values of the parameters are set by
experts in the respective parametrisation schemes. The stochastic parametrisation schemes
were found to have a neutral impact on the climatology of the UM, but have a significant
impact on individual forecasts in the ensemble (Bowler et al., 2008).
1.5
Comparison with Other Representations of Model
Uncertainty
Several methods of representing model uncertainty have been discussed. The question then
follows: which is best? Following on from the work of Doblas-Reyes et al. (2009), Weisheimer
et al. (2011) compared the forecast skill of three different representations of model uncertainty: the multi-model method, the perturbed parameter approach and the use of stochastic
parametrisations. For the MME, five different models, each running nine member initial condition ensembles were run. The perturbed parameter ensemble consists of one standard control
model and eight versions with simultaneous perturbations to 29 parameters. The nine member stochastic physics ensemble used the SPPT and SKEB parametrisation schemes, including
initial condition perturbations. A set of control forecasts with the ECMWF model without
stochastic physics was also generated.
The stochastic parametrisation ensemble performed the best for lead times out to one
month in terms of the Brier skill score. For longer lead times of two to four months, the
multi-model ensemble achieved the highest Brier skill score for surface temperature. At these
lead times the stochastic ensemble has higher forecast skill for precipitation events, apart from
dry December/January/February, where the perturbed parameter ensemble performs the best.
In none of the situations studied does the control ensemble, without representation of model
uncertainty, perform the best. The forecast skill was studied for different land regions; at lead
times of one month, the stochastic parametrisation ensemble performed the best in the majority
of cases, while at lead times of 2–4 months there was no clear “winner”. The reliability of the
26
ensembles was studied through comparison of the RMSE and ensemble spread. The MME
performed extremely well, and showed an almost perfect match between RMSE and spread at
all lead times. The stochastic ensemble also performed well for lead times of 1 to 4 months.
However, the control experiment and perturbed parameter ensemble showed a substantial
difference between the RMSE and ensemble spread, indicating these forecasts were unreliable.
1.6
Probabilistic Forecasts and Decision Making
There are many different frameworks for including a representation of uncertainty in atmospheric models, some of which have been discussed above. However, why is it important to
calculate the uncertainty in a forecast? Probabilistic forecasts enable users to make better
informed decisions. In this way, probabilistic forecasts are economically valuable to the end
user of the forecast, and it also allows the benefit of probabilistic forecasts to be quantified.
The value of a weather forecast can be understood using the framework of the “cost-loss
ratio situation”, commonly used to discuss decision making in meteorology (Murphy, 1977).
The situation consists of a user who must decide whether to take protective action against
some weather event, such as crops being destroyed by a drought, or a home being destroyed by
a flood. Taking the protective action costs the decision maker C, but if the protective action is
not taken and the destructive event does occur, the user suffers a loss L. Table 1.1 summarises
the outcomes of such a situation, for the example of insuring a property against flood damage.
The cost of insurance is independent of whether the event occurs or not, but the loss is only
incurred if the flood happens, where p is the probability of a flood occurring. The economically
logical choice for the decision maker is to insure the property if C < pL. The only estimate of p
available to the decision maker is from a probabilistic forecast. Therefore, the user should act
to take out protection if the forecast probability p > C/L and should not protect otherwise.
The user requires a probabilistic forecast to make his or her decision.
An important quality of a probabilistic forecast is that it is reliable. This refers to the
consistency, when statistically averaged, of the forecast probability of an event and the measured probability of an event (Wilks, 2006). For example, if all the occasions when a flood
was forecast with a 10% probability were collected together, the observed frequency of floods
should be 10%. If the perfect deterministic forecast could be generated, issuing this forecast
would be of greatest value to the user as they would only take out protective action if a guar27
Flood
Cost of Insurance
C
Cost of No Insurance
L
Probability
p
No Flood
Expected Cost
C
Cp + C(1 − p) = C
0
Lp + 0(1 − p) = Lp
(1 − p)
Table 1.1: Decision making using the cost-loss scenario. Should a decision maker insure his
property against flood damage if the cost of insurance is C and the expected loss is L given
that the forecast probability of a flood is p?
anteed flood was on its way (Murphy, 1977). Since the goal of perfect deterministic forecasts is
unattainable, a well calibrated probabilistic forecast should be the central aim for forecasters.
In reality, probabilistic forecasts are not perfectly reliable. For example, consider a flood
that was forecast with probability p, but due to shortcomings in the forecast model, the actual
probability of a flood occurring was q. The forecaster will take out protective action or not
based on an incorrect forecast probability, increasing the expected cost to the user.
An example of a perfectly reliable forecast, if a sufficiently long time window is used for
verification, is the climatological forecast. This forecast is not particularly useful as it contains
no information about the flow dependency of the forecast probability. However, it serves as
a useful baseline for considering the value of a forecasting system, as all users are assumed
to have access to the climatological information. Therefore, the economic value of a forecast
system to a user should be calculated with respect to the economic value of the climatological
forecast. The economic value, V , of a forecasting system is defined to be:
V =
Eclimate − Ef orecast
,
Eclimate − Eperf.det.
(1.11)
where Ei indicates the expected cost of the climatological forecast, the perfect deterministic
forecast or the forecast system under test (Wilks, 2006). The maximum economic value of 1
is obtained by the perfect deterministic forecasting system, and negative economic values are
obtained if following the forecast system will result in costs to the user which are greater than
following the climatological forecast.
It should be stressed again that, while the perfect deterministic forecast would have the
highest economic value, this forecast is an idealised theoretical construct. Uncertainty in
forecasts arising from initial conditions, boundary conditions and from model approximations
will never be eliminated, so the perfect deterministic forecast is an unattainable goal. In
general, using an imperfect deterministic forecast results in higher costs to the user than using
28
a probabilistic forecast, even if the probabilistic forecast is not perfectly reliable. For example,
Zhu et al. (2002) evaluate the economic value of the National Centers for Environmental
Prediction (NCEP) ensemble forecast compared to a higher resolution deterministic forecast
and to a reference deterministic forecast of the same resolution. The ensemble forecast was
shown to have a higher economic value for most cost-loss ratios at a range of lead times.
The above analysis has assumed that decision makers are rational, and will make a decision
based on the current forecast and their cost-loss analysis. However, forecast users are not
perfectly rational. They can have a low tolerance to occasions when the forecast probability
indicated they should act, but subsequently the event did not occur (“false alarm”), and
similarly to occasions when they chose not to act based on the forecast, but the event did
occur (“miss”). These events occur even if the forecast is perfectly reliable, but a low tolerance
to these events can affect the user’s confidence in the forecast and alter their behaviour. For
example, Roulston and Smith (2004) consider the case of Aesop’s fable “The Boy Who Cried
Wolf”, where a shepherd boy alerts the villagers to a wolf attacking their sheep twice, but
when the villagers come to protect their sheep they find no wolf. The third time he cries
“wolf”, the villagers do not believe him, so do nothing to protect the flock from the wolf which
then appears. The villagers intolerance to false alarms affects how they interpret the forecast:
their cost-loss ratio, based on the value of their sheep, is estimated to be 0.1, so logically they
should act if the probability of wolf attack is just 10%, but in reality they cannot tolerate the
associated 90% probability of a false alarm. In fact, if users are intolerant to false alarms and
only act on the warning with some probability proportional to the false alarm rate, the optimal
warning threshold should be set higher, and is shown to be closer to 60% (Roulston and Smith,
2004). This is considerably higher than the 10% threshold predicted by cost-loss analysis, but
is close to the threshold used by the U.K. Met Office for its early warning system.
The example of “The Boy Who Cried Wolf” highlights the importance of effective communication of probabilistic forecasts to decision makers. The producers of forecasts should engage
with users to assist their interpretation of the forecasts, and to discover both the clearest and
most useful way to present forecasts and what forecast products will be of greatest economic
value to the user (Stephens et al., 2012).
29
1.7
Evaluation of Probabilistic Forecasts
Section 1.6 outlined the importance of developing reliable probabilistic forecasts. The additional property that a probabilistic forecast must possess is resolution. Resolution is the
property which provides the forecaster with information specifying the future state of the
system (Bröcker, 2009). It sorts the potential states of the system into separate groups (Leutbecher, 2010). In order to have a high resolution, the forecast must be sharp, i.e. localised
in state space. Gneiting and Raftery (2007) consider the goal of a probabilistic forecast to be
maximising the sharpness while retaining the reliability.
Having produced a probabilistic forecast, how can we evaluate the skill of this forecast?
More specifically, how can we test whether this forecast is reliable, and has resolution? There
are many different methods for forecast verification that are commonly used (Wilks, 2006).
Graphical forecast summaries provide a lot of information about the forecast, but it can be
difficult to compare many forecasting models using them. Instead, it is often necessary to
choose a scalar summary of forecast performance allowing several forecasts to be ranked unambiguously. Different scores may produce different rankings, so deciding which is appropriate
for the situation is important, but not obvious. Some of the more common verification techniques will be discussed in this section.
1.7.1
Scoring Rules
Scoring rules provide a framework for forecast verification. They summarise the accuracy of
the forecast by giving a quantitative score based on the forecast probabilities and the actual
outcome, and can be considered as rewards which a forecaster wants to maximise. Due to
their ability to rank several forecasting systems unambiguously, scalar summaries of forecast
performance remain a popular verification method for probabilistic forecasts.
Scoring rules must be carefully designed to encourage honesty from the forecaster: they
must not contain features which promote exaggerated or understated probabilities. This constitutes a proper score, without which the forecaster may feel pressurised to present a forecast
which is not their best guess (Brown, 1970). For example, a forecaster may want to be deliberately vague such that their prediction will be proved correct, regardless of the outcome.
Alternatively, the user may demand that the forecaster backs a certain outcome instead of
providing the full probabilistic forecast. It can be shown that a proper skill score must eval30
uate both the reliability and the resolution of a probabilistic forecast (Bröcker, 2009). All
the scores used in this thesis are proper, and are some measure of the difference between the
forecast and verification, so small values indicate better forecasts.
Following Gneiting and Raftery (2007), let P be the probability distribution predicted by
the forecaster, and x be the final outcome, or ‘verification’ (Bröcker et al., 2009). The scoring
rule S(P, x) takes the forecast as its first argument and the verification as its second. Let
S(P, Q) represent the expected value of S(P, x) where Q is the verification distribution. A
forecaster seeks to minimise S(P, Q), so will only predict P = Q (i.e. predict the true result) if
S(P, Q) ≥ S(Q, Q) with equality if and only if P = Q. Such a scoring rule is strictly proper. If
the inequality is true for all P and Q, but S could also be optimised for some P not identical
to the verification distribution, the scoring rule is referred to as proper.
A score is local if it depends only on the probability forecast for the actual observation. A
score dependent on the full distribution, such as the Continuous Ranked Probability Score, is
not local.
It is useful to consider the improvement in the predictive ability of a forecast with respect to
some reference forecast. The reference forecast is often chosen to be the climatology, although
other choices such as persistence or an older forecasting system can be used instead (Wilks,
2006). Thus, the score (S) is expressed as a skill score (SS):
SS =
(S − Sref )
.
(Sperf − Sref )
(1.12)
For many scoring rules, the perfect score, Sperf , is zero, so the skill score can be expressed as:
SS = 1 −
S
.
Sref
(1.13)
A perfect forecast has a skill score of one. A forecast with no improvement over the reference
forecast, Sref , has a skill score of zero, and a forecast worse than Sref has negative skill. All
the scores discussed below may be converted into a skill score in this way.
1.7.1.1
The Brier Score
The Brier Score (BS) (Wilks, 2006) is used when considering dichotomous events (e.g. rain or
no rain). It is the mean square difference between the forecast and observed probability of an
31
event occurring. For n forecast/observation pairs, the Brier score can be written as
BS =
n
1X
(yk − ok )2 ,
n k=1
(1.14)
where yk is the kth predicted probability of the event occurring; ok = 1 if the kth event occurred
and 0 otherwise4 .
The Brier score can be decomposed explicitly into its reliability and resolution components
(Murphy, 1973). Assume that forecast probabilities are allowed to take one of a set of I discrete
values, yi , and that Ni is the number of times a forecast yi is made in the forecast-verification
data set. For each of the I subsamples, sorted according to forecast probability, the observed
frequency of occurrence of the event can be evaluated, oi , as
oi = p(o|yi ) =
1 X
ok ,
Ni k∈Ni
(1.15)
This is equal to the conditional probability of the event given the forecast. The climatological
frequency can also be defined as
o=
1X
ok
n k
(1.16)
where n is the total number of forecast-verification pairs.
The Brier score may then be written as:
I
I
1X
1X
2
BS =
Ni (yi − oi ) −
Ni (oi − o)2 + o(1 − o),
n i=1
n i=1
(1.17)
where the first term is the reliability and the second term is the resolution. The decomposition
includes a third term: uncertainty. This term is inherent to the forecasting situation and
cannot be improved upon. It is related to the variance of the climatological distribution, and
therefore to the intrinsic predictability of the system.
The BS is linked to the economic value of a forecast, given by (1.11): the integral of V with
respect to the cost/loss ratio is equivalent to using the BS to evaluate the forecasts (Murphy,
1966). In other words, BS is a measure of the average value of the forecast assuming that the
users have cost/loss ratios distributed evenly between zero and one (Richardson, 2001).
The BS is also closely linked to the reliability diagram (Wilks, 2006). This is a graphical
4
Note that this formulation is different to that originally proposed by Brier (1950), which summed over both
the event and non-event
32
Temperature / ◦ C Forecast A Forecast B Verification
18–20
0.3
0.3
1
20–22
0.6
0.2
0
22–24
0.1
0.1
0
24–26
0.0
0.4
0
Table 1.2: An illustrative example: Comparing two different temperature forecasts with observation.
diagnostic which summarises the full joint distribution of forecasts and observations, so can be
used to identify forecast resolution as well as reliability. It consists of two parts. The calibration
function shows the conditional distribution of observations given forecast probabilities, oi ,
plotted against the forecast probabilities, yi . The refinement distribution shows the distribution
of issued forecasts. Reliability and resolution, as defined in (1.17), can be identified using the
reliability diagram.
1.7.1.2
The Ranked Probability Score
The Ranked Probability Score (RPS) is a scoring rule used to evaluate a multi-category forecast.
Such forecasts can take two forms: nominal, where there is no natural ordering of events and
ordinal, where the events may be ordered numerically. For ordinal predictions, it is desirable
that the score takes ordering into account - for example in Table 1.2, Forecast A and Forecast
B both predict the true event, a temperature of 18–20◦ C, with a probability of 0.3. However,
it might be desirable to have Forecast A score higher than B as its forecast distribution was
clustered closer to the truth than B.
The RPS is defined as the squared sum of the difference between forecast and observed
probabilities, so is closely related to the BS. However, in order to include the effects of distance
discussed above, the difference is calculated between the cumulative forecast probabilities, Yi
and the cumulative observations Oi (Wilks, 2006). This means that the RPS is a non-local
score. Defining the number of event categories to be J,
Ym =
m
X
yj ,
m = 1, 2, ..., J,
(1.18)
m
X
oj ,
m = 1, 2, ..., J,
(1.19)
j=1
and
Om =
j=1
33
then
RPS =
J
X
m=1
(Ym − Om )2 ,
(1.20)
and the RPS is averaged over many forecast-verification pairs.
1.7.1.3
Ignorance and Entropic Scores
The Ignorance Score (IGN) was proposed by Roulston and Smith (2002) as a way of evaluating
a forecast based on the information it contains.
As for the RPS, define J event categories and consider N forecast-observation pairs. The
forecast probability that the kth verification will be event i is defined to be f (k)i (where
i = 1, 2, . . . , J and k = 1, 2, . . . , N ). If the corresponding outcome event was j(k), define
Ignorance to be
IGN = −
N
1 X
log f (k)j(k) ,
N k=1 2
(1.21)
where the score has been averaged over the N forecast-verification pairs.
Ignorance is particularly sensitive to outliers, and heavily penalises situations where the
verification lies outside of the forecast range. Over-dispersive forecasts are not as heavily
penalised. Unlike the RPS, the value of IGN depends only on the prediction at the verification
value; it is a local score.
To calculate IGN, Roulston and Smith (2002) suggest defining M + 1 categories for an
ensemble forecast of M members, and approximating the pdf as a uniform distribution between
consecutive ensemble members.
An alternative way to calculate Ignorance has been proposed by Leutbecher (2010). It
assumes the forecast distribution is Gaussian, and calculates the logarithm of the probability density predicted for the verification value. This results in the following expression for
Ignorance;
1
IGNL =
ln(2)
!
√ (z − m)2
+
ln
s 2π ,
2s2
(1.22)
where m and s are the ensemble forecast mean and standard deviation respectively, and z is
the observed value.
Information theory is also useful for evaluating the predictability of a system. The difference between the forecast distribution and the climatological distribution should be evaluated.
The greater the difference between the two distributions, the greater the predictability. If the
34
two are equal, the event is defined to be unpredictable (DelSole, 2004). The difference between
the two distributions (and therefore, the degree of predictability) can be evaluated using information theoretic principles. The information in a forecast is defined to be a function of
both the forecast distribution and the verification, and is closely related to IGN. Entropy, E,
is defined to be the average information in the forecast, weighted according to the probability
of the event occurring:
E=
Z
p(x) log p(x) dx.
(1.23)
One measure of the difference between the forecast and climatological distributions is the
difference in entropy between the two distributions (DelSole, 2004).
1.7.2
Other Scalar Forecast Summaries
1.7.2.1
Bias
The bias of a forecast is defined as the systematic error in the ensemble mean:
BIAS = hm − zi
(1.24)
where z is the verification and m is the ensemble mean, and the average is taken over the
region of interest and over all start dates. This can be scaled by comparing to the root mean
squared value of the verification in that region (allowing for the verification to take positive
and negative values). This diagnostic tests only the ensemble mean, and is suitable for both
probabilistic and deterministic forecasts.
1.7.2.2
Root Mean Squared Error
The Root mean squared error (RMSE) indicates the typical magnitude of errors in the ensemble
mean:
RM SE =
q
h(m − z)2 i
(1.25)
where the average is taken over the region of interest and over all start dates, as for the bias.
This diagnostic also tests only the ensemble mean, so is also suitable for deterministic forecasts.
35
1.7.3
Graphical Verification Techniques
While skill scores are a useful tool as they allow for the unambiguous ranking of different
forecasts, they also have their limitations. A single scalar measure cannot fully describe the
quality of a forecast. For example, Murphy and Ehrendorfer (1987) show that reducing the
Brier score does not necessarily correspond to increasing the economic value of a forecast (an
example of why the appropriate score for a situation should be chosen with care). Moreover,
skill scores give no indication as to why one forecast is better than another. Is the source
of additional skill improved reliability, or better resolution? Graphical verification techniques
provide a way to identify the shortcomings in a forecast so that they can be targeted in future
model development.
1.7.3.1
Error-spread Diagnostic
The full forecast pdf represents our uncertainty of the future state of the system. The consistency condition is that the verification behaves like a sample from that pdf (Anderson, 1997;
Wilks, 2006). In order to meet the consistency condition, the ensemble must have the correct
second moment. If it is under-dispersive, the verification will frequently fall as an outlier.
Conversely, if the ensemble is over-dispersive, the verification may fall too often towards the
centre of the distribution.
The reliability of an ensemble forecast can be tested through the spread-error relationship
(Leutbecher and Palmer, 2008; Leutbecher, 2010). The expected squared error of the ensemble
mean can be related to the expected ensemble variance by assuming the M ensemble members
and the truth are independently identically distributed random variables with variance σ 2 .
Assuming the ensemble is unbiased, this gives the following requirement for a statistically
consistent ensemble:
M
M
estimate ensemble variance =
squared ensemble mean error,
M −1
M +1
(1.26)
where the overbar indicates that the variance and mean error should be averaged over many
forecast-verification pairs. For a relatively large ensemble size, M & 50 , we can consider the
correction factor to be close to 1.
This measure can be assessed in two ways. Firstly, the root mean square (RMS) error
36
Figure 1.2: RMS error-spread graphical diagnostic for predictions of 500hPa geopotential height
between 35-65◦ N for forecasts from February–April 2006. Forecasts have a lead time of (a) two
days, (b) five days, and (c) ten days. The bins are equally populated. Taken from Leutbecher
and Palmer (2008).
and RMS ensemble spread can be evaluated for the forecast as a function of time for the
entire sample of cases, and the two compared. This gives a good summary of the forecast
calibration. However, (1.26) can be used in a stricter sense — the equation should also be
satisfied for subsamples of the forecast cases conditioned on the spread. This diagnoses the
ability of the forecasting system to make flow-dependent uncertainty estimates. This measure
can be assessed visually by binning the cases into subsamples of increasing RMS Spread, and
plotting against the average RMS Error in each bin. The plotted points should lie on the
diagonal (“the RMS error-spread graphical diagnostic”).
Figure 1.2 shows ECMWF forecast data for the 500hPa geopotential height between 3565◦ N. All three cases show that the spread of the ensemble forecast can be used as an indicator
of the ensemble mean error. At longer lead times of 5-10 days (Figures b and c), the ensemble
is well calibrated, and the average ensemble spread is a good predictor of RMSE. However
at shorter lead times, the forecast is under-dispersive for small errors, and over-dispersive for
large errors.
There are other standard ways to evaluate the reliability of an ensemble forecast, such
as reliability diagrams (briefly discussed in Section 1.7.1.1) and verification rank histograms
(Wilks, 2006), which will not be elaborated on here. However, all of these standard verification techniques involve a visual aspect, making comparison of many different forecast models
difficult. This motivates the development of a new proper scalar score, which is particularly
sensitive to the reliability of the probabilistic forecast (Chapter 4).
37
1.8
Outline of Thesis
As discussed above, it is important to represent model uncertainty in weather forecasts, and
there is a growing use of stochastic parametrisations for this purpose. This study seeks to
provide firm foundations for the use of stochastic parametrisation schemes as a representation
of model uncertainty in weather forecasting models. In Chapter 2, idealised experiments in the
L96 system are described, in which the initial conditions for the forecasts are known exactly.
This allows a clean test of the skill of stochastic parametrisation schemes at representing
model uncertainty. This study also analyses the potential of using stochastic parametrisations
for simulating the climate, an area in which there has been little research. In Chapter 3, this is
considered in the context of the L96 system. The link between the skill of the forecast models
for weather prediction and climate simulation is also discussed in this chapter. In Chapters 2
and 3, the skill of stochastic parametrisation schemes is compared to a deterministic perturbed
parameter approach.
While studying the L96 system, it was found that there is a need for a proper score which is
particularly sensitive to forecast reliability, and which could be used to summarise the information contained in the RMS error-spread graphical diagnostic. A suitable score is proposed
and tested in Chapter 4.
The final aim of this study is to use the lessons learned in the L96 system to test and develop stochastic and perturbed parameter representations of model uncertainty for use in the
ECMWF EPS. In Chapter 5, the representation of model uncertainty in the convection parametrisation scheme in the ECMWF model is considered. A perturbed parameter representation
of uncertainty is compared to a stochastic scheme, and to forecasts with no representation of
uncertainty in convection. In Chapter 6, a generalised version of SPPT is developed and compared to the existing version. In Chapters 2–6, important results will be emphasised to aid the
reader. In Chapter 7, some conclusions are drawn, limitations of the study are outlined, and
possible future work is suggested.
1.9
Statement of Originality
The content of this thesis is entirely my own work, unless stated below.
38
Chapter 4: The Error-Spread Score
Antje Weisheimer (ECMWF, University of Oxford) ran the System 4 seasonal forecasts, and
calculated the seasonal forecast anomalies.
Chapters 5 and 6: Experiments in the ECMWF Model
The experiments in the ECMWF model required a number of code changes. These built on
changes developed by Alfons Callado Pallares (La Agencia Estatal de Meteorologı́a). Alfons
Callado Pallares (ACP) made the following changes to the model, which have been used or
developed further in this thesis:
1. Generalisation of SPPT to allow the SPPT perturbation for a particular scheme to be
switched off. The changes to SPPT perturb the individual parametrisation tendencies
sequentially, before each tendency is passed to the next physics scheme.
2. Generalisation of the spectral pattern generator code to allow more than one multiscale spectral field to be generated and evolved. This allows tendencies from different
parametrisation schemes to be independently perturbed.
I made the following additional changes:
1. Generalisation of SPPT following (1) above. However, significant changes were made
such that the SPPT perturbations are not made sequentially. If the ACP code is used,
when SPPT is switched off for one physics scheme, that scheme will still be subject to
stochasticity in its input tendencies. Perturbing the parametrisation tendencies once they
have all been calculated removes this problem. A new subroutine (SPPTENI.F90) was
written which ensures that the perturbations in the new “independent SPPT” scheme
are truly independent, removing correlations introduced by sequentially perturbating the
tendencies.
2. Code development for fixed perturbed parameter scheme.
3. Code development for varying perturbed parameter scheme. This used ACP code (2) to
generate and evolve four new spectral fields.
In Chapter 5, the Ensemble Prediction and Parameter Estimation System used to estimate
parameter uncertainties was developed and tested in the IFS by Peter Bechtold (ECMWF),
39
Pirkka Ollinaho (Finnish Meteorological Institute) and Heikki Järvinen (University of Helsinki). Bechtold, Ollinaho and Järvinen provided me with the resultant joint probability distribution, which I used to develop the fixed and varying perturbed parameter schemes described
in Chapter 5. Sarah-Jane Lock (ECMWF) carried out the T639 high resolution integrations
presented in Chapter 6.
1.10
Publications
The work presented in this thesis has resulted in the following publications.
Chapters 2 and 3: Experiments in the Lorenz ’96 System
Chapter 2 and Chapter 3, Section 3.2, are based on a paper published in Philosophical Transactions of the Royal Society A (Arnold et al., 2013). Chapter 3, Section 3.3, considering regime
behaviour in the L96 system, is based on a paper currently in preparation.
Chapter 4: The Error-Spread Score
This chapter is based on a paper presenting the Error-Spread Score, which has been accepted
for publication in Quarterly Journal of the Royal Meteorological Society, pending minor corrections. The decomposition of the Error-Spread Score (Sections 4.5 and 4.8, and Appendix B.3)
is in preparation for submission to Monthly Weather Review.
Chapters 5 and 6: Experiments in the ECMWF Model
It is expected that the results presented in Chapters 5 and 6 will be published as two papers.
40
2
The Lorenz ’96 System:
Initial Value Problem
Before attending to the complexities of the actual atmosphere ... it may be well to
exhibit the working of a much simplified case.
– Lewis Fry Richardson, 1922
2.1
Introduction
The central aim of any atmospheric parametrisation scheme must be to improve the forecasting
skill of the atmospheric model in which it is embedded and to better represent our beliefs about
the future state of the atmosphere, be this the weather in five days time or the climate in 50
years time. One aspect of this goal is the accurate representation of uncertainty: a forecast
should skilfully indicate the confidence the forecaster can have in his or her prediction. As
discussed in Section 1.3, there are two main sources of error in atmospheric modelling: errors
in the initial conditions and errors in the model’s representation of the atmosphere. The
ensemble forecast should explore these uncertainties, and a probabilistic forecast should then
be issued to the user (Palmer, 2001). A probabilistic forecast is of great economic value to
the user as it allows reliable assessment of the risks associated with different decisions, which
cannot be achieved using a deterministic forecast (Palmer, 2002).
The need for stochastic parametrisations has been motivated by considering the requirement
that an ensemble forecast should include an estimate of uncertainty due to errors in the forecast
41
Parameter
Number of X variables
Number of Y variables per X variable
Coupling Constant
Forcing Term
Spatial Scale Ratio
Time scale Ratio
Symbol
K
J
h
F
b
c
Setting
8
32
1
20
10
4 or 10
Table 2.1: Parameter settings used for the L96 system (1.6) used in this experiment.
model. In this section, the ability of stochastic parametrisation schemes to skilfully represent
this model uncertainty is tested using the Lorenz ’96 system. The full, two-scale system is run
and defined as “truth”. An ensemble forecast model is then developed by assuming the small
scale ‘Y ’ variables are unresolved, and by parametrising the effects of these small scale variables
on the resolved scale. Initial condition uncertainty is removed by using perfect initial conditions
for all ensemble members. Therefore the only source of uncertainty in the forecast is due to
model error from imperfect parametrisation of the ‘Y ’ variables, and from errors due to the time
stepping scheme. The spread in the forecast ensemble is generated purely from the stochastic
parametrisation schemes, so the ability of such a scheme to represent model uncertainty can be
rigorously tested. The ability to distinguish between model and initial condition uncertainty
can only take place in an idealised setting. However, this work differs from Wilks (2005)
and Crommelin and Vanden-Eijnden (2008), where model uncertainty is not distinguished
from initial condition uncertainty in this way, so the ability of stochastic parametrisations to
represent model uncertainty is not explicitly investigated. The performance of the different
stochastic parametrisation schemes are compared to an approach using a perturbed parameter
ensemble. This is a commonly used deterministic representation of model uncertainty, so serves
as a useful benchmark (Stainforth et al., 2005; Rougier et al., 2009; Lee et al., 2012).
In Section 2.2, the Lorenz ’96 System used in this experiment is described, and the different
stochastic schemes tested are described in Section 2.3. Sections 2.4 and 2.5 discuss the effects
of the stochastic schemes on short term weather prediction skill and reliability respectively.
Section 2.6 discusses experiments with perturbed parameter ensembles and Section 2.7 draws
some conclusions.
42
Figure 2.1: Schematic of the L96 system described by (1.6) (taken from Wilks (2005)). Each
of the K = 8 large-scale X variables is coupled to J = 32 small-scale Y variables.
2.2
The Lorenz ’96 System
The Lorenz ’96 (L96) simplified model of the atmosphere was used to test the ability of
stochastic parametristation schemes to skilfully represent model uncertainty. The L96 system consists of two kinds of variables acting at two scales, as described by (1.6). The high
frequency, small scale Y variables are driven by the low frequency, large scale X variables, but
also affect the evolution of the X variables. This interaction between small and large scales is
observed in the atmosphere (see Section 1.2), and is what makes the parametrisation problem
non-trivial. The values of the parameters have been chosen such that both X and Y variables
are chaotic. The values and interpretation of the parameters are shown in Table 2.1. The L96
model is an ideal model for testing parametrisation schemes as it mimics important properties
of the atmosphere (interaction of scales and chaotic motion), but by integrating the full set of
equations, it allows for a rigorous definition of “truth” against which forecasts can be verified.
2.3
Description of the Experiment
A series of experiments is carried out using the L96 system. Each of the K = 8 low frequency,
large amplitude X variables is coupled to J = 32 high frequency, small amplitude Y variables,
as illustrated schematically in Figure 2.1. The X variables are considered resolved and the Y
variables unresolved, so must therefore be parametrised in a truncated model. The effects of
different stochastic parametrisations are then investigated by comparing the truncated forecast
model to the “truth”, defined by running the full set of coupled equations.
43
Two different values of the time scale ratio, c = 4 and c = 10, are used in this experiment.
The c = 10 case was proposed by Lorenz (1996), and was also considered by Wilks (2005). This
case has a large time scale separation so can be considered “easy” to parametrise. However, it
has been shown that there is no such time scale separation in the atmosphere (Nastrom and
Gage, 1985), so a second parameter setting of c = 4 is chosen, where parametrisation of the
sub-grid is more difficult, but which closer represents the real atmosphere.
By comparing the error doubling time of the model to that observed in atmospheric GCMs,
Lorenz (1996) deduced that one model time unit in the L96 system is approximately equal
to five atmospheric days. This scaling gives an error doubling time of 2.1 days in the L96
system, as was observed in GCMs. However, since 1996, the resolution of GCMs has improved
significantly, and the error doubling time of GCMs has reduced due to a better representation
of the small scales in the model (Lorenz, 1996; Buizza, 2010). For example, the error doubling
time for the ECMWF NWP model at T399 resolution was measured to be between 0.84 and
1.61 for perturbed forecasts at lead times of 1–3 days (Buizza, 2010). This is substantially less
than the estimate used by Lorenz to scale the L96 model. Reducing the time scale ratio to
c = 4 (and assuming the same scaling of 1 MTU = 5 days) reduces the error doubling time in
the L96 model to 0.80 days, which is closer to the error doubling time in current operational
weather forecasting models. Instead of independently re-scaling the model time units such that
the error doubling time in the c = 4 and c = 10 cases are both equal, the standard scaling of
1 MTU = 5 days is used for both cases. The c = 4 case is therefore a more realistic simulation,
both due to the closer scale separation and the more realistic error doubling time.
2.3.1
“Truth” model
The full set of equations (1.6) is run and the resultant time series defined as “truth”. The
equations are integrated using an adaptive fourth order Runge-Kutta time stepping scheme,
with a maximum time step of 0.001 model time units (MTU). Having removed the transients,
the 300 initial conditions on the attractor are selected at intervals of 10 MTU, corresponding
to 50 “atmospheric days”. This interval was selected to ensure adjacent initial conditions are
uncorrelated — the temporal autocorrelation of the X variables is close to zero after 10 MTU.
A truth run is carried out from each of these 300 initial conditions.
44
Subgrid Tendency, U
20
(b)
Udet = − 0.000223 X3 − 0.00550 X2
+ 0.575 X − 0.198
20
Subgrid Tendency, U
(a)
10
0
−10
−20
−15 −10
Udet = − 0.00235 X3 − 0.0136 X2
+ 1.30 X + 0.341
10
0
−10
−5
0
5
10
15
−20
−10
20
−5
X
0
5
10
15
X
Figure 2.2: Measured sub-grid tendency, U , as a function of the X variables (circles) for the
(a) c = 4 and (b) c = 10 cases. For each case, the data was generated from a long “truth”
integration of (1.6). The figure shows a time series of 3000 MTU ≈ 40 “atmospheric years”
duration, sampled at intervals of 0.125 MTU. The solid line on each graph is a cubic fit
to the truth data, representing a deterministic parametrisation of the tendencies. There is
considerable variability in the tendencies not captured by such a deterministic scheme.
2.3.2
Forecast model
A forecast model is constructed by assuming that only the X variables are resolved, and
parametrising the effect of the unresolved sub-grid scale Y variables in terms of the resolved
X variables:
dXk∗
∗
∗
∗
= −Xk−1
(Xk−2
− Xk+1
) − Xk∗ + F − Up (Xk∗ ); k = 1, ..., K,
dt
(2.1)
where Xk∗ (t) is the forecast value of Xk (t) and Up is the parametrised sub-grid tendency.
The forecast model (2.1) is integrated using a piecewise deterministic, adaptive second order
Runge-Kutta (RK2) scheme. A forecast time step is defined, at which the properties of the
truth time series are estimated. The stochastic noise term in Up is held constant over this
time step1 . Such a stochastic Runge-Kutta scheme has been shown to converge to the true
Stratonovich forward integration scheme, as long as the parameters in such a scheme are used
in the same way they are estimated (Hansen and Penland, 2006, 2007). This was verified for
the different stochastic parametrisations tested. The parametrisations Up (Xk∗ ) approximate
the true sub-grid tendencies,
kJ
X
hc
Yj ,
U (Xk ) =
b j=J(k−1)+1
1
(2.2)
At the start of a new time step, a new random number is drawn for each X variable tendency. All RK2
derivatives are calculated using this random number until the next time step is reached, which may involve
several RK2 iterations because of the adaptive time stepping.
45
which are estimated from the truth time series as
!
Xk (t + ∆t) − Xk (t)
U (Xk ) = [−Xk−1 (Xk−2 − Xk+1 ) − Xk + F ] −
.
∆t
(2.3)
The forecast time step was set to ∆t = 0.005. The L96 system exhibits cyclic symmetry, so
the same parametrisation is used for all Xk .
This estimated ‘true’ sub-grid tendency, U , is plotted as a function of the large-scale X
variables for both c = 4 and c = 10 (Figure 2.2). This can be modelled in terms of a
deterministic parametrisation, Udet , where
U (X) = Udet (X) + r(t),
(2.4)
Udet (X) = b0 + b1 X + b2 X 2 + b3 X 3 ,
(2.5)
for
and the parameter values (b0 , b1 , b2 , b3 ) were determined by a least squares fit to the (X, U ) truth
data, to minimise the residuals, r(t). However, Figure 2.2 shows significant scatter about the
deterministic parametrisation — the residuals r(t) are non-zero. This variability can be taken
into account by incorporating a stochastic component, e(t), into the parametrised tendency,
Up .
A number of different stochastic parametrisations are considered. These use different statistical models to represent the sub-grid scale variability lost when truncating the Y variables.
The different noise models will be described below.
2.3.2.1
Additive Noise (A)
This work builds on Wilks (2005), where the effects of white and red additive noise were
considered on the skill of the forecast model. The parametrised tendency is modelled as the
deterministic tendency and an additive noise term, e(t):
Up = Udet + e(t)
46
(2.6)
1
0.9
0.8
0.7
0.6
0.5
0
5
10
Lag / forecast timesteps
φ=
0.998
(b)
(c)
1
1
Spatial Autocorrelation
Temporal Autocorrelation
Temporal Autocorrelation
(a)
0.9
0.8
0.7
0.6
0.5
0
5
10
0.5
0
0
Lag / forecast timesteps
0.993
0.985
0.939
1
2
3
4
Spatial Separation
c=4
c = 10
Figure 2.3: Temporal and spatial autocorrelation for the residuals, r(t), measured from the
truth data (1.6). Figures (a) and (b) show the measured temporal autocorrelation function for
c = 4 and c = 10 respectively as grey triangles. Also shown is the temporal autocorrelation
function for an AR(1) process with different values of the lag-1 autocorrelation, φ (grey dot–
dash, dash, solid and dotted lines). The fitted AR(1) process is indicated by the darker grey
line in each case. Figure (c) shows the measured spatial correlation for the residuals measured
from the truth data. The spatial correlation is close to zero for spatial separation 6= 0.
The stochastic term, e(t), is designed to represent the residuals, r(t). The temporal and
spatial autocorrelation functions for the residuals are shown in Figure 2.3. The temporal
autocorrelation is significant but the spatial correlation is small. Therefore, the temporal
characteristics of the residuals are included in the parametrisation by modelling the e(t) as
an AR(1) process. The stochastic tendencies for each of the X variables are assumed to be
mutually independent. It is expected that in more complicated systems, including the effects
of both spatial and temporal correlations will be important to accurately characterise the subgrid scale variability. A second order autoregressive process was also considered, but fitting the
increased number of parameters proved difficult, and the resultant improvement over AR(1)
was slight, so is not discussed further here.
A zero mean AR(1) process, e(t), can be written as (Wilks, 2006):
e(t) = φ e(t − ∆t) + σe (1 − φ2 )1/2 z(t),
(2.7)
where φ is the first autoregressive parameter (lag-1 autocorrelation), σe2 is the variance of the
stochastic tendency and z(t) is unit variance white noise: z(t) ∼ N (0, 1). φ and σe can be
fitted from the truth time series.
47
2.3.2.2
State Dependent Noise (SD)
A second type of noise is considered where the standard deviation of additive noise is dependent
on the value of the X variable. This is called state dependent noise. It can be motivated in
the L96 system by studying Figure 2.2; the degree of scatter about the cubic fit is greater for
large magnitude X values. The parametrised tendency
Up = Udet + e(t),
(2.8)
where the state dependent standard deviation of e(t) is modelled as
σe = σ1 |X(t)| + σ0 .
(2.9)
As Figure 2.3 shows a large temporal autocorrelation, it is unlikely that white state dependent
noise will adequately model the residuals. Instead, e(t) will be modelled as an AR(1) process:
e(t) =
σe (t)
φ e(t − ∆t) + σe (t)(1 − φ2 )1/2 z(t),
σe (t − ∆t)
(2.10)
where the time dependency of the standard deviation and the requirement that e(t) must be
a stationary process have motivated the functional form.
The parameters σ1 and σ0 can be estimated by binning the residuals according to the
magnitude of X and calculating the standard deviation in each bin. The lag-1 autocorrelation
was estimated from the residual time series.
2.3.2.3
Multiplicative Noise (M)
Multiplicative noise has been successfully implemented in the ECMWF NWP model using the
SPPT scheme, and has been shown to improve the skill of the forecasting system (Buizza et al.,
1999). Therefore it is of interest whether a parametrisation scheme involving multiplicative
noise could give significant improvements over additive stochastic schemes in the L96 system.
The parametrisation proposed is
Up = (1 + e(t)) Udet ,
48
(2.11)
where e(t) is modelled as an AR(1) process, given by (2.7).
The parameters in this model can be estimated by forming a time series of the truth
“residual ratio”, Rk , that needs to be represented:
Rk + 1 = U / Udet .
(2.12)
However whenever Udet approaches zero, the residual ratio tends to infinity. Therefore, the
time series was first filtered such that only sections away from Udet = 0 were considered, and
the temporal autocorrelation and standard deviation estimated from these sections.
For multiplicative noise, it is assumed that the standard deviation of the true tendency is
proportional to the parametrised tendency, such that when the parametrised tendency is zero
the uncertainty in the tendency is zero. Figure 2.2 shows that multiplicative noise does not
appear to be a good model for the uncertainty in the L96 system as the uncertainty in the true
tendency is large even when Udet is zero. In the case of the L96 system, the uncertainty when
Udet is zero is likely to be because the deterministic parametrisation has not fully captured
the behaviour of the Y variables. Time stepping errors may also contribute to this error.
Nevertheless, multiplicative noise is investigated here.
2.3.2.4
Multiplicative and Additive Noise (MA)
Figure 2.2 motivates a final stochastic parametrisation scheme for testing in the L96 system,
which will include both multiplicative and additive noise terms. This represents the uncertainty
in the parametrised tendency even when the deterministic tendency is zero. This type of
uncertainty has been observed in coarse-graining studies. For example, Shutts and Palmer
(2007) observed that the standard deviation of the true heating in a coarse gridbox does not
go to zero when Q, the parametrised heating, is zero. This type of stochastic parametrisation
can also be motivated by considering errors in the time stepping scheme, which will contribute
to errors in the total tendency even if the sub-grid scale tendency is zero.
When formulating this parametrisation, the following points were considered:
1. In a toy model setting, random number generation is computationally cheap. However in
a weather or climate prediction model, generation of spatially and temporally correlated
fields of random numbers is comparatively expensive, and two separate generators must
49
Abbreviation
Functional Form
A
Up = Udet + e(t)
e(t) = φ e(t − ∆t) + σ(1 − φ2 )1/2 z(t)
State Dependent
SD
Up = Udet + e(t)
σt
e(t) = σt−∆t
φ e(t − ∆t) + σt (1 − φ2 )1/2 z(t)
where
σt = σ1 |X(t)| + σ0
Multiplicative
M
Up = (1 + e(t)) Udet
e(t) = φ e(t − ∆t) + σ(1 − φ2 )1/2 z(t)
Additive
Multiplicative & Additive
MA
Up = Udet + e(t)
e(t) = ǫ(t) (σm |Udet | + σa )
where
ǫ(t) = ǫ(t − ∆t)φ + (1 − φ2 )1/2 z(t)
Measured Parameters
c=4
c = 10
φmeas = 0.993
σmeas = 2.12
φmeas = 0.986
σmeas = 1.99
φmeas = 0.993
(σ0 )meas = 1.62
(σ1 )meas = 0.078
φmeas = 0.989
(σ0 )meas = 1.47
(σ1 )meas = 0.0873
φmeas = 0.950
σmeas = 0.746
φmeas = 0.940
σmeas = 0.469
φmeas = 0.993
(σm )meas = 0.177
(σa )meas = 1.55
φmeas = 0.988
(σm )meas = 0.101
(σa )meas = 1.37
Table 2.2: Stochastic parametrisations of the sub-grid tendency, U, used in this experiment, and the values of the model parameters fitted from
the truth time series.
50
Full Name
be used if two such fields are required. It is therefore desirable to use only one random
number per time step so that the parametrisation can be further developed for use in an
atmospheric model.
2. The fewer parameters there are to fit, the less complicated the methodology required to
fit them, the easier it will be to apply this method to a more complex system such as an
atmospheric model. This also avoids overfitting.
The most general form of additive and multiplicative noise is considered:
Up = (1 + ǫm ) Udet + ǫa = Udet + (ǫm Udet + ǫa ),
(2.13)
where ǫm is the multiplicative noise term, and ǫa is the additive noise term. This can be written
as pure additive noise:
Up = Udet + e(t),
(2.14)
e(t) = ǫm (t) Udet + ǫa (t).
(2.15)
where
Following point (1) above, it is assumed that ǫm (t) and ǫa (t) are the same random number,
ǫ(t),
e(t) = ǫ(t) (σm Udet + σa ) ,
(2.16)
where ǫ(t) has been scaled using the standard deviations of the multiplicative and additive
noise, σm and σa respectively. In the current form, (2.16) is not symmetric about the origin with
respect to Udet . The standard deviation of the stochastic tendency is zero when σm Udet = −σa .
Therefore, Udet in the above equation will be replaced with |Udet |:
e(t) = ǫ(t) (σm |Udet | + σa ) ,
(2.17)
ǫ(t) = ǫ(t − ∆t)φ + (1 − φ2 )1/2 z(t).
(2.18)
where
This does not change the nature of the multiplicative noise as ǫ(t) has mean zero, but the
additive part of the noise will act in the same direction as the multiplicative. ǫ(t) is modelled
as a AR(1) process of unit variance. The parameters are fitted from the residual time series.
51
(a) RPSS, c = 4
(b) IGNSS, c = 4
2.5
2.0
1.5
1.0
0.1
φ
(c) RPSS, c = 10
(d) IGNSS, c = 10
2.5
0.7
0.82 0.8 0.74
0.72
0.76
0.78
2.0
0.3
0.10
−0.2
−0.4
−0.6
−0.8
−1
0.5
φ
0.969
0.882
0.607
0.998
0.992
0.969
0.882
0.0
0.607
0.0
0.000
1.0
0.000
0.5
0.8
0.998
1.0
0.3
1.5
0.992
σ / σmeas
σ / σmeas
1.5
0.998
φ
2.5
2.0
0.993
0
−0.2
−0.4
−0.6
−0.8
−1
0.969
0.000
0.998
0.993
0.0
0.969
0.0
0.882
0.5
0.607
0.5
0.882
0.64
0.607
1.0
0.5
0.52
0.54
0.64
0.56
0.58
0.6
0.62
σ / σmeas
1.5
0.000
σ / σmeas
2.0
2.5
φ
Figure 2.4: Weather Skill Scores (RPSS and IGNSS) for a forecast model with an additive
AR(1) stochastic parametrisation for (a) and (b) the c = 4 case, and (c) and (d) the c = 10
case. The skill scores were evaluated as a function of the tuneable parameters in the model;
the lag-1 autocorrelation, φ, and the standard deviation of the noise, σ. The black crosses
indicate the measured parameter values. The contour interval is 0.01 for (a) and (c), and 0.1
for (b) and (d). All skill scores are calculated at a lead time of 0.6 MTU (3 atmospheric days).
The possibility of using an additive and multiplicative noise scheme in the ECMWF NWP
model is discussed in Section 5.3.
The different stochastic parametrisations used in this experiment are summarised in Table
2.2, together with the parameters measured from the truth time series.
2.4
Weather Forecasting Skill
The stochastic parametrisation schemes are first tested on their ability to predict the “weather”
of the L96 system, and represent the uncertainty in their predictions correctly. An ensemble of
40 members is generated for each of the 300 initial conditions on the attractor. Each ensemble
member is initialised from the perfect initial conditions defined by the “truth” time series.
Each stochastic parametrisation involves two or more tunable parameters which may be
estimated from the “truth” time series. In addition to the measured parameter values, many
other parameter settings were considered, and the skill of the parametrisation evaluated for
each setting using three scalar skill scores. The RPS (Section 1.7.1.2) was evaluated for a
52
(b) M, c = 4
1.0
0.66
0.58
0.82
2.5
0.82
0.82
φ
1.0
0.998
0.992
0.969
0.882
2.0
0.82
0.5
0.0
σ1 / (σ1)meas
1.5
0.0
1.0
0.8
1.0
2.5
0.8
1.5
2.0
0.72
0.76
0.5
1.0
0.8
0.74 0.72
0.76
0.78
1.5
0.82
0.607
2.5
2.0
1.5
0.998
2.0
0.0
0.0
1.0
0.993
2.0
0.5
0.82
(f) AR(1) MA, c = 10
0.0
0.82
0.5
0.5
0.5
2.5
1.5
0.5
σm / (σm)meas
σa / (σa)meas
1.0
0.0
1.0
2.5
0.000
1.5
0.74
0.76
0.78
0.8
1.5
(e) M, c = 10
0.7
σ / σmeas
σ0 / (σ0)meas
2.0
2.0
φ
(d) AR(1) SD, c = 10
0.72
0.54 0.52
0.56
0.58
0.6
0.62
0.64
0.0
0.969
0.607
0.000
2.5
2.0
0.66
0.62
0.5 0.6
σ1 / (σ1)meas
2.5
σa / (σa)meas
1.5
0.882
σ / σmeas
0.64
0.0
1.5
0.0
1.0
0.62 0.64
0.0
2.5
2.0
0.5
0.5
σ0 / (σ0)meas
0.5
0.52
0.54
0.56
2.0
0.58
0.6
1.5 0.62
0.64
1.0
(c) AR1 MA, c = 4
2.5
0.5
(a) SD, c = 4
2.5
σm / (σm)meas
Figure 2.5: Comparing the RPSS for the c = 4 and c = 10 cases for the different stochastic
parametrisations. For c = 4, (a) the skill of the state dependant (SD) additive parametrisation is shown for different values of the noise standard deviations, σ1 and σ0 , with the lag-1
autocorrelation set to φ = φmeas . (b) The skill of the pure multiplicative (M) noise is shown
for different values of the lag-1 autocorrelation, φ, and magnitude of the noise, σ. The parametrisation scheme was found to be numerically unstable for high σ > 2σmeas . (c) The skill
of the additive and multiplicative (MA) parametrisation is shown for different values of the
noise standard deviations, σm and σa , with the lag-1 autocorrelation set to φ = φmeas . The
equivalent figures for c = 10 are shown in (d)–(f). In all cases, the measured parameters are
indicated by the black cross. The contour interval is 0.01 for each case. The skill scores were
evaluated at a lead time of 0.6 model time units (3 atmospheric days).
ten-category forecast, where the categories were defined as the ten deciles of the climatological
distribution. IGN (Section 1.7.1.3) was evaluated using the method suggested by Roulston
and Smith (2002). The BS (Section 1.7.1.1) was evaluated for the event “the X variable is in
the upper tercile of the climatological distribution”. The BS is mathematically related to the
RPS (Wilks, 2006), and gave very similar results to the RPS, so is not shown here for brevity.
A skill score was calculated for each score with respect to climatology. Forecasts were verified
at a lead time of 0.6 units, equivalent to 3 atmospheric days.
Figure 2.4 shows the calculated skill scores for a forecast model with an additive AR(1)
stochastic parametrisation, for both the c = 4 and c = 10 cases. There is a broad peak in
forecasting skill according to each skill score, with a range of parameter settings scoring highly.
The shape of peak in forecasting skill is qualitatively similar in each case, but is lower for the
53
(b)
0.8
0.5
IGNSS
RPSS
(a)
0.7
0
−0.5
0.6
−1
Deterministic
White Add.
AR1 Add.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
Deterministic
White Add.
AR1 Add.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
0.5
c=4
c = 10
Figure 2.6: The skill of the different parametrisations according to (a) RPSS and (b) IGNSS
are compared for the c = 4 and c = 10 cases. The parameters in each parametrisation have
been estimated from the truth time series. In each case, “White” indicates that the measured
standard deviations have been used, but the autocorrelation parameters set to zero.
c = 4 case. The closer time scale separation of the c = 4 case is harder to parametrise, so a
lower skill is to be expected.
IGNSS shows a different behaviour to RPSS. Ignorance heavily penalises an underdispersive ensemble, but does not heavily penalise an overdispersive ensemble. This asymmetry is
observed in the contour plot for IGNSS — the peak is shifted upwards compared to the peak
for RPSS, and deterministic parametrisations have negative skill (are worse than climatology).
The very large magnitude and high autocorrelation noise parametrisations are not penalised,
but score highly, despite being overdispersive.
The RPSS may be decomposed explicitly into reliability, resolution and uncertainty components (Wilks, 2006). This decomposition (not shown) demonstrates that the deterministic
and low amplitude noise parametrisations score highly on their resolution, but poorly for their
reliability, and the converse is true for the large amplitude, highly autocorrelated noise parametrisations. The peak in skill according to the RPSS corresponds to parametrisations which
score reasonably well on both accounts.
A number of important parameter settings can be identified on Figure 2.4. The first corresponds to the deterministic parametrisation, which occurs on the x-axis where the standard
deviation of the noise is zero. The second corresponds to white noise, which occurs on the
y-axis where the autocorrelation parameter is set to zero. In particular, (φ, σ/σmeas ) = (0, 1)
corresponds to additive white noise with a magnitude fitted to the truth time series. The third
setting is the measured parameters, marked by a black cross. Comparing the skill of these
54
three cases shows an improvement over the deterministic scheme as first white noise, then red
noise is included in the parametrisation.
The RPSS calculated for the other stochastic parametrisation schemes is shown in Figure 2.5. The contour plots for IGNSS are comparable, so are not shown for brevity. The
forecasts are more skilful for the c = 10 case, but qualitatively similar. This result is as expected: the closer time scale separation for the c = 4 case is harder to parametrise, so the
forecast models perform less well than for the c = 10 case. For both cases considered, for all
parametrisation schemes, including a stochastic term in the parametrisation scheme results in
an improvement in the skill of the forecast over the deterministic scheme. This result is robust
to error in the measurement of the parameters — a range of parameters in each forecast model
gave good skill scores. This is encouraging, as it indicates that stochastic parametrisations
could be useful in modelling the real atmosphere, where noisy data restrict how accurately
these parameters may be estimated.
The results are summarised in Figure 2.6. For each parametrisation, the value for the
measured parameters is shown when both no temporal autocorrelation (“white” noise) and
the measured temporal autocorrelation characteristics are used. The significance of the difference between pairs of parametrisations was estimated using a Monte-Carlo technique. See
Appendix A for more details. For example, there is no significant difference between the RPS
for AR(1) MA noise and for AR(1) SD noise for the c = 10 case, but AR(1) M noise gave a
significant improvement over both of these.
The stochastic parametrisations are significantly more skilful than the deterministic parametrisation in both the c = 4 and c = 10 cases. For the c = 4 case, the more complicated
parametrisations show a significant improvement over simple additive noise, especially the multiplicative noise. For the closer time scale separation, the more accurate the representation of
the sub-grid scale forcing, the higher the forecast skill. For the c = 10 case, the large time
scale separation allows the deterministic parametrisation to have reasonable forecasting skill,
and a simple representation of sub-grid variability is sufficient to represent the uncertainty in
the forecast model; the more complicated stochastic parametrisations show little improvement
over simple additive AR(1) noise.
Traditional deterministic parametrisation schemes are a function of the grid scale variables
at the current time step only. If a stochastic parametrisation needs only to represent the sub
55
−3
RMS Error
(a)
x 10
(b)
10 (i)
10
5
5
10 (ii)
0
Deterministic
White Add.
AR1 Add.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
RMS Error
0
5
RMS Error
0
10 (iii)
5
c=4
c = 10
0
0
5
10
RMS Spread
Figure 2.7: (a) RMS error-spread graphical diagnostic for the c = 4 case (Section 1.7.3.1). For
(i) a deterministic forecast model, (ii) the parametrisation scheme using additive white noise,
and (iii) the parametrisation scheme which includes the measured temporal autocorrelation in
the stochastic term. (b) The skill of the different parametrisations according to REL for the c =
4 and c = 10 cases. The smaller the REL, the more reliable the forecast. The parameters in each
parametrisation have been estimated from the truth time series. In each case, “White” indicates
that the measured standard deviations have been used, but the autocorrelation parameters set
to zero.
grid- and time-scale variability, the white noise schemes would be adequate. However, for
both time scale separations, the skill of stochastic parametrisations which include a temporal
autocorrelation is significantly higher than those which use white noise. This challenges the
standard idea that a parametrisation should only represent sub-grid scale and sub-time step
variability: including temporal autocorrelation accounts for the effects of the sub-grid scale
at time scales greater than the model time step. In the L96 system, the spatial correlations
are low. However in an atmospheric situation, it is likely that spatial correlations will be
significant, and a stochastic parametrisation must account for the effects of the sub-grid at
scales larger than the spatial discretisation scale.
2.5
Representation of Model Uncertainty
In this idealised experiment, the forecast integrations are initialised from perfect initial conditions, leaving model uncertainty as the only source of error in the forecast. The forecast
ensembles use stochastic parametrisations to represent this uncertainty. The RMS error-spread
graphical diagnostic (described in Section 1.7.3.1) is a useful test of reliability, and therefore
56
a good indicator of how well the forecast model represents uncertainty. Figure 2.7(a) shows
this diagnostic for a selection of the parametrisation schemes tested for the c = 4 case. For a
well calibrated ensemble, the points should lie on the diagonal. A clear improvement over the
deterministic scheme is seen as first white, then red additive noise is included in the parametrisation scheme.
Visual forecast verification measures are limited when comparing many different models as
they do not give an unambiguous ranking of the performance of these models. Therefore, the
reliability component of the Brier Score (1.17), REL, is also considered. This is a scalar scoring
rule which tests the reliability of an ensemble forecast at predicting an event. The event is
defined to be “in the upper tercile of the climatological distribution”. The smaller the REL,
the closer the forecast probability is to the average observed frequency, the more reliable the
forecast.
The results are summarised in Figure 2.7(b). The REL score indicates that the different
AR(1) noise terms all perform similarly. A significant improvement is observed when temporal autocorrelation is included in the parametrisation, particularly for the c = 4 case. This
improvement between white and red stochastic models is much greater than the difference
between deterministic and white stochastic models, whereas the RPSS indicated a similar improvement as first white then red noise schemes were tested. This indicates that while overall
forecast skill can be improved to a reasonable extent using a white noise stochastic scheme, for
reliable forecasts it is very important that the stochastic parametrisation includes temporally
correlated noise as this captures the behaviour of the unresolved sub-grid scale variables better.
2.6
Perturbed Parameter Ensembles in the Lorenz ’96
System
It is of interest to determine whether a perturbed parameter ensemble can also provide a
reliable measure of model uncertainty in the Lorenz ’96 system. The four measured parameters
) defining the cubic polynomial are perturbed to generate a 40 member
, bmeas
, bmeas
, bmeas
(bmeas
3
2
1
0
ensemble. The skill of this representation of model uncertainty is evaluated as for the stochastic
parametrisations.
Following Stainforth et al. (2005), each of the four parameters is set to one of three values:
57
Subgrid Tendency, U
20
(b)
Udet = − 0.000223 X3 − 0.00550 X2
+ 0.575 X − 0.198
20
Subgrid Tendency, U
(a)
10
0
−10
−20
−15 −10
Udet = − 0.00235 X3 − 0.0136 X2
+ 1.30 X + 0.341
10
0
−10
−5
0
5
10
15
−20
−10
20
−5
X
0
5
10
15
X
Figure 2.8: The ensemble of deterministic parametrisations used to represent model uncertainty
in the perturbed parameter ensemble (solid lines), compared to the measured sub-grid tendency,
U , as a function of the grid scale variables, X (circles), for both the (a) c = 4 and (b) c = 10
cases. The degree to which the parameters are perturbed has been estimated from the truth
time series in each case (S = 1).
bmeas
0
σ(bsamp
)
0
meas
b1
σ(bsamp
)
1
meas
b2
σ(bsamp
)
2
meas
b3
σ(bsamp
)
3
c=4
-0.198
0.170
0.575
0.0464
-0.00550
0.00489
-0.000223
0.000379
c = 10
0.341
0.146
1.30
0.0381
-0.136
0.00901
-0.00235
0.000650
Table 2.3: Measured parameters defining the cubic polynomial, (bmeas
, bmeas
, bmeas
, bmeas
), and
0
1
2
3
samp
the variability of these parameters, σ(bi ), calculated by sampling from the truth time series,
for the c = 4 and c = 10 cases.
58
low (L), medium (M) or high (H). The degree to which the parameters should be varied
is estimated from the truth time series. The measured U (X) is split into sections 3 MTU
long, and a cubic polynomial fitted to each section. The measured variability in each of the
parameters is defined to be the standard deviation of the parameters fitted to each section
σ(bsamp
). The measured standard deviations are shown in Table 2.3. The low, medium and
i
high values of the parameters are given by:
L = bmeas
− Sσ(bsamp
)
i
i
M = bmeas
i
H = bmeas
+ Sσ(bsamp
),
i
i
(2.19)
where the scale factor, S, can be varied to test the sensitivity of the scheme. There are 34 = 81
possible permutations of the parameter settings, from which a subset of 40 permutations was
selected to sample the uncertainty. This allows for a fair comparison to be made with the
stochastic parametrisations, which also use a 40 member ensemble. The selected permutations
are shown in Table 2.4.
The same “truth” model is used as for the stochastic parametrisations, and the forecast
model is constructed in an analogous way: only the X variables are assumed resolved, and
the effects of the unresolved sub-grid scale Y variables are represented by an ensemble of
deterministic parametrisations:
Upp (Xk ) = bp0 + bp1 Xk + bp2 Xk2 + bp3 Xk3 ,
(2.20)
where the values of the perturbed parameters, bpi , vary between ensemble members. The
scale factor, S, in (2.19) is varied to investigate the effect on the skill of the forecast. The
ensemble of deterministic parametrisations is shown in Figure 2.8 where the degree of parameter
perturbation has been measured from the truth time series (i.e. S = 1). The truncated model
is integrated using an adaptive second order Runge-Kutta scheme.
2.6.1
Weather Prediction Skill
The skill of the ensemble forecast is evaluated using the RPSS and IGNSS at a lead time of
0.6 model time units for both the c = 4 and c = 10 cases. The results are shown in Figure 2.9.
59
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
b0
H
H
H
H
H
H
H
H
H
H
H
H
H
H
M
M
M
M
M
M
Selected Permutations
b1 b2 b3 Number b0
H H H 21
M
H H L 22
M
H M M 23
M
H L H 24
M
H L L 25
M
M H M 26
M
M M H 27
L
M M L 28
L
M L M 29
L
L H H 30
L
L H L 31
L
L M M 32
L
L L H 33
L
L L L 34
L
H H M 35
L
H M H 36
L
H M L 37
L
H L M 38
L
M H H 39
L
M H L 40
L
b1
M
M
L
L
L
L
H
H
H
H
H
M
M
M
M
L
L
L
L
L
b2 b3
L H
L L
H M
M H
M L
L M
H H
H L
M M
L H
L L
H M
M H
M L
L M
H H
H L
M M
L H
L L
Table 2.4: Chosen permutations for the perturbed parameter experiment. H, M and L represent
the High, Medium and Low settings respectively.
0.82
0.80
0.78
0.68
(d) RPSS, c = 10
0.66
0.64
0.62
0.6
0.58
0.56
(a) RPSS, c = 4
0.1
0
−0.2
−0.4
−0.6
−0.8
0.1
−1
0
0
−0.2
(e) IGNSS, c = 10
−0.4
(b) IGNSS, c = 4
−0.6
Scale Factor, S
−0.8
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Scale Factor, S
−1 −1
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
(f) REL, c = 10
0.001
(c) REL, c = 4
0.003
Scale Factor, S
0.005
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Scale Factor, S
0.009
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Scale Factor, S
Scale Factor, S
Figure 2.9: Weather forecasting skill scores for the perturbed parameter model as a function
of the scale factor, S. The skill of the c = 4 forecasts according to (a) RPSS, (b) IGNSS and
(c) REL. The equivalent results for c = 10 are shown in (d)–(f). The contour interval is 0.01
in (a) and (d), 0.1 in (b) and (e), and 0.001 in (c) and (f). The skill scores are calculated at a
lead time of 0.6 MTU in each case.
60
RMS Error
(a) c = 4
10
10
10
10
10
10
5
5
5
5
5
5
0
0
0
5
10
RMS Spread
0
0
5
10
0
5
10
RMS Spread
RMS Spread
5
5
0
0
5
10
RMS Spread
0
0
0
5
10
0
5
10
RMS Spread
RMS Spread
5
5
Increasing s
RMS Error
(b) c = 10
5
0
0
0
5
RMS Spread
5
0
0
5
RMS Spread
0
5
RMS Spread
0
0
0
0
5
RMS Spread
0
5
RMS Spread
0
5
RMS Spread
Increasing s
Figure 2.10: RMS Spread vs. RMS Error plots for the perturbed parameter ensemble as a function of the scale factor, S. The separate figures correspond to S = [0.0, 0.4, 0.8, 1.2, 1.6, 2.0].
The measured parameter perturbations are indicated with a black cross in each case.
Both RPSS and IGNSS indicate that the measured perturbed parameter ensemble is significantly less skilful than the stochastic ensemble for both the c = 4 and c = 10 case. Figures 2.9(c)
and (f) show the reliability component of the Brier Score calculated for the perturbed parameter ensembles. Comparing these figures with Figure 2.7(b), REL for the perturbed parameter
schemes is greater, indicating that the perturbed parameter ensemble forecasts are less reliable
than the stochastic parametrisation forecasts, and the ensemble is a poorer representation of
model uncertainty. The significance of the difference between skill scores for the measured
stochastic parametrisation schemes and the perturbed parameter schemes with the measured
perturbations (S = 1) are shown in Appendix A.
The reliability of the forecast ensemble is also considered using the RMS error-spread
diagnostic as a function of the scale factor (Figure 2.10). For small-scale factors, the ensemble
is systematically underdispersive for both the c = 4 and c = 10 cases. For larger scale
factors for the c = 10 case, the ensemble is systematically overdispersive for large errors,
and underdispersive for small errors. Comparing Figure 2.10 with Figure 2.7(a) shows that
none of the perturbed parameter ensembles are as reliable as the AR(1) additive stochastic
parametrisation, reflected in the poorer REL score for the perturbed parameter case.
61
2.7
Conclusion
Several different stochastic parametrisation schemes were investigated using the Lorenz ’96
(L96) system. All showed an improvement in weather forecasting skill over deterministic
parametrisations. This result is robust to error in measurement of the parameters — scanning
over parameter space indicated a wide range of parameter settings gave good skill scores.
Importantly, stochastic parametrisations have been shown to represent the uncertainty in a
forecast due to model deficiencies accurately, as demonstrated by an increase in the reliability
of the forecasts.
A significant improvement in the skill of the forecast models was observed when the
stochastic parametrisations included temporal autocorrelation in the noise term. This challenges the notion that a parametrisation scheme should only represent sub-grid scale (both
temporal and spatial) variability. The coupling of scales in a complex system means a successful parametrisation must represent the effects of the sub-grid scale processes acting on spatial
and time scales greater than the truncation level.
Stochastic representations of model uncertainty are shown to outperform perturbed parameter ensembles in the L96 system. They have improved short term forecasting skill and are
more reliable than perturbed parameter ensembles.
The L96 system is an excellent tool for testing developments in stochastic parametrisations.
These ideas can now be applied to numerical weather prediction models and tested on the
atmosphere.
62
3
The Lorenz ’96 System:
Climatology and Regime Behaviour
I believe that the ultimate climatic models ... will be stochastic, i.e., random numbers
will appear somewhere in the time derivatives.
– Ed Lorenz, 1975
3.1
Introduction
There is no formal definition of climate. The climate of a region can be defined in terms of the
long term statistics of weather in that region, including both the mean and the variability. The
World Meteorological Organisation defines thirty years as a suitable period of time for calculating these statistics. The calculated “climate” will be sensitive to this selected period, especially
if the climate is not stationary. Lorenz (1997) proposes several definitions of climate motivated
by dynamical systems theory, including suggesting that “the climate [can be identified as] the
attractor of the dynamical system”. Whereas weather forecasting is considered to be an initial
value problem, climate is often considered to be a boundary condition problem (Bryson, 1997),
with climate change occurring in response to changes in the boundary conditions.
As well as being difficult to define, it is difficult to verify climate projections. Probabilistic
weather forecasts can be verified since many forecast-observation pairs are available to test the
statistics of the forecast. This is not possible for climate predictions as there is only one evolution of the climate, and we must wait many years for new data to become available. However,
63
the strong coupling between different temporal scales (Section 1.2) has motivated investigation
into “seamless prediction”, whereby climate projections are evaluated or constrained through
study of the climate model’s ability to predict shorter time-scale atmospheric events (Palmer
et al., 2008). This is possible due to the non-linearity of the atmosphere, which allows interactions between different temporal and spatial scales. Fast diabatic processes on day-long time
scales (such as radiative and cloud effects) can ultimately affect the cryosphere and biosphere
with time scales of many years. In fact, most of the uncertainty in our climate projections has
been attributed to uncertainty in the representation of cloud feedbacks, which operate on the
shortest time scales (Solomon et al., 2007).
There are several different methods which use this idea to verify climate projections. Rodwell and Palmer (2007) use the initial tendencies in NWP models to assess how well different
climate models represent the model physics. Through studying the six hour tendencies, Rodwell and Palmer were able to discount results from perturbed parameter experiments since
they resulted in unphysical fast scale physics, allowing the associated climate projections to be
proven false. Palmer et al. (2008) use the reliability of seasonal forecasts to verify and calibrate
climate change projections. They argue that it is a necessary requirement that a climate model
is reliable when run in weather prediction mode. The requirement is not sufficient as NWP
models do not include the longer time scale physics of the cryosphere and biosphere which are
nevertheless important for accurate climate prediction. Errors in representations of small-scale
features, testable in a numerical weather prediction model, will manifest themselves as errors in
large-scale features predicted by a climate model. Having analysed the reliability of members
of a MME, Palmer et al. (2008) calibrate the climate change projections to discount unreliable
models. The Transpose-Atmospheric Model Intercomparison Project (Transpose-AMIP) uses
climate models to make weather forecasts. The source of systematic errors in climate simulations is identified and corrected using these short term forecasts, which results in improvements
across all scales (Martin et al., 2010).
In the L96 system, the hypothesis of “seamless prediction” can be rigorously tested. The
same deterministic, stochastic and perturbed parameter forecast models can be used to make
weather and climate predictions, and the skill of these forecasts verified by comparison with
the full “truth” model.
In the L96 system, the boundary conditions are held constant, and very long integrations
64
can be used to define the climate. We consider two definitions of “climate” in the L96 system.
The first is that the climate of L96 can be described as the pdf of the X variables calculated
over a sufficiently long time window. This will include information about both the mean and
the variability, as required for the conventional definition of climate outlined above. The second
definition will consider the presence and predictability of regime behaviour in the L96 system.
This is a dynamics-driven definition of climate as it includes information about the temporal
statistics of weather. It uncovers some of the characteristics of the system’s attractor, following
Lorenz’s definition of climate given above.
In Section 3.2, the forecast models outlined in Chapter 2 are evaluated according to their
skill at reproducing the pdf of the atmosphere. The “seamless prediction” hypothesis is tested
by comparing the climatological skill with the weather prediction skill, evaluated in Chapter 2.
In Section 3.3, atmospheric regime behaviour is introduced, and a subset of the L96 forecast
models are tested on their ability to reproduce the regime behaviour of the L96 system. For
each definition of climate, the performance of stochastic and perturbed parameter models will
be compared with the results from the full “truth” system.
3.2
Climatological Skill: Reproducing the pdf of the Atmosphere
The climatology of the L96 system is defined to be the pdf of the X variables, averaged over
a long run (10,000 model time units ∼ 140 “atmospheric years”), and the forecast climatology
is defined in an analogous way. The skill at predicting the climatology can then be quantified
by measuring the difference between these two pdfs, which may be evaluated in several ways.
The Kolmogorov-Smirnov (KS) statistic, Dks , has been used in this context in several other
studies (Wilks, 2005; Kwasniok, 2012), where
Dks = max |P (Xk ) − Q(Xk )| .
Xk
65
(3.1)
(a) Hellinger Distance, c = 4
(b) KL. Divergence, c = 4
2.5
2.5
2.0
φ
0.998
φ
(c) Hellinger Distance, c = 10
(d) KL. Divergence, c = 10
2.5
2.5
2.0
σ / σmeas
2.0
σ / σmeas
0.993
0.000
0.998
0.0
0.993
0.0
0.969
0.5
0.882
0.5
0.607
0.008
1.0
0.969
0.042
1.0
0.006
1.5
0.882
0.038
0.607
σ / σmeas
0.034
1.5
0.000
σ / σmeas
2.0
0.03
1.5
1.0
0.038
0.5
0.004
1.5
0.004
1.0
0.006
0.5
0.008
φ
0.998
0.992
0.969
0.882
0.607
0.000
0.998
0.992
0.969
0.882
0.0
0.607
0.000
0.0
φ
Figure 3.1: Comparing the Hellinger distance and the Kullback-Leibler divergence as measures
of climatological skill. The measures are shown for the additive AR(1) noise parametrisation,
as a function of the tuneable parameters, for both (a) and (b) the c = 4 case, and (c) and (d)
the c = 10 case. The crosses indicate the measured parameters.
Here P is the forecast cumulative pdf, and Q is the verification cumulative pdf. A second
measure, the Hellinger distance, DHell , was also calculated for each forecast model:
2
DHell
(p, q)
1Z
=
2
q
p(x) −
q
2
q(x)
dx,
(3.2)
where p(x) is the forecast pdf, and q(x) is the verification pdf (Pollard, 2002). Similarly, the
Kullback-Leibler (KL) divergence (Dkl ), is defined as
Dkl (p, q) =
Z
!
p(x)
p(x)ln
dx.
q(x)
(3.3)
This measure is motivated by information theory (Kullback and Leibler, 1951), and is equivalent to the relative entropy between the two distributions (Section 1.7.1.3). For all these
measures, the smaller the measure, the better the match between forecast and verification
climatologies.
The Hellinger distance was found to be a much smoother measure of climatological skill
than the KS statistic as it integrates over the whole pdf. Therefore, the KS statistic has
66
1.5
2.0
2.0
0.03
1.0
0.034
2.5
2.0
1.5
1.0
0.5
0.03
1.0
0.5
φ
2.5
2.0
1.5
0.0
0.998
0.993
0.882
0.607
0.000
2.5
0.0
2.0
0.0
0.038
0.03
1.5
1.0
1.5
0.5
1.5
0.0
(c) AR(1) MA, c = 10
2.5
0.034
1.0
σm / (σm)meas
2.5
0.969
0.5
0.5
0.5
0.0
0.034
1.0
0.0
0.038
0.0
σa / (σa)meas
σ / σmeas
σ0 / (σ0)meas
1.0
(b) M, c = 10
0.038
0.03
σ1 / (σ1)meas
1.5
0.5
(a) AR(1) SD, c = 10
1.5
0.03
φ
2.5
2.0
0.038
0.998
0.000
2.5
2.0
1.5
0.0
1.0
0.0
0.5
0.5
σ1 / (σ1)meas
0.03
1.0
0.5
σa / (σa)meas
2.0
0.993
0.038
2.0
0.969
1.0
2.5
0.882
1.5
2.5
0.607
σ / σmeas
0.03
0.0
σ0 / (σ0)meas
2.0
(c) AR1 MA, c = 4
(b) M, c = 4
(a) AR1 SD, c = 4
2.5
σm / (σm)meas
Figure 3.2: The climatological skill of different stochastic parametrisations for the c = 4 and
c = 10 cases, as a function of the tuneable parameters in each parametrisation. The crosses
indicate the measured parameters in each case. For the SD and MA cases, φ is set to φmeas . The
Hellinger distance was calculated between the truth and forecast probability density function
(pdf) of the X variables. The smaller the Hellinger distance, the better the forecast pdf.
not been considered further here. Figure 3.1 shows that the Hellinger distance and KullbackLeibler divergence give very similar results, so for these reasons only the Hellinger distance will
be considered. Pollard (2002) shows that the two measures are linked, so this similarity is not
surprising.
The Hellinger distance is evaluated for the different stochastic parametrisations for the two
cases, c = 4 and c = 10. Figures 3.1 and 3.2 shows the results for c = 4 and c = 10 when the
tuneable parameters were varied as for the weather skill scores (Section 2.4). The climatology
of the c = 10 case is more skilful than for the c = 4 case, as indicated by smaller Hellinger
distances. The larger time scale separation for the c = 10 case is easier to parametrise, so the
forecast models perform better than for the c = 4 case. The peak of skill is shifted for the c = 4
case compared to the c = 10 case, towards parametrisation schemes with larger magnitude
noise. Qualitatively, the shape of the plots is similar to those for the equivalent weather skill
scores. This suggests that if a parametrisation performs well in weather forecasting mode, it
performs well at simulating the climate. The peak is shifted up and to the right compared to
the RPSS, but is in a similar position to IGNSS and REL.
67
0.04
0.035
0.03
Deterministic
White Add.
AR1 Add.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
Hellinger Distance
0.045
c=4
c = 10
Figure 3.3: The Hellinger distance for the different parametrisations is compared for the c = 4
and c = 10 cases. The smaller the Hellinger distance, the better the climatological “skill”.
The parameters in each parametrisation have been estimated from the truth time series. In
each case, “White” indicates that the measured standard deviations have been used, but the
autocorrelation parameters set to zero.
The climatological skill for the different parametrisations is summarised in Figure 3.3. As
for the weather forecasting skill, a significant improvement in the climatological skill is observed
when temporal autocorrelation is included in the parametrisations. This is demonstrated in
Appendix A.2, which shows the significance of the difference between the parametrisation
schemes tested when white or red noise is used. The white noise climatologies do not show a
significant improvement over the deterministic climatology for both the c = 4 and c = 10 case.
The red noise schemes are all significantly better than both the deterministic and white noise
stochastic parametrisation schemes, with the red multiplicative scheme performing significantly
better than all schemes for both the c = 4 and c = 10 case.
The climatological skill, as measured by the Hellinger distance, can be compared to the
weather skill scores using scatter diagrams (Figure 3.4). This is of interest, as the seamless
prediction paradigm suggests that climate models could be verified by evaluating the model in
weather forecasting mode. Figures 3.4(a) and 3.4(d) show the relationship between RPSS and
the Hellinger distance. For the c = 10 case, there appears to be a strong negative correlation
between the two. However, the peak in RPSS is offset slightly from the minimum in Hellinger
distance giving two branches in the scatter plot. The c = 4 case can be interpreted as being
positioned at the joining point of the two branches, and shows how using the RPSS as a method
to verify a model’s climatological skill could be misleading.
Figures 3.4(b) and 3.4(e) compare IGNSS with the Hellinger distance. The upper branch in
68
0.03
0.04
0.03
0.02
0.5
0.6
0.7
0.03
0.02
−1
−0.5
0
0.5
0
5
10
REL
(d) c = 10
(e) c = 10
(f) c = 10
0.04
Hellinger Distance
IGNSS
0.06
0.08
0.06
0.04
0.02
0.7
0.8
W. SD.
15
x 10
−3
0.08
0.06
0.04
0.02
−1
RPSS
AR1 Add.
0.04
RPSS
0.08
0.02
0.6
(c) c = 4
Hellinger Distance
Hellinger Distance
0.04
0.02
0.4
Hellinger Distance
(b) c = 4
Hellinger Distance
Hellinger Distance
(a) c = 4
−0.5
0
0.5
0
IGNSS
AR1 SD.
2
4
REL
AR1 M.
W. MA.
6
x 10
−3
AR1 MA.
Figure 3.4: Scores (RPSS, IGNSS, REL) calculated when the forecast model is in weather mode
are compared to the climatological skill of the forecast model, as measured by the Hellinger
distance. Figures (a)–(c) are for the c = 4 case, and Figures (d)–(f) are for the c = 10 case.
The greater the forecast skill score (RPSS, IGNSS), the better the forecast. The lower the
reliability score (REL), the more reliable the forecast. The lower the Hellinger distance, the
closer the match between the forecast climatology and the true climatology of the system.
Each point in the scatter diagram corresponds to a forecast model with each of the tested sets
of parameters, (σ, φ). The symbols represent the different parametrisations - the legend below
corresponds to all figures.
69
(e) corresponds to the large magnitude high temporal autocorrelation parametrisations which
have a high IGNSS, but a poor climatological skill. This makes IGN unsuitable as an evaluation
method for use in seamless prediction.
Figures 3.4(c) and 3.4(f) show the results for REL. For the c = 10 case, there is a strong
correlation between REL and the Hellinger distance. For the c = 4 case, a small Hellinger
distance is conditional on having a small value for REL, but a small REL does not guarantee
a small Hellinger distance. This indicates that reliability in weather forecasting mode is a
necessary but not a sufficient requirement of a good climatological forecast, as was suggested
by Palmer et al. (2008). The results indicate that REL is a suitable score for use in seamless
prediction. It is not surprising that the REL is well suited for this task, as it is particularly
sensitive to the reliability of an ensemble, which is the characteristic of a weather forecast
which is important for climate prediction (Palmer et al., 2008). The other weather skill scores
studied put too much weight on resolution to be used for this purpose.
3.2.1
Perturbed Parameter Ensemble
The climatology of the perturbed parameter ensemble was evaluated for each value of the scale
parameter, ‘S’, which determines the degree to which the parameters are perturbed following
(2.19). The climatology of the perturbed parameter ensemble must include contributions from
each of the 40 ensemble members. Therefore, the climatology is defined as the pdf of the X
variables, averaged over the same total number of model time units as the stochastic parametrisations (10,000); each of the 40 ensemble members is integrated for 250 model time units.
The Hellinger distance between the truth and forecast climatologies can then be calculated as
a function of the scale factor (Figure 3.5).
For the c = 10 case, the climatology of the measured perturbed parameter ensemble is
significantly worse than all red noise stochastic parametrisations (see Appendix A.2). This is
as predicted by the “seamless prediction” paradigm; the perturbed parameter ensembles are
less reliable than the stochastic parametrisations, and so predict a less accurate climatology.
However, for the c = 4 case, the pdf of the measured perturbed parameter ensemble is not
significantly different to the red noise stochastic parametrisation schemes; reliability is not
sufficient for a good climatological forecast according to the Hellinger distance.
70
0.036
0.036
0.04
0.04
0.04
0.040
0.048
0.048
0.048
(b) Hellinger Distance, c = 10
0.036
0.038
0.044
(a) Hellinger Distance, c = 4
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8 2.0
Scale Factor, s
Scale Factor, s
Figure 3.5: The Hellinger distance between the truth and forecast climatologies of the perturbed parameter model as a function of the scale factor, S. The smaller the Hellinger distance,
the better the predicted climatology. The S > 2.4 and S > 1.2 forecast models for the c = 4
and c = 10 cases respectively are numerically unstable over long integrations, so a climatology
could not be calculated.
3.3
Climatological Skill: Regime Behaviour
The presence of regimes is a characteristic of non-linear, chaotic systems (Lorenz, 2006). In
the atmosphere, regimes emerge as familiar circulation patterns such as the El-Niño Southern
Oscillation (ENSO), the North Atlantic Oscillation (NAO) and Scandinavian Blocking events.
More generally, a regime can be defined as “a region of state space that is more populated than
neighbouring regions” (Stephenson et al., 2004). Identifying this localised clustering in state
space is a non-trivial statistical problem (Stephenson et al., 2004), but can be achieved using a
clustering algorithm such as k–means clustering1 (Dawson et al., 2012; Pohl and Fauchereau,
2012; Straus et al., 2007) or by estimating the pdf of the distribution and searching for multiple
maxima (Corti et al., 1999).
In recent years there has been much interest in the problem of identifying and studying
atmospheric regimes (Palmer, 1993, 1999). In particular, there is much interest in how these
regimes respond to an external forcing such as anthropogenic greenhouse gas emissions. An
attractor with regime structure could respond to a forcing in two possible ways. Hasselmann
(1999) discusses the climate attractor in terms of a field with several potential wells, each of
which represents a different atmospheric regime. The first possible response to an external
forcing would be a change in the relative depths of the potential wells. This would lead to
changes in both the relative residence times in the wells, and to the transition frequencies
between regimes. Studying 20th Century reanalysis data indicates greenhouse gas forcing
leads, in part, to this response in the climate system. Corti et al. (1999) observed changes
in the frequency of Northern Hemisphere intraseasonal–interannual regimes between 1949 and
1
K-means clustering partitions n datapoints into k clusters such that each observation is assigned to the
cluster with the nearest mean. This is performed by an iterative procedure, which minimises the sum of the
squared difference between each data point and the mean of the cluster it belongs to. The resultant clusters
may be dependent on the initial clusters chosen at random, so the clustering process is repeated a large number
of times and the optimal clusters are selected.
71
1994, though the structure of the regimes remained unchanged over this time period.
The second possible response to an external forcing is a change in regime properties, such
as centroid location and number of regimes (i.e. the position and number of potential wells). In
addition to observing changes in frequency of regimes over the time period 1948–2002, Straus
et al. (2007) observe this second type of response in the reanalysis data; the structure of the
Pacific trough regime is statistically significantly different at the end of the time period than
at the beginning.
The importance of regimes in observed trends over the past 50–100 years indicates that in
order to predict anthropogenic climate change, our climate models must be able to accurately
represent natural circulation regimes, their statistics and variability. Dawson et al. (2012) show
that while NWP models are able to capture the regime behaviour of the climate system with
reasonable accuracy, the same model run at climate resolution does not show any statistically
significant regime structure. However, the model used in this study has no representation of
model uncertainty; a single deterministic forecast is made from each starting date.
It is now well established that representing model uncertainty as well as initial condition
uncertainty is important for reliable weather forecasts (Ehrendorfer, 1997). Many possible
methods for representing model uncertainty have been discussed in Section 1.3. It is possible
that including a representation of model uncertainty could enable the simulator to explore
larger regions of the climate attractor, including other flow regimes. This section seeks to
investigate the effect of including representations of model uncertainty on the regime behaviour of a simulator. A deterministic parametrisation scheme will be compared to stochastic
parametrisation approaches and a perturbed parameter ensemble (please refer to Chapter 2
for experimental details). A simple chaotic model of the atmosphere, the L96 system, will be
used to study the predictability of regime changes (Lorenz, 1996, 2006).
3.3.1
Data and Methods
The Lorenz (1996) simplified model of the atmosphere was used in this investigation, as described in Chapter 2. Firstly, it needs to be established whether the L96 two–scale model exhibits regime behaviour. Lorenz (2006) carried out a series of experiments using the one-scale
Lorenz ’96 system (hereafter L96 1D), which describes the evolution of the L96 Xk variables,
72
Figure 3.6: The time series of total energy for the L96 1D system, where total energy is defined
P
as E = 12 Xk2 . The labels [4.95–5.25] indicate the value of the forcing, F , in (3.4). Taken
from Lorenz (2006) (Fig. 3).
73
Total Energy
300
200
100
0
20
40
60
80
100
120
Time / MTU.
140
160
180
200
Figure 3.7: The time series of total energy for the L96 system, c = 4 case, where total energy is
P
defined as E = 21 Xk2 . Total energy is not conserved as the system is forced and dissipative.
The time series for total energy does not appear to show regime behaviour for the c = 4 case
with forcing F = 20.
without the influence of the smaller scale Yj variables (Lorenz, 1996):
dXk
= −Xk−1 (Xk−2 − Xk+1 ) − Xk + F
dt
k = 1, ..., K
(3.4)
Lorenz defines a dynamical system as having regime behaviour if:
1. The phase space of the dynamical system has two separate regions, A and B.
2. Both transitions A–B and B–A are observed.
3. For both modes, the average length of time between transitions must be long compared
to some other significant oscillation of the system.
He performed extended numerical runs and examined the resultant time series of total energy.
E=
K
1X
X2
2 k=1 k
(3.5)
Figure 3.6 shows the time series for the L96 1D system for different values of the forcing
parameter, F . From his criteria, Lorenz identified regimes in the time series for F = 5.05−5.25.
Figure 3.7 and Figure 3.8(a) show the time series of total energy for the c = 4 and c = 10
cases for the L96 system respectively. The c = 4 case does not appear to have regimes as
defined by Lorenz (2006) for F = 20. However, the c = 10 case shows drops in the total energy
of the system which persist for a few model time units, which could indicate the presence of
regimes. The time series for c = 10 looks qualitatively similar to the series for F = 5.25 in
Figure 3.6, for the L96 1D system.
Lorenz also considers the spatial distribution of the X variables as a test for the presence of
regimes. Figure 3.9 (a) shows profiles of the L96 system Xk at 6 hour intervals taken from 60
MTU after the start of the dataset in Figure 3.8, when the total energy of the system tends to
74
Total Energy
(a)
200
150
100
Covariance diagnostic
0
20
20
40
60
80
100
120
Time / MTU.
140
160
180
200
20
40
60
80
100
120
Time / MTU.
140
160
180
200
20
40
60
80
100
120
Time / MTU.
140
160
180
200
(b)
10
0
−10
−20
0
Regime
(c)
A
B
0
Figure 3.8: (a) The time series of total energy for the L96, c = 10, F = 20 case, where total
P
energy is defined as E = 12 Xk2 . (b) The covariance diagnostic evaluated for the data shown in
(a). If the diagnostic is positive, a wave–2 pattern dominates the behaviour of the X variables,
whereas if it is negative, a wave–1 pattern is dominant. See text for details. (c) The same data
set, interpreted in terms of two regimes — A and B — defined using the covariance diagnostic.
If the diagnostic is positive, the system is interpreted as being in Regime A.
be higher with large oscillations. Figure 3.9 (b) shows profiles from 50 MTU after the start of
the dataset, when the total energy of the system has dropped to a lower, more quiescent state.
The two sets of profiles are different: since these two samples are characterised by physically
different states, it is reasonable to interpret them as coming from two different regimes.
The difference in structure between the two regimes is most clearly revealed by consideration
of the covariance matrix of the X variables, shown in Figure 3.10. It is convenient to define
Regimes A and B in terms of this covariance matrix C, calculated using samples of the time
series 1 MTU long, where C(m, n) represents the covariance between Xm and Xn :
Regime = A ⇐⇒ (C(1, 5) + C(2, 6) + C(3, 7) + C(4, 8)) > 0
Regime = B ⇐⇒ (C(1, 5) + C(2, 6) + C(3, 7) + C(4, 8)) < 0
(3.6)
In other words, the system is defined to be in Regime A if opposite X variables are in phase
for K = 8, and in Regime B if opposite X variables are out of phase. The time series of this
“covariance diagnostic” and the resultant identified regimes are shown in Figure 3.8 (b) and
75
(a)
(b)
2
2
1
1
0
0
1
2
3
4
Xk
5
6
7
8
1
2
3
4
Xk
5
6
7
8
Figure 3.9: Profiles of the K = 8 X variables for the L96 c = 10 case. The profiles are taken
from (a) 60 MTU and (b) 50 MTU after the start of the time series shown in Figure 3.8. The
labelling on the Y-axis indicates the number of “atmospheric days” since the first profile. The
profiles from Regime A show a wave–2 type behaviour, while those from Regime B show a
dominant wave–1 pattern.
(a)
(b)
0
0
0
7
−10
0
−10 −10
6
20
30
Xn
10
−10 −10
0
0
0 10
2
0
2
1
1
4
0
0
Xm
5
6
7
10
0
−10
0
10
20 30
2
4
0
10
3
0
−10
3
10
20
0
−10
5
0 −10
3
−20
0
0
10
20
40
−10
−10
6
−10
−10
10
0
7
10
0
5
8
10
Xn
8
1
8
1
−10
−20
−10
10 0
2
3
4
Xm
5
0
6
7
8
Figure 3.10: The covariance matrix, C(m, n), for the covariance between Xm and Xn calculated
from a 1 MTU sample, (a) 60 MTU and (b) 50 MTU after the start of the time series shown
in Figure 3.8. (a) The dominant feature is a wave–2 pattern, with the ‘opposite’ X variables
in phase with each other. (b) The dominant feature is a wave–1 pattern, with the ‘opposite’
X variables out of phase with each other.
(a)
0.45
0.1
(b)
0.4
0.08
Frequency Density
Frequency Density
0.35
0.06
0.04
0.3
0.25
0.2
0.15
0.1
0.02
0.05
0
0
10
20
30
40
Duration / MTU
0
0
2
4
6
8
10
Duration / MTU
Figure 3.11: The probability distribution function (pdf) for the duration of (a) Regime A and
(b) Regime B. Regime A events are observed to be longer lasting on average than Regime B
events.
76
Magnitude
0.5
0
−0.5
1
2
3
4
5
6
7
8
X
EOF 1
EOF 2
EOF 3
EOF 4
Figure 3.12: The first four Empirical Orthogonal Functions (EOFs) calculated from the c = 10
truth time series. Due to the symmetry of the system, the EOFs correspond to the leading
harmonics.
(c) respectively.
Lorenz’s third criterion for regimes requires their duration to be longer than some other
significant oscillation of the system. In the L96 system for the c = 10 case, the dominant
oscillation of the X variables has a time period of approximately 0.5 MTU. Figure 3.11 shows
a pdf of the duration of each regime. The average duration of Regime A is 5.08 MTU, and the
average duration of Regime B is 2.06 MTU. The average duration of both regimes is greater
than 0.5 MTU, so we can conclude that, for the c = 10 case, the L96 system does indeed
exhibit regime behaviour, and is a suitable model for use in this investigation.
The predictability of the regime behaviour of the L96 system, c = 10 case, will be studied
using the same techniques used for atmospheric data. Firstly, it has been suggested that the
time series should be temporally smoothed to help identify the regimes (Straus et al., 2007;
Stephenson et al., 2004). For example, consider the well–known Lorenz (1963) system (the
“butterfly attractor”): the system clearly has two regimes corresponding to the two lobes of
the attractor, but these regimes are only apparent in a pdf of the system if the time series is
first temporally averaged (Corti et al., 1999; Stephenson et al., 2004)2 . In the L96 system, the
modal residence time in Regime B is ∼ 0.4 MTU (Figure 3.11), so a running time average over
0.4 MTU will be used to smooth the time series.
When studying atmospheric data sets, the dimensionality of the problem is usually reduced
using an empirical orthogonal function (EOF) analysis on the temporally smoothed data series
2
The time series must not be too heavily smoothed as this will cause the pdf to tend towards a Gaussian
distribution (following the central limit theorem).
77
(Straus et al., 2007). Due to the symmetry of the L96 system, Figure 3.12 shows that the leading
EOFs of the L96 system are simply the dominant harmonics. The first two EOFs are degenerate,
and are
π
4
out of phase wavenumber two oscillations, i.e. are in phase quadrature. The third
and fourth EOFs are similarly in phase quadrature, and are
π
2
out of phase wavenumber one
oscillations. Consideration of Figure 3.12 shows that EOF1 and EOF2 are likely to dominate
in Regime A, whereas EOF3 and EOF4 dominate in Regime B. The principal components
(PCs) were calculated for each EOF. Due to the degeneracies of the EOFs, the magnitude of
the principal component vectors, ||[P C1, P C2]|| and ||[P C3, P C4]||, will be considered and the
pdf of the system plotted in this space.3 The corresponding eigenvalues show that EOF1 and
EOF2 account for 68.7% of the variance, while EOF3 and EOF4 account for a further 14.4%.
3.3.2
The True Attractor
The full set of c = 10 L96 equations is integrated forward for 10,000 MTU ∼ 140 “atmospheric
years” to ensure the attractor is fully sampled. The time series is temporally smoothed with
a running average over 0.4 MTU. For comparison, the raw, unsmoothed time series is also
considered. An EOF decomposition is carried out on each of the truth time series (raw and
smoothed), and the PCs calculated. The dimensionality of the space is further reduced by
considering only the magnitude of the PC vectors [P C1, P C2] and [P C3, P C4] as a function
of time. The state vector pdf for the full “truth” model is shown in Figure 3.13 for (a)
the unsmoothed time series and (b) the smoothed time series. Temporally smoothing the
time series helps to identify the two regimes. The maximum of the pdf is located at large
[P C1, P C2] and small [P C3, P C4], corresponding to the more common wave–2 “Regime A”
state of the system. However, in (b) the pdf is elongated away from this maximum towards
large [P C3, P C4] and small [P C1, P C2], where there is a small but distinct subsidiary peak;
this corresponds to the less common “Regime B”. Figure 3.13(a) does not have a distinct second
peak, so does not indicate the presence of regimes.
Figures 3.13(c) and (d) show the mean residence time of trajectories within local areas of
phase space. For each point in PC space, a circular region with radius R is defined, and the
3
This is equivalent to considering complex EOFs, which are used to capture propagating modes. A Hilbert
transform is applied to the time series X to calculate H(X), which includes information about the first derivative
of the data, and indicates the presence of the orthogonal sine/cosine signals necessary for a propagating signal.
A standard EOF analysis is then performed for the modified time series: X + H(X). In this study, EOF1 and
EOF2 together represent the first complex EOF, and EOF3 and EOF4 represent the second complex EOF.
78
average residence time of trajectories within that region is calculated, following Frame et al.
(2013). Here R = 2, and the displayed circle indicates the size of the region for comparison. For
both (c) the unsmoothed and (d) the smoothed time series, two regions of high residence time
can be identified. The longest residence times occur at large [P C1, P C2] and small [P C3, P C4],
corresponding to Regime A. There is a further peak in residence time at large [P C3, P C4]
and small [P C1, P C2], corresponding to Regime B. These two distinct peaks provide further
evidence for the regime nature of the L96 system: there are two regions in phase space which
the system preferentially occupies for extended periods of time, and transitions between these
regions are more rapid. This diagnostic confirms that Regime A is more persistent than Regime
B, as expected from Figure 3.11.
Figures 3.13(e) and (f) show the mean velocity of the system’s motion through phase space.
The colour indicates mean speed, and arrows indicate mean direction. A region with radius 0.5
is defined centred on each point in phase space, and the net displacement of trajectories starting
within this region is calculated over 0.05 MTU. The average magnitude and average direction
of displacement is then calculated. Both Figures (e) and (f) show two centres of rotation in
phase space corresponding to the two regimes. On average, trajectories circle these centres,
resulting in persistent conditions (cf. the Lorenz ’63 ‘butterfly attractor’). The structure of
the flow field is somewhat different for the smoothed time series — the second centre is less
clearly defined, but coincides with a maximum in the average magnitude of displacement.
In fact, trajectories are observed to oscillate vertically about this centre during a persistent
Regime B phase, resulting in the large average magnitude, but small average displacement
vector. Nevertheless, both Figures 3.13(e) and (f) provide conclusive evidence of the existence
of regimes in the L96 system, which can be detected in both the raw and smoothed data.
3.3.3
Simulating the Attractor
Having demonstrated that the full L96 system exhibits regime behaviour, the skill of each of
the truncated L96 forecast models at reproducing this regime behaviour will be evaluated.
Each forecast model is integrated forward for 10,000 MTU as for the full L96 system. For
the perturbed parameter experiment, each of the 40 deterministic models is integrated for
250 MTU. While regime behaviour can be detected in both the raw and smoothed truth time
series, the results in Section 3.3.2 indicate that it is easier to detect the presence of regimes in
79
(a)
(b)
0.03
15
0.02
0.015
0.01
5
0
0
5
10
15
10
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
0.025
10
0.08
12
0.06
8
0.04
6
4
0.005
2
0
0
0.02
0
|| [PC 1, PC 2] ||
5
10
|| [PC 1, PC 2] ||
(c)
(d)
15
12
0.4
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
0
0.1
10
0.05
5
0.35
0.3
8
0.25
6
0.2
4
0.15
0.1
2
0
0
5
10
15
0
0
0.05
0
|| [PC 1, PC 2] ||
(e)
10
(f)
4
15
2
12
10
3
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
5
|| [PC 1, PC 2] ||
10
2
5
1.5
8
1
6
4
1
0.5
2
0
0
5
10
15
0
0
|| [PC 1, PC 2] ||
0
5
10
0
|| [PC 1, PC 2] ||
Figure 3.13: Regime characteristics of the full L96 system. Both the raw and temporally
smoothed time series are considered, where the smoothing is a running average over 0.4
MTU. Each diagnostic is shown in the space of pairs of leading EOFs, [EOF 1, EOF 2] and
[EOF 3, EOF 4]. See text for details. (a) Raw and (b) smoothed pdfs. (c) Raw and (d)
smoothed mean residence times (MTU): the mean length of time a trajectory remains within 2
units of each location. A circle of radius two is indicated. (e) Raw and (f) smoothed magnitude
(colour) and orientation (arrows) of the average displacement in phase space over 0.05M T U ,
averaged over trajectories passing within 0.5 units of each position in phase space.
80
the smoothed time series pdf, as was suggested by Straus et al. (2007). Therefore, the time
series are temporally smoothed with a running average over 0.4 MTU. The four leading order
truth EOFs are used to calculate the PCs of the forecast models to ensure a fair comparison
(Corti et al., 1999). The dimensionality of the space is further reduced by considering only the
magnitude of the PC vectors [P C1, P C2] and [P C3, P C4] as a function of time (equivalent
to considering the first two complex EOFs). The state vector pdfs for each of the forecast
models considered are shown in Figure 3.14; the pdf for the full “truth” model has also been
reproduced for ease of comparison.
Panel (a) in Figure 3.14 corresponds to the full “truth” equations (1.6), reproduced from
Figure 3.13, and shows two distinct peaks corresponding to the two regimes, A and B. None of
the forecast models are able to accurately capture the second subsidiary peak, corresponding
to Regime B. The white noise stochastic models and the deterministic (DET) parametrisation
scheme all put too little weight on this area of phase space. However, the AR(1) stochastic
models show a large improvement — Regime B is explored more frequently than by the deterministic or white stochastic models, though not as frequently as by the full truth system. The
attractor of the perturbed parameter model, Figure 3.14(c), shows a distinct peak for Regime
B, unlike the other forecast models. However, the attractor has a very different structure to
that for the truth time series — Regime B is visited too frequently.
In fact, it is surprising how reasonable the perturbed parameter attractor looks! It consists
of an average of 40 constituent members, shown in Figure 3.15. The contour colours are
consistent with Figure 3.14. Many of the 40 different perturbed parameter ensemble members
show vastly different regime behaviour to the true attractor. While some ensemble members (for
example, numbers 8, 13, 18 and 38) look reasonable, many spend all their time in Regime A
and do not explore Regime B at all (for example, numbers 1–4, etc), while some predominantly
inhabit Regime B (e.g. numbers 5, 37, 39). Perturbed parameter ensembles are often used for
climate prediction. However, if individual members only explore one region of the true climate
attractor, how can the effect of forcing on the frequency of different regimes be established?
For ease of comparison, the 2D Figure 3.14 has been decomposed into two, 1D pdfs for
each of the forecast models (Figure 3.16). The Kolmogorov-Smirnov Statistic (3.1), Dks , and
Hellinger distance (3.2), DHell , have been calculated as measures of the difference between the
true pdf and the forecast pdf for each case. For both Dks and DHell , the smaller the measure,
81
|| [PC 3, PC 4] ||
(a) TRU
10
0.08
8
0.06
6
0.04
4
0.02
2
0
0
0
5
10
15
|| [PC 1, PC 2] ||
(c) PP
10
8
8
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
(b) DET
10
6
4
2
0
0
5
10
6
4
2
0
15
0
|| [PC 1, PC 2] ||
8
8
6
4
2
0
5
10
4
2
0
15
0
(f) WM
10
15
(g) AR1M
10
8
8
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
5
|| [PC 1, PC 2] ||
10
6
4
2
0
15
6
|| [PC 1, PC 2] ||
0
10
(e) AR1A
10
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
(d) WA
10
0
5
|| [PC 1, PC 2] ||
5
10
6
4
2
0
15
|| [PC 1, PC 2] ||
0
5
10
15
|| [PC 1, PC 2] ||
Figure 3.14: Ability of different parametrisation models to reproduce the true attractor (top
left). The pdf of the state vector for the c = 10 L96 system is plotted in the space of pairs of
leading EOFs. See text for details. Six different forecasting models are shown. (b) Deterministic parametrisation scheme; (c) perturbed parameter scheme; additive stochastic parametrisation with (d) white and (e) AR(1) noise; multiplicative stochastic parametrisation with (f)
white and (g) AR(1) noise. The degree of perturbation of the perturbed parameters, and the
standard deviation and autocorrelation in the stochastic parametrisations have been estimated
from the truth time series (see Chapter 2 for more details). The same EOFs determined from
the full truth data set are used in each panel, and the colour of the contours is also consistent.
82
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
0.08
0.06
0.04
0.02
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
5
0
5
0
5
0
5
0
5
0
5
0
5
0
5
0
0
5
10
|| [PC 1, PC 2] ||
15 0
5
10
|| [PC 1, PC 2] ||
15 0
5
10
|| [PC 1, PC 2] ||
15 0
5
10
|| [PC 1, PC 2] ||
15 0
5
10
15
|| [PC 1, PC 2] ||
Figure 3.15: Ability of the perturbed parameter scheme to reproduce the true attractor. The
pdf of the state vector for the L96 system is plotted in the space of pairs of leading EOFs,
[EOF 1, EOF 2] and [EOF 3, EOF 4] for each of the forty perturbed parameter ensemble members (numbered). The attractors of individual ensemble members appear very different to
the true attractor, with some members only exploring one of the two regimes present in the
full system. The figure numbers correspond to the numbering of the ensemble members in
Table 2.4. The same colour bar is used as in Figure 3.14. Note that many of the ensemble
members saturate this colour bar.
83
(b)
0.3
0.3
0.25
0.25
Probability Density
Probability Density
(a)
0.2
0.15
0.1
0.05
0
0
0.2
0.15
0.1
0.05
5
10
15
0
0
|| [PC1, PC2] ||
TRU
2
4
6
8
10
|| [PC3, PC4] ||
DET
AR1 A
AR1 M
PP
Figure 3.16: Ability of different parametrisation models to reproduce two of the dimensions of
the true attractor, shown as two 1–dimensional plots. The pdf of the state vector for the L96
system is plotted for (a) the magnitude of [P C1, P C2] and (b) the magnitude of [P C3, P C4].
Four different forecasting models are shown on each panel (coloured lines) together with that
for the truth data. White noise, whether additive or multiplicative, is indistinguishable from
the deterministic case, so has not been shown.
Parametrisation
Deterministic
White Additive Stochastic
AR(1) Additive Stochastic
White Multiplicative Stochastic
AR(1) Multiplicative Stochastic
Perturbed Parameters
[PC1,PC2]
Dks Dhell
0.083 0.125
0.084 0.125
0.037 0.062
0.083 0.121
0.040 0.052
0.072 0.110
[PC3,PC4]
Dks Dhell
0.112 0.117
0.114 0.118
0.049 0.055
0.110 0.115
0.028 0.035
0.040 0.048
Table 3.1: The skill of different parametrisation schemes at reproducing the structure of the
Truth attractor along each of two directions defined by the dominant EOFs. The KolmogorovSmirnov distance Dks , and Hellinger distance Dhell , are used to measure the similarity between
the true and forecast pdfs. The smaller each of these measures, the closer the forecast pdf is
to the true pdf. The best forecast model according to each measure is shown in red.
84
the closer the forecast is to the true pdf. The results are shown in Table 3.1. The AR(1)
multiplicative stochastic forecast is the most skilful for each case according to the Hellinger
distance. The AR(1) additive scheme also scores well for both cases, and is the most skilful
representation of [P C1, P C2] according to the Kolmogorov-Smirnov distance. The perturbed
parameter ensemble greatly improves over the deterministic and white stochastic forecasts
for the [P C3, P C4] case, but does not greatly improve over the deterministic scheme for the
[P C1, P C2] pdf.
However, a perturbed parameter ensemble must be interpreted carefully. Since each member
of the ensemble is a physically distinct model of the system, the forecast climatology of each
member should be assessed individually. The 1D pdfs of [P C1, P C2] and [P C3, P C4] were
calculated for each ensemble member, and the Hellinger distance between the forecast and true
pdfs evaluated for each case. For comparison, the deterministic and stochastic forecasts were
also split into 40 sections, each 250 M T U long, and the Hellinger distance evaluated for each
section as for the perturbed parameter ensemble. This allows the effect of sampling error to be
considered. The distribution of Hellinger distances for each case is shown in Figure 3.17. The
spread of skill is largest for the perturbed parameter ensemble, with some members showing
very poor climatologies, while others are more skilful. The white additive and multiplicative
schemes do not show a significant difference from the deterministic forecast, while the AR1
additive and AR1 multiplicative schemes are consistently better than the other schemes.
3.3.4
Simulating Regime Statistics
While reproducing the pdf of the true system is important for capturing regime behaviour, it is
also necessary for a forecast model to represent the temporal characteristics of the regimes well.
This is evaluated using the distribution of persistence of each regime (Dawson et al., 2012; Pohl
and Fauchereau, 2012; Frame et al., 2013), and will be considered using two different techniques.
Firstly, the behaviour of the system in PC space is used to examine the temporal characteristics of the system. The mean residence time of trajectories in phase space is calculated. For
each point in PC space, a circular region with radius R is defined, and the average residence
time of trajectories within that region is calculated, as for Figure 3.13(c) and (d). Figure 3.18
shows the mean residence time of trajectories when R = 2 PC units, and Figure 3.19 shows the
same diagnostic when R = 4 PC units. For each case, the circle in panel (a) indicates the size
85
(a)
[PC1, PC2]
Distribution of Hellinger Distances
Distribution of Hellinger Distances
0.5
0.4
0.3
0.2
0.1
0
DET WA AR1A WM AR1M PP
Forecast Scheme
0.4
(b)
[PC3, PC4]
0.3
0.2
0.1
0
DET WA AR1A WM AR1M PP
Forecast Scheme
Figure 3.17: The distribution of Hellinger distance calculated for the difference between forecast and observed EOF climatologies. The pdf for the magnitude of (a) [P C1, P C2] and (b)
[P C3, P C4] is calculated. For the deterministic and stochastic models, the time series is split
into 40 sections, 250 M T U long, and the pdfs calculated for each. For the perturbed parameter
ensemble, the pdfs are calculated for each ensemble member separately. The Hellinger distance
between each forecast pdf and the true pdf is evaluated, and the distribution of Hellinger distance represented by a box and whisker plot. The median value is marked by a horizontal red
line. The 25th and 75th percentiles are indicated by the edges of the box, and the whiskers
extend to the minimum and maximum value in each case, unless there are outliers, which are
marked by a red cross. An outlier is defined as a value smaller than 1.5 times the inter–quartile
range (IQR) below the lower quartile, or greater than 1.5 IQR above the upper quartile.
of the region. For both R = 2 and R = 4, two regions of comparatively high residence time
can be identified in the truth simulation shown in panel (a). The longest residence times occur
at large [P C1, P C2] and small [P C3, P C4], corresponding to Regime A. There is a smaller
peak in residence time at large [P C3, P C4] and small [P C1, P C2], corresponding to Regime
B. Figures 3.18 (a) and 3.19 (a) are qualitatively similar, except that the peaks have a more
equal depth for the R = 2 case than for the R = 4 case.
The forecast models are able to capture the temporal characteristics of the true system. They
show two distinct peaks in residence time of approximately the correct magnitude. However,
there are subtle differences between the different forecast models. In Figure 3.18, the DET, PP,
WA and WM forecast models have regimes that are too persistent — the two peaks in residence
time are too high, particularly for Regime A. The red noise stochastic schemes perform better,
with the AR1M scheme capturing the average residence time for Regime B particularly well.
In Figure 3.19, all models predict too high residence times for Regime A. However, the AR1M
scheme performs the best, with a good representation of residence times in the tail of the pdf,
and a lower and more accurate peak residence time for Regime A.
86
|| [PC 3, PC 4] ||
(a) TRU
10
0.5
8
0.4
6
0.3
4
0.2
2
0.1
0
0
5
10
0
15
|| [PC 1, PC 2] ||
(b) DET
(c) PP
10
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
10
8
6
4
2
0
8
6
4
2
0
5
10
0
15
0
|| [PC 1, PC 2] ||
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
10
8
6
4
2
8
6
4
2
0
5
10
0
15
0
|| [PC 1, PC 2] ||
5
10
15
|| [PC 1, PC 2] ||
(f) WM
(g) AR1M
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
15
(e) AR1A
10
8
6
4
2
0
10
|| [PC 1, PC 2] ||
(d) WA
0
5
8
6
4
2
0
5
10
0
15
|| [PC 1, PC 2] ||
0
5
10
15
|| [PC 1, PC 2] ||
Figure 3.18: Mean residence time in model time units. The mean length of time trajectories
remain within 2 units of each position in PC space. A circle of radius 2 units is shown for
comparison in panel (a).
87
(a) TRU
5
|| [PC 3, PC 4] ||
10
4
8
3
6
4
2
2
1
0
0
5
10
0
15
|| [PC 1, PC 2] ||
(b) DET
(c) PP
10
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
10
8
6
4
2
0
8
6
4
2
0
5
10
0
15
0
|| [PC 1, PC 2] ||
|| [PC 3, PC 4] ||
|| [PC 3, PC 4] ||
10
8
6
4
2
8
6
4
2
0
5
10
0
15
0
|| [PC 1, PC 2] ||
5
10
15
|| [PC 1, PC 2] ||
(f) WM
(g) AR1M
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
15
(e) AR1A
10
8
6
4
2
0
10
|| [PC 1, PC 2] ||
(d) WA
0
5
8
6
4
2
0
5
10
0
15
|| [PC 1, PC 2] ||
0
5
10
15
|| [PC 1, PC 2] ||
Figure 3.19: Mean residence time in model time units. The mean length of time trajectories
remain within 4 units of each position in PC space. A circle of radius 4 units is shown for
comparison in panel (a).
88
As for the pdf of the state vector, it is important to recall that the PP ensemble consists
of 40 physically distinct representations of the system. The residence time pdfs are plotted for
each perturbed parameter ensemble member in Figure 3.20 for the R = 2 case. The individual
ensemble members have vastly different temporal characteristics. Some members show very
persistent regimes with very few transitions, and some (e.g. 8, 13) indicate the presence of a
third regime. The same colour scale is used as for Figure 3.18, but has saturated for several
panels. For example, the maximum residence time of Regime A for ensemble member 1 is 1.08
MTU, which is more than double the maximum residence time observed in the full system.
Similarly, the maximum residence time of Regime B for ensemble member 5 is 2.04 MTU,
more than six times greater than that observed for Regime B in the full system. Considered
individually, the PP ensemble members are a poor representation of the regime behaviour of
the true system.
The second technique used to study the regime statistics uses the definition of Regimes
A and B given by (3.6). The definition is used to determine the regime at each time step,
and the pdf of persistence of each regime calculated as for Figure 3.11. Figure 3.21 compares
the persistence pdfs for the full L96 system with that of the truncated forecast models. The
AR(1) stochastic parametrisation schemes (red and magenta lines) improve significantly over
the white stochastic schemes (blue and cyan) and the deterministic scheme (green) — the
distribution of regime durations more closely matches the true distribution for the AR(1) noise
cases. The proportion of time spent in each regime also improves (Table 3.2); the deterministic
and white noise schemes visit Regime B too rarely, whereas the proportion of time spent in
Regime B by the AR(1) stochastic schemes is close to the truth.
The perturbed parameter scheme (grey lines) appears to perform very well. Table 3.2
shows that averaged over all ensemble members, the proportion of time spent in each regime
very accurately represents the true system. Figure 3.21 shows that the frequency and duration
of the modal persistence match the truth time series well for both Regime A and Regime B.
However, this figure is misleading. The tail of the perturbed parameter pdf extends well beyond
the X–axis limit for both Figures (a) and (b), as some ensemble members showed very persistent
regimes with only rare transitions, as indicated in Figure 3.20. The members with only one or
two transitions only contribute one or two persistence values to the data set, so affect the modal
height and position very little. Nevertheless, it is interesting to see how, while each individual
89
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
|| [PC 3, PC 4] ||
10
1
2
3
4
5
0.4
5
0.2
0
0
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
5
0
5
0
5
0
5
0
5
0
5
0
5
0
0
10
|| [PC 1, PC 2] ||
0
10
|| [PC 1, PC 2] ||
0
10
|| [PC 1, PC 2] ||
0
10
|| [PC 1, PC 2] ||
0
10
|| [PC 1, PC 2] ||
Figure 3.20: Mean residence time in model time units for each member of the perturbed
parameter ensemble. The mean length of time trajectories remain within 2 units of each
position in PC space. The figure numbers correspond to the numbering of the ensemble
members in Table 2.4.
90
Regime A
(a)
0.5
0.08
Frequency Density
Frequency Density
0.1
0.06
0.04
0.02
0
0
Regime B
(b)
0.4
0.3
0.2
0.1
10
20
30
40
0
0
2
Duration / MTU
TRU
DET
4
6
8
10
Duration / MTU
WA
AR1 A
WM
AR1 M
PP
Figure 3.21: Predicting the distribution of persistence of (a) Regime A and (b) Regime B. The
true distribution is shown in black, and the six different forecast models shown as coloured
lines.
ensemble member represents the regime statistics relatively poorly, when averaged together they
perform very well. A perturbed parameter ensemble where the selected parameters vary in
time would allow each ensemble member to sample the parameter uncertainty, allowing each
individual ensemble member to capture the regime behaviour. Such a stochastically perturbed
parameter ensemble is tested in the ECMWF NWP model in Chapter 5.
As before, it is helpful to consider the climatology of each perturbed parameter ensemble
member separately. The Hellinger distance between the truth and each perturbed parameter
persistence pdf was calculated. For comparison, the deterministic and stochastic forecasts
were split into 40 sections as before, and the pdfs of persistence calculated for each section.
Figure 3.22 shows the distribution of Hellinger distance for each of the forecast schemes. The
white additive and white multiplicative schemes improve slightly over the deterministic scheme
— the median Hellinger distance improves in each case, though the spread in skill also increases.
The AR(1) stochastic schemes are significantly more skilful at predicting the regime statistics
than both the deterministic and white stochastic schemes, with consistently higher skill. The
skill of the perturbed parameter ensemble shows the greatest variability. While the median
score is only slightly greater than the median for the deterministic ensemble, some perturbed
parameter ensemble members score a Hellinger distance of one, indicating the forecast and
truth distributions are mutually exclusive.
91
Regime A
(b)
Distribution of Hellinger Distances
Distribution of Hellinger Distances
(a)
1
0.8
0.6
0.4
0.2
DET WA AR1A WM AR1M PP
Forecast Scheme
Regime B
1
0.8
0.6
0.4
0.2
DET WA AR1A WM AR1M PP
Forecast Scheme
Figure 3.22: The distribution of Hellinger distance calculated for the difference between forecast
and observed distributions of regime persistence. The pdf for the forecast durations of (a)
Regime A and (b) Regime B is calculated. The Hellinger distance between each forecast pdf
and the true pdf is evaluated, and the distribution of Hellinger distance represented by a box
and whisker plot (see caption to Figure 3.17 for more details).
Parametrisation
Truth
Deterministic
White Additive Stochastic
AR(1) Additive Stochastic
White Multiplicative Stochastic
AR(1) Multiplicative Stochastic
Perturbed Parameter Scheme
p(Regime A) p(Regime B)
0.7904
0.2096
0.9057
0.0943
0.9051
0.0949
0.8268
0.1732
0.9039
0.0961
0.8002
0.1998
0.7898
0.2102
Table 3.2: Predictability of regime frequencies by different forecast models. The deterministic
and white stochastic schemes all underpredict the proportion of time spent in the rarer Regime
B, while the AR(1) stochastic and perturbed parameter scheme explore this region of phase
space with the correct frequency.
92
3.4
Conclusion
The same stochastic parametrisation schemes presented in Chapter 2 are tested for their ability
to reproduce the climate of the Lorenz ’96 (L96) system. Two definitions of climate are
considered. The first defines the climate to be the pdf of the X variables, and the difference
between the forecast and true climate is evaluated using the Hellinger distance. According
to this measure, including white noise into the parametrisation scheme does not significantly
improve the climatology over that of a deterministic parametrisation scheme. This result is
observed for all noise models tested in both the c = 4 and c = 10 L96 cases. However, a
large, highly significant improvement in skill is observed when temporally autocorrelated noise
is used in the stochastic parametrisation schemes for both the c = 4 and c = 10 case.
It was found that the climatological skill of the forecast models is correlated with the
performance of the forecast model in weather prediction mode, in particular, with the reliability
of short-range forecasts. The correlation between the short term predictive skill of the forecast
model and its ability to reproduce the climatology of the L96 system provides support for
the “Seamless Prediction” paradigm. This provides a method of verifying climate predictions:
the climate model can be evaluated in weather forecasting mode to indicate the potential for
climatological skill.
The climate of the perturbed parameter forecast models described in Chapter 2 was also
tested. For the c = 10 case, the measured perturbed parameter model showed an improved
climatology over the deterministic and white noise stochastic models, but a significantly poorer
climatology than the red noise stochastic models, when the climate of the L96 system is defined
as the pdf of the X variables. However, for the c = 4 case, the perturbed parameter model
is not significantly different to any of the red noise stochastic models, and has a significantly
improved climatology over the deterministic and white noise schemes.
Regime behaviour, commonly observed in the atmosphere, is also observed in the L96
system. It is argued that the L96 system has two regimes for c = 10 and F = 20 — the
system is in Regime A 79% of the time, while the less common Regime B occurs 21% of
the time. The regime behaviour of this system makes it a useful testbed for analysing the
ability of different forecast models to reproduce regime behaviour. Three types of models were
considered: a deterministic parametrisation scheme, stochastic parametrisation schemes with
additive or multiplicative noise, and a perturbed parameter ensemble.
93
Each forecasting scheme was tested on its ability to reproduce the attractor of the full
system, defined in a reduced space based on an EOF decomposition of the truth time series.
None of the forecast models accurately capture the less common Regime B, though a significant
improvement is observed over the deterministic parametrisation when a temporally correlated
stochastic parametrisation is used instead. The stochastic parametrisation enables the system
to explore a larger portion of the attractor, in the same way in which a ball–bearing in a
potential well will explore around its equilibrium position when subjected to a random forcing.
The regime statistics describing the persistence of the regimes and their frequency of occurrence
were also improved for the stochastic parametrisations with AR(1) noise compared to the
deterministic scheme, and multiplicative noise was found to be particularly skilful.
The attractor for the perturbed parameter ensemble improves on that forecast by the
deterministic or white additive schemes; it shows a distinct peak in the attractor corresponding
to Regime B, though this peak is more pronounced than in the truth attractor. The ensemble
is also very skilful at forecasting the correct statistics for the regime behaviour of the system.
However, the 40 constituent members of the perturbed parameter ensemble differ greatly from
the true attractor, with many only showing one dominant regime with very rare transitions.
It is interesting that, while each individual ensemble member models the regime behaviour
poorly, when averaged together, the ensemble performs very well.
Using regime behaviour to study the climate of a system provides considerably more information than studying the pdf of the system. The pdf of the perturbed parameter ensemble,
while not as skilful as the red noise stochastic parametrisations, shows skill for the c = 10 case.
The pdfs of individual ensemble members are also all skilful, with a mean Hellinger distance
of 0.06 ± 0.03 (not shown) — all perturbed parameter ensemble members have a reasonable
pdf. However, the regime behaviour of the individual perturbed parameter ensemble members
varies widely. In order to correctly simulate the statistics of the weather (for example, the duration of blocking events over Europe), a climate simulator must accurately represent regime
behaviour. It is therefore important that climate models are explicitly tested on this ability.
The results presented here indicate that while the average of a perturbed parameter ensemble
performs well, individual ensemble members are at risk of failing this test.
94
4
Evaluation of Ensemble Forecast
Uncertainty: The Error-Spread Score
It is far better to foresee even without certainty than not to foresee at all.
– Henri Poincare
4.1
Introduction
The first and arguably most important lesson in any experimental physics course is that a
physicist must quantify the uncertainty in his or her measurement or prediction. Firstly, this
allows for comparison of measurement with theory: a theory can only be discounted if it predicts
a value outside of the “error bars” of a measurement. Additionally, a theory does not necessarily
predict a single value. For example, for the famous experiment in which electrons are fired one
at a time through a double-slit, the theory of quantum mechanics predicts the probability of
an electron striking a screen behind the slit at any given point. The statistical reliability of
the forecast pdf can only be verified by repeated measurements which then provide evidence
for the validity of the theory; many individual electrons were passed through the double slit,
and an interference pattern observed as predicted (Donati et al., 1973).
The same lesson is valid in the atmospheric sciences. A weather or climate prediction
should include an estimate of the uncertainty of the prediction. In weather forecasting, Ensemble Prediction Systems (EPS) are commonly used to give an estimate of error in a forecast;
the ensemble of forecasts is assumed to sample the full forecast uncertainty. As outlined in
95
Section 1.3, there are two main sources of uncertainty which must be represented in a weather
forecast, initial condition uncertainty and model uncertainty (Palmer, 2001).
Having attempted to represent the errors in our prediction, the accuracy of the forecast pdf
must be verified. In the same way that Donati et al. (1973) measured the location of many
electrons, many forecast-observation (or -verification) pairs must be used to evaluate how well
the forecast ensemble represents uncertainty. Ideally, the ensemble forecast should behave like
a random sample from the forecast pdf (the hypothetical pdf representing the uncertainty in
the forecast). The consistency condition is that the verification also behaves like a sample from
that pdf (Anderson, 1997; Wilks, 2006). If this condition is fulfilled, the ensemble is perfectly
capturing the uncertainty in the forecast.
In this chapter, techniques for evaluation of the predicted uncertainty in a forecast are
considered in the context of predicting the weather. In Section 4.2, the problems with current
methods of forecast verification are discussed, and the need for a new scoring rule, the Errorspread Score, is motivated, which is defined in Section 4.3. The new Error-spread Score is shown
to be proper in Section 4.4, and in Section 4.5 the decomposition of the score into reliability,
resolution and uncertainty components is discussed. In Section 4.6 the Error-spread Score is
tested and compared to existing diagnostics using forecasts made in the Lorenz ’96 system. In
Section 4.7 the Error-spread Score is tested using operational ensemble forecasts from ECMWF.
The decomposition of the score is evaluated for the ECMWF forecasts in Section 4.8, which
gives a more complete understanding of the new score. Finally, in Section 4.9 the score is
used to evaluate forecasts made using the ECMWF seasonal forecasting system, and some
conclusions are drawn in Section 4.10.
4.2
Evaluation of Ensemble Forecasts
Section 1.7 outlined some of the different methods commonly used for forecast verification.
All the methods discussed are sensitive to the two properties which a probabilistic forecast
must have to be useful: reliability and resolution. Graphical forecast diagnostics provide a
comprehensive summary of the forecast, including an indication of reliability and resolution.
However, they do not produce an unambiguous ranking of forecasts, so it is difficult to use them
to compare many models. Scoring rules are useful as they provide a quantitative indication of
forecast skill, allowing many different forecasts to be compared. Bröcker (2009) showed that all
96
strictly proper scores can be explicitly decomposed into a component which tests reliability and
a component which tests resolution. The decomposition also includes a third term, uncertainty,
which depends only on the statistics of the observations.
Currently, many scoring rules used for forecast verification, such as the Continuous Ranked
Probability Score, CRPS (Wilks, 2006), and the Ignorance Score, IGN (Roulston and Smith,
2002), require an estimate of the full forecast pdf. This is usually achieved using kernel smoothing estimates or by fitting the parameters in some predetermined distribution, both of which
require certain assumptions about the forecast pdf to be made. Alternatively, the pdf must be
discretised in some way, such as for the Brier Score, BS and Ranked Probability Score, RPS
(Wilks, 2006), which were both originally designed for multi-category forecasts. On the other
hand, the RMS error-spread graphical diagnostic used in Chapter 2 is an attractive verification tool as it does not require an estimation of the full forecast pdf, and instead is calculated
using the raw ensemble forecast data. However, being a graphical diagnostic, it is very difficult
to compare many forecast models using this tool (for example, comparing the many different forecasts generated by changing the tunable parameters in a stochastic parametrisation
scheme).
A new scoring rule is proposed designed for ensemble forecasts of continuous variables,
that is particularly sensitive to the reliability of a forecast and which seeks to summarise the
RMS error-spread graphical diagnostic. In a similar way to the RMS error-spread graphical
diagnostic, it is formulated with respect to moments of the forecast distribution, and not
using the full distribution itself. These moments may be calculated directly from the ensemble
forecast, provided it has sufficient members for an accurate estimate to be made. The new
score proposed does not require the forecast to be discretised, and acknowledges the inability
of the forecaster to fully specify a probability distribution for a variable due to the amount
of information needed to estimate the distribution. This limitation has been recognised by
other authors and forms the basis for the development of Bayes Linear Statistics (Goldstein
and Wooff, 2007).
The new score will be compared with a number of existing proper scores: the Brier Score,
BS (1.14), the reliability component of the Brier Score, REL (1.17), the Ranked Probability
Score, RPS (1.20) and the Ignorance Score, IGN (1.21). Each score is converted into a Skill
Score by comparison with a reference forecast following (1.12).
97
4.3
The Error-Spread Score
Consider two distributions. Q(X) is the truth probability distribution function for variable X
which has moments mean, µ, variance, σ 2 , skewness, γ, and kurtosis1 , β, defined in the usual
way:
µ = E[X],
(4.1)
σ 2 = E[(X − µ)2 ],
"
3 #
X −µ
,
σ
"
#
X −µ 4
β=E
,
σ
γ=E
(4.2)
(4.3)
(4.4)
where E[·] denotes the expectation of the variable. The probabilistic forecast issued is denoted
P (X), with mean, m, variance, s2 , skewness, g, and kurtosis, b, defined in the same way.
The perfect probabilistic forecast will have moments equal to those of the truth distribution:
m = µ, s2 = σ 2 , g = γ, and b = β, etc.
The proposed Error-spread Score, ES, is written
ES = (s2 − e2 − esg)2 ,
(4.5)
where the difference between the verification, z, and the ensemble mean, m, is the error in the
ensemble mean,
e = m − z,
(4.6)
and the verification, z, follows the truth probability distribution, Q. The mean value of
the score is calculated over many forecast-verification pairs, both from different grid point
locations and from different starting dates. A smaller average value of the score indicates a
better forecast.
The first two terms in the square of the right hand side of (4.5) are motivated by the errorspread relationship (1.26): for a reliable ensemble, it is expected that the ensemble variance,
s2 , will give an estimate of the expected squared error in the ensemble mean, e2 (Leutbecher,
1
Note that kurtosis is used not excess kurtosis. The kurtosis of the normal distribution, βN = 3.
98
m
s
pdf (X)
A
B
C
D
X
Figure 4.1: Schematic illustrating how the Error-spread Score accounts for the skewness of the
forecast distribution. The hypothetical pdf shown as a function of X is positively skewed, and
has mean (m) and standard deviation (s) as indicated. See text for more details.
2010). However, with these two terms alone, the score is not proper. Consider the trial score,
EStrial = (s2 − e2 )2
(4.7)
It can be shown that the expected value of this score is not minimised by predicting the
true moments, m = µ, s = σ. In fact it is minimised by forecasting m = µ +
s2 = σ 2 1 +
2
γ
4
γσ
2
and s2 → s2 1 +
(see Appendix B.1). The substitutions m → m + gs
2
and
g2
4
transform the trial score into the Error-spread Score, (4.5), which can be shown to be a proper
score (Section 4.4).
The third term in the Error-spread Score can be understood as acknowledging that the
full forecast pdf contains more information than is in the first two moments alone. This
term depends on the forecast skewness, g. Consider the case when the forecast distribution
is positively skewed (Figure 4.1). If the observed error is smaller than the predicted spread,
e2 < s2 , the verification must fall in either section B or C in Figure 4.1. The skewed forecast
distribution predicts the verification is more likely to fall in B, so this case is rewarded by the
scoring rule. If the observed error is larger than the predicted spread, e2 > s2 , the verification
must fall in either section A or D in Figure 4.1. Now, the forecast pdf indicates section D is
the more likely of the two, so the scoring rule rewards a negative error e.
The Error-spread Score is a function of the first three moments only. This can be understood by considering (4.7). When expanded, the resultant polynomial is fourth order in the
verification, z. The coefficient of the z 4 term is unity, i.e. it is dependent on the fourth power
of the verification only, so the forecaster cannot hedge his or her bets by altering the kurtosis
99
of the forecast distribution. The first term with a non-constant coefficient is the z 3 term,
indicating that skewness is the first moment of the true distribution which interacts with the
forecaster’s prediction. The forecast skewness is therefore important, and should appear in the
proper score. If the score were based on higher powers, for example motivated from (sn − en )2 ,
the highest-order moment required would be the (2n − 1)st moment2 .
4.4
Propriety of the Error-Spread Score
A scoring rule must be proper in order to be a useful measure of forecast skill. The ES can
not be strictly proper as it is only a function of the moments of the forecast distribution — a
pdf with the same moments as the true pdf will score equally well. However, it is important
to confirm that the ES is a proper score.
To test for propriety, we calculate the expected value of the score, assuming the verification
follows the truth distribution (refer to Appendix B.2 for the full derivation).
2
E [ES] = (σ 2 − s2 ) + (µ − m)2 − sg(µ − m)
+ σ 2 (2(µ − m) + (σγ − sg))2 + σ 4 (β − γ 2 − 1)
(4.8)
In order to be proper, the expected value of the scoring rule must be minimised when
the truth distribution is forecast. Appendix B.2 confirms that the truth distribution falls
at a stationary point of the score, and that this stationary point is a minimum. Therefore,
the scoring rule is proper, though not strictly proper, and is optimised by issuing the truth
distribution. Appendix B.2 also includes a second test of propriety, from Bröcker (2009).
4.5
Decomposition of the Error-Spread Score
It is useful to decompose a score into its constituent components, reliability and resolution, as
it gives insight into the source of skill of a forecast. It allows the user to identify the strengths of
one forecasting system over another. Importantly, it indicates the characteristics of the forecast
which require improvement, providing focus for future research efforts. Many of the existing
scoring rules have been decomposed into their constituent components. The BS (Brier, 1950)
2
2
Scores based on the magnitude of the error were also considered, for example, (|s| − |e|) , but a proper
score could not be found.
100
has been decomposed in several ways (e.g. Sanders, 1963; Murphy, 1973, 1986). Similarly,
the CRPS can be decomposed into two parts scoring reliability and resolution/uncertainty
(Hersbach, 2000). Tödter and Ahrens (2012) show that a generalisation of IGN can also be
decomposed into reliability, resolution and uncertainty components. In each of these cases, the
decomposition allows the source of skill in a forecast to be identified.
It is desirable to be able to decompose the ES into its constituent components as has been
carried out for the BS and CRPS. Appendix B.3 shows that the ES score, as a proper score,
can be decomposed into a reliability, resolution and uncertainty component:


2

J
I X

si gj e2i,j + Gi,j 
1X
 2

2 2

ES =
Ni,j (si − ei,j ) +
2
|

n i=1 j=1
{z
}
e


i,j
a
{z
}
|
b



2 





}
J
I X

G2i,j
1X
G
−
(e2i,j − e2 )2 + e2i,j 
Ni,j 
− 2
2

n i=1 j=1
{z
}
e
ei,j
|
c
+

n 
1X
G
 2
(e − e2k )2 + e2k
n k=1 
e2
{z
|
e
{z
|
!2
d
− 2e3k

G

.
2
e 

(4.9)
}
The first term evaluates the reliability of the forecast. This has two components, a and
b, which test the reliability of the ensemble spread and the reliability of the ensemble shape
respectively. Term a is the squared difference between the forecast variance and the observed
mean square error for that forecast variance. For a reliable forecast, these terms should be
equal (Leutbecher and Palmer, 2008; Leutbecher, 2010). The smaller the term a, the more
reliable the forecast spread. Term b is the squared difference between the measured shape
factor, Gi,j , and the expression the shape factor takes if the ensemble spread and skew are
accurate, −si gj e2i,j (B.41). If the forecast skewness, or ‘shape’, of the probability distribution
is a good indicator of the skewed uncertainty in the forecast distribution, this term will be
small. For both terms a and b, the sum is weighted by the number of forecast-verification
pairs in each bin, Ni,j .
The second term evaluates the resolution of the forecast. This also has two components, c
and d, testing the resolution of the predicted spread and the resolution of the predicted shape
respectively. Both terms evaluate how well the forecasting system is able to distinguish between
101
situations with different forecast uncertainty characteristics. Term c is the squared difference
between the mean square error in each bin and the climatological mean squared error. If the
forecast has high resolution, the spread of the forecast should separate predictions into cases
with low uncertainty (low mean square error), and those with high uncertainty (high mean
square error), resulting in a large value for term c. If the forecast spread does not indicate
the expected error in the forecast, term c will be small as all binned mean squared errors
will be close to the climatological value. Therefore a large absolute value of term c indicates
high resolution in the predicted spread. This is subtracted when calculating the Error-spread
Score, contributing to the low value of ES for a skilful forecast. Similarly, term d indicates the
resolution of the skewness or shape of the ensemble forecast, evaluating the squared difference
between the binned and climatological shape factors. If this term is large, the forecast has
successfully distinguished between situations with different degrees of skewness in the forecast
uncertainty: it has high shape resolution. Again, for both terms c and d, the sum is weighted
by the number of forecast-verification pairs in each bin.
The last term, e, is the uncertainty in the forecast, which is not a function of the binning
process. It depends only on the measured climatological error distribution, compared to the
individual measurements. Nevertheless, unlike for the Brier score decomposition, this term
is not independent of the forecast system, and instead provides information about the error
characteristics of the forecast system: a system with larger errors on average will have a larger
uncertainty term. The term is reduced by reducing the mean square error in the forecast.
4.6
Testing the Error-Spread Score: Evaluation of Forecasts in the Lorenz ’96 System
The experiments carried out in the L96 simplified model of the atmosphere described in
Chapter 2 can be used to test the ES. Forecasts made using the additive stochastic parametrisation scheme are evaluated at a lead time of 0.9 model time units (∼ 4.5 atmospheric days).
The other experimental details are identical to Chapter 2, including the details of the parametrisation scheme, and the number of ensemble members. The tunable parameters in the forecast
model, the magnitude of the noise term (σn ) and the temporal autocorrelation of the noise
term (φ), are varied, and the forecasts for each evaluated using three techniques. Figure 4.2(a)
102
shows the graphical error-spread diagnostic (Section 1.7.3.1). The forecast-verification pairs
are binned according to the variance of the forecast. The average variance in each bin is plotted against the mean square error in each bin. For a reliable forecast system, these points
should lie on the diagonal (Leutbecher and Palmer, 2008). Figure 4.2(b) shows the reliability
component of the Brier Score, REL (1.17), where the “event” was defined as “the Xk variable
is in the top third of its climatological distribution”. Figure 4.2(c) shows the new Error-spread
Skill Score (ESS), which is calculated with respect to the climatological forecast.
The difficulty of analysing many forecasts using a graphical method can now be appreciated.
Trends can easily be identified in Figure 4.2(a), but the best set of parameter settings is hard
to identify. The stochastic forecasts with small magnitude noise (low σn ) are under-dispersive.
The error in the ensemble mean is systematically larger than the spread of the ensemble, i.e.
they are overconfident. However, the stochastic parametrisations with very persistent, large
magnitude noise (large σn , large φ) are over dispersive and under-confident. Figure 4.2(b)
shows REL evaluated for each parameter setting, which is small for a reliable forecast. It
scores highly those forecasts where the variance matches the mean square error, such that
the points in (a) lie on the diagonal. The ESS is a proper score, and is also sensitive to the
resolution of the forecast. It rewards well calibrated forecasts, but also those which have a
small error. The peak of the ESS in Figure 4.2(c) is shifted down compared to REL, and it
penalises the large σn , large φ models for the increase in error in their forecasts. The ESS
summarises Figure 4.2(a), and shows a sensitivity to both reliability and resolution as required
of a proper score.
4.7
Testing the Error-Spread Score: Evaluation of MediumRange Forecasts
The ESS was tested using ten day operational forecasts made with the ECMWF EPS. The
EPS uses a spectral atmospheric model, the Integrated Forecasting System (IFS) (described in
detail in Section 5.2). The EPS is operationally run out to day ten with a horizontal triangular
truncation of T6393 , with 62 vertical levels, and uses persisted sea surface temperature (SST)
3
The IFS is a spectral model, and resolution is indicated by the wave number at which the model is truncated.
For comparison, a spectral resolution of T639 corresponds to 30 km resolution, or a 0.28o latitude/longitude
grid at the equator.
103
(a)
σ / σm
40
2.5
20
0
40
2.0
20
0
40
1.5
20
0
40
1.0
20
0
40
0.5
MS Error
0.0
20
0
40
20
0
0
20
40 0
20
40 0
20
40 0
20
40 0
20
40 0
20
40
MS Spread
φ
0.000
0.882
0.969
(c)
2.5
0.003
0.001
0.003
0.001
0.005
1.0
0.007
0.88
0.86
0.84
0.94
0.92
0.9
φ
0.998
0.993
0.000
0.998
0.993
0.969
0.882
0.0
0.607
0.000
0.0
0.94
0.9
1.0
0.5
0.003
0.005
0.009
0.92
1.5 0.92
0.969
0.5
0.998
0.76
0.78
0.8
0.82
0.9
0.882
1.5
2.5
2.0 0.94
σ / σmeas
σ / σmeas
2.0
0.993
0.607
(b)
0.607
φ
Figure 4.2: (a) The Mean Square (MS) Error-Spread diagnostic, (b) the Reliability component
of the Brier Score and (c) the Error-spread Skill Score, evaluated for forecasts of the L96 system
using an additive stochastic parametrisation scheme. In each figure, moving Left–Right the
autocorrelation of the noise in the forecast model, φ, increases. Moving Bottom–Top, the
standard deviation of the noise, σn , increases. The individual figures in (a) correspond to
different values of (φ, σm ). The bottom row of figures in (a) are blank because deterministic
forecasts cannot be analysed using the MS Error-Spread diagnostic: there is no forecast spread
to condition the binning on.
104
T 850
295
(a) EPS: 4oN
295
(b) DD: 4oN
300
(c) EPS: 53oN
300
294
294
290
290
293
293
280
280
292
292
270
270
291
291
260
260
290
290
250
250
289
0
4.5
9 13.5 18 22.5
Longitude
289
0
4.5
9 13.5 18 22.5
Longitude
(d) DD: 53oN
13.3 20 26.7 33.3 40 46.7
13.3 20 26.7 33.3 40 46.7
Longitude
Longitude
Figure 4.3: The 4DVar analyses (black) of temperature at 850 hPa (T850) are compared to
the ten-day ensemble forecasts (grey) at 11 longitudes at 4o N for (a) the EPS and (b) the DD
system, and at 11 longitudes at 53o N for (c) the EPS and (d) the DD system. The forecasts
are initialised on 19th April 2012 for all cases. The horizontal black dashed lines correspond
to the deciles of the climatological distribution of T850 at this latitude.
anomalies instead of a dynamical ocean model. The 50 member ensemble samples initial condition uncertainty using the EDA system (Isaksen et al., 2010). The perturbations prescribed
by the EDA system are combined with perturbations from the leading singular vectors to
define an ensemble which represents the initial condition uncertainty well (Buizza et al., 2008).
The EPS system uses stochastic parametrisations to represent uncertainty in the forecast due
to model deficiencies. The 50 ensemble members differ as each uses a different seed for the
stochastic parametrisation schemes. Two stochastic parametrisation schemes are used: SPPT
(Section 1.4.3.1) and SKEB (Section 1.4.3.2).
Ten day forecasts are considered, initialised from 30 dates between April 14th and September 15th 2012, and separated from each other by five days. The high resolution 4D variational
(4DVar) analysis (T1279, 16 km) is used for verification. Both forecast and verification fields
are truncated to T159 (125 km) before verification. Forecasts of temperature at 850 hPa (T850
— approximately 1.5 km above ground level) are considered, and the ES evaluated as a function
of latitude.
For comparison, a perfect statistical probabilistic forecast is generated based on the high
resolution T1279 operational deterministic forecast. This is defined in an analogous way to
the idealised hypothetical forecasts in Leutbecher (2010). The error between the deterministic
forecast and the 4DVar analysis is computed for each ten day forecast, and the errors grouped
as a function of latitude. Each deterministic forecast is then dressed by adding a 50 member
ensemble of errors to the deterministic forecast, where the errors are drawn from this latitudinally dependent distribution. The error distribution does not include spatial or temporal
correlations. This dressed deterministic (DD) ensemble can be considered a “perfect statistical” forecast as the error distribution is correct if averaged over all time. However, the error
105
(b)
2
1
0
0
(c)
1.2
6
1
5
RMS Error
3
RMS Error
RMS Error
(a)
0.8
0.6
0.4
1
2
3
3
2
0.2
0.2 0.4 0.6 0.8
RMS Spread
4
1
RMS Spread
1.2
1
2
4
6
RMS Spread
Figure 4.4: RMS error-spread plot for forecasts made using the EPS (pale grey) and the DD
system (dark grey). Ten day forecasts of T850 are considered for latitudes between (a) 18o S
and 10o S, (b) 0o N and 8o N , and (c) 50o N and 60o N . The ensemble forecasts are sorted and
binned according to their forecast spread. The standard deviation of the error in the ensemble
mean in each bin is plotted against the RMS spread for each bin. For a reliable ensemble,
these should lie on the diagonal shown (Leutbecher and Palmer, 2008).
distribution is static: it does not vary from day to day as the predictability of the atmospheric
flow varies. A useful score should distinguish between this perfect statistical forecast and the
dynamic probabilistic forecasts made using the EPS.
An example ten day forecast using these two systems is shown in Figure 4.3 for 11 longitudes
close to the equator, and for 11 longitudes at mid-latitudes. The flow dependency of the
EPS is evident — the spread of the ensemble varies with position giving an indication of the
uncertainty in the forecast. The spread of the DD ensemble varies slightly, indicating the
sampling error for a 50 member ensemble.
Figure 4.4 shows the RMS error-spread diagnostic for three different latitude bands. On
average, the DD is perfectly reliable — the mean of the scattered points lies on the diagonal
for each case considered. However, the spread of the forecast does not indicate the error in the
ensemble mean. In contrast, the EPS forecasts contain information about the expected error
in the ensemble mean. The spread of the ESS is well calibrated, though at latitudes close to
the equator it is slightly under-dispersive (Figure 4.4(b)). At a lead time of ten days, the RMS
error between the deterministic forecast and the verification is higher than the RMS error in
the ensemble mean for the lower resolution EPS. This difference is greatest at mid-latitudes,
and can be observed in Figure 4.4(c).
Figure 4.5 shows the skill of the EPS forecasts calculated using three different proper
scores: ES, RPS and IGN. For each, the smaller the score, the better the forecast. The BS
was also calculated, but the results were very similar to the RPS and so are not shown here.
All scores agree that the skill of the EPS forecast is lower in mid-latitudes than at ±20o .
However, they disagree as to the skill of the forecast near the equator. The RPS and IGN
106
6
5
Score
4
3
2
1
0
−80 −60 −40 −20
0
20
Latitude
40
60
80
Figure 4.5: Forecasting skill of the EPS as a function of latitude using the ES (solid); IGN
calculated following Roulston and Smith (2002) (dot-dash), for ten events defined with respect
to the deciles of the climatology; RPS dash, for ten events defined with respect to the deciles
of the climatology. The ten day forecasts are compared with the 4DVar analysis for T850 and
averaged over 30 initial dates in Summer 2012. The scores have been scaled to allow them to
1
, and the RPS by a factor of 5.
be displayed on the same axes: the ES by a factor of 300
indicate a reduced skill at the equator, whereas ES indicates a higher skill there than at midlatitudes. The cause of this difference is that at the equator the climatological variability is
much smaller than at midlatitudes, so the climatological deciles are closer together. This affects
the scores conditioned on the climatological percentiles (RPS, IGN), which do not account for
the spacing of the bins. At the equator, even if the forecast mean and verification are separated
by several bins, the magnitude of the error is actually small. It seems unreasonable to penalise
forecasts twice near the equator when calculating a skill score — the first time from the closer
spaced bins, and the second time by calculating the skill score with respect to a more skilful
climatological forecast. The ES score is not conditioned on the climatological percentiles, so
is not susceptible to this. It rewards forecasts made close to the equator for their small RMS
error when compared to forecasts made at other latitudes.
Figure 4.6 shows skill scores for the EPS forecast calculated with reference to the DD
forecast, using three different proper scores, ESS, RPSS and IGNSS. In each case, the skill
score SS is related to the score for the EPS, SEPS , and for the DD, SDD by:
SS = 1 −
SEP S
.
SDD
(4.10)
The higher the skill score, the better the scoring rule is able to distinguish between the dynamic
probabilistic forecast made using the EPS, and the statistical forecast using the DD ensemble.
107
1
Skill Score
0.8
0.6
0.4
0.2
0
−100
−50
0
Latitude
50
100
Figure 4.6: Skill scores for the EPS forecast as a function of latitude. Three proper skill scores
are calculated using the dressed deterministic forecast as a reference: the ESS (solid), IGNSS
(dot-dash), RPSS (dash). The ten day T850 forecasts are compared with the 4DVar analysis
and averaged over 30 initial dates in Summer 2012.
Figure 4.6 indicates that the Error-spread Skill Score is considerably more sensitive to this
property of an ensemble than the other scores, though it still ranks the skill of different latitudes
comparably. All scores indicate forecasts of T850 at the equator are less skilful than at other
latitudes: the ESS indicates there is forecast skill at these latitudes, though the other scores
suggest little improvement over the climatological forecast — the skill scores are close to zero.
It has been observed that the deterministic forecast has a larger RMS error than the mean
of the EPS forecast. This will contribute to the poorer scores for the DD forecast compared to
the EPS forecast. A harsher test of the scores is to compare the EPS forecast with a forecast
which dresses the EPS ensemble mean with the correct distribution of errors. This dressed
ensemble mean (DEM) forecast differs from the EPS forecast only in that it has a fixed (perfect)
ensemble spread, whereas the EPS produces a dynamic, flow-dependent indication of forecast
uncertainty. Figure 4.7 shows the skill of the EPS forecast calculated with respect to the DEM
forecast. The ESS is able to detect the skill in the EPS forecast from the dynamic reliability
of the ensemble. Near the equator, the EPS forecast is consistently under-dispersive, so has
negative skill compared to the DEM ensemble, which has the correct spread on average (in
Chapter 6, a new stochastic parametrisation scheme is proposed which substantially improves
the spread of the ECMWF ensemble forecasts at equatorial latiudes). The skill observed when
comparing the EPS to the DD forecast is due to the lower RMS error for the EPS forecast at
equatorial latitudes. The other skill scores only indicate a slight improvement of the EPS over
the DEM — compared to the ESS, they are insensitive to the dynamic reliability of a forecast.
108
0.5
0.4
Skill Score
0.3
0.2
0.1
0
−0.1
−0.2
−80 −60 −40 −20
0
20
Latitude
40
60
80
Figure 4.7: Skill Scores for the EPS forecast evaluated as a function of latitude. Three proper
skill scores are calculated using the dressed deterministic forecast as a reference, where the
ensemble mean is used as the deterministic forecast: the ESS (solid), IGNSS calculated following Roulston and Smith (2002) (dot-dash), RPSS (dash). The ten-day T850 forecasts are
compared with the 4DVar analysis and averaged over 30 initial dates in Summer 2012.
4.8
Evaluation of Reliability, Resolution and Uncertainty
for EPS forecasts
The source of skill in the EPS forecasts can be investigated further by calculating the decomposition of the ES as a function of latitude and longitude. Operational ten-day forecasts
made using the ECMWF EPS are considered, initialised from 30 dates between April 14th
and September 15th in 2010, 2011 and 2012 respectively, and 10 dates from the same period
in 2009: a large sample is required since the forecast is binned in two dimensions. As before,
the high resolution 4D variational (4DVar) analyses (T1279, 16 km) are used for verification.
Forecast and verification fields are truncated to T159 (125 km) before verification. Forecasts
of T850 are considered. The bias (expected value of the error) in these forecasts is small
(< 0.25% at any one latitude, < 0.05% globally), so the approximations made in deriving the
decomposition are valid.
To perform the decomposition, the forecast-verification pairs are sorted into ten bins of
equal population according to the forecast standard deviation. Within each of these bins,
the forecasts are sorted into 10 further bins of equal population according to their skewness.
To increase the sample size, for each latitude-longitude point the forecast-verification pairs
within a radius of 285 km are used in the binning process, which has the additional effect
of spatially smoothing the calculated scores. The number of data points within a bin varies
slightly depending on latitude, but is approximately 20. The average standard deviation and
109
skewness are calculated for each bin, as are the average error characteristics required by (4.9).
The EPS is compared with the dressed ensemble mean (DEM) forecast described above.
The decomposition of the ES should distinguish between this perfect statistical forecast and
the dynamic probabilistic forecasts made using the EPS, and identify in what way the dynamic
probabilistic forecast improves over the perfect statistical case.
Figure 4.8(a) shows the forecasting skill of the EPS evaluated using the ES. The lower the
value of the score, the better the forecast. A strong latitudinal dependency in the value of the
score is observed, with better scores found at low latitudes. This can be attributed largely
to the climatological variability, which is strongly latitudinally dependent. At high latitudes
variability is greater, the mean square error is larger so the ES is larger. This is explained in
more detail in Section 4.7 above. Figure 4.8(b) shows the forecasting skill of the EPS evaluated
using the ES, where the ES has been calculated using the decomposition described in (4.9).
The results are similar to using the raw ES, confirming the decomposition is valid. The small
observed differences can be attributed to two causes. Firstly, the decomposition assumes that
spread and skew are discrete variables constant within a bin, which is not true. Secondly, the
decomposition uses neighbouring forecast-verification pairs to increase the sample size for the
binning process, which is not necessary when the ES is evaluated using (4.5). Figure 4.8(c)
shows the Error-spread Skill Score (ESS) calculated for the EPS with reference to the DEM
forecasts following (4.10). A positive value of the skill score indicates an improvement over
the DEM forecast whereas a negative value indicates the DEM forecast was more skilful. The
results overwhelmingly indicate the EPS is more skilful than the DEM, with positive scores
over much of the globe. The highest skill is found in two bands north and south of the equator
in the western Pacific Ocean. There are some small regions with negative skill over equatorial
land regions and over the equatorial east Pacific. It is these regions which are responsible for
the negative ESS at low latitudes in Figure 4.7.
To investigate the source of skill in the EPS compared to the DEM forecast, the decomposition of the ES was calculated for both sets of forecasts. Figure 4.9 shows the reliability,
resolution and uncertainty terms calculated for the EPS (left hand column) and DEM (centre
column) forecasts. Visually, the plots in the two columns look similar. Comparing Figure 4.9(a)
and (b) indicates that the reliability term tends to be smaller for the EPS across much of the
tropics, and comparing (d) and (e) shows that the resolution term tends to be smaller for
110
(a)
3
2
1
0
−1
−2
(b)
3
2
1
0
−1
−2
(c)
1
0.5
0
−0.5
−1
Figure 4.8: Forecasting skill of the EPS evaluated using the Error-spread Score. Ten day
forecasts of T850 are compared with the 4DVar analysis and averaged over 100 dates sampled
from April–Sept, 2009–2012. (a) The score is calculated the standard way using (4.5). (b) The
score is calculated using the decomposition in (4.9). (c) The Error-spread Skill Score for the
EPS forecast, calculated the standard way, with respect to the DEM forecast. For (a) and (b),
the score is plotted on a logarithmic scale — a contour level of “n” indicates a score of 10n .
111
the DEM. The uncertainty term, shown in (g) and (h), is similar for the EPS and DEM. In
Figure 4.9(b), the method of construction of the DEM forecast results in a strong horizontal
banding across much of the equatorial Pacific Ocean. The standard deviation of the DEM
forecast is constant as a function of longitude, and the error characteristics are similar, so the
reliability term is approximately constant.
To ease comparison, Figures 4.9 (c), (f) and (i) show the skill score calculated for the EPS
forecasts with reference to the DEM forecasts for the reliability, resolution and uncertainty
components of the ES respectively. Figure 4.9(c) shows the ES reliability skill score. High skill
scores indicate the EPS is more reliable than the DEM. Very high skill scores of greater than
0.8 are found in two bands north and south of the equator in the western Pacific Ocean, with
lower positive scores observed over much of the Pacific Ocean. This indicates that the high
skill in these regions, as indicated by the ESS, is largely attributable to an improvement in the
reliability of the forecast.
At polar latitudes and in the south east Pacific, the reliability skill score is negative, indicating the DEM is more reliable than the EPS. However in these regions, Figure 4.9(f) shows a
ES resolution skill score that is large and negative. Because resolution contributes negatively
to the total score, a large value of resolution is desirable and negative values of the resolution
skill score indicate skill in the EPS forecast. At polar latitudes and in the south-east Pacific,
the EPS forecasts have more resolution than the DEM forecasts. Therefore, despite their low
reliability at these latitudes, the overall ESS indicates an improvement over the DEM. The
improvement in ES in these regions can be attributed to an improvement in resolution of the
forecast. At low latitudes, the resolution of the EPS is similar to that of the DEM.
Figure 4.9(i) shows the ES uncertainty skill score. This is zero over much of the globe,
indicating the EPS and DEM forecasts have very similar uncertainty characteristics. This is
as expected, since the forecast error characteristics are near identical. The small deviations
from zero can be attributed to sampling: the sample distribution of errors used to dress the
deterministic forecast does not necessarily have a mean of zero.
The ES decomposition has indicated in what ways the EPS forecast is more skilful than the
DEM forecast, and has also highlighted regions of concern. It is of interest to see if this skill
is reflected in other diagnostic tools. The calibration of the second moment of the ensemble
forecast can be evaluated by constructing RMS error-spread diagrams, which test whether
112
3
(a)
(b)
(c)
2
1
0.5
1
0
0
−0.5
−1
−2
−1
(d)
(e)
(f)
(g)
(h)
(i)
Figure 4.9: Source of forecasting skill evaluated using the ESS, comparing the EPS and DEM
forecasts. See text for more details. The reliability component of (a) the EPS forecasts and
(b) the DEM forecasts. (c) The reliability skill score: positive values indicate the EPS is more
reliable than the DEM forecast. The resolution component of (d) the EPS forecasts and (e)
the DEM forecasts. (f) The resolution skill score: negative values indicate the EPS has more
resolution than the DEM forecast. The uncertainty component of (g) the EPS forecasts and
(h) the DEM forecasts. (i) The uncertainty skill score: positive values indicate the EPS has
lower uncertainty than the DEM forecast. The colourbar in (a) also corresponds to figures (b),
(d-e) and (g-h). The colourbar in (c) also corresponds to figures (f) and (i). In (a-b), (d-e)
and (g-h), the components of the score are plotted on a logarithmic scale — a contour level of
“n” indicates a score of 10n .
113
80
60
latitude / o N
40
Region 1
20
Region 2
0
−20
Region 3
−40
−60
−80
0
50
100
150
200
250
300
350
longitude / o E
Figure 4.10: The three regions of interest defined by considering the decomposition of the ES.
Region 1 is defined as 10 − 25o N, 120 − 200o E. Region 2 is defined as 0 − 8o N, 220 − 280o E.
Region 3 is defined as 35 − 50o S, 200 − 280o E.
(1.26) is followed. This diagnostic is a more comprehensive analysis of the calibration of the
forecast, and can be used to identify the shortcomings of a forecast in more detail.
The forecast-verification pairs are sorted and binned according to the forecast variance, and
the RMS error and spread evaluated for each bin. The spread reliability and spread resolution
can be identified on these diagrams. A forecast with high spread reliability has scattered points
lying close to the diagonal line. If the range in vertical distribution in scattered points is large,
the forecast has successfully sorted the cases according to their uncertainty, and the forecast
has high resolution.
Three regions of interest were defined by consideration of Figure 4.9. The three regions are
indicated in Figure 4.10. Region 1 is defined as 10 − 25o N, 120 − 200o E, and covers the region
in the north-west Pacific Ocean with a very high reliability skill score. Region 2 is defined as
0 − 8o N, 220 − 280o E, and covers the region in the east Pacific Ocean with very low (negative)
reliability skill score. Region 3 is defined as 35 − 50o S, 200 − 280o E, and covers a region in
the south-east Pacific Ocean with negative reliability skill score, but also a negative resolution
skill score indicating an improvement in resolution.
Figure 4.11 shows the RMS Error-Spread diagnostic evaluated for each region for both the
EPS and DEM forecasts. This can be compared to the skill score for the ES reliability and
resolution components averaged over each region, shown in Table 4.1. Figure 4.11(a) shows
the results from region 1. As expected, the reliability of the EPS is markedly better than for
the DEM, with the scattered points for the EPS forecasts falling on the diagonal as required
for a statistically consistent ensemble forecast. There is a slight improvement in resolution,
dominated by the cases with the highest uncertainty. This is reflected in Figure 4.9(f) and
114
(b) 2
(c)
1.5
RMS Error
(a)
5
RMS Error
RMS Error
2
1.5
1
0.5
0.5
0
0
1
1
2
0
4
3
2
0
1
2
1
1
RMS Spread
RMS Spread
2
3
4
5
RMS Spread
Figure 4.11: RMS Error-Spread plot for forecasts made using the EPS (pale grey) and the
DEM system (dark grey). Ten day forecasts of T850 are considered for three regions: (a)
region 1: 10 − 25o N , 120 − 200o E, (b) region 2: 0 − 8o N , 220 − 280o E and (c) region 3:
35 − 50o S, 200 − 280o E. The ensemble forecasts are sorted and binned according to their
forecast spread. The RMS error in each bin is plotted against the RMS spread for each bin.
For a reliable ensemble, these should lie on the diagonal shown (Leutbecher and Palmer, 2008)
Region
1
2
3
RELSS RESSS
0.81
-0.48
-0.72
-0.39
-0.93
-1.19
Table 4.1: Skill scores for the reliability and resolution components of the ES (RELSS and
RESSS respectively) for the ECMWF EPS forecast compared to the DEM forecast, for each
of the three regions defined in the text.
Table 4.1, which show an improvement in the region on average.
In Figure 4.11(b), the results are shown for region 2. The reliability of the EPS forecast
is indeed poorer than for the DEM forecast; the ensemble is consistently under-dispersive.
However, the figure indicates an improvement in resolution in this region. This improvement
can be traced to a tongue of very low resolution skill score extending north west from the
Peruvian coast, visible in Figure 4.9(f).
Figure 4.11(c) shows the results for region 3. As for region 2, the EPS forecast is less reliable
than the DEM forecast being somewhat underdisersive, though the difference is less great than
in region 2. The resolution of the EPS forecast is better than for the DEM forecast, as expected
from the ES decomposition. The ES decomposition has correctly identified regions of interest
for the ECMWF EPS which have particularly high or low skill with respect to the reliability or
resolution of the forecast. Investigating these regions further using more complete, graphical
tests of statistical consistency, can then indicate in what way the forecast is unreliable or has
poor resolution.
115
Month 2−4: JJA
(b)
1
(c)
1
0.5
RMS Error
0.5
RMS Error
Month 5−7: SON
1
RMS Error
Month 1 : M
(a)
0.5
0.5
RMS Spread
0
0
1
1
(e)
1
0.5
RMS Error
0.5
0
0
0.5
RMS Spread
0
0
1
Month 2−4: DJF
(d)
RMS Error
Month 1 : N
0.5
RMS Spread
1
0
0
0.5
RMS Spread
0.5
RMS Spread
1
Month 5−7: MAM
(f)
RMS Error
0
0
1
1
0.5
0
0
0.5
RMS Spread
1
Figure 4.12: RMS error-spread diagnostic for System 4 seasonal forecasts of SST initialised in
(a–c) May and (d–e) November. Forecasts of the average SST over each season are considered,
and compared to reanalysis. The upright dark grey triangles are for the Niño 3.4 region, the
inverted mid-grey triangles are for the Equatorial Indian Ocean region, and the light grey
circles are for the North Pacific region, where the regions are defined in the text. To increase
the sample size for this diagnostic, the unaveraged fields of SSTs in each region are used instead
of their regionally average value.
4.9
Application to Seasonal Forecasts
Having confirmed that the Error-spread Score is a proper score, sensitive to both reliability
and resolution, but that is particularly sensitive to the reliability of a forecast, the score can
be used to evaluate forecasts made with the ECMWF seasonal forecasting system, System 4.
In System 4, the IFS has a horizontal resolution of T255 (∼ 80 km grid) with 91 levels in the
vertical. The IFS is coupled to the ocean model, Nucleus for European Modelling of the Ocean
(NEMO), and a 51 member ensemble forecast is produced out to a lead time of seven months.
The forecasts are initialised from 1st May and 1st November for the period 1981–2010. These
seasonal forecasts were provided by Antje Weisheimer (ECMWF, University of Oxford).
Three regions are selected for this case study: the Niño 3.4 (N3.4) region is defined as
5o S − 5o N , 120 − 170o W , the equatorial Indian Ocean (EqIO) region is defined as 10o S − 10o N ,
50 − 70o E, and the North Pacific (NPac) region is defined as 30 − 50o N , 130 − 180o W . The
monthly and areally averaged SST anomaly forecasts are calculated for a given region, and
compared to the analysis averaged over that region. The forecasts made with System 4 are
compared to two reference forecasts. The climatological forecast is generated by calculating
the mean, standard deviation and skewness of the areally averaged reanalysis SST for each
116
(a)
(b)
1
ES
ES
1
0.5
0
M J J A S O N
Month
0.5
0
N D J F M A M
Month
0.04
0.04
ES
(d)0.06
ES
(c) 0.06
0.02
0.02
(e) 0.1
(f) 0.1
ES
0
N D J F M A M
Month
ES
0
M J J A S O N
Month
0.05
0
M J J A S O N
Month
0.05
0
N D J F M A M
Month
Figure 4.13: The ES score as a function of lead time for forecasts of monthly averaged sea
surface temperatures, averaged over each region. In each panel, the solid line with circle
markers corresponds to the System 4 forecast, the solid line is for the climatological forecast,
and the dashed line is for the persistence forecast. Panels (a)–(b) are for the Niño 3.4 region,
(c)–(d) are for the Equatorial Indian Ocean region, and (e)–(f) are for the North Pacific region.
The left (right) hand column is for forecasts initialised in May (November).
region over the 30 year time period considered. This forecast is therefore perfectly reliable,
though has no resolution. A persistence forecast is also generated. The mean of the persistence
forecast is set to the average reanalysis SST for the month prior to the start of the forecast (e.g.
April for the May initialised forecasts). The mean is calculated separately for each year, and
analysis increments calculated as the difference between the SST reanalysis and the starting
SST. The standard deviation and skewness of the analysis increments are calculated and used
for the persistence forecast.
Figure 4.12 shows the RMS error-spread diagnostic for each region calculated for each
season. The spread of the forecasts for each region gives a good indication of the expected
error in the forecast. However, it is difficult to identify which region has the most skilful
forecasts: the EqIO has the smallest error on average, but the forecast spread does not vary
greatly from the climatological spread. In contrast, the errors in the forecasts for the N3.4
region and the NPac region are much larger, but the spread of the ensemble also has a greater
degree of flow dependency.
Figure 4.13 shows the average ES score calculated for each region for the System 4 ensemble
forecasts (solid line with circles), for the climatological forecast (solid line) and for the persist117
ence forecast (dashed line) for the May and November start dates respectively. The System 4
forecasts for the EqIO (panels (c) and (d)) have the lowest (best) ES for the forecast period
for both start dates. However, this region shows little variability, so the climatological and
persistence forecasts are also very skilful. In the NPac region (panels (e) and (f)), variations
in SST are much greater, but show little long term signal: the ensemble is unable to forecast
the observed variations, so the ES is higher in this region. The climatological and persistence
forecasts are also poorer due to the high variability. System 4 forecasts for the N3.4 region also
have a high ES. However, the climatological and persistence forecasts score very poorly, and
have considerably higher ES than System 4 at all lead times for the May initialised forecasts,
and at some lead times for the November initialised cases. This indicates that there is considerable skill in the System 4 forecasts for the N3.4 region: they contain significant information
over that in the climatological or persistence forecasts. For the November start date, the persistence forecast is most skilful at short lead times though very poor at long lead times, and
the climatological forecast is most skilful for the longest lead times but very poor at short lead
times. The System 4 forecasts perform well throughout the time window.
Consideration of Figure 4.13 also indicates how the ES balances scoring reliability and
resolution in a forecast. Since the climatological and persistence forecasts are perfectly reliable
by construction, the difference in their scores is due to resolution. Figure 4.14 shows the spread
of each reference forecast as a function of lead time for all regions and both start dates. The
skill of the reference forecasts as indicated by Figure 4.13 can be seen to be directly linked
to their spread: the ES scores a reliable forecast with narrow spread as better than a reliable
forecast with large spread. The strong seasonal dependency of the variability of SSTs in the
N3.4 region explains the high skill of the climatological forecast for March–May, but low skill
at other times.
Figure 4.13 shows that the ES detects considerable skill in System 4 forecasts when compared to the climatological or persistence forecasts, but that this skill is dependent on the
region under consideration and the time of year. The skill in the forecasts indicates that the
full forecast pdf gives a reliable estimate of the uncertainty in the ensemble mean, and varies
according to the predictability of the atmospheric flow.
118
(e)
1.2
1
0.8
0.6
0.4
0.2
M J J A S O N
Month
σ
σ
(c)
1.2
1
0.8
0.6
0.4
0.2
M J J A S O N
Month
σ
σ
σ
σ
(a)
1.2
1
0.8
0.6
0.4
0.2
M J J A S O N
Month
(b)
1.2
1
0.8
0.6
0.4
0.2
N D J F M A M
Month
(d)
1.2
1
0.8
0.6
0.4
0.2
N D J F M A M
Month
(f)
1.2
1
0.8
0.6
0.4
0.2
N D J F M A M
Month
Figure 4.14: The standard deviation of the climatological (solid line) and persistence (dashed
line) reference forecasts for SST, as a function of forecast lead time. The forecasts were
calculated using the analysis data over the time period analysed, 1981–2010. Panels (a)–
(b) are for the Niño 3.4 region, (c)–(d) are for the Equatorial Indian Ocean region, and (e)–(f)
are for the North Pacific region. The left (right) hand column is for forecasts initialised in May
(November).
4.10
Conclusion
A new proper score, the Error-spread Score (ES), has been proposed for evaluation of ensemble
forecasts of continuous variables. It is unique as it is formulated purely with respect to moments
of the ensemble forecast distribution, instead of using the full distribution itself. This means
that the full forecast pdf does not need to be estimated or stored. It is suitable for evaluation of
continuous forecasts, and does not require the discretisation of the forecast using bins, as is the
case for the categorical Brier and Ranked Probability Scores. The score is designed to evaluate
how well a forecast represents uncertainty: is the forecast able to distinguish between cases
where the atmospheric flow is very predictable from those where the flow is unpredictable? A
well calibrated probabilistic forecast that represents uncertainty is essential for decision making,
and therefore has high value to the user of the forecast. The ES is particularly sensitive to
testing this requirement.
In a similar manner to other proper scores, the ES can be decomposed into reliability,
resolution and uncertainty components. The ES reliability component evaluates the reliability
of the forecast spread and skewness. This term is small if the forecast and verification are
119
statistically consistent, and the moments of the ensemble forecast are a good indication of the
statistical characteristics of the verification. Similarly, the ES resolution component evaluates
the resolution of the forecast spread and shape. This term contributes negatively to the ES,
so a large resolution term is desirable. This term is large if the spread and skewness of the
ensemble forecast vary according to the state of the atmosphere and the predictability of the
atmospheric flow. The spread of a forecast system with high ES resolution separates forecast
situations with high uncertainty (large mean square error) from those with low uncertainty. The
ES uncertainty component depends only on the measured (climatological) error distribution,
and is independent of the forecast spread or skewness. A forecast system with larger errors on
average will have a larger (poorer) uncertainty component.
The ESS was tested using forecasts made in the Lorenz ’96 system, and was found to
be sensitive to both reliability and resolution as expected. The score was also tested using
forecasts made with the ECMWF IFS. The score indicates that EPS forecasts, which have
a dynamic representation of model uncertainty, are considerably more skilful than a dressed
deterministic ensemble which does not have a flow dependent probability distribution. Existing
scores are not particularly sensitive to this characteristic of probabilistic forecasts. The ES
decomposition attributed the improvement in skill at low latitudes to an improvement in
reliability, whereas the skill at higher latitudes was due to an improvement in resolution. The
ES decomposition was used to highlight a number of regions of interest for the EPS, and the
RMS error-spread diagnostic was calculated for these regions. The results were as expected
from the ES decomposition, but also indicated in what way the forecast was reliable or showed
resolution. The decomposition shown in this chapter is therefore a useful tool for analysing
the source of skill in ensemble forecasts, and for identifying regions which can be investigated
further using more comprehensive graphical diagnostic tools.
The ESS was used to evaluate the skill of seasonal forecasts made using the ECMWF
System 4 model. The score indicates significant skill in the System 4 forecasts of the Niño
3.4 region, as the ensemble is able to capture the flow-dependent uncertainty in the ensemble
mean. The results indicate that the ESS is a useful forecast verification tool due to its ease of
use, computational cheapness and sensitivity to desirable properties of ensemble forecasts.
120
5
Experiments in the IFS:
Perturbed Parameter Ensembles
But as the cool and dense Air, by reason of its greater Gravity, presses upon the hot
and rarified, ’tis demonstrated that this latter must ascend in a continued stream as
fast as it Rarifies
– Edmund Halley, 1686
5.1
Introduction
In Chapters 2 and 3, the impact of perturbed parameter and stochastic representations of model
uncertainty on forecasts in the L96 system was considered. The results from that simple system
indicated that the best stochastic parametrisations produced more skilful forecasts than the
perturbed parameter schemes. However, the perturbed parameter ensembles were skilful in
forecasting the weather of the system, and performed better than many of the sub-optimal
stochastic schemes, such as those which used white noise. This chapter will extend the earlier
work in the Lorenz ’96 system by comparing the performance of a stochastic and perturbed
parameter representation of model uncertainty in the ECMWF convection scheme. Convection
is generally acknowledged to be the parametrisation to which weather and climate models
are most sensitive (Knight et al., 2007), and it is therefore imperative that the uncertainty
originating in the parametrisation of convection is well represented.
In Section 5.2, the ECMWF model, the Integrated Forecasting System (IFS), is described,
121
and its parametrisation schemes are outlined. In Section 5.3, a generalisation to SPPT is
formulated which allows the stochastic perturbations to the convection tendency to be turned
off and replaced with alternative schemes. Section 5.4 describes the perturbed parameter
representations of model uncertainty which have been developed for this study. In Section 5.5,
the experimental procedure and verification techniques are described, and results are presented
in Section 5.6. In Section 5.7, the results are discussed and some conclusions are drawn.
5.2
The Integrated Forecasting System
The IFS is the operational global weather forecasting model developed and operated by
ECMWF. The following description refers to model version CY37R2, and the configuration
which was operational in 2011. The IFS comprises several components (Anderson and Persson,
2013; ECMWF, 2012). The atmospheric general circulation model consists of diagnostic
equations describing the physical relationship between pressure, density, temperature and
height, together with prognostic equations describing the time evolution of horizontal wind
speed (zonal, U, and meridional, V), temperature (T), humidity (q), and surface pressure. The
model dynamical equations describe the evolution of the resolved-scale variables, while the effect of sub-grid scale processes is included using physically motivated, but statistically derived,
parametrisation schemes. The atmospheric model contains a number of these parametrisation
schemes, which will be discussed further in Section 5.2.1. Each scheme operates independently
on each atmospheric vertical column. Two stochastic parametrisation schemes can be used to
represent model uncertainty: SPPT (Section 1.4.3.1) and SKEB (Section 1.4.3.2).
The atmospheric model is numerically integrated using a semi-Lagrangian advection scheme
combined with a semi-implicit time integration scheme. Together, these provide stability and
accuracy, enabling the use of larger time steps to reduce integration time. Horizontally, the IFS
is a dual spectral/grid-point model. The dynamical variables are represented in spectral space
to aid the calculation of horizontal derivatives and the time-stepping scheme. The physical
parametrisations are spatially localised, so are implemented in grid-point space on a reduced
Gaussian grid. The model then converts back and forth between grid-point and spectral space.
Vertically, the atmospheric model is discretised using sigma co-ordinates. This is a hybrid
co-ordinate system; near the surface, the sigma levels follow the orographic contours whereas
higher in the atmosphere the sigma levels follow surfaces of constant pressure.
122
Physics Parametrisation Scheme
Radiation
Turbulence and Gravity Wave Drag
Non-orographic Gravity Wave Drag
Convection
Large Scale Water Processes
Abbreviation
RDTT
TGWD
NOGW
CONV
LSWP
Table 5.1: Physical parametrisation schemes in the IFS atmospheric model
The atmospheric model is coupled to a land surface model — the H-TESSEL scheme
(Hydrology-Tiled ECMWF Scheme for Surface Exchange over Land). The land within each
grid box is represented by up to six different types of surface, with which the atmosphere
exchanges water and energy. The atmospheric model is also coupled to an ocean wave
model called “WAM”. The coupling allows the exchange of energy between wind and waves
in both directions. Persisted SST anomalies are used out to day ten; the atmosphere and the
ocean are coupled through exchanges of heat, momentum and mass.
Data assimilation is used to calculate the starting conditions for the IFS forecasts. The
four-dimensional variational data assimilation (4DVar) system combines information from observations with the physical description of the atmosphere contained in the model. This generates a physically reasonable estimate of the state of the atmosphere. However, this method
produces no flow-dependent estimate of the uncertainty in the analysis. To estimate this, an
ensemble of data assimilations (EDA) is generated: ten equally likely analyses are calculated at
a resolution of T399 (Isaksen et al., 2010). They differ from each other due to the introduction
of small perturbations in the observations and SST, as well as perturbations from SPPT.
Operationally, the model is used to produce both a high resolution deterministic forecast
and a lower resolution ensemble forecast out to a lead time of fifteen days. The deterministic
model is run at a spectral resolution of T1279 (16 km) with 91 levels in the vertical and a time
step of 10 minutes. A single forecast is made using unperturbed initial conditions from the
4DVar system, without the SPPT and SKEB parametrisation schemes in the forecast model.
The EPS is operationally run with a spectral resolution of T639 (30 km) and 62 vertical levels,
and with a time step of 20 minutes. The ensemble has fifty perturbed members and one
control member. Initial condition uncertainty is sampled using the EDA system, combined
with perturbations from the leading singular vectors. The stochastic parametrisations are
activated to sample model uncertainty.
123
5.2.1
Parametrisation Schemes in the IFS
There are five main parametrisation schemes in the IFS, shown in Table 5.1. The physics tendencies from these five schemes are combined with the dynamical tendencies using a technique
called “fractional stepping” (Wedi, 1999). The schemes are called sequentially, and schemes
called later use updated variables. This has the disadvantage of introducing intermediate time
steps (hence “fractional stepping”) at which the tendencies are updated by each parametrisation scheme in turn. However, it has the advantage of ensuring a balance between the different
physical parametrisation schemes.
The first scheme to be called is the radiation scheme (RDTT), including both a long- and
short-wave radiation calculation. The full radiation scheme is very expensive, so it is calculated on a coarser grid than the other parametrisation schemes, and the resultant tendencies
interpolated to the required resolution. Furthermore, the scheme is not run at every time step:
for the high resolution forecast, it is run once an hour, and for the EPS it is run once every
three hours. The radiation scheme interacts with clouds: incoming short wave radiation is
reflected by clouds, and the clouds emit long wave radiation. Since 2007, the Monte Carlo
Independent Column Approximation (McICA) approach has been used to account for clouds.
The grid box is divided into a number of sub-columns, each of which has a cloud fraction of ‘0’
or ‘1’ at each vertical level. Instead of calculating the sum over all columns in a grid box and
over all radiation intervals (which would be prohibitively expensive), a Monte Carlo approach
is used whereby the radiative transfer calculation is performed for a single randomly selected
sub column only. This will introduce unbiased random errors into the solution.
The second scheme called accounts for vertical exchange of energy, momentum and moisture
due to turbulence and gravity wave drag (TGWD). The scheme accounts for turbulent exchange
between the surface and the lowest atmospheric levels. Atmospheric momentum is also affected
by sub-grid scale orography. Orography exerts a drag on the atmospheric flow both from
blocking the flow in the lowest levels, and due to reflection and absorption of gravity waves.
The third scheme is the non-orographic gravity wave drag scheme (NOGW). Non-orographic
gravity waves are generated by convection, the jet stream and frontogenesis. They are particularly important in the stratosphere and mesosphere, where they contribute to driving the
Brewer-Dobson circulation, and the quasi-biennial and semi-annual oscillations.
The convection parametrisation (CONV) is based on the mass-flux scheme of Tiedtke
124
(1989). The scheme describes three types of convective cloud: deep, shallow and mid-level. The
convective clouds in a column are represented by a pair of entraining and detraining plumes
of a given convective type, which describe updraft and downdraft processes respectively1 . The
choice of convective type determines certain properties of the cloud (such as the entrainment
formulation). The mass flux at cloud base for deep convection is estimated by assuming that
deep convection acts to reduce convectively available potential energy (CAPE) over some specified (resolution dependent) time scale. Mid-level convection occurs at warm fronts. The
mass flux at cloud base is set to be the large scale vertical mass flux at that level. For shallow
convection, the mass flux at cloud base is derived by assuming that the moist static energy in
the sub-cloud layer is in equilibrium.
Finally, the large scale water processes (LSWP, or “cloud”) scheme contains the prognostic
equations for cloud liquid water, cloud ice, rain, snow and cloud fraction. It builds on the
scheme of Tiedtke (1993), but is a more complete description, including more prognostic variables and an improved representation of mixed phase clouds. Whereas the convection scheme
calculates the effect of unresolved convective clouds, the cloud scheme calculates the impact of
clouds which are resolved by the model. This means that the same cloud could be represented
by a different parametrisation scheme if the resolution of the model changed.
The IFS also contains parametrisations of methane oxidation and ozone chemistry. The
tendencies from these schemes do not affect the variable tendencies perturbed by SPPT, so
the schemes will not be considered further here.
5.3
Uncertainty in Convection: Generalised SPPT
The operational SPPT scheme addresses model uncertainty in the IFS due to the physics
parametrisation schemes by perturbing the physics tendencies using multiplicative noise; the
word ‘tendency’ refers to the change in a variable over a time step. SPPT perturbs the sum
of the parametrisation tendencies:
T =
5
X
∂X
Pi .
= D + K + (1 + e)
∂t
i=1
1
(5.1)
Entrainment is the mixing of dry environmental air into the moist convective plume, while detrainment is
the reverse.
125
where T is the total tendency in X. D is the tendency from the dynamics, K is horizontal
diffusion, Pi is the tendency from the ith physics scheme in Table 5.1, and e is the zero mean
random perturbation. The scheme perturbs the tendency for four variables: T , U , V and q.
Each variable tendency is perturbed using the same random number field. The perturbation
field is generated using a spectral pattern generator. The pattern at each time step is the sum
of three independent random fields with horizontal correlation scales of 500, 1000 and 2000 km.
These fields are evolved in time using an AR(1) process on time scales of 6 hours, 3 days and
30 days respectively, and the fields have standard deviations of 0.52, 0.18 and 0.06 respectively.
It is expected that the smallest scale (500 km and 6 hours) will dominate at a 10 day lead time
— the larger scale perturbations are important for monthly and seasonal forecasts.
SPPT does not distinguish between the different parametrisation schemes. However, the
parametrisation schemes likely have very different error characteristics, so this assumption may
not be valid. In particular, this chapter considers alternative, perturbed parameter representations of model uncertainty in convection. In order to test a perturbed parameter scheme in
convection, it is necessary to be able to ‘switch off’ the SPPT perturbations for the convection
parametrisation tendency. A generalised version of SPPT was developed for this chapter, building on earlier work by Alfons Callado Pallares (AEMET)2 . In this scheme, the multiplicative
noise is applied separately to the tendencies from each physics parametrisation scheme,
T =D+K +
5
X
(1 + ei )Pi ,
(5.2)
i=1
where the stochastic field, ei , for the convection tendency can be set to zero. In order to detect
an improvement in the representation of uncertainty in the convection scheme, the uncertainty
in the other four schemes must be well represented. In this experiment, SPPT is used to
represent uncertainty in the other four schemes, applying the same stochastic perturbation to
each scheme. The stochastic perturbations are three-scale fields with the same characteristics
as used in operational SPPT. The SKEB scheme (Section 1.4.3.2) represents a process that is
otherwise missing from the model, so will be used in these experiments.
The results of Chapters 2 and 3 indicate that a multiplicative stochastic parametrisation
scheme is a skilful representation of model uncertainty in the Lorenz ’96 system. Using SPPT
2
See Section 1.9 for an outline of the code changes to the IFS which have been incorporated from Callado
Pallares, and the changes which have been developed as part of this thesis.
126
to represent convective uncertainty is therefore a good benchmark when testing the perturbed
parameter schemes outlined below. The Lorenz ’96 system also indicated that multiplicative
and additive noise stochastic schemes produced skilful forecasts. It would be interesting to test
an additive noise scheme in addition to SPPT for the convective tendencies. Additive noise
represents uncertainty in the convection tendency when the deterministic tendency is zero. This
uncertainty could be due to the model discretisation, or from errors in the formulation of the
convection parametrisation scheme which cannot be captured by a multiplicative noise scheme.
However, additive noise schemes will not be investigated further here. Implementing an additive
noise scheme in the IFS is problematic in the context of convection (Martin Leutbecher, pers.
comm., 2013). The deterministic convection parametrisation acts to vertically redistribute heat
and moisture in the atmosphere, drying some levels, and moistening by an equivalent amount
at others. A multiplicative noise term does not disrupt this balance. However, an additive
term would disrupt this balance, and developing and implementing an additive scheme which
preserves the balance is outside the scope of this thesis.
5.4
Perturbed Parameter Approach to Uncertainty in
Convection
5.4.1
Perturbed Parameters and the EPPES
When developing a parametrisation scheme, parameters are introduced to represent physical
processes within the scheme. For example, in the entraining plume model of convection, the
degree to which dry environmental air is turbulently mixed into the plume is assumed to be
proportional to the inverse of the radius of the plume, with the constant of proportionality
defined to be the entrainment coefficient. This is a simplification of the true processes involved
in the convective cloud, and because of this and the sparsity of the required environmental
data, physical parameters such as the entrainment coefficient are poorly constrained. However,
the evolution of convective clouds and the resultant effects on weather and ultimately global
climate are very sensitive to these parameters, and to the entrainment coefficient in particular
(Sanderson et al., 2008). Because of this, perturbed parameter models have been proposed to
represent the uncertainty in predictions due to the uncertainty in these parameters.
In a perturbed parameter ensemble, the values of a selected set of parameters are sampled
127
from a distribution representing the uncertainty in their values, and each ensemble member is
assigned a different set of parameters. These parameters are fixed globally and for the duration of the integration. The parameter distribution is usually determined through “expert
elicitation” whereby scientists with the required knowledge and experience of using the parametrisation suggest upper and lower bounds for the parameter (Stainforth et al., 2005). No
information about the relationships between parameters is included in the ensemble, though
unrealistic simulations can be removed from the ensemble later (Stainforth et al., 2005).
The poorly constrained nature of these physical parameters can have adverse effects on
high-resolution deterministic integrations. Tuning the many hundreds of parameters in atmospheric models is a difficult, lengthy, costly process, usually performed by hand. An attractive
alternative is the use of a Bayesian parameter estimation approach. This seeks to provide the
probability distribution of parameters given the data, and provides a framework for using new
data from forecasts and observations to update prior knowledge or beliefs about the parameter
distribution (Beck and Arnold, 1977). One specific proposed technique is the Ensemble Prediction and Parameter Estimation System (EPPES) (Järvinen et al., 2012; Laine et al., 2012),
which runs on–line in conjunction with an operational ensemble forecasting system. At the
start of each forecast, a set of parameter values for each ensemble member is sampled from the
parameters’ joint distribution. The joint distribution is updated by evaluating the likelihood
function for the forecast and observations after the verifying observations are available. Note
that this may be many days after the forecast was initialised, so other perturbed parameter ensemble forecasts will have been initialised in the meantime. In this way, the EPPES approach
differs from a Markov Chain Monte Carlo method, which updates the parameter distribution
before each new draw.
My collaborators, Peter Bechtold (ECMWF), Pirkka Ollinaho (Finnish Meteorological Institute) and Heikki Järvinen (University of Helsinki), have used this approach with the IFS to
better constrain four of the parameters in the convection scheme: ENTRORG, ENTSHALP,
DETRPEN, RPRCON:
• ENTRORG represents organised entrainment for positively buoyant deep convection,
with a default value of 1.75 × 10−3 m−1 .
• ENTSHALP × ENTRORG represents shallow entrainment, and the default value for
ENTSHALP is 2.
128
• DETRPEN is the average detrainment rate for penetrative convection, and has a default
value of 0.75 × 10−4 m−1 .
• RPRCON is the coefficient for determining the conversion rate from cloud water to rain,
and has a default value of 1.4 × 10−3 .
The likelihood function used was the geopotential height at 500 hPa for a ten day forecast.
The resultant optimised value of each parameter was used in the high resolution deterministic
forecast model, and many of the verification metrics were found to improve when compared to
using the default values (Pirkka Ollinaho, pers. comm., 2013). This is very impressive, since
the operational version of the IFS is already a highly tuned system.
The EPPES approach also produces a full joint pdf for the chosen parameters. Since
Gaussianity is assumed, this takes the form of a covariance matrix for the four parameters.
This information is useful for model tuning as it can reveal parameter correlations, and can
therefore be used to identify redundant parameters. However, the joint pdf also gives an
indication of the uncertainty in the parameters. I have been provided with this information,
which I have used to develop a perturbed parameter representation of uncertainty for the
ECMWF convection scheme.
5.4.2
Method
The EPPES approach was used to determine the posterior distribution of four parameters in
the ECMWF convection scheme at T159. This was calculated in terms of a mean vector, M (i)
and covariance matrix with elements Σi,j , where i = 1 represents ENTRORG, i = 2 represents
ENTSHALP, i = 3 represents DETRPEN, and i = 4 represents RPRCON.


0.182804e − 02




0.214633e + 01



M =


0.778274e − 04






0.151285e − 02
129


 0.9648e − 07 −0.2127e − 04 −0.4199e − 09 −0.1839e − 07




−0.2127e − 04 0.9255e − 01
0.1318e − 05 −0.3562e − 04



Σ=


−0.4199e − 09 0.1318e − 05
0.5194e − 10 −0.1134e − 08






−0.1839e − 07 −0.3562e − 04 −0.1134e − 08
0.4915e − 07
By comparison with the default values, the M vector indicates the degree to which the parameters should be changed to optimise the forecast. The off-diagonal terms in the Σ matrix
indicate there is significant covariance between parameters. This highlights one of the problems
with using “expert elicitation” to define parameter distributions — such distributions contain
no information about parameter inter-dependencies.
5.4.2.1
Fixed Perturbed Parameter Distribution
The usual method used in perturbed parameter experiments is a fixed perturbed parameter ensemble (Murphy et al., 2004; Sanderson, 2011; Stainforth et al., 2005; Yokohata et al., 2010).
Each ensemble member is assigned a set of parameter values which are held constant spatially and over the duration of the integration. Such ensembles are traditionally used for
climate–length integrations. It will be interesting to see how well such an ensemble performs
at representing uncertainty in weather forecasts.
The multivariate normal distribution supplied by Bechtold, Ollinaho and Järvinen in October 2012 was sampled to give N sets of the four parameters, where the number of ensemble
members, N = 50. The procedure for this is as follows. N sample vectors, zn (1 ≤ n ≤ N ), are
drawn from the four-dimensional standard multivariate normal distribution (M = 1, Σ = I).
The Cholesky decomposition is used to find matrix A, such that AAT = Σ:

 3.1062E − 4


−6.8490E − 2

A=

−1.3518E − 6



0
0
0
2.9641E − 1
0
0
4.1332E − 6
5.7470E − 6
0












−5.9200E − 5 −1.3387E − 4 −1.1492E − 4 1.2048E − 4
130
The samples from the standard multivariate distribution are transformed to samples from the
correct parameter distribution, xn , using the transformation:
xn = M + Azn
(5.3)
Two types of fixed perturbed parameter ensemble are considered here. The first uses the
same fifty sets of four parameters for all starting dates (“TSCP”). Sampling of the parameters
is performed offline: Latin hypercube sampling is used to define fifty percentiles at which
to sample the standard multivariate normal distribution, before (5.3) is used to transform to
parameter space. This technique ensures the joint distribution is fully explored. The covariance
of the resultant sample is checked against the EPPES covariance matrix; 10,000 iterations found
a sample whose covariance matrix differed by less than 5% from the true matrix. The sampled
parameter values are shown in Table 5.2. The second type of fixed perturbed parameter
ensemble uses N new sets of parameters for each initial condition (“TSCPr”). This sampling
is performed online, and the samples are not optimised. However, when forecasts from many
starting conditions are taken together, the ensemble is sufficient to fully sample the joint pdf.
5.4.2.2
Stochastically Varying Perturbed Parameter Distribution
Khouider and Majda (2006) recognise that a problem with many deterministic parametrisation
schemes is the presence of parameters that are “nonphysically kept fixed/constant and spatially
homogeneous”. An alternative to the fixed perturbed parameter ensemble described above is a
stochastically varying perturbed parameter ensemble (“TSCPv”) where the parameter values
are varied spatially and temporally following the EPPES distribution. However, the EPPES
technique contains no information about the correct spatial and temporal scales on which to
vary the parameters. Since the likelihood function is evaluated at day ten of the forecast, the
set of parameters must perform well over this time window to produce a skilful forecast; this
indicates that ten days could be a suitable temporal scale. The likelihood function evaluates
the skill of the forecast using the geopotential height at 500 hPa. The likelihood function
will therefore focus on the midlatitudes, where the geopotential height has high variability. A
suitable spatial scale could therefore be ∼ 1000 km. The SPPT spectral pattern generator is
a suitable technique for stochastically varying the parameters in the convection scheme. It
generates a spatially and temporally correlated field of random numbers.
131
Number
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
Selected Permutations
ENTRORG ENTSHALP DETRPEN
1.8203e-3
2.4122e-3
1.6561e-3
1.1054e-3
2.0785e-3
1.7574e-3
1.9311e-3
1.6377e-3
1.8987e-3
1.5554e-3
1.4116e-3
1.5061e-3
1.9821e-3
1.9148e-3
2.0575e-3
1.2438e-3
1.4471e-3
1.6914e-3
1.3696e-3
1.5985e-3
2.1007e-3
1.9999e-3
1.7084e-3
2.5506e-3
1.8514e-3
1.8047e-3
1.8358e-3
1.9477e-3
1.7890e-3
1.6740e-3
1.7733e-3
2.3390e-3
1.8828e-3
2.2864e-3
2.1500e-3
1.9647e-3
1.5776e-3
1.6185e-3
2.2445e-3
1.5317e-3
2.0375e-3
2.2090e-3
1.7413e-3
2.1779e-3
1.4782e-3
2.1244e-3
1.3171e-3
1.8671e-3
2.0184e-3
1.7250e-3
2.3121e+0
1.7785e+0
1.9652e+0
2.4199e+0
1.4016e+0
1.8280e+0
2.2220e+0
2.4955e+0
1.9838e+0
2.4064e+0
2.1398e+0
2.2247e+0
2.3726e+0
2.3462e+0
2.3785e+0
2.2379e+0
2.1781e+0
2.6139e+0
1.9646e+0
2.0827e+0
2.7757e+0
1.8012e+0
2.2402e+0
2.0698e+0
2.1935e+0
1.9516e+0
2.1223e+0
2.1125e+0
2.3366e+0
2.1128e+0
2.0756e+0
2.5212e+0
1.9526e+0
1.8812e+0
1.5878e+0
2.6737e+0
2.5651e+0
2.5899e+0
1.6171e+0
1.9515e+0
2.2471e+0
2.3014e+0
2.2027e+0
1.6718e+0
2.5573e+0
1.9506e+0
1.8954e+0
2.1600e+0
2.2348e+0
1.6116e+0
8.7197e-5
6.8076e-5
6.8473e-5
8.0037e-5
7.5603e-5
6.4998e-5
8.9559e-5
7.8693e-5
7.6484e-5
7.5846e-5
6.7459e-5
8.3208e-5
8.0642e-5
8.5138e-5
7.5289e-5
7.3377e-5
8.0664e-5
7.6817e-5
7.4865e-5
7.8839e-5
9.9625e-5
7.8752e-5
8.1817e-5
7.2315e-5
7.3820e-5
7.9388e-5
7.5268e-5
7.7636e-5
7.9808e-5
7.6250e-5
7.1866e-5
8.4617e-5
8.0540e-5
7.4853e-5
7.2808e-5
8.4574e-5
8.6837e-5
8.1101e-5
6.7066e-5
8.3194e-5
7.9687e-5
7.0050e-5
8.2246e-5
5.7394e-5
9.0479e-5
7.9765e-5
8.4434e-5
7.8113e-5
7.7212e-5
6.8595e-5
RPRCON
1.3731e-3
1.5406e-3
1.6606e-3
1.5757e-3
1.4454e-3
1.6689e-3
1.1177e-3
1.4802e-3
1.3189e-3
1.4879e-3
1.7171e-3
1.4336e-3
1.3901e-3
1.3017e-3
1.4298e-3
1.9182e-3
1.5861e-3
1.5926e-3
1.8839e-3
1.8027e-3
9.2852e-4
1.5595e-3
1.4210e-3
1.6063e-3
1.6587e-3
1.6842e-3
1.6807e-3
1.3370e-3
1.5194e-3
1.6079e-3
1.8394e-3
1.1419e-3
1.4474e-3
1.5983e-3
1.6111e-3
1.2773e-3
1.2864e-3
1.3479e-3
1.7770e-3
1.3346e-3
1.3254e-3
1.4242e-3
1.4821e-3
1.6127e-3
1.2180e-3
1.3744e-3
1.6384e-3
1.5982e-3
1.4772e-3
2.1027e-3
Table 5.2: Chosen perturbed convection parameters for the fixed perturbed parameter experiment.
132
In this experiment the standard SPPT settings were used in the pattern generator. A three
scale composite pattern is used which has the same spatial and temporal correlations as used
in SPPT. The standard deviations of these independent patterns are 0.939 (smallest scale),
0.325 and 0.108 (largest scale) to give a total standard deviation of 1. These settings vary the
parameters faster and on smaller spatial scales than the scales to which EPPES is sensitive,
as estimated above. However it will still be useful as a first test, and when combined with
the fixed perturbed parameter ensemble (varying on an ∞ spatial and temporal scale), it can
provide bounds on the skill of such a representation of model uncertainty.
The SPPT pattern generator is used to generate four independent composite fields with
mean 0 and standard deviation 1. The correct covariance structure is introduced using the
transformation matrix, A: xi,j,t = M + Azi,j,t , where the indices i, j refer to latitude and
longitude and t refers to time. The parameters do not vary as a function of height since the
convection parametrisation is applied columnwise in the model. The resultant four covarying
fields are used to define the values of the four convection parameters as a function of position
and time.
5.5
Experimental Procedure
Parameter estimation was carried out with the EPPES system using IFS model version CY37R3
for 45 dates between 12 May 2011 and 8 August 2011, with forecasts initialised every 48
hours. The same model version is used here for consistency. Different initial dates to those
used to estimate the joint pdfs must be selected (an out of sample test), but taken from the
same time of the year since the EPPES estimated pdfs may be seasonally dependent. In
order to detect improvements in the model uncertainty representation, it is important that
initial condition uncertainty is well represented in the ensemble forecast. The best technique
possible will be used, which for the IFS involves using hybrid EDA/singular vector estimates
for the perturbations. The initial dates used must be after June 2010, when the EDA system
became operational. The selected dates for the hindcasts are therefore from Summer 2012.
The parametrisation schemes will be tested at T159 (1.125◦ ) using a fifty member ensemble
forecast. The schemes are tested using ten-day hindcasts initialised every five days between
14 April and 6 September 2012 (30 dates in total). Persistent SSTs are used instead of a
dynamical ocean. The high-resolution ECMWF 4DVar analysis is used for verification.
133
Other Four Tendencies: SPPT
Convection: Zero
TSCZ
Convection: SPPT
TSCS
TSCP
Convection: Perturbed Parameters (constant)
Convection: Perturbed Parameters (resampled)
TSCPr
Convection: Perturbed Parameters (varying)
TSCPv
Table 5.3: Proposed experiments for investigating the representation of uncertainty in the
ECMWF convection parametrisation scheme.
Five experiments are proposed to investigate the representation of model uncertainty in
the convection scheme in the IFS (Table 5.3). In each experiment, the uncertainty in the other
four parametrisation tendencies (radiation, turbulence and gravity wave drag, non-orographic
gravity wave drag, large scale water processes) is represented by SPPT (“TS”). In the first
experiment, there is no representation of uncertainty in the convection tendency (“CZ”). In
the second, SPPT is used to represent uncertainty in the convection tendency (“CS” — equivalent to the operational SPPT parametrisation scheme). In the final three, uncertainty in
the convection tendency is represented by a static perturbed parameter ensemble, with and
without resampling of parameters for different start dates (“CPr” and “CP” respectively), and
by a stochastically varying perturbed parameter ensemble (“CPv”).
In order to compare the different representations of convection model uncertainty, the SPPT
scheme must correctly account for uncertainty in the other four tendencies. Therefore verification will be performed in a two stage process. Firstly, the calibration of the ensemble will
be checked in a region with little uncertainty due to convection, i.e., where there is little convective activity. The five experiments in Table 5.3 should perform very similarly in this region
as they have the same representation of uncertainty in the other four tendencies. Secondly,
a region where convection is the dominant process will be selected to test the different uncertainty schemes. Given that model uncertainty has been accounted for in the other four
parametrisations using SPPT, and that a region has been selected where the model uncertainty is dominated by deep convection, a scheme which accurately represents uncertainty in
deep convection will give a reliable forecast in this region, and any detected improvements
in forecast skill can be attributed to an improvement in representation of uncertainty in the
convection scheme.
134
(a)
90
latitude / o N
60
30
0.6
0
−30
0.5
−60
−90
0
30
60
90
120
150
180
210
240
270
300
330
360
0.4
longitude / o E
(b)
0.3
90
latitude / o N
60
0.2
30
0
0.1
−30
−60
−90
0
30
60
90
120
150
180
210
240
270
300
330
360
0
o
longitude / E
Figure 5.1: Convection diagnostic (colour) derived from the IFS tendencies calculated as part
of the YOTC project (see text for details). (a) Regions where the diagnostic is close to zero
(bounded by grey boxes), indicating there is little convection. (b) Regions where the diagnostic
is large (bounded by grey box), indicating convection is the dominant process.
5.5.1
Definition of Verification Regions
The regions of interest are defined using the Year of Tropical Convection (YOTC) dataset from
ECMWF. YOTC was a joint WCRP and World Weather Research Programme/The ObservingSystem Research and Predictability Experiment (WWRP/THORPEX) project which aimed
to focus research efforts on the problem of organised tropical convection. The ECMWF YOTC
dataset consists of high resolution analysis and forecast data for May 2008 — April 2010. In
particular, the IFS parametrisation tendencies were archived at every time step out to a lead
time of ten days.
The 24-hour cumulative temperature tendencies at 850 hPa for each parametrisation scheme
are used. Forecasts initialised from 30 dates between 14 April and 6 September 2009 are selected, with subsequent start dates separated by five days. To identify regions where convection is
the dominant process, the ratio between the magnitude of the convective tendency and the sum
of the magnitudes of all tendencies is calculated, and is shown in Figure 5.1. This diagnostic
135
can be used to define regions where there is little convection (the ratio is close to zero) or
where convection dominates (the ratio greater than 0.5). Since the forecasting skill of the IFS
is strongly latitudinally dependent, both the regions with little convection and with significant
convection are defined in the tropics (25◦ S–25◦ N). Both regions are approximately the same
size, and cover areas of both land and sea. Any differences in the forecast verification between
these two regions will be predominantly due to convection.
5.5.2
Chosen Diagnostics
Four variables of interest have been selected which will be used to verify the forecasts. Temperature and zonal wind at 850 hPa (T850 and U850 respectively), correspond to fields at
approximately 1.5 km altitude, and falls above the boundary layer in many places. The geopotential height at 500 hPa (Z500) is a standard ECMWF diagnostic variable. It is particularly
useful in the extra-tropics where it shows characteristic features corresponding to low and high
pressure weather systems. The zonal wind at 200 hPa (U200) is particularly interesting when
considering convection. This is because 200 hPa falls close to the tropopause, where deep
convection is capped. Convective outflow often occurs at this level, which can be detected in
U200.
For each variable, the impact of the schemes will be evaluated using a number of the
diagnostics described in Chapter 1:
• Bias (Section 1.7.2.1)
• RMSE compared to RMS spread (Section 1.7.3.1)
• RMS error-spread graphical diagnostic (Section 1.7.3.1)
• Forecast skill scores: RPSS, IGNSS, ESS (Sections 1.7.1.2 and 1.7.1.3, and Chapter 4
respectively)
In convecting regions, precipitation (PPT) and total column water vapour (TCWV) will also
be considered. PPT is a parametrised product of the convection and large scale water processes
parametrisation schemes, so an improvement to the convection scheme should be detectable by
studying this variable. The convection scheme effectively redistributes and removes moisture
from the atmosphere, so an improvement in TCWV could be indicative of an improvement in
the convection scheme.
136
(a)
(b)
T 850
U 850
0.05
10
0
8
−0.1
% Bias
% Bias
−0.05
−0.15
−0.2
6
4
2
−0.25
0
−0.3
−2
0
50
100
150
200
0
50
lead time / hrs
(c)
100
150
200
lead time / hrs
(d)
Z 500
U 200
0.05
2
0
−0.05
% Bias
% Bias
0
−0.1
−2
−4
−6
−8
−0.15
−10
−0.2
0
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
Figure 5.2: Percentage forecast bias in tropical regions with little convection as a function of
time for (a) T850, (b) U850, (c) Z500 and (d) U200. Results are shown for the five experiments:
black — TSCZ; blue — TSCS; red — TSCP; magenta — TSCPr; green — TSCPv. The bias
is calculated as described in the text, and given as a percentage of the root mean square of the
analysis in the region of interest. The red line is obscured by the magenta line in each figure.
5.6
5.6.1
Verification of Forecasts
Verification in Non-Convecting Regions
Firstly, the impact of the different representations of model uncertainty will be considered in
the non-convecting regions defined in Figure 5.1(a). Figure 5.2 shows the percentage bias of
forecasts, calculated following (1.24), for regions of little convection. This is a useful diagnostic,
as it can indicate the presence of systematic errors in a forecast. A small change is observed
in the bias when different uncertainty schemes are used, and in particular the TSCPv scheme
(green) performs well for all variables considered. As expected, the TSCP and TSCPr schemes
(red and magenta respectively) perform similarly — the bias in the ensemble mean is unaffected
by whether the parameter perturbations are resampled for each initial condition. For U850,
Z500 and U200, the TSCZ scheme (black) outperforms the TSCS scheme (blue), a result which
will be discussed in Section 5.7.
The impact of the uncertainty schemes on the calibration of the ensemble can be summarised
by evaluating the RMS error in the ensemble mean and the RMS ensemble spread as a function
of time within the region of interest, which should be equal for a well calibrated ensemble.
137
T 850
(a)
U 850
(b)
3.5
3
RMS
RMS
2
1.5
2.5
2
1.5
1
0
50
100
150
200
0
50
lead time / hrs
Z 500
150
200
U 200
(d)
160
8
140
7
120
6
RMS
RMS
(c)
100
lead time / hrs
100
80
5
4
60
3
40
20
0
2
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
Figure 5.3: Temporal evolution of root mean square ensemble spread (dashed lines) and root
mean square error (solid lines) for regions with little convection for (a) T850, (b) U850, (c)
Z500 and (d) U200. Results are shown for the five experiments: black — TSCZ; blue — TSCS;
red — TSCP; magenta — TSCPr; green — TSCPv. The grey curves indicate the results for
the operational (T639) EPS forecasts for comparison. The red line is obscured by the magenta
line in each figure.
Figure 5.3 shows this diagnostic for the regions with little convection. Forecasts are observed
to be slightly under-dispersive for all variables. The underdispersion is large for Z500, but is
small for the other variables (note that the y-axes do not start at zero). For comparison, the
RMS spread and error curves are also shown for operational ensemble forecasts at T639 (grey).
At this higher resolution the ensemble spread is similar, but the RMSE is smaller. The low
resolution (T159) used in the five test experiments is responsible for the higher RMSE and
therefore contributes to the under-dispersive nature of the ensemble.
A more comprehensive understanding of the calibration of the ensemble can be gained by
considering the RMS error-spread diagnostic. The forecast-verification pairs are collected for
each spatial point in the region of interest for each starting condition. These pairs are ordered
according to their forecast variance, and divided into 30 equally populated bins. The RMS
spread and RMSE are evaluated for each bin and displayed on scatter plots. Figure 5.4 shows
this diagnostic for regions with little convection at lead times of one, three and ten days.
The scattered points should lie on the one-to-one diagonal, shown in black, for a statistically
consistent ensemble following (1.26). The diagnostic indicates a large degree of flow-dependent
spread in the ensemble forecasts, with scattered points lying close to the one-to-one line. The
138
T850 forecasts are particularly well calibrated, and the spread of the U850 and U200 forecasts
are also a skilful indicator of the expected error for all five experiments. The Z500 forecasts
show little flow dependency at short lead times, but improve when the longer ten-day forecasts
are considered. As expected, the results from the five experiments are very similar, and show
moderately under-dispersive but otherwise well calibrated forecasts.
5.6.2
Verification in Convecting Regions
The previous section indicates that the model uncertainty in the other four tendencies is sufficiently well represented by SPPT. This section considers forecasts for the strongly convecting
regions defined in Figure 5.1(b) to evaluate the impact of the new uncertainty schemes. Figure 5.5 shows the percentage bias for forecasts of T850, U850, Z500 and U200 in this region
for the five different schemes considered. The bias is similar for all schemes; no one scheme is
systematically better or worse than the others.
Figure 5.6 shows the RMS error and spread as a function of time averaged over all cases
for all points within the region of interest. The RMS error in the forecast is similar for all
experiments — the perturbed parameter ensembles have not resulted in an increase in error
over the operational scheme, except for a slight increase for T850. However, the fixed perturbed
parameter ensemble (red/magenta) has resulted in an increase in spread over the operational
TSCS forecast (blue). This is especially large for T850, where the observed increase is 25%
at long lead times. Interestingly, the TSCZ ‘deterministic convection’ forecasts of T850 also
result in an increase in ensemble spread over TSCS. This is a counter-intuitive result, as it is
expected that using a stochastic parametrisation would increase the spread of the ensemble.
This result will be discussed in Section 5.7, and motivates the experiments carried out in
Chapter 6. For comparison, the results for the operational EPS are also shown in grey. As
is the case in regions with little convection, some of the ensemble under-dispersion at T159 is
due to an increased forecast RMSE compared to the operational T639 forecasts, though the
forecasts are under-dispersive at both resolutions.
Figure 5.7 shows the RMS error-spread graphical diagnostic for the five forecast models in
regions with significant convection. The impact of the different schemes is slight. However,
there is a larger difference than in regions with little convection (see Figure 5.4). All schemes
remain well calibrated, and do not show large increases in error compared to the operational
139
(a)
(b)
(c)
3
2.5
4
1.5
1
2
RMSE
RMSE
RMSE
2
2
1
1
0.5
0
0
0
0
1
2
Forecast Spread
(d)
0
0
1
2
3
Forecast Spread
(e)
5
4
1.5
1
RMSE
3
RMSE
2
2
1
0.5
0
0
(g)
3
2
1
0
0
1
2
Forecast Spread
2
4
Forecast Spread
(f)
4
2.5
RMSE
3
0
0
1
2
3
Forecast Spread
(h)
2
4
Forecast Spread
(i)
300
60
20
50
0
0
20 40 60 80
Forecast Spread
(k)
4
8
3
6
RMSE
RMSE
(j)
2
2
4
Forecast Spread
10
8
4
0
0
100
200
Forecast Spread
(l)
2
1
0
0
100
0
0
20
40
Forecast Spread
200
RMSE
0
0
RMSE
RMSE
RMSE
100
40
6
4
2
2
4
6
Forecast Spread
0
0
5
10
Forecast Spread
Figure 5.4: Root mean square error-spread diagnostic for tropical regions with little convection for (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200, at lead times of 1 day (first
column), 3 days (second column) and 10 days (third column) for each variable. Results are
shown for the five experiments: black — TSCZ; blue — TSCS; red — TSCP; magenta —
TSCPr; green — TSCPv. For a well calibrated ensemble, the scattered points should lie on
the one-to-one diagonal shown in black.
140
(a)
(b)
T 850
0.05
0
−5
−0.05
% Bias
% Bias
U 850
0
−0.1
−10
−15
−0.15
−0.2
0
50
100
150
−20
0
200
50
lead time / hrs
(c)
100
150
200
lead time / hrs
(d)
Z 500
U 200
0
20
15
−0.1
% Bias
% Bias
−0.05
−0.15
5
−0.2
−0.25
0
10
50
100
150
0
0
200
50
lead time / hrs
100
150
200
lead time / hrs
Figure 5.5: Percentage forecast bias in tropical regions with significant convection as a
function of time for (a) T850, (b) U850, (c) Z500 and (d) U200. Results are shown for the
five experiments: black — TSCZ; blue — TSCS; red — TSCP; magenta — TSCPr; green —
TSCPv. The bias is calculated as described in the text, and given as a percentage of the root
mean square of the analysis in the region of interest. The red line is obscured by the magenta
line in each figure.
T 850
(a)
U 850
(b)
0.9
3.5
0.8
3
RMS
RMS
0.7
0.6
2.5
2
0.5
1.5
0.4
0.3
0
1
50
100
150
200
0
50
lead time / hrs
Z 500
(c)
100
150
200
lead time / hrs
U 200
(d)
7
6
RMS
RMS
150
100
5
4
3
50
2
0
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
Figure 5.6: Temporal evolution of root mean square ensemble spread (dashed lines) and root
mean square error (solid lines) for regions with significant convection for (a) T850, (b)
U850, (c) Z500 and (d) U200. Results are shown for the five experiments: black — TSCZ;
blue — TSCS; red — TSCP; magenta — TSCPr; green — TSCPv. The grey curves indicate
the results for the operational (T639) EPS forecasts for comparison. The red line is obscured
by the magenta line in each figure.
141
TSCS forecasts. The fixed perturbed parameter schemes (red and magenta) have larger spread
than the other schemes, which is most apparent for T850 in Figures 5.7(a–c). TSCS (blue)
has the most under-dispersive ensemble at long lead times, though is better calibrated than
the other experiments at short lead times. The TSCPv experiment has intermediate spread,
improving on TSCS but under-dispersive compared to the TSCP experiments.
The skill of the forecasts is evaluated using the RPS, IGN and ES as a function of lead
time, and the skill scores evaluated with respect to the climatological forecast for the convecting region. The results are shown in Figure 5.8 for each variable of interest. There is
no significant difference between the TSCP and TSCPr forecasts according to the skill scores.
The TSCP/TSCPr schemes score highly for a range of variables according to each score: they
perform significantly better than the other forecasts for T850 according to RPSS and IGNSS,
for U850 according to IGNSS and ESS, and for Z500 according to all scores (see Appendix A
for details of significance testing). For U200, the TSCS forecasts are significantly better than
the other forecasts, and the TSCZ forecasts are significantly poorer. However for the other
variables, TSCS performs comparatively poorly, and often produces significantly the worst
forecasts. This is probably due to the poorer forecast ensemble spread.
5.6.2.1
Precipitation Forecasts
The impact of the different model uncertainty schemes on forecasts of convective precipitation
is a good indicator of improvement in the convection scheme. However, it is difficult to verify
precipitation forecasts as measurements of precipitation are not assimilated into the IFS using
the 4DVar or EDA systems, unlike T, U and Z. One option is to use short-range high resolution
(T1279) deterministic forecasts for verification. However, there are known problems with spinup for accumulated fields like precipitation — the model takes a few time steps to adjust to
the initial conditions (Kaallberg, 2011). Instead, the Global Precipitation Climatology Project
(GPCP) dataset is used for verification of precipitation forecasts. The GPCP, established by
the WCRP, combines information from a large number of satellite and ground based sources
to estimate the global distribution of precipitation. The data set used here is the One-Degree
Daily (1DD) product (Huffman et al., 2001), which has been conservatively re-gridded onto a
T159 reduced Gaussian grid to allow comparison with the IFS forecasts.
Figure 5.9 shows the RMS error-spread diagnostic for convective precipitation. All forecasts
142
(a)
(b)
(c)
2
1.5
1.2
1.5
0.8
0.6
RMSE
1
RMSE
RMSE
1
1
0.5
0.4
0.5
0.2
0.2 0.4 0.6 0.8 1 1.2
Forecast Spread
(d)
0.2 0.4 0.6 0.8 1 1.2 1.4
Forecast Spread
(e)
0.5
1
1.5
Forecast Spread
(f)
3
1.5
RMSE
5
2
RMSE
RMSE
6
4
2.5
3
2
1
4
3
2
1
0.5
0.5
1
1 1.5 2 2.5
Forecast Spread
1
2
3
Forecast Spread
(g)
1
(h)
(i)
80
250
200
200
40
150
RMSE
60
RMSE
RMSE
2
3
4
5
Forecast Spread
100
150
100
50
20
20
40
60
Forecast Spread
50
20
(j)
40
60
80 100
Forecast Spread
50
100
150
Forecast Spread
(k)
(l)
5
3
10
5
8
RMSE
RMSE
RMSE
4
6
4
3
2
6
4
2
1
2
1
2
3
4
5
Forecast Spread
2
4
6
Forecast Spread
2
4
6
8
10
Forecast Spread
Figure 5.7: Root mean square error-spread diagnostic for tropical regions with significant
convection for (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200, at lead times of
1 day (first column), 3 days (second column) and 10 days (third column) for each variable.
Results are shown for the five experiments: black — TSCZ; blue — TSCS; red — TSCP;
magenta — TSCPr; green — TSCPv. For a well calibrated ensemble, the scattered points
should lie on the one-to-one diagonal shown in black.
143
(a)
(b)
(c)
0.4
1
0.3
0.98
0.2
0.96
0.6
0.4
ESS
IGNSS
RPSS
0.5
0.1
0.94
0.3
0
0.92
0.2
0
100
−0.1
0
200
lead time / hrs
0.9
0
200
lead time / hrs
(d)
0.7
0.8
0.6
0.6
(f)
0.95
0.5
0.4
0.2
0
100
200
0.9
0.85
0.3
0.5
200
1
ESS
0.7
100
lead time / hrs
(e)
0.9
IGNSS
RPSS
100
0.8
0
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.4
0.6
0.2
0.2
0.8
0
0.6
ESS
IGNSS
RPSS
0.4
−0.2
0
−0.4
−0.2
−0.6
0.4
0.2
0
0
100
200
0
lead time / hrs
100
−0.2
0
200
lead time / hrs
(j)
(k)
0.8
100
200
lead time / hrs
(l)
0.6
1
0.6
0.5
0.95
0.4
ESS
IGNSS
RPSS
0.7
0.4
0.3
0
0.9
0.2
0.85
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure 5.8: Ensemble forecast skill scores calculated for tropical regions with significant
convection. First column: Ranked Probability Skill Score. Second column: Ignorance Skill
Score. Third column: Error-spread Skill Score. (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and
(j)–(l) U200. Results are shown for the five experiments: black — TSCZ; blue — TSCS; red
— TSCP; magenta — TSCPr; green — TSCPv. The red line is obscured by the magenta line
in each figure.
144
(a)
(b)
(c)
0.015
0.015
0.005
0.005
0
0
0.01
RMSE
0.01
RMSE
RMSE
0.015
0.005
0.01
Forecast Spread
0
0
0.01
0.005
0.005
0.01
Forecast Spread
0
0
0.005 0.01 0.015
Forecast Spread
Figure 5.9: RMS error-spread diagnostic for cumulative convective precipitation for the 24 hour
window before a lead time of (a) 1 day, (b) 3 days and (c) 10 days. The diagnostic is calculated
for tropical regions with significant convection. Results are shown for the five experiments:
black — TSCZ; blue — TSCS; red — TSCP; magenta — TSCPr; green — TSCPv. For a well
calibrated ensemble, the scattered points should lie on the one-to-one diagonal shown in black.
are under-dispersive, and the different uncertainty schemes have only a slight impact on the
calibration of the ensemble. Figure 5.10(b) indicates more clearly the impact of the different
schemes on the ensemble spread and error. On average, the TSCZ scheme is significantly the
most under-dispersive and has a significantly larger RMSE. The two stochastic schemes, TSCS
and TSCPv, have significantly the smallest error. TSCS has significantly the largest spread at
short lead times, and TSCP and TSCPr have significantly the largest spread at later lead times.
Figure 5.10(a) shows the bias in forecasts of convective precipitation. The stochastic schemes,
TSCS and TSCPv, have the smallest bias over the entire forecasting window. Figures 5.10(c)–
(e) show the forecast skill scores for convective precipitation. TSCS is significantly the best
between days three and five according to RPS and ES. TSCZ is significantly the poorest
according to RPS, but the other schemes score very similarly. ES and IGN also score TSCZ as
significantly the worst at early lead times, but at later lead times, no one scheme is significantly
different to the others.
It is important for a model to capture the spatial and temporal characteristics of precipitation. The global frequency distribution of rain rate (in mm/day) was considered for the
different forecast models and compared to the GPCP 1DD dataset. The results are shown
in Figure 5.11. All five forecast models perform similarly well, and no one model performs
particularly well or poorly compared to the others. All forecasts under-predict the proportion
of low rain rates and over predict the proportion of high rain rates when compared to the
GPCP data set (grey), but overall predict the distribution of rain rates well.
The spatial distribution of cumulative precipitation (convective plus large scale) was also
145
(a)
(b)
−3
x 10
6
8
5
7
3
RMS
% Bias
4
2
1
6
5
0
−1
0
100
200
0
lead time / hrs
(c)
200
(d)
(e)
0.6
0.25
−0.05
0.5
0.2
−0.1
0.4
0.15
0.1
0
100
200
ESS
0
IGNSS
RPSS
100
lead time / hrs
−0.15
0.3
−0.2
0.2
−0.25
0.1
0
100
200
lead time / hrs
lead time / hrs
0
0
100
200
lead time / hrs
Figure 5.10: Summary forecast diagnostics for 24 hour cumulative convective precipitation
(prior to the indicated lead time) in tropical regions with significant convection. (a) Percentage bias. (b) Temporal evolution of RMS ensemble spread (dashed lines) and error (solid
lines) averaged over the region. (c) Ranked Probability Skill Score. (d) Ignorance Skill Score.
(e) Error-spread Skill Score. Results are shown for the five experiments: black — TSCZ; blue
— TSCS; red — TSCP; magenta — TSCPr; green — TSCPv. The red line is obscured by the
magenta line in each figure.
0
10
−1
10
−2
probability
10
−3
10
−4
10
−5
10
−6
10
−7
10
0
50
100
150
Rain Rate / mm in 24 hours
Figure 5.11: Probability distribution of rain rate (mm/24hrs) evaluated globally. The distribution has been normalised to 1, given that rain is observed in each 24 hour window. The
observed result from the GPCP dataset (grey) is compared to the five experiments: TSCZ
(solid black), TSCS (blue), TSCP (red), TSCPr (magenta) and TSCPv (green).
146
considered for the different forecast models. All schemes performed equally well (not shown).
When compared to the GPCP data, all showed too much precipitation over the ocean, and in
particular forecast intensities of rain in the intertropical and South Pacific convergence zones
that were higher than observed. The results were indistinguishable by eye — the difference
between forecast and observations is far greater than the differences between different forecasts.
5.6.2.2
Total Column Water Vapour
The impact of the different model uncertainty schemes on forecasts of total column water
vapour (TCWV) is also a good indicator of improvement in the convection scheme. Figure 5.12
shows the RMS error-spread diagnostic for TCWV. The forecasts for this variable are poorly
calibrated when compared to convective precipitation. The RMSE is systematically larger
than the spread, and the slope of the scattered points is too shallow. This shallow slope
indicates that the forecasting system is unable to distinguish between cases with low and high
predictability for this variable — the expected error in the ensemble mean is poorly predicted
by the ensemble spread. The different forecast schemes show a larger impact than for forecasts
of precipitation — the TSCS model produces forecasts which are under-dispersive compared
to the other forecasts. Figure 5.13 shows (a) the bias, (b) the RMSE and spread as a function
of time, and (c)–(e) the forecast skill scores for each experiment. Figure (b) shows that the
TSCPv forecasts have significantly the largest spread at lead times of 24 hours and greater. The
TSCS forecasts have significantly the smallest spread at later lead times, but also significantly
the largest error at all lead times. Figure 5.13 (a) shows the bias is also largest for the TSCS
forecasts, and Figure 5.13 (c–e) indicates the skill is the lowest.
An early version of SPPT was found to dry out the tropics, and resulted in a decrease in
TCWV of approximately 10% (Martin Leutbecher, pers. comm., 2013). This was corrected in a
later version. It is possible that TCWV could be sensitive to the proposed perturbed parameter
representations of model uncertainty. The average TCWV between 20◦ N and 20◦ S is averaged
over all start dates separately for each ensemble member, and is shown in Figure 5.14. Initially,
all experiments show a drying of the tropics of approximately 0.5 kgm−2 over the first 12 hours,
indicating a spin-up period in the model. The TSCZ, TSCS and TSCPv forecasts then stabilise.
However, each ensemble member in the TSCP model has vastly different behaviour, with some
showing systematic drying, and others showing systematic moistening over the ten day forecast.
147
(a)
(b)
5
(c)
7
6
4
8
2
RMSE
RMSE
RMSE
5
3
4
3
6
4
2
1
2
1
1
(a)
2
3
4
Forecast Spread
5
2
4
6
Forecast Spread
(b)
2
(c)
4
6
8
Forecast Spread
Figure 5.12: RMS error-spread diagnostic for total column water vapour for lead times of
(a) 1 day, (b) 3 days and (c) 10 days. The diagnostic is calculated for tropical regions with
significant convection. Results are shown for the five experiments: black — TSCZ; blue —
TSCS; red — TSCP; magenta — TSCPr; green — TSCPv. For a well calibrated ensemble,
the scattered points should lie on the one-to-one diagonal shown in black.
(a)
(b)
−1
6
5
RMS
% Bias
−1.5
−2
4
3
−2.5
2
0
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
(c)
(e)
(d)
0.7
IGNSS
RPSS
0.5
0.4
0.3
0.95
0.2
0.9
0.1
0.3
0
0.2
−0.1
0.1
0
50
100
150
200
lead time / hrs
ESS
0.6
−0.2
0
0.85
0.8
0.75
0.7
0.65
50
100
150
200
lead time / hrs
0
50
100
150
200
lead time / hrs
Figure 5.13: Summary forecast diagnostics for total column water vapour in tropical regions
with significant convection. (a) Percentage bias. (b) Temporal evolution of RMS ensemble
spread (dashed lines) and error (solid lines) averaged over the region. (c) Ranked Probability
Skill Score (d) Ignorance Skill Score. (e) Error-spread Skill Score. Results are shown for the
five experiments: black — TSCZ; blue — TSCS; red — TSCP; magenta — TSCPr; green —
TSCPv. The red line is obscured by the magenta line in each figure.
148
(b)
Average TCWV/ kgm−2
Average TCWV/ kgm−2
(a)
41
40.5
40
39.5
39
38.5
0
50
200
40.5
40
39.5
39
38.5
0
40
39.5
39
38.5
0
50
100
150
200
100
150
200
Lead Time/ hrs
41
40.5
40
39.5
39
38.5
Lead Time/ hrs
(e)
50
(d)
41
Average TCWV/ kgm−2
Average TCWV/ kgm−2
150
40.5
Lead Time/ hrs
(c)
Average TCWV/ kgm−2
100
41
0
50
100
150
200
Lead Time/ hrs
41
40.5
40
39.5
39
38.5
0
50
100
150
200
Lead Time/ hrs
Figure 5.14: Average total column water vapour (TCWV) between 20◦ S and 20◦ N as a function
of time. The spatial average is calculated for each ensemble member averaged over all start
dates, and the averages for each of the fifty ensemble members are shown. Results are shown
for the five experiments: (a) TSCZ, (b) TSCS, (c) TSCP, (d) TSCPr and (e) TSCPv.
The TSCPr model does not show this behaviour to the same extent. Figure 5.15 shows an
alternative diagnostic. The TCWV is averaged over the region. The average and standard
deviation of this diagnostic is calculated over all ensemble members and start dates. The
average TCWV is similar for all experiments. The standard deviation initially decreases for all
experiments. However, at longer lead times, the standard deviation increases for both TSCP
and TSCPr, indicating differing trends in TCWV for different ensemble members for both
experiments.
5.7
Discussion and Conclusion
The results presented above show that the perturbed parameter schemes have a positive impact
on the IFS, though the impact is relatively small. Introducing the TSCP/TSCPr schemes
149
(a)
Ensemble Mean TCWV/ kgm−2
41
40.5
40
39.5
39
38.5
0
50
100
150
Lead Time/ hrs
200
50
100
150
Lead Time/ hrs
200
(b)
σ( TCWV )/ kgm−2
1.2
1.1
1
0.9
0.8
0.7
0.6
0
Figure 5.15: The total column water vapour (TCWV) is averaged between 20◦ S and 20◦ N
as a function of time. The (a) mean and (b) standard deviation are then calculated over all
ensemble members and start dates. Results are shown for the five experiments: black — TSCZ,
blue — TSCS, red — TSCP, magenta — TSCPr and green — TSCV.
150
(defined in Table 5.3) does not lead to increased bias in T850, U850, Z500 or U200, indicating
that systematic errors in these fields have not increased. An increase in ensemble spread is
observed when the perturbed parameter schemes are used to represent uncertainty in convection
instead of SPPT, and the TSCP/TSCPr forecasts have significantly the largest spread for
T850, U850 and Z500 forecasts, which Figure 5.7 indicates is flow-dependent. The perturbed
parameter schemes produce significantly the most skilful forecasts of T850, U850 and Z500 as
ranked by the RPSS, IGNSS and ESS.
These results indicate that using a fixed perturbed parameter ensemble instead of SPPT
improves the representation of uncertainty in convection. However, the fixed perturbed parameter ensembles remain under-dispersive. While an increase in spread is observed when the
perturbed parameter schemes are used to represent uncertainty in convection compared to
SPPT, a substantial proportion of this increase is also observed in Figure 5.6 when SPPT is
switched off for the convection scheme (this counter-intuitive result is analysed in Chapter 6).
Since SPPT is switched off for convection in TSCP and TSCPr, this indicates that the parameter perturbations are contributing only slightly to the spread of the ensemble, and much of the
spread increase can be attributed to this decoupling of the convection scheme from SPPT (see
Section 6.5 for further experiments which confirm this “decoupling” hypothesis). The small
impact of the perturbed parameter scheme indicates that such schemes are not fully capturing
the uncertainty in the convection scheme at weather forecasting time scales. This is surprising
as the parameter uncertainty has been explicitly measured and used to develop the scheme.
The TSCPv scheme had a positive impact on the skill of the weather forecasts, and significantly improved over the TSCZ and TSCS forecasts for many diagnostics. The impact
on spread and skill was smaller than the static perturbed parameter schemes. It is possible
that the parameter perturbations vary on too fast a time scale for a significant impact to be
observed — if the parameters varied more slowly, a larger, cumulative effect could be observed
in the forecasts. It would be interesting to test the TSCPv scheme using a longer correlation
time scale to test this hypothesis.
The two types of perturbed parameter scheme presented here represent fundamentally different error models. Fixed perturbed parameter schemes are based on the ansatz that there
exists some optimal (or “correct”) value of the parameters in the deterministic parametrisation scheme. Even using EPPES, the optimal parameters cannot be known with certainty,
151
so a perturbed parameter ensemble samples from a set of likely parameter values. The fixed
perturbed parameter ensembles tested in this chapter were under-dispersive, and did not fully
capture the uncertainty in the forecasts. This indicates that fixed parameter uncertainty is
not the only source of model uncertainty, and that fixed perturbed parameter ensembles cannot
be used alone to represent model uncertainty in an atmospheric simulation. While parameter
uncertainty could account for systematic errors in the forecast, the results indicate that some
component of the error cannot be captured by a deterministic uncertainty scheme. In particular, perturbed parameter ensembles are unable to represent structural uncertainty due to
the choices made when developing the parametrisation scheme, and a different approach is
required to represent uncertainties due to the bulk formula assumption.
The second error model recognises that in atmospheric modelling there is not necessarily a
“correct” value for the parameters in the physics parametrisation schemes. Instead there exists
some optimal distribution of the parameters in a physical scheme. Since in many cases the
parameters in the physics schemes have no direct physical interpretation, but represent a group
of interacting processes, it is likely that their optimal value may vary from day to day, or from
grid box to grid box, or on larger scales they may be seasonally or latitudinally dependent.
A stochastically perturbed parameter ensemble represents this parameter uncertainty. The
stochastically perturbed parameter scheme also underestimated the error in the forecasts. Even
generalised to allow varying parameters, parameter uncertainty is not the only source of model
uncertainty in weather forecasts. Not all sub-grid scale processes can be accurately represented
using a statistical parametrisation scheme, and some forecast errors cannot be represented
using the phase space of the parametrised tendencies.
The EPPES indicated that the uncertainty in the convection parameters was moderate, and
smaller than expected (Heikki Järvinen, pers. comm., 2013). The results presented here also
indicate larger parameter perturbations could be necessary to capture the uncertainty in the
forecast from the convection scheme. However, the average tropical total column water vapour
indicates that even these moderate perturbations are sufficient for biases to develop in this field
over the ten day forecast period. The ensemble members with different sets of parameters have
vastly different behaviours, with some showing a systematic drying and others a systematic
moistening in this region. This is very concerning. The second diagnostic presented indicates
that TSCPr also has the problem of systematic drying or moistening for individual ensemble
152
members depending on the model parameters, and suggests that this is a fundamental problem with using a fixed perturbed parameter ensemble. The fact that this problem develops
noticeably over a ten day window indicates that this could be a serious problem in climate
prediction, where longer forecasts could result in even larger biases developing. This result
supports the conclusions made in Chapter 3 in the context of L96, where individual perturbed
parameter ensemble members were observed to have vastly different regime behaviour. The
TSCPv forecasts did not develop biases in this way, as the parameter sets for each ensemble
member varied over the course of the forecast, which did not allow these biases to develop.
Therefore, stochastically varying perturbed parameter ensembles could be an attractive way
of including parameter uncertainty into weather and climate forecasts.
A particularly interesting and counter-intuitive result is that removing the stochastic perturbations from the convection tendency resulted in an increase in forecast spread for some
variables. This is observed for T850, U850 and TCWV in both regions considered, and for
Z500 and U200 in non-convecting regions. SPPT perturbs the sum of the physics tendencies.
It does not represent uncertainty in individual tendencies, but assumes uncertainty is proportional to the total tendency. The increase in spread for TSCZ forecasts compared to TSCS
forecasts suggests that convection could act to reduce the sum of the tendencies, resulting in a
smaller SPPT perturbation. This is as expected from the formulation of the parametrisation
schemes in the IFS, and will be discussed further in Section 6.1. Perturbing each physics tendency independently would allow for an estimation of the uncertainty in each physics scheme,
potentially improving the overall representation of model uncertainty. This is the subject of
the next chapter.
Despite the reduced spread, the TSCS scheme outperforms the TSCZ scheme according
to other forecast diagnostics. The error in T850 forecasts is reduced using the TSCS scheme,
reflected by higher skill scores for this variable, and TSCS is significantly more skilful than
TSCZ at lead times of up to 3 days for U850 and Z500. Additionally, TSCS results in an
increase of spread for U200 and convective precipitation compared to TSCZ. At this point, it
is important to remember that the parametrisation tendencies are not scalar quantities, but are
vectors of values corresponding to the tendency at different vertical levels, and that SPPT uses
the same stochastic perturbation field at each vertical level3 . The convection parametrisation
3
The perturbation is constant vertically except for tapering in the boundary layer and the stratosphere.
153
scheme is sensitive to the vertical distribution of temperature and humidity, and it is possible
that the tendencies output by the convection parametrisation scheme act to damp or excite
the scheme at subsequent time steps. Therefore perturbing the convective (vector) tendency
using SPPT could lead to an increased variability in convective activity between ensemble
members through amplification of this excitation process. Since both U200 and convective
precipitation are directly sensitive to the convection parametrisation scheme, these variables
are able to detect this increased variability, and show an increased ensemble spread as a result.
In fact, TSCS has significantly the most skilful forecasts out of all five experiments between
days three and ten for U200, and between days three and five for convective precipitation.
T850, U850 and Z500 are less sensitive to convection than U200 and precipitation. Since in
general the total perturbed tendency is reduced for TSCS compared to TSCZ, this could lead
to the reduction in ensemble spread observed for these variables.
The experiments presented in this chapter have used the IFS at a resolution of T159. This
is significantly lower than the operational resolution of T639. Nevertheless, the experiments
give a good indication of what the impact of the different schemes would be on the skill of
the operational resolution IFS. In Chapter 6, results are presented for T159 experiments which
were repeated at T639; the same trends can be observed at the higher resolution. Therefore,
the low resolution runs presented in this chapter can be used to indicate the expected results of
the models at T639, and can suggest whether it would be interesting to run further experiments
at higher resolution.
154
6
Experiments in the IFS:
Independent SPPT
The only relevant test of the validity of a hypothesis is comparison of prediction with
experience.
– Milton Friedman, 1953
6.1
Motivation
The generalised stochastically perturbed parametrisation tendencies (SPPT) scheme developed
in the previous chapter allowed the SPPT perturbation to be switched off for the convection
tendency and replaced with a perturbed parameter scheme. However, it also enables one
to perturb the five IFS physics schemes with independent random fields. In the operational
SPPT, the uncertainty is assumed to be proportional to the total net tendency, whereas this
generalisation to SPPT assumes that the errors from the different parametrisation schemes
are uncorrelated, and that the uncertainty in the forecast is proportional to the individual
tendencies.
The standard deviation of the perturbed tendency, σtend , using operational SPPT is given
by
2
σtend
=
σn2
5
X
i=1
Pi
!2
,
(6.1)
where σn is the standard deviation of the noise perturbation and Pi is the parametrised tendency from the ith physics scheme. This can be compared to the standard deviation using
155
independent SPPT (SPPTi):
2
σtend
=
5 X
i=1
σi2 Pi2 .
(6.2)
If the physics tendencies tend to act in opposite directions, SPPTi will acknowledge the large
uncertainty in the individual tendencies and will increase the forecast ensemble spread.
A priori, it is not known whether SPPT or SPPTi is the more physically plausible error
model for the IFS, though it is very unlikely that uncertainties in the different processes are
precisely correlated, as modelled by SPPT. However, the different physics schemes in the
IFS have been developed in tandem, and are called sequentially in the IFS to maintain a
balance. For example, the cloud and convection schemes model two halves of the same set of
atmospheric processes, as described in Section 5.2.1. The convection parametrisation scheme
represents the warming due to moist convection, but the cloud scheme calculates the cooling
due to evaporation of cloudy air that has been detrained from the convective plume. This
means that the net tendency from the two schemes is smaller than each individual tendency,
and SPPT represents a correspondingly small level of uncertainty. SPPTi could be beneficial
in this case, as it is able to represent the potentially large errors in each individual tendency.
On the other hand, if the two schemes have been closely tuned to each other, potentially with
compensating errors, decoupling the two schemes by using independent perturbations could
reduce the forecast skill and introduce errors into the forecasts.
The impact of using SPPTi in the IFS will be tested in this chapter. As a first attempt,
each independent random number field will have the same characteristics as the field used
operationally in SPPT, i.e. each of the five independent fields is itself a composite of three
independent fields with differing temporal and spatial correlations and magnitudes (see Section 5.3). A series of experiments is carried out in the IFS. Four experiments are considered
to investigate the impact of the SPPTi scheme, and are detailed in Table 6.1. The impact of
the operational SKEB scheme is also considered as a benchmark. The same resolution, start
dates and lead time were used as in Chapter 5.
Firstly, the impact of SPPTi on global diagnostics is presented in Section 6.2. In Section 6.3, the impact of the SPPTi scheme in the tropics is considered, including differences
between behaviour of the scheme in the tropical regions with significant and little convection
which were defined in Chapter 5. The impact of the scheme on the skill of convection diagnostics is presented in Section 6.4. A series of experiments are described in Section 6.5, which
156
Experiment Abbreviation
TSCS
TSCS + SKEB
TSCSi
TSCSi + SKEB
SPPT SPPTi SKEB
ON
OFF
OFF
ON
OFF
ON
OFF
ON
OFF
OFF
ON
ON
Table 6.1: The four experiments and their abbreviations considered in this chapter to investigate the impact of independent SPPT over operational SPPT. The impact of the SKEB scheme
is also considered for comparison. See Table 5.3 for an explanation of the abbreviations.
aim to increase understanding of the mechanisms by which SPPTi impacts the ensemble. In
Section 6.6, the results from experiments at operational T639 resolution are presented. In
Section 6.7, the results are discussed and some conclusions are drawn.
6.2
Global Diagnostics
The representations of model uncertainty considered in Chapter 5 focused on the convection
scheme only, so the different schemes were verified in the tropics where convection is most
active. The SPPTi scheme discussed in this chapter affects all parametrisation schemes, so
has a global impact. It is therefore important to evaluate the impact of the scheme on global
forecasts. The global bias, calculated following (1.24) is shown as a percentage in Figure 6.1.
Globally, the bias is small for each variable considered. However, for all variables, implementing
the SPPTi scheme results in an increase in global bias. The impact is particularly large for
U200, where the global bias has more than doubled in magnitude. The impact of SKEB
(comparing blue with cyan, and red with magenta) is considerably smaller than the impact of
the SPPTi scheme (comparing the warm and cool pairs of lines).
Figure 6.2 shows the temporal evolution of the RMS ensemble spread and error respectively,
for each experiment considered, averaged over each standard ECMWF region: the northern
extra-tropics is defined as north of 25◦ N , the southern extra-tropics is defined as south of 25◦ S,
and the tropics is defined as 25◦ S − 25◦ N . In the northern and southern extra-tropics (first
and third columns respectively), the impact of SPPTi on the ensemble spread is comparable
to, but slightly smaller than, the impact of SKEB, and all experiments have under-dispersive
forecasts. The different schemes have little impact on the RMSE. However in the tropics (centre
column), SPPTi has a significant positive impact on the spread of the ensemble forecasts. The
impact is significantly larger than that of SKEB, and corrects the under-dispersive forecasts for
157
(a)
(b)
T 850
0
U 850
0.2
0
−0.2
% Bias
% Bias
−0.05
−0.1
−0.4
−0.6
−0.15
−0.8
−0.2
−1.2
−1
0
50
100
150
200
0
50
lead time / hrs
(c)
100
150
200
lead time / hrs
(d)
Z 500
0
U 200
3
2.5
2
% Bias
% Bias
−0.05
−0.1
1.5
1
−0.15
0
0.5
50
100
150
0
0
200
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.1: Global forecast bias as a function of time for (a) T850, (b) U850, (c) Z500 and (d)
U200. Results are shown for the four experiments: blue — TSCS; cyan — TSCS + SKEB; red
— TSCSi; magenta — TSCSi + SKEB. The bias is calculated as described in the text, and
given as a percentage of the root mean square of the verification in the region of interest.
T850, U850 and U200. While SPPTi has a larger impact on the spread of Z500 forecasts than
SKEB, the ensembles remain under-dispersive. A small impact on the RMSE is observed —
the T850 and Z500 errors are slightly increased and the U850 and U200 errors slightly reduced
by the SPPTi scheme. These results are very positive, and indicate the potential of the SPPTi
scheme.
Figure 6.2 indicates that SPPTi has the largest impact in the tropics. Figure 6.3 shows
the skill of the forecasts in this region evaluated using the RPSS, IGNSS and ESS for the four
variables of interest. IGNSS indicates an improvement of skill for all variables when SPPTi is
implemented. This is as expected from Figure 6.2: IGNSS strongly penalises under-dispersive
ensemble forecasts, so reducing the degree of under-dispersion results in an improved score.
RPSS and ESS indicate a slight improvement in skill for the U850 and U200 forecasts, but a
slight reduction in skill for the T850 and Z500 forecasts when the SPPTi scheme is used. This
could be due to the increase in root mean square error observed for these variables, linked to
the increase in bias observed in Figure 6.1. The IFS is such a highly tuned forecast model that
it would be very surprising if a newly proposed scheme resulted in an improvement in skill for
all variables in all areas. Before operationally implementing a new scheme, the scheme would
158
(a)
(b)
3.5
(c)
4
1.6
3
RMS
RMS
2
1.2
1.5
1
1
0.8
0
100
RMS
1.4
2.5
2
1
0.6
0
200
lead time / hrs
100
0
200
100
200
lead time / hrs
lead time / hrs
(d)
(e)
8
(f)
3.5
5
3
3
6
RMS
4
RMS
RMS
3
2.5
4
2
2
1.5
1
0
100
200
2
0
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
200
1000
600
800
400
RMS
RMS
RMS
150
100
600
400
200
50
0
100
200
200
0
lead time / hrs
100
200
0
lead time / hrs
(j)
100
200
lead time / hrs
(k)
(l)
8
12
10
6
4
4
8
6
4
2
0
10
6
RMS
RMS
RMS
8
100
200
lead time / hrs
2
0
2
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure 6.2: Temporal evolution of the RMS ensemble spread (dashed lines) and RMSE (solid
lines) for each standard ECMWF region. First column: northern extra-tropics, north of 25N.
Second column: tropics, 25S–25N. Third column: southern extra-tropics, south of 25S. (a)–(c)
T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200. Results are shown for the four experiments:
blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta — TSCSi + SKEB.
159
need to be tuned, and the model re-calibrated to account for the effects of the new scheme.
However, the significant improvement in spread observed in the tropics is sufficient to merit
further investigation.
6.3
Effect of Independent SPPT in Tropical Areas
What is the cause of the improved spread in the tropics when the SPPTi scheme is implemented? As in Chapter 5, let us consider areas in the tropics where there is little convection, and
areas where convection is the dominant process. The regions defined in Section 5.5.1 will be
used as before.
The percentage bias as a function of lead time is shown for areas of little convection in
Figure 6.4. The SPPTi forecasts have a larger bias at lead times greater than 24 hrs for T850
than for operational SPPT, and similar bias characteristics for Z500. However, U850 and U200
both show a reduction in the forecast bias when the SPPTi scheme is used compared to the
operational SPPT scheme. Figure 6.5 shows the same diagnostic for regions with significant
convection. The results are similar for T850 and Z500. U850 shows a slight improvement,
but U200 indicates that SPPTi results in a small increase in bias. For operational SPPT,
the negative bias in non-convecting regions cancels the positive bias in convecting regions to
produce a small globally averaged bias. The SPPTi scheme has effectively reduced the negative
bias in non-convecting regions, but has not had a large impact on the bias in convecting regions,
resulting in the large increase in magnitude of the bias observed for global U200 in Figure 6.1(d).
Considering regionally averaged bias can be misleading due to compensating errors.
Figures 6.6 and 6.7 show the evolution of the RMSE (solid lines) and RMS spread (dashed
lines) for the tropical regions with little convection and with significant convection respectively.
The operational SPPT ensembles (blue lines) are under-dispersive at all times, for all variables,
in both regions. The under-dispersion is greater in regions with significant convection. Including SKEB (cyan lines) does not significantly increase the spread of the ensemble.
In regions with little convection, using SPPTi results in a moderately large correction to
this under-dispersion, approximately halving the difference between spread and error for T850,
U850 and U200 when compared to the operational runs. The impact is larger than the impact
of including SKEB. The impact on spread is smaller for Z500, but still positive.
For regions with significant convection, the improvement in spread is greater than in regions
160
(b)
0.7
0.5
0.6
(c)
1
0.99
ESS
0.6
IGNSS
RPSS
(a)
0.8
0.4
0.3
100
0.2
0
200
0.97
0.96
0.5
0
0.98
lead time / hrs
100
0.95
0
200
(d)
100
200
lead time / hrs
lead time / hrs
(e)
(f)
1
0.6
0.8
0.95
0.6
0.4
ESS
IGNSS
RPSS
0.7
0.85
0.2
0.5
0.9
0.4
0.8
0
100
0
0
200
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.8
1
0.4
0.95
IGNSS
0.4
0.2
0
100
0.2
ESS
RPSS
0.6
0
0.85
−0.2
0.8
200
0
lead time / hrs
100
200
0
lead time / hrs
(j)
100
200
lead time / hrs
(k)
0.9
(l)
0.8
1
0.98
0.7
0.6
ESS
IGNSS
0.8
RPSS
0.9
0.96
0.4
0.94
0.6
0
100
200
lead time / hrs
0.2
0
100
200
lead time / hrs
0.92
0
100
200
lead time / hrs
Figure 6.3: Ensemble forecast skill scores calculated for the tropics (25S–25N). First column:
Ranked Probability Skill Score. Second column: Ignorance Skill Score. Third column: Errorspread Skill Score. (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200. Results are shown
for the four experiments: blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta —
TSCSi + SKEB.
161
(a)
(b)
T 850
U 850
0.05
10
0
8
−0.1
% Bias
% Bias
−0.05
−0.15
−0.2
6
4
2
−0.25
0
−0.3
−2
0
50
100
150
200
0
50
lead time / hrs
(c)
100
150
200
lead time / hrs
(d)
Z 500
U 200
0.05
2
0
−0.05
−2
% Bias
% Bias
0
−0.1
−4
−6
−8
−0.15
−10
−0.2
0
50
100
150
200
0
50
lead time / hrs
100
150
200
lead time / hrs
Figure 6.4: Percentage forecast bias in tropical regions with little convection as a function of
time for (a) T850, (b) U850, (c) Z500 and (d) U200. Results are shown for the four experiments:
blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta — TSCSi + SKEB. The bias
is calculated as described in the text, and given as a percentage of the root mean square of the
verification in the region of interest.
(a)
(b)
T 850
0.05
0
−5
−0.05
% Bias
% Bias
U 850
0
−0.1
−10
−15
−0.15
−0.2
0
50
100
150
−20
0
200
50
lead time / hrs
(c)
100
150
200
lead time / hrs
(d)
Z 500
U 200
0
20
15
−0.1
% Bias
% Bias
−0.05
−0.15
5
−0.2
−0.25
0
10
50
100
150
0
0
200
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.5: As for Figure 6.4, except for tropical regions with significant convection.
162
T 850
(a)
U 850
(b)
3.5
2.4
2.2
3
1.8
RMS
RMS
2
1.6
2.5
2
1.4
1.2
1.5
1
0
50
100
150
200
0
50
lead time / hrs
Z 500
(c)
100
150
200
lead time / hrs
U 200
(d)
180
8
160
7
6
120
RMS
RMS
140
100
5
80
4
60
3
40
0
2
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.6: Temporal evolution of RMS ensemble spread (dashed lines) and RMSE (solid lines)
for tropical regions with little convection for (a) T850, (b) U850, (c) Z500 and (d) U200.
Results are shown for the four experiments: blue — TSCS; cyan — TSCS + SKEB; red —
TSCSi; magenta — TSCSi + SKEB.
of little convection, whereas the improvement due to SKEB remains small. The spread of the
ensembles closely matches the RMSE, and is slightly over-dispersive for U850. Moreover,
the temporal evolution of the spread has an improved profile for T850, U850 and U200. For
operational SPPT, the increase in spread is a fairly linear function of time, whereas for SPPTi,
there is an initial period of rapid spread increase, followed by a reduction in rate, which closely
matches the observed error growth. Figures 6.6 and 6.7 also indicate that it is the convectively
active regions that are primarily responsible for the observed increase in RMSE for T850 and
Z500 for the SPPTi experiments. This increase in error is a concern.
The RMS error-spread graphical diagnostic gives more information about the calibration
of the forecast, testing whether the ensemble is able to skilfully indicate flow dependent uncertainty. Figures 6.8 and 6.9 show this diagnostic for tropical regions with little and significant
convection respectively, for each variable of interest, at lead times of 1, 3 and 10 days. In both
regions, Z500 is comparatively poorly forecast by the model. The error-spread relationship
is weakly captured, and the ensemble spread is a poor predictor of the expected error in the
ensemble mean. For the other variables, ensemble spread is a good predictor of RMSE in both
regions.
163
T 850
U 850
(b)
0.9
3.5
0.8
3
0.7
RMS
RMS
(a)
0.6
2.5
2
0.5
1.5
0.4
0
50
100
150
1
0
200
50
lead time / hrs
Z 500
(c)
100
150
200
lead time / hrs
U 200
(d)
180
7
160
6
120
RMS
RMS
140
100
5
4
80
60
3
40
0
50
100
150
2
0
200
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.7: As for Figure 6.6, except for tropical regions with significant convection.
For the regions with little convection in Figure 6.8, the ensembles appear fairly well calibrated, and SPPTi has a small positive effect. The largest effect is seen at a lead time of
10 days for U200, where the operational SPPT ensemble was most under-dispersive. For the
regions with significant convection shown in Figure 6.9, forecasts show a large improvement.
The increase in spread of the ensembles is state dependent: for most cases, the increase in
spread results in an improved one-to-one relationship. One exception is Figure 6.9(e), where
the slope of the SPPTi scattered points is too shallow. It is interesting to note that the increase
in RMSE for the T850 forecasts appears to occur predominantly for small error forecasts. This
results in a flat tail to Figures 6.9(a)–(c) at small RMSE and spread, instead of a uniform increase in error across all forecasts. This tail appears to be unique to T850 out of the variables
considered, visible at all lead times, and only in regions where convection dominates.
Figures 6.10 and 6.11 show the skill of the forecasts in regions of little and significant
convection respectively, according to the RPSS, IGNSS and ESS. The skill scores for these
two regions effectively summarise Figures 6.8 and 6.9, and provide more information as to the
source of skill observed in 6.3. In particular:
• The poorer RPSS and ESS for SPPTi in the tropics for T850 (Figure 6.3) is mainly due
to poorer forecast skill in convecting regions, though a small reduction in skill in nonconvecting regions is also observed. The significant improvement in IGNSS in this region
164
(a)
(b)
(c)
3
2.5
4
1.5
1
2
RMSE
RMSE
RMSE
2
1
0
0
0
0
1
2
Forecast Spread
(d)
0
0
3
(f)
5
3
1
4
RMSE
RMSE
1.5
2
1
0.5
0
0
3
2
1
0
0
1
2
Forecast Spread
(g)
0
0
1
2
3
Forecast Spread
(h)
2
4
Forecast Spread
(i)
300
40
100
RMSE
RMSE
60
50
20
0
0
100
0
0
20
40
Forecast Spread
(j)
200
0
0
20 40 60 80
Forecast Spread
(k)
8
3
6
RMSE
4
2
1
10
8
4
2
2
4
Forecast Spread
0
0
100
200
Forecast Spread
(l)
RMSE
RMSE
2
4
Forecast Spread
4
2
RMSE
1
2
Forecast Spread
(e)
2.5
RMSE
2
1
0.5
0
0
3
6
4
2
2
4
6
Forecast Spread
0
0
5
10
Forecast Spread
Figure 6.8: RMS error-spread diagnostic for tropical regions with little convection for (a)–
(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200, at lead times of 1, 3 and 10 days for
each variable (first, second and third columns respectively). Results are shown for the four
experiments: blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta — TSCSi +
SKEB. The one-to-one diagonal is shown in black.
165
(a)
(b)
(c)
1.5
2
RMSE
0.5
1.5
1
RMSE
RMSE
1
1
0.5
0.5
0
0
0
0
0.5
1
Forecast Spread
(d)
0
0
0.5
1
1.5
Forecast Spread
(e)
(f)
3
6
3
RMSE
RMSE
RMSE
4
2
0.5
1
1.5
Forecast Spread
2
1
4
2
1
0
0
0
0
1
2
3
Forecast Spread
(g)
(h)
80
250
200
20
150
RMSE
RMSE
40
100
150
100
50
0
0
50
0
0
20
40
60
Forecast Spread
(j)
0
0
50
100
Forecast Spread
(k)
3
2
RMSE
8
RMSE
RMSE
200
10
6
4
100
Forecast Spread
(l)
5
4
2
1
0
0
2
4
6
Forecast Spread
(i)
200
60
RMSE
0
0
2
4
Forecast Spread
6
4
2
2
4
Forecast Spread
0
0
2
4
6
Forecast Spread
0
0
5
10
Forecast Spread
Figure 6.9: As for Figure 6.8, except for tropical regions with significant convection.
166
(a)
(b)
(c)
0.6
1
0.8
0.6
0.5
0.4
ESS
0.7
IGNSS
RPSS
0.5
0.3
0.95
0.9
0.2
0.4
0
100
0.1
0
200
100
200
0
lead time / hrs
lead time / hrs
(d)
100
200
lead time / hrs
(e)
(f)
1
0.7
0.4
0.5
0.4
0.3
0.8
ESS
IGNSS
RPSS
0.6
0.2
0.6
0.1
0.3
0.2
0
100
0
0
200
lead time / hrs
100
0.4
0
200
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.6
1
0.8
0.95
ESS
0.6
IGNSS
0.2
0.4
0
0.2
0
100
200
−0.2
0
lead time / hrs
0.85
100
0.75
0
200
lead time / hrs
(j)
IGNSS
0.8
0.7
0.6
100
200
lead time / hrs
(k)
0.9
RPSS
0.9
0.8
(l)
0.8
1
0.6
0.95
ESS
RPSS
0.4
0.4
0.9
0.85
0.2
0.5
0
100
200
lead time / hrs
0
100
200
lead time / hrs
0.8
0
100
200
lead time / hrs
Figure 6.10: Ensemble forecast skill scores calculated for tropical regions with little convection. First column: Ranked Probability Skill Score. Second column: Ignorance Skill Score.
Third column: Error-spread Skill Score. (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l)
U200. Results are shown for the four experiments: blue — TSCS; cyan — TSCS + SKEB;
red — TSCSi; magenta — TSCSi + SKEB.
167
(a)
(b)
(c)
0.4
1
0.3
0.98
0.2
0.96
0.6
0.4
ESS
IGNSS
RPSS
0.5
0.1
0.94
0.3
0
0.92
0.2
0
100
−0.1
0
200
lead time / hrs
0.9
0
200
lead time / hrs
(d)
0.7
0.8
0.6
0.6
(f)
0.95
0.5
0.4
0.2
0
100
200
0.9
0.85
0.3
0.5
200
1
ESS
0.7
100
lead time / hrs
(e)
0.9
IGNSS
RPSS
100
0.8
0
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.4
0.6
0.2
0.8
0
0.6
ESS
IGNSS
RPSS
0.4
0.2
−0.2
0
−0.4
−0.2
−0.6
0.4
0.2
0
0
100
200
0
lead time / hrs
100
−0.2
0
200
lead time / hrs
(j)
(k)
0.8
100
200
lead time / hrs
(l)
0.6
1
0.6
0.5
0.95
0.4
ESS
IGNSS
RPSS
0.7
0.4
0.3
0
0.9
0.2
0.85
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure 6.11: As for Figure 6.10, except for tropical regions with significant convection.
168
indicates an improvement in ensemble spread as observed, so it is likely this reduction in
RPSS and ESS is due to the increased RMSE and the flat tail observed in the RMSEspread diagnostic plots.
• The significantly improved RPSS and ESS for SPPTi for U850 is due to improved skill in
convecting regions. The improvements in IGNSS involve contributions from both regions.
• Z500 is not a particularly informative field to study in the tropics as it is very flat and
featureless. IGNSS indicates negative skill for lead times greater than 3 days, and RPSS
and ESS also indicate little skill for this variable.
• The improvement of RPSS and IGNSS for U200 in the tropics is mostly due to an
improved forecast skill in the regions with little convection. This improvement was clearly
visible in the RMSE-spread scatter diagrams. A small but significant improvement in
skill in convecting regions is also observed, especially at later lead times.
6.4
Convection Diagnostics
The impact of SPPTi is largest in regions with significant convection. To investigate if this
is an indication that convection is modelled better by this scheme, the convection diagnostics
discussed in Chapter 5 will be considered here, evaluated for tropical regions with significant
convection.
6.4.1
Precipitation
Firstly, the skill of forecasting convective precipitation is considered. Convective precipitation
is calculated by the convection scheme, and is not directly perturbed by SPPT. Therefore any
impact of SPPTi on forecasting this variable indicates a feedback mechanism: the convection
physics scheme is responding to the altered atmospheric state. As in Chapter 5, the GPCP data
set is used for verification of precipitation. Figure 6.12 shows the RMS error-spread diagnostic
for convective precipitation. It indicates that operational SPPT is under-dispersive for this
variable, and that SPPTi results in an improved spread in forecast convective precipitation at
all lead times. Figure 6.13 shows (a) the bias, (b) the RMSE and spread as a function of time,
and (c)–(e) the forecast skill scores for convective precipitation. The bias in the convective
169
(a)
(b)
0.015
(c)
0.015
0.005
0
0
0.01
RMSE
0.01
RMSE
RMSE
0.015
0.005
0
0
0.005
0.01
Forecast Spread
0.01
0.005
0
0
0.005 0.01 0.015
Forecast Spread
0.005 0.01 0.015
Forecast Spread
Figure 6.12: RMS error-spread diagnostic for cumulative convective precipitation for the 24
hour window before a lead time of (a) 1 day, (b) 3 days and (c) 10 days. The diagnostic is
calculated for tropical regions with significant convection. Results are shown for the four
experiments: blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta — TSCSi +
SKEB. The one-to-one diagonal is shown in black.
(a)
(b)
−3
x 10
8
8
7
RMS
% Bias
6
4
6
2
5
0
100
4
0
200
lead time / hrs
(c)
IGNSS
RPSS
200
(d)
0.25
0.2
0.15
0.1
0
100
lead time / hrs
100
200
lead time / hrs
(e)
0
0.7
−0.05
0.6
−0.1
0.5
ESS
0
−0.15
0.4
−0.2
0.3
−0.25
0.2
0
100
200
lead time / hrs
0.1
0
100
200
lead time / hrs
Figure 6.13: Summary forecast diagnostics for 24 hour cumulative convective precipitation in
tropical regions with significant convection. (a) Percentage bias. (b) Temporal evolution
of RMS ensemble spread (dashed lines) and error (solid lines) averaged over the region. (c)
Ranked Probability Skill Score (d) Ignorance Skill Score. (e) Error-spread Skill Score. Results
are shown for the four experiments: blue — TSCS; cyan — TSCS + SKEB; red — TSCSi;
magenta — TSCSi + SKEB
170
precipitation forecasts is higher for SPPTi than for SPPT, but all other diagnostics indicate
that the SPPTi forecasts are more skilful than the operational SPPT forecasts.
The skill of the precipitation forecasts can also be evaluated by considering the spatial
distribution of cumulative precipitation (convective plus large-scale) for the different forecast
models. The average 24-hour cumulative precipitation is shown for the GPCP data set in
Figure 6.14. The difference between the forecast and GPCP fields is shown in Figure 6.15 for
each of the four experiments in Table 6.1. Blue indicates the forecast has too little precipitation
whereas red indicates too much precipitation. Figures (a) and (b) show the results for the
operational SPPT scheme, with and without SKEB respectively. The results are very similar.
Both show too much precipitation across the oceans. Figures (c) and (d) show the results for
the SPPTi scheme, with and without SKEB respectively. Again, including SKEB has little
impact, but including the SPPTi scheme has slightly increased the amount of precipitation
over the oceans, as indicated earlier by the increase in bias in Figure 6.13(a). Using SPPTi
does not result in a significant change in the spatial distribution of rain.
The skill of the precipitation forecasts can also be evaluated by considering the global
frequency distribution of rain rate for the different forecast models and comparing to the
observed rain rate distribution in the GPCP dataset. This is shown in Figure 6.16. All four
forecast models perform well, though all underestimate the proportion of low rain rates and
overestimate the proportion of high rain rates. The operational SPPT scheme is closer to the
observations at mid to high rain rates, and the SPPTi scheme is marginally better at low rain
rates. Importantly, the diagnostic does not flag up any major concerns for the new SPPTi:
when SPPT was originally being developed, such a diagnostic showed that very high rain rates
occurred at a significantly inflated frequency, which led to alterations to the SPPT scheme
(Martin Leutbecher, pers. comm., 2013).
6.4.2
Total Column Water Vapour
Secondly, the skill of forecasting total column water vapour (TCWV) is considered, to which
the convection scheme is sensitive. Figure 6.17 shows the RMS error-spread diagnostic for
TCWV. As observed in Section 5.6.2.2, forecasts for this variable are poorly calibrated for
all experiments. They have a systematically too large RMSE, and the RMS error-spread
diagnostic has a shallow slope. Nevertheless, the SPPTi scheme increases the spread of the
171
0.02
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Figure 6.14: Distribution of 24-hour cumulative precipitation (metres) in the GPCP dataset,
averaged for each successive 24-hour window between 14th April and 9th September 2012.
(a)
(b)
−3
x 10
5
4
3
2
1
0
(c)
(d)
−1
−2
−3
−4
−5
Figure 6.15: Difference between forecast and GPCP 24-hour cumulative precipitation (m).
Blue indicates too little precipitation in the forecast, red indicates too much. The colour bar
corresponds to all figures. Results are shown for the four experiments: (a) TSCS, (b) TSCS +
SKEB, (c) TSCSi and (d) TSCSi + SKEB
172
0
10
−1
10
−2
probability
10
−3
10
−4
10
−5
10
−6
10
−7
10
0
50
100
150
Rain Rate / mm in 24 hours
Figure 6.16: Probability distribution of rain rate (mm/12hrs) evaluated globally. The distribution has been normalised to 1, given that rain is observed in each 12 hour window. The
observed result from the GPCP dataset (grey) are compared to TSCS (blue), TSCS + SKEB
(cyan), TSCSi (red) and TSCSi + SKEB (magenta) forecasts.
ensemble compared to the operational scheme, improving the calibration. Figure 6.18 shows (a)
the bias, (b) the RMSE and spread as a function of time, and (c)–(e) the forecast skill scores for
TCWV. These diagnostics indicate a significant improvement in forecast skill of TCWV when
SPPTi is used. The forecast bias is reduced, the RMS spread is increased without increasing
the RMSE, and the RPSS, IGNSS and ESS all indicate higher skill.
It is possible that SPPTi could result in significant changes of TCWV in the tropics as
was observed for the perturbed parameter experiments in Chapter 5, so this must be checked.
Figure 6.19 shows the average TCWV between 20◦ S and 20◦ N (calculated as described in
Section 5.6.2.2). This will diagnose if using SPPTi results in a systematic drying or moistening
of the tropics. All experiments show an initial spin-down period where the tropics dry by
0.5 kgm−2 over the first 12 hours, before stabilising. The operational SPPT forecasts in figures
(a) and (b) show a slight drying over the 240 hour forecast window, whereas the SPPTi forecasts
in figures (c) and (d) have a more constant average TCWV. All four experiments show stable
results.
173
(a)
(b)
5
6
3
8
5
RMSE
RMSE
4
RMSE
(c)
7
4
3
2
6
4
2
1
(a)
2
1
1
2
3
4
5
Forecast Spread
2
4
6
Forecast Spread
(b)
2
(c)
4
6
8
Forecast Spread
Figure 6.17: RMS error-spread diagnostic for TCWV for lead times of (a) 1 day, (b) 3 days and
(c) 10 days. The diagnostic is calculated for tropical regions with significant convection.
Results are shown for the four experiments: blue — TSCS; cyan — TSCS + SKEB; red —
TSCSi; magenta — TSCSi + SKEB. The one-to-one diagonal is shown in black.
(a)
(b)
−1
6
5
RMS
% Bias
−1.5
−2
4
3
−2.5
2
0
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
(c)
(e)
(d)
0.7
IGNSS
RPSS
0.5
0.4
0.3
0.95
0.2
0.9
0.1
0.3
0
0.2
−0.1
0.1
0
50
100
150
200
lead time / hrs
ESS
0.6
−0.2
0
0.85
0.8
0.75
0.7
0.65
50
100
150
200
lead time / hrs
0
50
100
150
200
lead time / hrs
Figure 6.18: Summary forecast diagnostics for TCWV in tropical regions with significant
convection. (a) Percentage bias. (b) Temporal evolution of RMS ensemble spread (dashed
lines) and error (solid lines) averaged over the region. (c) Ranked Probability Skill Score (d)
Ignorance Skill Score. (e) Error-spread Skill Score. Results are shown for the four experiments:
blue — TSCS; cyan — TSCS + SKEB; red — TSCSi; magenta — TSCSi + SKEB
174
(a)
(b)
40.5
Average TCWV/ kgm−2
Average TCWV/ kgm−2
40.5
40
39.5
39
38.5
0
50
100
150
40
39.5
39
38.5
200
0
50
Lead Time/ hrs
(c)
200
40.5
Average TCWV/ kgm−2
Average TCWV/ kgm−2
150
(d)
40.5
40
39.5
39
38.5
100
Lead Time/ hrs
0
50
100
150
40
39.5
39
38.5
200
Lead Time/ hrs
0
50
100
150
200
Lead Time/ hrs
Figure 6.19: Average TCWV between 20◦ S and 20◦ N as a function of time. The spatial average
is calculated for each ensemble member averaged over all start dates, and the averages for each
of the fifty ensemble members are shown. Results are shown for the four experiments: (a)
TSCS, (b) TSCS + SKEB, (c) TSCSi, and (d) TSCSi + SKEB.
6.5
Individually Independent SPPT
Independent SPPT assumes that the errors associated with different physics schemes are uncorrelated. It also has the effect of decoupling the physics schemes in the IFS: the random
patterns are introduced after all calculations have been made so each physics scheme does not
have the opportunity to react to the modified tendencies from the other schemes. The results
presented in this chapter show that this assumption results in a large increase of spread, particularly in convecting regions, and for U200 in non-convecting regions. To probe further into the
mechanisms of SPPTi, a series of five experiments was carried out. In each experiment, just
one of the five physics schemes was perturbed with an independent random number field to the
other four (Table 6.2). These “individually independent SPPT” experiments should indicate
the degree to which a particular physics scheme should have an independent error distribution
from the others. In particular, these experiments aim to answer the following questions:
1. Is it decoupling one particular scheme from the others that results in the large increase
in spread, or is it important that all schemes are treated independently?
2. Does decoupling one particular scheme result in the increased error observed for T850?
175
Physics Scheme
Experiment Abbreviation
if Independently Perturbed
Radiation
RDTTi
Turbulence and Gravity Wave Drag
TGWDi
Non-Orographic Gravity Wave Drag
NOGWi
Convection
CONVi
Large Scale Water Processes
LSWPi
Table 6.2: The experiment abbreviations for the individually independent SPPT experiments,
in which each physics scheme in turn is perturbed with a different pattern to the other four
schemes, which are perturbed together.
T 850
(a)
U 850
(b)
2.5
3.5
3
RMS
RMS
2
1.5
2.5
2
1.5
1
1
0
50
100
150
0
200
50
Z 500
(c)
100
150
200
lead time / hrs
lead time / hrs
U 200
(d)
8
150
7
RMS
RMS
6
100
5
4
3
50
2
0
50
100
150
200
0
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.20: RMS error (solid lines) and spread (dashed lines) as a function of time for forecasts
in tropical regions with little convection. (a) T850, (b) U850, (c) Z500 and (d) U200. The five
individually independent SPPT experiments are shown. Black — RDTTi. Grey — TGWDi.
Yellow — NOGWi. Green — CONVi. Magenta — LSWPi. Blue — operational SPPT, and
Red — SPPTi are included for comparison. The blue lines are obscured by the yellow and
grey lines in each figure.
176
Figure 6.20 shows the RMSE in the ensemble mean and RMS ensemble spread as a function
of time for each of the five individually independent SPPT experiments in regions with little
convection. The results for SPPT and SPPTi are also shown for comparison. The largest
impact is observed for U200, where both CONVi and RDTTi show the same spread increase
as SPPTi. This indicates that it is decoupling these two schemes which results in the large
spread increase observed for U200 when SPPTi is used instead of SPPT. For the other variables
considered, the impact of each individually independent scheme is more moderate, though in
each case CONVi and RDTTi result in the largest increase in spread. For T850, RDTTi also
results in a reduction in RMSE when compared to forecasts which use the operational SPPT
scheme.
Apart from for U200, SPPTi has the largest impact in regions with significant convection.
Figure 6.21 shows the RMSE and RMS spread as a function of time in these regions. CONVi
has the largest impact for each variable — perturbing the convection tendencies independently
from the other schemes results in an increase of ensemble spread equal to or greater than
independently perturbing all physics schemes. This supports the results from Chapter 5, in
which it was observed that decoupling the convection scheme by not perturbing its tendencies
resulted in an increase in spread. The next most influential scheme is radiation. For Z500 and
U200, perturbing this scheme independently also results in an increase of ensemble spread equal
to SPPTi. A large impact is also seen for U850 and T850. For the variables at 850 hPa, LSWPi
has a large impact. This is especially true at short lead times, when the impact is greater than
that of radiation. Using independent random fields for TGWDi (grey) and NOGWi (yellow)
has little impact on the ensemble spread — their RMSE and RMS spread are almost identical
to those from the operational SPPT forecasts. This is probably because these two schemes
act mainly in the boundary layer (TGWD) or in the middle atmosphere (NOGW), away from
the variables of interest. Additionally, the stochastic perturbations to these schemes will be
tapered, which will further reduce the impact of SPPTi.
Figure 6.21 also indicates which schemes contribute to the observed increase/decrease in
RMSE for SPPTi in regions with significant convection. For T850, SPPTi resulted in an
increase in RMSE. This same increase is observed for CONVi, and to a lesser extent for the
LSWPi forecasts. For the other variables, CONVi shows a similar RMSE to operational SPPT.
It is interesting to note that the RDTTi experiment does not result in an increase in error for
177
T 850
(a)
4
U 850
(b)
0.9
3.5
0.8
RMS
RMS
3
0.7
0.6
2.5
2
0.5
1.5
0.4
0
50
100
150
1
0
200
50
Z 500
(c)
100
150
200
lead time / hrs
lead time / hrs
U 200
(d)
180
7
160
6
120
RMS
RMS
140
100
5
80
4
60
3
40
0
50
100
150
2
0
200
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.21: RMS error (solid lines) and spread (dashed lines) as a function of time for forecasts
in tropical regions with significant convection. (a) T850, (b) U850, (c) Z500 and (d) U200.
The five individually independent SPPT experiments are shown. Black — RDTTi. Grey —
TGWDi. Yellow — NOGWi. Green — CONVi. Magenta — LSWPi. Blue — operational
SPPT, and Red — SPPTi are included for comparison. The blue lines are obscured by the
yellow and grey lines in each figure. The black solid line in (a) is obscured by the grey solid
line.
178
T850, but does give a substantial increase in spread. The RDTTi experiment also performs
well for U850 and Z500, with an increase in spread and no increase in error observed for both
variables. For U200, the RDTTi scheme results in a decrease in error. These results imply
that much of the spread increase observed with SPPTi could be achieved by perturbing radiation
independently from the other physics schemes, which will not result in the increase in RMSE
for T850.
Figure 6.22 shows the RMS error-spread diagnostic at a lead time of ten days for the
individually independent experiments in regions with significant convection. This diagnostic
confirms that both CONVi (green) and RDTTi (black) produce forecasts with a similar degree
of spread to SPPTi (red). Furthermore, these individually independent schemes improve the
one-to-one relationship between RMSE and RMS spread. Figure 6.22(a) shows the results
for T850, including the increased error for predictable situations. The inset figure shows the
region of interest in more detail, indicated by the grey rectangle. LSWPi results in a significant
increase of error and a flatter ‘tail’. CONVi also results in an increase of error for the smallest
forecast spread cases, giving an upward ‘hook’ in the scatter diagnostic at smallest spreads.
This indicates poorly calibrated forecasts: the forecast spread does not correctly indicate the
error in the ensemble mean, and forecasts with the smallest spreads of between 0.4 and 0.5◦ C
consistently have a higher error than those with spreads between 0.5 and 0.6◦ C. The results
for RDTTi are positive, showing an increase in spread but no associated increase in error.
Figure 6.23 shows the skill of the individually independent SPPT forecasts in regions with
significant convection, as indicated by the RPSS, IGNSS and ESS. Overall, the RDTTi forecasts
are more skilful than forecasts from any other scheme. In fact, RDTTi is more skilful than
SPPT for T850, whereas SPPTi was less skilful than SPPT for this variable. RDTTi also
performs well for the other variables considered, and has skill equal to or better than the
SPPTi scheme in most cases.
6.6
High Resolution Experiments
Due to limitations in computer resources, the experiments presented above ran the IFS at a
relatively low resolution of T159. The question then arises: does SPPTi have the same impact
when the model is run at the operational resolution of T639? I am grateful to Sarah-Jane Lock
(ECMWF), who ran two experiments on my behalf to test SPPTi at operational resolution.
179
(b)
(a)
2
6
5
RMSE
RMSE
1.5
1
4
3
2
0.5
1
0.5
1
1.5
Forecast Spread
1
2
(c)
3
4
5
Forecast Spread
6
(d)
250
10
200
8
RMSE
RMSE
2
150
6
100
4
50
2
50
100
150
Forecast Spread
2
4
6
8
Forecast Spread
10
Figure 6.22: RMS error-spread diagnostic for tropical regions with significant convection for
(a) T850, (b) U850, (c) Z500 and (d) U200, at a lead times of 10 days. The five individually
independent SPPT experiments are shown (triangles): Black — RDTTi, Grey — TGWDi,
Yellow — NOGWi, Green — CONVi, and Magenta — LSWPi. For comparison, the operational
SPPT (blue circles) and SPPTi (red circles) are also shown. The one-to-one diagonal is shown
in black. The tiled figure in (a) is a close up of the region indicated by the grey rectangle.
180
(a)
(b)
(c)
0.4
1
0.3
0.98
0.2
0.96
0.6
0.4
ESS
IGNSS
RPSS
0.5
0.1
0.94
0.3
0
0.92
0.2
0
100
−0.1
0
200
lead time / hrs
0.9
0
200
lead time / hrs
(d)
0.7
0.8
0.6
0.6
(f)
0.95
0.5
0.4
0.2
0
100
200
0.9
0.85
0.3
0.5
200
1
ESS
0.7
100
lead time / hrs
(e)
0.9
IGNSS
RPSS
100
0.8
0
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.4
0.6
0.2
0.2
0.8
0
0.6
ESS
IGNSS
RPSS
0.4
−0.2
0
−0.4
−0.2
−0.6
0.4
0.2
0
0
100
200
0
lead time / hrs
100
−0.2
0
200
lead time / hrs
(j)
(k)
0.8
100
200
lead time / hrs
(l)
0.6
1
0.6
0.5
0.95
0.4
ESS
IGNSS
RPSS
0.7
0.4
0.3
0
0.9
0.2
0.85
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure 6.23: Ensemble forecast skill scores calculated for tropical regions with significant
convection. First column: Ranked Probability Skill Score. Second column: Ignorance Skill
Score. Third column: Error-spread Skill Score. (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and
(j)–(l) U200. Results are shown for the five individually independent SPPT experiments: Black
— RDTTi, Grey — TGWDi, Yellow — NOGWi, Green — CONVi, and Magenta — LSWPi.
Blue — operational SPPT, and Red — SPPTi are included for comparison. The blue lines are
obscured by the yellow lines in each figure. Additionally, in (j) the grey line is obscured by the
magenta line; in (k) the yellow and grey lines are obscured by the magenta line; in (l) the grey
line is obscured by the yellow line.
181
Ten-day ensemble hindcasts were initialised every five days between 14 April and 18 June
2012 (14 dates in total). The ensembles have 20 members instead of the operational 50. The
two experiments repeated at T639 were “SPPT” and “SPPTi”. A subset of twenty ensemble
members is taken from the operational forecasts for these dates to produce equivalent “SPPT
+ SKEB” forecasts for comparison.
6.6.1
Global Diagnostics
Figure 6.24 shows the RMSE and RMS spread for each of the standard ECMWF global regions
as a function of time for the variables of interest. At this higher resolution, the spread of the
forecasts is well calibrated in the extra-tropics (first and third column). SPPTi has little
impact here, so the ensembles remain well calibrated. In the tropics, the T639 forecasts are
under-dispersive. Here, SPPTi results in a significant increase in spread, and has a larger
impact on ensemble spread than the operational SKEB scheme. These results are similar to
those at T159, shown in Figure 6.2. The key difference between T159 and T639 is that the
operational T639 forecasts are better calibrated than the equivalent T159 forecasts. This means
that when SPPTi is implemented at T639, the ensemble forecasts become over-dispersive for
some variables (e.g. U850). There is also a significant increase in RMSE for T850 and Z500
forecasts at T639. The impact of SPPTi on U200 forecasts is an good match between the
ensemble spread and RMS error in the ensemble mean.
It is important to note that for T850, U850 and U200, the ensemble spread is greater than
the RMSE at a lead time of 12 hours for all experiments. This is indicative of inflation of initial
condition uncertainty to compensate for an incomplete representation of model uncertainty.
Because the ensembles are under-dispersive at longer lead times, the initial condition perturbations have been artificially inflated to increase the ensemble spread. In fact, in the IFS, the
initial condition perturbations calculated by the EDA system are combined with singular vector perturbations before they are used. If SPPT is replaced by SPPTi, this artificial inflation
could be removed, and the raw initial condition uncertainty estimated by the EDA system used
instead. The temporal evolution of the ensemble spread for forecasts of U850 closely matches
the evolution of the RMSE (Figure 6.24(e)), and if the initial condition perturbations were
to be reduced to the raw EDA output, the results here indicate that SPPTi could produce a
forecast that is well calibrated at all lead times for this variable. It would also be interesting
182
4
(a)
(b)
(c)
4
1.6
1.4
2
3
RMS
RMS
RMS
3
1.2
1
0.8
1
1
0
100
0.6
0
200
lead time / hrs
0
200
3.5
5
3
4
2.5
2
(e)
100
6
2
200
0
100
200
0
lead time / hrs
100
200
lead time / hrs
(h)
(i)
800
1000
150
400
800
RMS
RMS
600
RMS
4
2
lead time / hrs
(g)
200
(f)
1.5
1
0
100
lead time / hrs
RMS
(d)
3
100
lead time / hrs
RMS
RMS
6
2
100
600
400
200
50
0
100
200
200
0
lead time / hrs
100
200
0
lead time / hrs
(j)
(k)
12
10
10
8
6
RMS
6
RMS
RMS
200
(l)
8
4
4
8
6
4
2
0
100
lead time / hrs
100
200
lead time / hrs
2
0
100
200
lead time / hrs
2
0
100
200
lead time / hrs
Figure 6.24: Temporal evolution of RMS ensemble spread (dashed lines) and RMSE (solid
lines) for each standard ECMWF region. First column: northern extra-tropics, north of 25N.
Second column: tropics, 25S–25N. Third column: southern extra-tropics, south of 25S. (a)–
(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l) U200. Results are shown for the three T639
experiments: blue — TSCS; cyan — TSCS + SKEB; red — TSCSi.
183
to test using SPPTi in the EDA system, as it is possible that using SPPTi will impact the
initial condition uncertainty estimated using the EDA.
6.6.2
Verification in the Tropics
As at T159, SPPTi has the largest impact in the tropics. To investigate the source of the
increased spread, the forecasts will be verified in areas in the tropics where there is little
convection, and in areas where convection is the dominant process. The areas considered will
be those defined in Section 5.5.1. Results are shown in Figures 6.25 and 6.26 for regions with
little and significant convection respectively. The operational forecasts are under-dispersive
in both regions. As at T159, the under-dispersion is more severe in regions with significant
convection. For both regions, SPPTi has the effect of significantly increasing the ensemble
spread, whereas the impact of SKEB is more moderate. However, SPPTi also increases the
RMSE for T850 and Z500 forecasts in both regions, and results in a slight increase of RMSE
for U850 and U200 forecasts in convecting regions. In non-convecting regions, SPPTi results
in a slight reduction in RMSE for U850 and U200. The improved temporal evolution of the
ensemble spread identified above is observed in convecting regions, but not in non-convecting
regions. As at T159, the difference in behaviour between convecting and non-convecting regions
indicates that it is convection, and its interactions with other physical processes, that is the key
mechanism by which SPPTi affects the ensemble.
Figures 6.27 and 6.28 show the RMS error-spread graphical diagnostic at a lead time of
ten days for the three T639 experiments for regions with little and significant convection
respectively. The forecasts have been binned into 14 bins instead of 30 to ensure the population
of each bin is the same as before, and is a sufficiently large sample to estimate the statistics. The
impact of SPPTi is small in regions with little convection, though the spread error relationship
is improved slightly for U850. The average spread of forecasts is also improved for U200, but
the flow dependent calibration is poor — the scattered points do not follow the one-to-one
line. In regions of significant convection, the impact is greater. For T850 and Z500 there is an
improved error-spread relationship when SPPTi is used instead of SPPT. The spread of the
ensemble forecasts has increased, and the forecasts continue to give a flow-dependent indication
of uncertainty in the forecast. However, for T850, an increase in RMSE is observed. Unlike at
T159, this increase in RMSE occurs for all forecast spreads, not just for the small spread and
184
(a)
(b)
2.2
3
2
2.5
1.6
RMS
RMS
1.8
1.4
1.2
2
1.5
1
0.8
0
50
100
150
1
0
200
50
lead time / hrs
(c)
8
140
7
120
6
100
RMS
RMS
160
100
150
200
lead time / hrs
80
(d)
5
4
60
3
40
2
20
0
50
100
150
200
0
50
lead time / hrs
100
150
200
lead time / hrs
Figure 6.25: Temporal evolution of RMS ensemble spread (dashed lines) and RMSE (solid
lines) for tropical regions with little convection for (a) T850, (b) U850, (c) Z500 and (d)
U200. Results are shown for the three T639 experiments: blue — TSCS; cyan — TSCS +
SKEB; red — TSCSi.
(b)
4
0.9
3.5
0.8
3
RMS
RMS
(a)
1
0.7
2.5
0.6
2
0.5
1.5
0.4
0
50
100
150
1
0
200
50
100
150
200
lead time / hrs
lead time / hrs
(c)
(d)
7
120
6
RMS
RMS
100
80
60
4
3
40
20
0
5
50
100
150
2
0
200
lead time / hrs
50
100
150
200
lead time / hrs
Figure 6.26: As for Figure 6.25, except for tropical regions with significant convection.
185
(a)
(b)
5
3.5
4
RMSE
RMSE
3
2.5
2
1.5
3
2
1
1
0.5
1
2
3
Forecast Spread
1
(c)
2
3
4
Forecast Spread
5
4
6
8
10
Forecast Spread
12
(d)
250
8
RMSE
RMSE
200
150
100
6
4
50
2
50
100 150 200
Forecast Spread
2
Figure 6.27: RMS error-spread diagnostic for tropical regions with little convection for (a)
T850, (b) U850, (c) Z500 and (d) U200, at a lead times of 10 days for each variable. Results
are shown for the three T639 experiments: blue — TSCS; cyan — TSCS + SKEB; red —
TSCSi. The one-to-one diagonal is shown in black.
error cases. For U850, SPPTi also increases the spread of the forecasts, but by too great an
amount. For U200, the SPPTi results follow a shallower slope than the SPPT forecast results.
This indicates a reduced degree of flow-dependent predictability, though this is not a problem
at earlier lead times (not shown).
Figures 6.29 and 6.30 show the skill of the ensemble forecasts in regions with little and
significant convection respectively. Despite the improvement in spread, the SPPTi forecasts
tend to score poorly due to an associated increase in RMSE. The SPPTi forecasts are more
skilful than the SPPT forecasts for U850 and U200 in non-convecting regions, and for Z500 at
long lead times in convecting regions according to all skill scores.
6.7
Discussion and Conclusion
SPPTi results in a significant increase of spread for all variables at all lead times. In the extratropics, the ensemble forecasts are well calibrated at T639, and moderately under-dispersive
at T159. The impact of SPPTi is small in these regions. At T159, a small increase in ensemble
spread is observed correcting for the under-dispersion, and at T639 the impact is smaller, and
the ensemble forecasts remain well calibrated. The impact of SPPTi is similar to SKEB in
these regions. In the tropics, forecasts made with SPPT are significantly under-dispersive at
186
(b)
5
1.5
4
RMSE
RMSE
(a)
2
1
3
2
0.5
1
0.5
1
1.5
Forecast Spread
2
4
Forecast Spread
(c)
6
(d)
10
200
RMSE
RMSE
8
150
100
6
4
50
2
50
100
150
Forecast Spread
2
4
6
8
10
Forecast Spread
Figure 6.28: As for Figure 6.27, except for tropical regions with significant convection.
both T159 and T639. SPPTi has a large beneficial impact in these regions. The forecast
spread is significantly larger than when SPPT is used, and the impact is considerably larger
than the impact of SKEB. This is observed at both T159 and T639. SPPTi produces skilful,
flow-dependent estimates of forecast uncertainty, having a larger impact on forecasts that were
more under-dispersive when using SPPT.
The impact of SPPTi in tropical regions with significant convection (Figure 6.7) is considerably greater than in tropical regions with little convection (Figure 6.6) for T850 and U850, and
to a lesser extent, for Z500. This indicates that convection, together with its interactions with
other physics schemes, is a key process by which SPPTi impacts the ensemble. Equation (6.2)
indicates that the forecast uncertainty represented by SPPTi will only be greater than SPPT
for regions where the model tendencies act in opposite directions, i.e., where the individual
tendencies are large but the net tendency is small. In tropical regions with significant convection, this is indeed the case for the IFS. The convection scheme parametrises the effect of
convective latent heating on the atmosphere. This scheme interacts directly with the large
scale water processes (clouds) scheme: water detrained from the convective plume acts as a
source of water for clouds in the LSWP scheme, which then calculates the effect of evaporative
cooling on the atmosphere (ECMWF, 2012). This interaction means that a warming due to
convection tends to be associated with a cooling from the cloud scheme. The opposing nature
of these tendencies results in the significant increase in ensemble spread associated with SPPTi
187
(a)
(b)
(c)
0.6
1
0.5
0.98
0.4
0.96
0.6
0.5
ESS
0.7
IGNSS
RPSS
0.8
0.3
0.2
0.4
0
100
0.92
0.1
0
200
0.94
100
0.9
0
200
lead time / hrs
lead time / hrs
(d)
(e)
0.8
100
200
lead time / hrs
(f)
0.5
1
0.4
0.8
0.4
0.3
ESS
IGNSS
RPSS
0.6
0.2
0.6
0.1
0.2
0.4
0
100
0
0
200
lead time / hrs
100
200
0
lead time / hrs
(g)
100
200
lead time / hrs
(h)
(i)
0.6
0.6
0.4
0.4
0.2
0.95
ESS
0.8
IGNSS
RPSS
1
0.2
100
200
0.85
0.8
0
0
0.9
0
lead time / hrs
100
0.75
0
200
lead time / hrs
(j)
100
200
lead time / hrs
(k)
(l)
0.9
1
0.7
0.6
0.7
0.6
0.95
0.5
ESS
IGNSS
RPSS
0.8
0.4
0.9
0.3
0.5
0
0.2
100
200
lead time / hrs
0
100
200
lead time / hrs
0.85
0
100
200
lead time / hrs
Figure 6.29: Ensemble forecast skill scores calculated for tropical regions with little convection. First column: Ranked Probability Skill Score. Second column: Ignorance Skill Score.
Third column: Error-spread Skill Score. (a)–(c) T850, (d)–(f) U850, (g)–(i) Z500 and (j)–(l)
U200. Results are shown for the three T639 experiments: blue — TSCS; cyan — TSCS +
SKEB; red — TSCSi.
188
(a)
(b)
0.7
0.6
0.98
0.5
0.4
ESS
0.3
IGNSS
RPSS
(c)
1
0.4
0.2
0.1
0.3
0.2
0
50
100
150
0.94
0.92
0
0
200
0.96
lead time / hrs
50
100
150
0.9
0
200
(d)
(e)
150
200
1
0.6
0.7
0.95
0.5
ESS
IGNSS
RPSS
100
(f)
0.7
0.8
50
lead time / hrs
lead time / hrs
0.4
0.9
0.6
0.3
0.5
0
50
100
150
0.2
0
200
lead time / hrs
50
100
150
0.85
0
200
lead time / hrs
(g)
50
100
150
200
lead time / hrs
(h)
(i)
0.8
1
0.4
0.8
IGNSS
0.4
0.2
ESS
RPSS
0.6
0.6
0
0.2
0
50
100
150
200
−0.2
0
lead time / hrs
50
100
150
0.4
0
200
lead time / hrs
(j)
50
100
150
200
lead time / hrs
(k)
(l)
0.6
1
0.4
0.95
IGNSS
RPSS
0.7
0.6
0.5
ESS
0.8
0.2
0.9
0.4
0
50
100
150
200
lead time / hrs
0
0
50
100
150
200
lead time / hrs
0.85
0
50
100
150
200
lead time / hrs
Figure 6.30: As for Figure 6.29, except for tropical regions with significant convection.
189
in these regions. The individually independent SPPT experiments also suggest it is decoupling
clouds and convection from each other that results in this large increase in spread for T850 and
U850, as both the CONVi and LSWPi experiments showed increases in spread compared to
SPPT (Figure 6.21).
The impact of the convection scheme on clouds also impacts the radiation parametrisation
scheme. As noted in Section 5.2.1, the radiation scheme interacts with the cloud scheme
since both short- and long-wave radiative transfer are sensitive to the cloud fraction predicted
by the cloud scheme. In particular, low level cloud is often associated with cooling from
the radiation scheme (Morcrette, 2012), which opposes the warming from convection. This
interaction between radiation and convection could contribute to the increase in spread for the
SPPTi forecasts in regions with significant convection. The RDTTi experiment showed that
decoupling RDTT from the other parametrisation schemes results in a large increase in spread
for T850 and U850 (Figure 6.21), supporting this hypothesis.
For U200, the largest increase in spread was also observed in the tropics (Figure 6.2).
However, unlike for the other variables, this spread increase is predominantly from forecasts
for tropical regions with little convection (Figure 6.6). The individually independent SPPT
experiments shown in Figure 6.20 show that independently perturbing just RDTT or CONV
gives the same increase in ensemble spread as SPPTi: for U200, it is decoupling RDTT from
CONV that results in the large increase in forecast spread. The variable U200 is sensitive to
convection as it is located close to the level at which maximum convective outflow occurs. In
regions with significant convection, it is expected that there will be thick cloud at this level
due to the spreading out of the convective anvil. If the 200 hPa level falls at the top of an
anvil cloud, significant radiative cooling will be observed as the cloud efficiently emits longwave
radiation to space (Gray and Jacobson, 1977). However if the 200 hPa level falls below an anvil,
a radiative warming would be observed due to an increase in trapped longwave radiation. This
characteristic warming profile is shown schematically in Figure 6.31, taken from Gray and
Jacobson (1977). For this reason, in regions of significant convection the radiation scheme
will produce tendencies at 200 hPa which either oppose or enhance the convective warming,
reducing the impact of SPPTi when averaged over many cases. In contrast, regions with little
convection have reduced amounts of optically thick high level cloud, so the radiation scheme
will tend to cool the atmosphere at this level. A large impact is observed in these regions
190
Figure 6.31: Typical radiation induced temperature changes in the tropics for a clear sky
region compared to a ‘disturbance’ region with thick high cloud cover. Taken from Gray and
Jacobson (1977).
when the opposing CONV and RDTT tendencies are decoupled using SPPTi. Decoupling the
convective and radiative temperature tendencies affects the horizontal temperature gradients
at 200 hPa, which could then affect the zonal wind speed at that level1 .
Considering the YOTC tendencies at 200 hPa provides support for this hypothesised mechanism. Figure 6.32 shows the 24 hr temperature tendencies of RDTT, CONV and LSWP at
200 hPa, averaged over 30 start dates between 14 April and 6 September 2009. Each scattered
point represents the tendency from two schemes at a particular location. Figure (a) shows that
in regions of significant convection, while the temperature tendencies are consistently positive,
radiation tendencies can be either positive or negative. In regions with little convection, the
convective tendencies remain positive on average, but the radiative tendencies are negative on
average.
The impact of SPPTi on U200 could also be indicative of an improved variability of convection in the IFS. The upper level wind field is sensitive to waves generated remotely by
convective systems (Stensrud, 2013). The improvement in ensemble spread and reduction in
error is most apparent at a lead time of ten days (Figure 6.8), which could suggest that this
improvement is due to a remote source. Other diagnostics indicate that convection is better
represented when SPPTi is used instead of SPPT. The skill of convective precipitation forecasts
at T159 improve at all lead times when SPPTi is used; forecasts show an improved ensemble
spread and a slight reduction in RMSE, though the wet bias of the model is also increased.
Forecasts of TCWV show a significant improvement when SPPTi is used. The bias is reduced,
1
The RDTT parametrisation scheme directly affects the atmospheric temperature only, so the observed
impact of RDTTi on U indicates a feedback mechanism
191
(b)
∆ T (RDTT)
∆ T (RDTT)
(a)
2
0
−2
−5
0
5
0
5
∆ T (CONV)
2
0
−2
−2
0
2
∆ T (LSWP)
4
∆ T (LSWP)
(c)
4
2
0
−2
−5
∆ T (CONV)
Figure 6.32: The 24-hour cumulative temperature tendencies at 200 hPa taken from the YOTC
data set for the RDTT, CONV and LSWP parametrisation schemes. The tendencies have
been averaged over 30 dates between 14 April and 6 September 2009, with subsequent start
dates separated by five days. The scattered points represent pairs of parametrised tendencies
from different spatial locations sampled over: the entire globe (grey); regions with significant
convection (red); regions with little convection (blue). The regions are those defined in Section
5.5.1. Figure (a) compares the RDTT and CONV tendencies; (b) compares the RDTT and
LSWP tendencies; (c) compares the LSWP and CONV tendencies.
and the spread of the forecasts improves significantly.
The increase in ensemble spread is certainly beneficial. However, is it physically reasonable
to decouple the parametrisation schemes in this way? As described in Section 5.2.1, the IFS
calls the parametrisation schemes sequentially to ensure balance between the different physics
schemes. Additionally, the different schemes in the IFS have been tuned to each other, possibly
with compensating errors. An increase of forecast error could be expected on decoupling the
schemes. This is indeed the case when SPPTi is implemented, and an increase in forecast bias
is observed for many variables. Maintaining balance is particularly important for the cloud
and convection parametrisation schemes as they represent two halves of the same process,
and because the division of labour between the two schemes is primarily dependent on model
resolution (ECMWF, 2012). The convection scheme is in balance with the cloud scheme:
the net deterministic tendency for the two schemes is close to zero. It is plausible that the
increase in RMSE in the SPPTi, CONVi and LSWPi experiments, most noticeably for T850
and Z500, could be attributed to the decoupling of the cloud and convection schemes, which
could remove this balance. At T159, the increase in RMSE for T850 only occurs for predictable
forecast situations with small forecast spread. It is possible that the CONV and LSWP schemes
192
have been tuned to be very accurate for these specific cases, so decoupling the two schemes
by introducing SPPTi has a detrimental effect on the forecast accuracy. This explanation is
supported by the results of the CONVi experiment — the RMS error-spread diagnostic for this
experiment had an upward ‘hook’ at small spreads, indicating a significant increase in error
for previously accurate forecast situations. At T639, Figure 6.28(a) indicates that the increase
in RMSE for T850 forecasts occurs uniformly across forecast cases. It is interesting that the
high resolution model behaves differently to the low resolution model, but further experiments
are required to diagnose the cause.
Despite the increase in forecast error, SPPTi clearly has some merit. It skilfully increases
the ensemble spread in under-dispersive regions, which tend to be those with significant convection. The resultant ensemble spread evolves in a very similar way to the RMSE — the scheme
appears to represent the uncertainty in the forecast very accurately. In the extra-tropics,
forecasts remain well calibrated and SPPTi has little effect. Hermanson (2006) considers the
difference between parametrised IFS tendencies at T799 and at T95, using the T799 integration
as a proxy for “truth”. He calculates a histogram for the sum of the temperature tendencies
from clouds and convection for each model. Both peaks were centred at between -1 and 0K/day.
However, the T95 model was observed to have a narrower, taller peak in the histogram than
the T799 model. This indicates that the lower resolution version of the IFS is underestimating
the variability of the sum of the cloud and convective tendencies — the low resolution model is
too balanced. It would be interesting to perform similar experiments using the current model
to test this balance hypothesis. If it is the case that the convection and cloud schemes are
too balanced, this could justify the use of SPPTi, and explain the source of the increased
skill, especially for the convection diagnostics. It would also be interesting to perform further
experiments using SPPTi — it is possible that using a common base pattern for the CONV
and LSWP perturbations and applying independent patterns to a smaller degree would be
beneficial, retaining a large degree of the balance between convection and clouds, but allowing
ensemble members to explore the slightly off-balance scenarios observed by Hermanson (2006)
in the higher resolution model.
The RDTTi experiment also merits further investigation as it resulted in a significant improvement in ensemble spread when tested at T159, but with no associated increase in RMSE.
In fact, RDDTi resulted in a reduction of error for U200 in regions with little convection, which
193
is also observed for forecasts using SPPTi in this region — the bias for U200 is reduced in regions with little convection. The reduction in RMSE indicates that the stochastic perturbations
to CONV and RDTT should be uncorrelated. The radiation scheme is affected by the convection scheme through the cloud parametrisation. However, this coupling between radiation and
clouds is through a Monte-Carlo calculation (the McICA scheme), so unless the cloud fractions
are systematically wrong, the correlation between the error from the cloud scheme and the
error from the radiation scheme should be reduced. The independent pattern approach could
therefore be a physically reasonable model for the errors between the convection and radiation
tendencies. It would be interesting to test the RDTTi scheme at T639 to see if a reduction in
RMSE is observed and if the forecast skill is improved at operational resolution.
In conclusion, modifying the SPPT scheme to allow independent perturbations to the parametrised tendencies has a significant positive impact on the spread of the ensemble forecasts
at T159 resolution. The improvement in ensemble spread was also observed at T639, though
this was accompanied by an increase in RMSE for some variables. Nevertheless, ECMWF is
very interested in performing further experiments using the SPPTi scheme to test its potential
for use in the operational EPS.
194
7
Conclusion
It seems to me that the condition of confidence or otherwise forms a very important
part of the prediction, and ought to find expression. It is not fair to the forecaster that
equal weight be assigned to all his predictions and the usual method tends to retard
that public confidence which all practical meteorologists desire to foster.
– W. Ernest Cooke, 1906
Reliability is a necessary property of forecasts for them to be useful to decision makers for
risk assessment, as described in Chapter 1 (Zhu et al., 2002). In order to produce reliable probabilistic weather forecasts, it is important to account for all sources of error in atmospheric
models. In the case of weather prediction, the two main sources of error arise from initial
condition uncertainty and model uncertainty (Slingo and Palmer, 2011). There has been much
interest in recent years in using stochastic parametrisations to represent model uncertainty
in atmospheric models (Buizza et al., 1999; Lin and Neelin, 2002; Craig and Cohen, 2006;
Khouider and Majda, 2006; Palmer et al., 2009; Bengtsson et al., 2013). Stochastic parametrisations have been shown to correct under-dispersive ensemble spread and improve the overall
skill of the forecasts. However, there has been little research in explicitly testing the skill of
stochastic parametrisations at representing model uncertainty — existing stochastic schemes
have been tested in situations where there is also initial condition uncertainty, so it is hard
to determine to what degree the representation of initial condition uncertainty accounts for
model uncertainty, and vice versa. Additionally, there has been little research into the impact
of stochastic schemes on climate prediction. However, research into the ‘seamless prediction’
paradigm states that in order to predict the climate skilfully, a model should be skilful at
195
predicting shorter time scale events. This ansatz can be used to evaluate and improve climate
models (Rodwell and Palmer, 2007; Palmer et al., 2008; Martin et al., 2010).
This study has tested stochastic parametrisations for their ability to represent model uncertainty skilfully in atmospheric models, and thereby to produce reliable, flow-dependent
probabilistic forecasts. The main aims were:
1. To explicitly test stochastic parametrisations as a way of representing model uncertainty
in atmospheric models. An idealised set up in the Lorenz’96 system allows the initial
conditions to be known exactly, leaving model uncertainty as the only source of error in
the forecast.
2. To test stochastic parametrisations for their ability to simulate the climate of a model.
Is it true that, following the seamless prediction paradigm, a model which is unreliable
in predicting the weather is likely to be unreliable in predicting the climate?
For both of these, the skill of stochastic parametrisation schemes is compared to perturbed
parameter schemes. These are commonly used as deterministic representations of uncertainty
in climate models. The final aim is:
3. To use the lessons learned in a simple system to test and develop stochastic and perturbed parameter representations of model uncertainty for use in an operational weather
forecasting model.
The main findings of this thesis are summarised below.
Stochastic parametrisations are a skilful way of representing model uncertainty in
weather forecasts.
It is important to represent model uncertainty in weather forecasts: the forecasts in the L96
system, described in Chapter 2, showed a significant improvement in skill when a representation of model uncertainty was included in the forecast model compared to the deterministic
forecasts. The stochastic parametrisation schemes produced the most skilful forecasts. More
importantly, the best stochastic parametrisation scheme, using multiplicative AR(1) noise, was
shown to be reliable, i.e., the ensemble was able to capture the uncertainty in the forecast due
to limitations in the forecast model. This indicates that stochastic parametrisations are a
skilful approach for representing model uncertainty in weather forecasts.
196
Using the L96 system to test stochastic parametrisations was advantageous because it
allowed initial condition uncertainty to be removed, leaving only model uncertainty. This is not
the case for the experiments using the IFS. Nevertheless, the new independent SPPT scheme
presented in Chapter 6 seems to be a skilful way of representing model uncertainty in the IFS,
improving the reliability of the ensemble forecasts, though resulting in too much spread for
some variables. However, the results at T639 indicate that initial condition perturbations are
routinely over-inflated to correct for the lack of spread generated by the operational stochastic
parametrisation schemes. The proposed SPPTi scheme could make inflation of initial condition
perturbations unnecessary in the IFS.
The perturbed parameter ensemble forecasts tested in the L96 system were more skilful
than a single deterministic forecast with no representation of model uncertainty, but they
were not as skilful as the best stochastic schemes (Chapter 2). This is despite estimating the
degree of parameter perturbation from the truth time series. The same result was obtained in
Chapter 5 when perturbed parameter ensembles were tested in the IFS — a relatively small
impact on forecast skill was observed, despite estimating the uncertainty in the parameters
using a Bayesian approach. This indicates that parameter uncertainty is not the only source
of model uncertainty in atmospheric models. In fact, assumptions and approximations made
when constructing the parametrisation scheme also result in errors which cannot be represented
by varying uncertain parameters. The stochastically perturbed parameter scheme tested in
the IFS also had a small positive impact on skill. However, the temporal and spatial noise
correlations were not estimated or tuned for this scheme, and the standard SPPT values were
used instead. It is likely that the optimal noise parameters would be different for a perturbed
parameter scheme than for SPPT, and using measured noise parameters could result in further
improvement in the forecast skill.
It is important to represent short time-scale model uncertainty in climate forecasts.
Stochastic parametrisations are a skilful way of doing this.
In Chapter 3, the L96 forecast model climatology showed a significant improvement over the
deterministic forecast model when stochastic parametrisations were used to represent model
uncertainty. The climate pdf simulated by the stochastic models was a better estimate of the
truth pdf than that simulated by the deterministic model. The improvement in climatology
197
was particularly clear when considering the regime behaviour of the L96 system. The deterministic forecast was unable to explore both regimes, and did not capture the variability of the
‘truth’ model. The stochastic forecast model performed significantly better, and was able to
capture both the proportion of time spent in each regime, and the regime transition time scales.
Studying the regime behaviour of a system provides much more information than comparing
the pdfs: this verification technique should be used when testing climate models.
The perturbed parameter ensembles were also tested on their ability to simulate the climate
of the L96 system. As each ensemble member is a physically distinct model of the system,
ensemble members should be treated independently. The average perturbed parameter pdf is
significantly more skilful than the deterministic pdf, though less skilful than the best stochastic
schemes. Each individual ensemble member also produces a skilful pdf. The perturbed parameter model was tested on its ability to simulate regime behaviour in the L96 system. While
the average of the ensemble members performed well, individual ensemble members varied
widely. Many performed very poorly and only explored one regime. A similar result was observed in the IFS in Chapter 5, when forecasts with the perturbed parameter ensemble were
considered. Over the ten day forecast time window, significant trends in tropical total column
water vapour were observed, with some ensemble members systematically drying and others
systematically moistening. Over a climate length integration, it is possible that these biases
would continue to grow. The stochastically perturbed parameter ensemble did not develop
these biases, so this could be a better way of representing parameter uncertainty in climate
models.
The results presented provide some support for the ‘seamless prediction’ paradigm: the
climatological skill of the forecast models was related to their skill at predicting the weather.
The results suggest that it is a necessary but not sufficient condition that a skilful climate
model produce reliable short range forecasts.
Skilful stochastic parametrisations should have a realistic representation of the
model error
In Chapters 2 and 3, a significant improvement in weather and climate forecast skill for the
L96 system was observed when the stochastic parametrisation schemes included temporally
correlated (red) noise. This more accurately reflects the true sub-grid scale forcing from the
198
Y variables, which shows high temporal autocorrelation. In the atmosphere, it is likely that
the error in the sub-grid scale tendency (the difference between the parametrised and true
tendencies) also has a high degree of temporal and spatial correlation, which a stochastic
parametrisation scheme should represent.
In the L96 system, the more realistic c = 4 case further demonstrates the importance
of realistically representing noise in a stochastic parametrisation scheme: the more accurate
stochastic representations of the sub-grid scale forcing (the SD, M and MA schemes — see
Chapter 2 for more details) produced significantly more skilful weather forecasts than the
simple additive noise case. This is also the case for both the pdf and regime definitions of
climate considered.
In the IFS, it is not realistic to assume that the errors arising from different parametrisation
schemes are perfectly correlated, and relaxing this assumption in Chapter 6 was found to result
in a significant improvement in forecast reliability. However, it is unlikely that the errors
from different parametrisation schemes are completely uncorrelated, so measuring the degree
of correlation would be an important step for producing a realistic representation of model
uncertainty in the IFS.
Stochastic ensemble forecasts contain information about flow-dependent model
uncertainty, which is absent from statistically generated ensembles.
A new proper score, the Error-spread Score (ES), was proposed in Chapter 4, suitable for evaluation of ensemble forecasts. The score is particularly sensitive to the dynamic (flow-dependent)
reliability of ensemble forecasts, and detects a large improvement in skill for ensemble forecasts
compared to a statistically generated ‘dressed deterministic’ forecast. The decomposition of
the score indicates that the stochastic EPS forecasts have improved reliability over much of
the tropics compared to a statistical approach, and also have an improved resolution. The
ES detects skill in probabilistic seasonal forecasts of sea surface temperature in the Niño 3.4
region compared to both climatological and persistence forecasts.
The skill of ensemble forecasts at predicting flow-dependent uncertainty is also reflected
in an improved RMS error-spread scatter plot: the forecast model can distinguish between
predictable and unpredictable days, and correctly predicts the error in the ensemble mean.
The stochastic parametrisations tested in the L96 system (Chapter 2), and the SPPTi scheme
199
tested in the IFS (Chapter 6) both showed improvements using this diagnostic.
Limitations of Work and Future Research
It is important to consider the limitations of this study when drawing conclusions. These
limitations also suggest interesting or constructive research projects which could extend the
work presented in this thesis.
The L96 model was designed to be a simple model of the atmosphere, with desirable properties including fully chaotic behaviour and interaction between variables of different scales.
However, it is a very simple model, and so has limitations. It is only possible to test statistically derived parametrisation schemes in the context of this model as opposed to those with
a more physical basis. This is true for both the deterministic and stochastic parametrisation
schemes analysed in this study. Nevertheless, the L96 system has many benefits. It allows
idealised experiments to be carried out, in which initial condition uncertainty can be removed.
It is also cheap to run, which allows the generation of a very large truth data set, so that the
parameters in the deterministic and stochastic schemes can be accurately determined. It also
allows large forecast-verification data sets and very long climate runs to be produced, giving
statistically significant results.
In Chapter 3, the L96 model was shown to have two distinct regimes, similar to regimes
observed in the atmosphere. However, in the L96 system, one regime is much weaker than
the other, and describes only 20% of the time series. This is dissimilar to what is observed
in the atmosphere. For example, four distinct weather regimes are observed in the North
Atlantic in winter, each accounting for between 21.0% − 29.2% of the time series — a relatively
even split between the four regimes (Dawson et al., 2012). Nevertheless, the weaker regime in
the L96 system is sufficiently robust to appear in several diagnostics, for both the smoothed
and unsmoothed time series, and the different representations of model uncertainty tested in
this study had a noticeable impact on simulating the regime behaviour. It would be very
interesting to further this work by considering atmospheric models. Recent work indicates
that stochastic parametrisations can improve the representation of regimes in the IFS (Andrew
Dawson, pers. comm., 2013), but there has been no work considering the representation of
regimes by perturbed parameter ensembles. We have carried out preliminary experiments
testing the regime behaviour of the climateprediction.net ‘Weather at Home’ data set, which
200
looks like an interesting route for further investigation.
The proposed ES is an attractive score as it uses the raw ensemble output to verify the
forecasts, and it is suitable for continuous forecasts. It is also sensitive to the ‘dynamic reliability’ of the forecast, which other scores seem insensitive to. However, it is only suitable
for ensemble forecasts with a sufficiently large number of members. This is because of the
need to calculate the third order moment of the ensemble, which is particularly sensitive to
sampling errors. Poor estimates of the ensemble skewness can result in erroneously large values
of the score. The ES decomposition provides useful information about the source of skill in the
forecast, as well as aiding an understanding of how the score works. However, this requires a
very large forecast-verification sample because of the need to bin the forecast-verification pairs
in two dimensions. This makes it unsuitable for small or preliminary studies where only a few
dates may be tested. Nevertheless, ECMWF have expressed an interest in including the ES
in their operational verification suite, which I will implement over the next few months. This
will provide a very large data set which I can use to test the score further, which should help
provide a more complete understanding of the score’s strengths and weaknesses.
In Chapter 5, precipitation is an important variable to consider for verification. It is produced by the convection scheme, so studying precipitation should detect improvements in the
parametrisation of convection. However, verification of precipitation is difficult as measurements of precipitation are not assimilated into the IFS using the 4DVar or EDA systems. In
Chapter 5, the GPCP data set was used for verification which includes information from both
satellites and rain gauges. This data set is likely to contain errors, particularly in small scale
features, which have not been accounted for when verifying the IFS forecasts. It is likely that
the regridding process (from a one-degree to a T159 reduced Gaussian grid) introduces additional errors. Spatially averaging the GPCP and forecast fields before verification, e.g. to a
T95 grid, should reduce this error, and will be considered in future work.
The main limitation of the experiments carried out in the IFS stem from the low resolution
of the forecasts. Using a resolution of T159 is computationally affordable, but has a grid point
spacing four times larger than the operational T639 resolution. While the T639 forecasts in
Chapter 6 showed similar trends to the T159 forecasts, it is apparent that the operational T639
forecasts are better calibrated than the T159 forecasts. This makes it hard to analyse the T159
forecasts, which are significantly under-dispersive, and which amplify the need to increase the
201
forecast spread in the tropics. At T159, the SPPTi scheme results in a well calibrated ensemble,
while at T639, it results in a significantly over-dispersive ensemble forecast for some variables.
Future research will focus on repeating experiments of interest at operational resolution. In
particular, it will be interesting to test the RDDTi scheme, as well as testing a common base
pattern for the CONV and LSWP tendencies. Additionally, it will be interesting to consider
ensemble forecasts using a reduced initial condition perturbation, specifically, we propose to
use the raw EDA output to provide the initial condition perturbations. This could be a
more realistic representation of initial condition uncertainty, and could improve the spreaderror relationship when used in conjunction with SPPTi. This work is already scheduled to
be done in collaboration with ECWMF. The ultimate aim is to improve the SPPTi scheme,
reducing the forecast RMSE while maintaining the improvement in reliability, such that it can
be incorporated into a future cycle of the IFS.
The results from the L96 system motivate the importance of including accurate spatial
and temporal correlations in a stochastic parametrisation scheme. However, this correlation
structure has never been studied in a weather forecasting model, and an arbitrary structure is
currently imposed in the SPPT scheme. An interesting research direction will be to use coarse
graining experiments to physically derive this correlation structure, before implementing it in
the ECMWF model. This could significantly increase the skill of the forecast, while increasing
our understanding of the limitations of parametrisation schemes.
The key assumption in the formulation of SPPT is that one can treat uncertainty from
each parametrisation scheme in the same way. In contrast, the SPPTi scheme assumes that
the different parametrisation schemes have independent errors, but that multiplicative noise
is still a good approach. There has been no research, to my knowledge, which addresses the
interaction of uncertainties from different physics schemes, yet it is particularly important that
this is understood if a move is made away from schemes such as SPPT which represent overall
uncertainty, to using different representations for different schemes. To address this, I propose
to use a series of coarse graining experiments, which use high resolution simulations (where,
to a large extent, parametrisation schemes are not required) to evaluate how the uncertainties
from each parametrisation scheme interact, and how SPPT and/or SPPTi should be adapted
if uncertainties from different parametrisation schemes must be treated independently.
Finally, the results from the L96 system indicate that stochastic parametrisations could
202
be a powerful tool for improving a model’s climatology. It would be useful to perform experiments which could provide further proof whether stochastic parametrisation in climate
models is a direction worth pursuing. Firstly, it would be interesting to carry out a series
of experiments using the ECMWF model at longer lead times of one month. At these time
scales, initial condition uncertainty becomes less important, and the climatology of the forecast model becomes significant. The seamless prediction paradigm proposes that the skill of
the stochastic and perturbed parameter schemes at monthly time scales is linked to their skill
at weekly time scales: these experiments would test this hypothesis. Furthermore, perturbed
parameter experiments have traditionally been used in climate prediction, whereas stochastic
parametrisations have remained confined to weather forecasts. The consideration of monthly
forecasts, where validation is possible, will indicate which scheme could produce the most reliable estimates of anthropogenic climate change, which cannot otherwise be tested. I would
then move on to apply the insights gained from these experiments to seasonal forecasting. Accurate representation of model uncertainty is crucial at seasonal time scales, and the reliability
of seasonal forecasts can be tested through comparison with observations. Because of this,
seasonal forecasts provide an excellent way of testing stochastic parametrisations before they
are implemented in climate models.
The results presented in this study indicate that stochastic parametrisations are a skilful approach to representing model uncertainty in atmospheric simulators. The reliability
of weather forecasts can be improved by using stochastic parametrisation schemes, provided
these schemes are designed to represent the model error accurately, for example, by using
spatially and temporally correlated noise. Furthermore, stochastic schemes have the potential
to improve a model’s climatology; testing and development of stochastic parametrisations in
climate models should be an important focus for future research. With further development of
physically-based stochastic parametrisation schemes, we could have the potential to produce
the reliable, flow-dependent, probabilistic forecasts required by users for decision making.
203
204
Appendix A
Skill Score Significance Testing
A.1
Weather Forecasts in the Lorenz ‘96 System
In order to state with confidence that one parametrisation is better than another, it is necessary
to know how significantly different one skill score is from another. The simple Monte-Carlo
technique used here evaluates how significant the difference is between two skill scores, assuming
the null hypothesis that the two parametrisations have equal skill.
Take the situation when the significance of the difference between two Skill Scores (SS)
must be evaluated. Consider two vectors, A and B, which contain the values of the skill score
evaluated for each forecast-verification pair for forecast models A and B respectively. The skill
score for the forecast model is the average of these individual scores. The vectors are each of
length n, corresponding to the number of forecast-verification pairs considered.
If the forecasts have equal skill, the elements of A and B are interchangeable, and any
apparent difference in skill of forecast system A over B is due to chance. Therefore, the
elements of A and B were pairwise shuffled 4000 times, and the skill of the shuffled vector
forecasts calculated. The difference in skill score, D = SS(A) − SS(B) is calculated, and
the probability of occurrence of the measured D assuming the null hypothesis is evaluated.
The significance of the difference between the skill of the forecast models tested in the L96
system is calculated following the method above. The details of the different models, and the
skill of the models at short term forecasts in the L96 system are presented in Chapter 2.
The following tables contain the proportion of shuffled forecast skill scores with a smaller
difference, D, than that measured (to two decimal places). For the RPSS and IGNSS, if the
proportion is greater than 0.95, and Dmeas is positive, forecast A is considered to be significantly
205
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA PP
1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
⋆
1.00
0.00
1.00 0.94
1.00
0.56
1.00 1.00
⋆
0.00
1.00 0.00
1.00
0.00
1.00 0.00
⋆
1.00 1.00
1.00
1.00
1.00 1.00
⋆ 0.00
1.00
0.00
0.95 0.00
⋆
1.00
0.07
1.00 1.00
⋆
0.00
0.00 0.00
⋆
1.00 1.00
⋆ 0.00
Table A.1: Significance values for improvement of [column heading] parametrisation (C) over
[row heading] parametrisation (R), calculated for the RPSS for the c = 4 case. Blue indicates
C is significantly more skilful than R at the 95% significance level, while red indicates R is
significantly more skilful than C.
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA PP
1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
⋆
1.00
0.00
1.00 1.00
1.00
0.07
1.00 1.00
⋆
0.00
0.88 0.00
1.00
0.00
0.88 0.12
⋆
1.00 1.00
1.00
0.97
1.00 1.00
⋆ 0.00
0.98
0.00
0.55 0.04
⋆
1.00
0.00
1.00 1.00
⋆
0.00
0.03 0.00
⋆
1.00 1.00
⋆ 0.03
Table A.2: As for Table A.1, except the significance of difference in RPSS for the c = 10 case.
more skilful than forecast B (at the 95% level), and is coloured blue in Tables A.1–A.4. If the
proportion is less than 0.05, and Dmeas is negative, forecast A is considered to be significantly
worse than forecast B, and is coloured red.
For REL a smaller value indicates improved reliability. Therefore if the proportion is less
than 0.05, and Dmeas is negative, forecast A is considered to be significantly more skilful than
forecast B (at the 95% level), and is coloured blue in Tables A.5–A.6. If the proportion is
greater than 0.95, and Dmeas is positive, forecast A is considered to be significantly worse than
forecast B, and is coloured red.
A.2
Simulated Climate in the Lorenz ‘96 System
The significance of the difference between two climatological pdfs is calculated using a similar
Monte-Carlo technique. The details of the forecast models are presented in Chapter 2, and
206
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA PP
1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
⋆
1.00
0.00
1.00 0.21
1.00
0.52
1.00 1.00
⋆
0.00
0.66 0.00
1.00
0.00
0.69 0.00
⋆
1.00 0.99
1.00
1.00
1.00 1.00
⋆ 0.00
1.00
0.00
0.52 0.00
⋆
1.00
0.81
1.00 1.00
⋆
0.00
0.00 0.00
⋆
1.00 1.00
⋆ 0.00
Table A.3: As for Table A.1, except the significance of difference in IGNSS for the c = 4 case.
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA PP
1.00
1.00
1.00
1.00 1.00
1.00
1.00
1.00 1.00
⋆
1.00
0.00
1.00 1.00
1.00
0.00
1.00 1.00
⋆
0.00
0.35 0.00
0.94
0.00
0.63 0.00
⋆
1.00 1.00
1.00
0.85
1.00 1.00
⋆ 0.00
0.97
0.00
0.77 0.00
⋆
1.00
0.00
1.00 1.00
⋆
0.00
0.12 0.00
⋆
1.00 1.00
⋆ 0.00
Table A.4: As for Table A.1, except the significance of difference in IGNSS for the c = 10 case.
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
WA
0.30
⋆
AR1 A W SD
0.00
0.63
0.00
0.87
⋆
1.00
⋆
AR1 SD W M AR1 M W MA AR1 MA
0.00
0.34
0.00
0.17
0.00
0.00
0.53
0.00
0.29
0.00
1.00
1.00
0.23
1.00
0.95
0.00
0.20
0.00
0.06
0.00
⋆
1.00
0.23
1.00
0.95
⋆
0.00
0.29
0.00
⋆
1.00
1.00
⋆
0.00
⋆
PP
0.35
0.53
1.00
0.26
1.00
0.52
1.00
0.67
1.00
Table A.5: Significance values for improvement of [column heading] parametrisation (C) over
[row heading] parametrisation (R), calculated for REL for the c = 4 case. Blue indicates
C is significantly more skilful than R at the 95% significance level, while red indicates R is
significantly more skilful than C.
207
Det.
WA
AR1 A
W SD
AR1 SD
WM
AR1 M
W MA
AR1 MA
W A AR1 A W SD AR1 SD
0.94
0.00
0.94
0.00
⋆
0.00
0.33
0.00
⋆
1.00
1.00
⋆
0.00
⋆
W M AR1 M W MA AR1 MA PP
0.95
0.055
0.90
0.00 0.43
0.51
0.00
0.28
0.00 0.04
1.00
0.96
1.00
0.73 1.00
0.65
0.00
0.42
0.00 0.09
1.00
0.96
1.00
0.73 1.00
⋆
0.00
0.27
0.00 0.03
⋆
1.00
0.04 0.98
⋆
0.00 0.09
⋆ 1.00
Table A.6: As for Table A.5, except the significance of difference in REL for the c = 10 case.
the skill of the different models at simulating the climate of the L96 system is presented in
Chapter 3.
The significance of the difference between the two climatological vectors XA and XB must
be evaluated. Each vector samples 10, 000M T U with a resolution of 0.05M T U . Firstly, each
vector is divided into sections 50M T U long. These sections are pairwise shuffled to create two
new climatological vectors, XP and XQ . The Hellinger distance between each shuffled vector
and the true climatological vector is analysed following (3.2) to give Dhell (XP ) and Dhell (XQ )
respectively. The difference, D = Dhell (XP ) − Dhell (XQ ) is calculated. This is repeated 2000
times, and the distribution of D compared to the improvement in Hellinger distance between
the original XA and XB , Dtru = Dhell (XA ) − Dhell (XB ), where each Hellinger distance is
calculated by comparing to the true climatological distribution. The proportion of D smaller
than Dtru is calculated and shown in Tables A.7 and A.8.
The smaller the value of Dhell the better the simulation of the true climatology. Therefore,
if Dtru is negative, A has a better representation of the true climatology than B. In this case,
if less than 5% of the distribution of D is smaller (more negative) than Dtrue , the climate of A
is significantly better than B at the 95% significance level, and is coloured blue in Tables A.7
and A.8. Conversely, if Dtru is positive and the proportion of D smaller than Dtru is 0.95 or
greater, the climate of A is significantly worse than the climate of B, and the proportion is
coloured red in Tables A.7 and A.8.
208
Det.
White A.
AR1 A.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA
PP
0.34
0.00
0.26
0.00 0.40
0.00
0.56
0.00 0.00
⋆
0.00
0.39
0.00 0.56
0.00
0.70
0.00 0.00
⋆
1.00
0.05 1.00
0.00
1.00
0.031 0.026
⋆
0.00 0.67
0.00
0.79
0.00 0.00
⋆ 1.00
0.005
1.00
0.39 0.26
⋆
0.00
0.65
0.00 0.00
⋆
1.00
0.99 0.91
⋆
0.00 0.00
⋆ 0.32
Table A.7: Significance values for improvement of the climatology of [column heading] parametrisation over [row heading] parametrisation, calculated for the Hellinger Distance for the c
= 4 case. “0” indicates that R is better than C with a likelihood of less than 1/2000, while “1”
indicates a likelihood of greater than 1999/2000. Blue indicates C is significantly more skilful
than R at the 95% significance level, while red indicates R is significantly more skilful than C.
Det.
White A.
AR1 A.
White SD.
AR1 SD.
White M.
AR1 M.
White MA.
AR1 MA.
W A AR1 A W SD AR1 SD W M AR1 M W MA AR1 MA
PP
0.99
0.00
0.46
0.00 0.44
0.00
0.39
0.00
0.00
⋆
0.00 0.045
0.00 0.011
0.00
0.003
0.00
0.00
⋆
1.00
0.38 1.00 0.0065
1.00
0.28
1.00
⋆
0.00 0.47
0.00
0.42
0.00
0.00
⋆ 1.00
0.013
1.00
0.39
1.00
⋆
0.00
0.45
0.00 0.0005
⋆
1.00
0.96
1.00
⋆
0.00
0.00
⋆
1.00
Table A.8: As for Table A.7, except for the c = 10 case.
209
A.3
Skill Scores for the IFS
It is necessary to calculate the significance of the difference between the skill of the different
IFS model versions to establish if a significant improvement has been made. The technique
described in Section A.1 is used. As for the L96 system, the score vectors will be pairwise
shuffled. This is because the difference between the skill of forecast system A under predictable flow conditions and under unpredictable flow conditions is likely to be greater than the
difference between forecast system A and B for the same conditions. It is therefore important
that each shuffled vector contains the same ratio of predictable to unpredictable cases as for
the un-shuffled cases.
When considering forecasts made using the IFS, it is important to consider spatial correlations as well as temporal correlation. A time series of the thirty initial conditions was
constructed, and the spatial correlation estimated as a function of horizontal displacement
for the tropical region with significant convection. Figure A.1 shows the correlation for each
variable of interest. The time series show significant spatial correlation out to large horizontal
separations. For synoptic scales of 1,500–2,000km, the correlations are less than 0.5 for all variables, except Z500, which varies on larger spatial scales. This corresponds to a longitudinal
separation of approximately 15◦ at the equator. Therefore, to preserve the spatial correlation
in the dataset to a large degree, the skill scores for each forecast will be split into blocks
15◦ × 15◦ in size which will then be treated independently. The tropical region with significant
convection is split into sixteen blocks, for forecasts starting from 30 initial dates. There are
therefore 480 blocks of scores in total which are pairwise shuffled using the method described
in Section A.1.
A.3.1
Experiments in the IFS
The significance of the difference between the RPS is calculated for experiments using the
IFS. The results for IGN and ES are similar, so are not shown for brevity. For each figure,
each row of figures corresponds to a different variable. Each column corresponds to a different
experiment of interest. The figure shows the likelihood that the experiment of interest had
a significantly higher (poorer) score than each of the other four experiments. A likelihood
of greater than 0.95 indicates the experiment of interest is significantly worse than the other
experiment. A likelihood of less than 0.05 indicates the experiment of interest is significantly
210
1
0.9
0.8
Correlation
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
500
1000
1500
2000
2500
3000
3500
4000
Spatial Separation / km
Figure A.1: Spatial correlation between initial condition time series for T850 (red), U850
(green), U200 (magenta), Z500 (black), TCWV (blue) and PPT(cyan). The correlation is
calculated by the initial condition time series at some spatial separation. This was calculated
for all points within the tropical region of significant convection defined in Figure 5.1(b) as a
function of spatial separation.
better than the other experiment. The other experiments (to which we are comparing) are
distinguished by the colour of the symbols
Figure A.2 shows the results for the stochastic and perturbed parameter experiments from
Chapter 5: TSCZ, TSCS, TSCP, TSCPr, and TSCPv.
Figure A.3 shows the results for the independent SPPT experiments from Chapter 6: TSCS,
TSCS + SKEB, TSCSi and TSCSi + SKEB.
Figure A.4 shows the results for the individually independent SPPT experiments from
Chapter 6: TSCS, TSCSi, RDDTi, TGWDi, NOGWi, CONVi, and LSWPi.
Figure A.5 shows the results for the T639 independent SPPT experiments from Chapter 6:
TSCS, TSCS + SKEB, and TSCSi.
For example: either Figure A.2 (k) or (l) can be used to ascertain the significance of the
difference between the RPS for Z500 for the TSCZ and TSCS experiments of Chapter 5. In
figure (k), the line with blue crosses indicates that TSCZ was significantly less skilful than
TSCS at short lead times, that there was no significant difference at a lead time of 120 hrs,
and that at 240 hrs, TSCZ was significantly more skilful. In figure (l), the line with black
crosses gives the same information.
211
(b) TS CS
T 850
significance
(a) TS CZ
1
0.5
0.5
0.5
0.5
0.5
0
100
200
significance
significance
100
200
0
0
(h)
100
200
0
(i)
1
0.5
0.5
0.5
0.5
0.5
0
100
200
0
0
100
200
0
0
(l)
100
200
100
200
0
(n)
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
200
0
0
100
200
(q)
(p)
0
0
100
200
100
200
0
(s)
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0
100
200
0
0
(u)
100
200
0
0
(v)
100
200
100
200
0
(x)
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0
100
200
0
0
100
200
(1)
(z)
0
0
100
200
100
200
0
(3)
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
100
200
(4)
1
0
200
0
0
(2)
100
(y)
1
0
200
0
0
(w)
100
(t)
1
0
200
0
0
(r)
100
(o)
1
100
200
0
0
(m)
100
(j)
1
0
significance
0
0
1
0
significance
200
1
(k)
significance
100
1
0
TCWV
0
0
(g)
0
PPT
(e) TS CPv
1
(f)
U 200
(d) TS CPr
1
0
Z 500
TS CP
1
0
U 850
(c)
1
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure A.2: Significance of difference between RPS for each pair of experiments in Chapter 5:
TSCZ (black), TSCS (blue), TSCP (red), TSCPr (magenta) and TSCPv (green). Each column
corresponds to a different experiment of interest, and the four plotted lines are the significance
of the difference between that experiment and the other four, indicated by colour of markers.
A value of less than 0.05 indicates the experiment of interest significantly improves on the
other experiment, and a value of greater than 0.95 indicates it is significantly poorer.
212
T 850
significance
(a)
(b) TSCS + SKEB
TSCS
1
0.5
0.5
0.5
0.5
0
100
200
significance
significance
200
0
(g)
0.5
0.5
0.5
0.5
0
100
200
0
0
100
200
100
200
0
(k)
1
1
1
0.5
0.5
0.5
0.5
0
200
0
0
(m)
100
200
100
200
0
(o)
1
1
1
0.5
0.5
0.5
0.5
0
0
100
200
0
0
100
200
100
200
0
(s)
1
1
1
0.5
0.5
0.5
0.5
0
0
100
200
0
0
100
200
100
200
0
(w)
1
1
1
0.5
0.5
0.5
0.5
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
100
200
(x)
1
0
200
0
0
(v)
(u)
100
(t)
1
0
200
0
0
(r)
(q)
100
(p)
1
0
200
0
0
(n)
100
(l)
1
100
200
0
0
(j)
100
(h)
1
0
significance
100
1
0
significance
0
0
(f)
(i)
significance
200
1
0
TCWV
100
1
0
PPT
0
0
(e)
U 200
(d) TSCSi + SKEB
1
0
Z 500
TSCSi
1
0
U 850
(c)
1
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure A.3: As for Figure A.2, except for each pair of experiments in Chapter 6: TSCS (blue),
TSCS + SKEB (cyan), TSCSi (red), and TSCSi + SKEB (magenta).
213
significance
T 850
significance
significance
LSWPi
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
1
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
100
200
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
0
0
100
200
100
200
1
1
1
1
1
1
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
0
0
100
200
lead time / hrs
100
200
0
100
200
0
100
200
0
0
1
0
0
0
0
1
0
significance
CONVi
1
0
U 200
NOGWi
1
0
Z 500
TGWDi
1
0
U 850
RDDTi
1
214
TSCSi
TSCS
1
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure A.4: As for Figure A.2, except for each pair of ‘individually independent’ experiments in Chapter 6: TSCS (blue), TSCSi (red), RDDTi
(black), TGWDi (grey), NOGWi (yellow), CONVi (green), and LSWPi (magenta).
T 850
significance
(a)
TSCS
(b)TSCS + SKEB
1
1
1
0.5
0.5
0.5
0
0
0
100
200
significance
U 850
(e)
0.5
0.5
0.5
0
100
200
significance
200
0.5
0.5
0.5
0
100
200
100
200
0
(k)
(l)
1
1
0.5
0.5
0.5
0
100
200
lead time / hrs
200
0
0
1
0
100
(i)
1
0
200
0
(h)
(j)
significance
100
1
200
100
0
0
1
100
0
(f)
1
0
0
U 200
200
1
(g)
Z 500
100
1
0
0
TSCSi
0
0
(d)
(c)
0
0
100
200
lead time / hrs
0
100
200
lead time / hrs
Figure A.5: As for Figure A.2, except for each pair of T639 experiments in Chapter 6: TSCS
(blue), TSCS + SKEB (cyan), and TSCSi (red).
215
216
Appendix B
The Error-spread Score: A Proper
Score
B.1
Derivation of the Form of the Error-Spread Score
The starting point when deriving the Error-spread Score is spread-error relationship; the expected squared error of the ensemble mean can be related to the expected ensemble variance according to (1.26) by assuming the ensemble members and the truth are independently
identically distributed random variables with variance σ 2 (Leutbecher, 2010). Consider the
trial Error-spread score,
EStrial = (s2 − e2 )2
(B.1)
Expanding out the brackets, and expressing the error, e, in terms of the forecast ensemble
mean, m and the verification, z:
EStrial = s4 − 2s2 (m − z)2 + (m − z)4
= (s4 − 2s2 m2 + m4 ) + z(4s2 m − 4m3 ) + z 2 (4m2 − 2s2 + 2m2 ) − 4mz 3 + z 4
217
(B.2)
The expected value of the score can be calculated by assuming the verification follows the truth
distribution:
E [EStrial ] = (s4 − 2s2 m2 + m4 ) + E[z] (4s2 m − 4m3 )
+ E[z 2 ] (4m2 − 2s2 + 2m2 ) − 4m E[z 3 ] + E[z 4 ]
(B.3)
The stationary points of the score are calculated by differentiating with respect to the forecast
moments:
F :=
d E [EStrial ]
ds
(B.4)
=4s(s2 − m2 ) + 8sm E[z] − 4s E[z 2 ]
G :=
d E [EStrial ]
dm
(B.5)
= − 4m(s2 − m2 ) + (4s2 − 12m2 ) E[z] + 12m E[z 2 ] − 4 E[z 3 ]
Substituting the true moments: E[z] = µ, E[z 2 ] = σ 2 + µ2 , and E[z 3 ] = γσ 3 + 3µσ 2 + µ3 ,
F = 4s(s2 − σ 2 − (m − µ)2 )
(B.6)
G = 4(m − µ)3 + 4(m − µ)(3σ 2 − s2 ) − 4γσ 3
(B.7)
s2 = σ 2 + (m − µ)2
(B.8)
Setting F = 0 gives
Setting G = 0, and substituting (B.8) gives,
4γσ 3 = 4(m − µ)3 + 4(m − µ)(3σ 2 − s2 )
= 8σ 2 (m − µ)
∴
m=µ+
γσ
2
(B.9)
218
Substituting (B.9) into (B.8) gives,
γσ
s =σ + µ+
−µ
2
!
γ2
= σ2 1 +
4
2
2
2
(B.10)
Therefore, the trial Error-spread Score is not optimised if the mean and standard deviation of
the true distribution are forecast. Instead of issuing his or her true belief, (m, s), the forecaster
should predict a distribution with mean mhedged = m +
s2hedged = s2 1 +
2
g
4
gs
2
and inflated standard deviation,
in order to maximise their expected score.
To prevent a forecaster from hedging the forecast in this way, the substitution m → m + gs
2
and s2 → s2 1 +
g2
4
can be made in the trial Error-spread score:
EStrial := s2 − (m − z)2
→
ES := s
2
g2
1+
4
!
2
gs
−z
− m+
2
(B.11)
2 !2
ES =(s2 − e2 − esg)2
B.2
(B.12)
Confirmation of Propriety of the Error-spread Score
It is important to confirm that the Error-spread score is proper. Firstly, expand out the
brackets:
ES = (s2 − (m − z)2 − (m − z)sg)2
(B.13)
= (s2 − m2 + 2zm − z 2 − msg + zsg)2
= (s4 − 2m2 s2 − 2ms3 g + m2 s2 g 2 + 2m3 sg + m4 )
+ z(4ms2 + 2s3 g − 4m2 sg − 4m3 − 2ms2 g 2 − 2m2 sg)
+ z 2 (−2s2 + 2m2 + 2msg + 4m2 + 4msg + s2 g 2 )
+ z 3 (−4m − 2sg) + z 4
(B.14)
Calculate the expectation of the score assuming the verification, z, follows the truth distri-
219
bution (equations 4.2–4.4).
E [ES] = (s4 − 2m2 s2 − 2ms3 g + m2 s2 g 2 + 2m3 sg + m4 )
+ E [z] (4ms2 + 2s3 g − 4m2 sg − 4m3 − 2ms2 g 2 − 2m2 sg)
h
i
+ E z 2 (−2s2 + 2m2 + 2msg + 4m2 + 4msg + s2 g 2 )
h
i
h
+ E z 3 (−4m − 2sg) + E z 4
i
(B.15)
But
E [z] = µ,
h
(B.16)
i
E z 2 = σ 2 + µ2 ,
h
(B.17)
i
E z 3 = σ 3 γ + 3µσ 2 + µ3 ,
h
(B.18)
i
E z 4 = σ 4 β + 4µσ 3 γ + 6µ2 σ 2 + µ4 .
(B.19)
Therefore
E [ES] = (s4 − 2m2 s2 − 2ms3 g + m2 s2 g 2 + 2m3 sg + m4 )
+ µ(4ms2 + 2s3 g − 4m2 sg − 4m3 − 2ms2 g 2 − 2m2 sg)
+ (σ 2 + µ2 )(−2s2 + 2m2 + 2msg + 4m2 + 4msg + s2 g 2 )
+ (σ 3 γ + 3µσ 2 + µ3 )(−4m − 2sg)
+ σ 4 β + 4µσ 3 γ + 6µ2 σ 2 + µ4
(B.20)
Expanding and re-factorising, it can be shown that
2
E [ES] = (σ 2 − s2 ) + (µ − m)2 − sg(µ − m)
+ σ 2 (2(µ − m) + (σγ − sg))2 + σ 4 (β − γ 2 − 1)
(B.21)
In order to be proper, the expected value of the scoring rule must be minimised when the
“truth” distribution is forecast. Let us test this here.
220
Differentiating with respect to m:
d E[ES]
= 2 (σ 2 − s2 ) + (µ − m)2 − sg(µ − m) (sg − 2(µ − m))
dm
− 4σ 2 (2(µ − m) + (σγ − sg))
=0
(B.22)
at optimum.
Differentiating with respect to s:
d E[ES]
= 2 (σ 2 − s2 ) + (µ − m)2 − sg(µ − m) (−2s − g(µ − m))
ds
− 2σ 2 g (2(µ − m) + (σγ − sg))
=0
(B.23)
at optimum.
Differentiating with respect to g:
d E[ES]
= −2s(µ − m) (σ 2 − s2 ) + (µ − m)2 − sg(µ − m)
dg
− 2σ 2 s (2(µ − m) + (σγ − sg))
=0
Since
d E[ES]
dv
(B.24)
at optimum.
= 0 for v = m, s, g, the “truth” distribution corresponds to a stationary point of
the score. The Hessian of the score is given by


2
0
2σ 
γ + 4




2
H = 2σ 2 
γ + 4 σγ 
 0
,




2σ
σγ
σ2
which has three eigenvalues ≥ 0. This stationary point is a minimum as required.
Additionally, a score, S is proper if, for any two probability densities P (x) and Q(x)
(Bröcker, 2009),
Z
S[P (x), z] Q(z) dz ≥
Z
S[Q(x), z] Q(z) dz.
(B.25)
where the integral is over the possible verifications, z. This criterion can be tested for the
Error-spread score. The term on the left of (B.25) is the expectation of ES calculated earlier
221
if we identify P (x) with the issued forecast and Q(x) with the “truth” distribution.
Z
2
S[P (x), z] Q(z) = (σ 2 − s2 ) + (µ − m)2 − sg(µ − m)
+ σ 2 (2(µ − m) + (σγ − sg))2 + σ 4 (β − γ 2 − 1)
(B.26)
Similarly,
Z
S[Q(x), z] Q(z) = σ 4 (β − γ 2 − 1)
(B.27)
Therefore,
Z
S[P (x), z] Q(z) dz −
Z
S[Q(x), z] Q(z)
2
= (σ 2 − s2 ) + (µ − m)2 − sg(µ − m)
≥ 0 ∀ m, s and g.
+ σ 2 (2(µ − m) + (σγ − sg))2
(B.28)
The Error-spread score is a proper score.
B.3
Decomposition of the Error-spread Score
In this decomposition, the error between the verification and the ensemble mean, (m − z), is
discussed as if it were the verification. Consider that the forecast can be transformed to centre
it on the origin, and the same transformation can be used on the verification. The statistics of
the measured error are unchanged, and will have an expected value of zero if the forecast was
unbiased, i.e. the mean of the forecast and the mean of the truth distribution are identical.
This assumption is used in the following — the error can now be thought of as the verification
as its value depends only on the measured true state of the atmosphere, assuming the forecast
ensemble mean is perfectly accurate.
Assume that the predicted spread, sk , can only take I discrete values si where i = (1, . . . , I).
Assume the predicted skewness, gk , can only take J discrete values gj where j = (1, . . . , J).
Bin the measured errors, ek , according to the predicted spread, si , and the predicted skewness,
222
gj . Defining
n=
J
I X
X
Ni,j ,
(B.29)
i=1 j=1
e2i,j =
1 X 2
e
Ni,j k∈Ni,j k
≈ E[e2k ],
(B.30)
k∈Ni,j
and
e2 =
J
n
I X
1X
1X
e2k =
Ni,j e2i,j
n k=1
n i=1 j=1
≈ E[e2k ],
(B.31)
where n is the total number of forecast-verification pairs and Ni,j is defined as the number of
forecast-verification pairs in bin (i, j). e2i,j is the average squared error in each bin and e2 is
the climatological error, both of which represent the sample estimates of the expected value
of these errors. The binning is conditioned on the forecast spread and skew only as, for the
reasons given above, the forecast mean is unimportant if the forecast is unbiased.
The Error-Spread score can be rewritten as a sum over the IJ bins as
ES =
=
=
n
1X
(s2 − e2k − ek sk gk )2
n k=1 k
J
I X
X
1X
(s2 − e2k − ek si gj )2
n i=1 j=1 k∈Ni,j i
J
I X
X
1X
(s2 − e2 )2 −2ek si gj (s2i − e2k ) + e2k s2i gj2
n i=1 j=1 k∈Ni,j | i {z k } |
{z
}
A
(B.32)
B
Consider the first term, A, evaluating the spread of the forecast:
A=
J
I X
X
1X
(s2 − e2k )2
n i=1 j=1 k∈Ni,j i

(B.33)

J
I X
X
X
1X
Ni,j (s2 )2 − 2s2
(e2k )2  .
e2k +
=
i
i
n i=1 j=1
k∈Ni,j
k∈Ni,j
Here it has been recognised that si is a discrete variable, constant within a bin, so can be moved
outside the summation term. Using the definitions of e2i,j and e2 , the square is completed twice
223
to give:
A=
J
J
I X
I X
2
2
1X
1X
Ni,j s2i − e2i,j −
Ni,j e2i,j − e2
n i=1 j=1
n i=1 j=1
+

J
I X
1X
Ni,j (e2 )2 − 2Ni,j e2 e2 +
i,j
n i=1 j=1
X
k∈Ni,j
(B.34)

(e2k )2  .
Recalling the definition of e2i,j , and from Eq. B.30
PI
i=1
PJ
j=1
Ni,j (e2 )2 = n(e2 )2 =
A=
Pn
2 2
k=1 (e ) ,
J
J
I X
I X
2
2
1X
1X
Ni,j s2i − e2i,j −
Ni,j e2i,j − e2
n i=1 j=1
n i=1 j=1
+
J
I X
J
n
I X
X
X
1X
1X
2 X
e2k +
(e2 )2 − e2
(e2 )2 .
n k=1
n i=1 j=1 k∈Ni,j
n i=1 j=1 k∈Ni,j k
(B.35)
Since the ek have not been sorted according to spread and skew, the multiple summation terms
can be replaced by a summation over k:
J
J
I X
I X
n 2
2
2
1X
1X
1X
2
2
2
2
Ni,j si − ei,j −
Ni,j ei,j − e
A=
e2 − e2k
+
n i=1 j=1
n i=1 j=1
n k=1
(B.36)
The first term, A, has been decomposed into a sum of squared terms.
Consider the second term, B, which evaluates the shape of the forecast. This can be written
in terms of the expectation of the moments of the error ek :
B=
=
J
I X
X 1X
−2ek si gj (s2i − e2k ) + e2k s2i gj2
n i=1 j=1 k∈Ni,j
J
I X
1X
−2Ni,j s3i gj E[ek ] + 2Ni,j si gj E[e3k ] + Ni,j s2i gj2 E[e2k ]
n i=1 j=1
k∈Ni,j
k∈Ni,j
k∈Ni,j
(B.37)
Assume E[ek ] = 0. This is equivalent to assuming there is no systematic bias in the forecast
k∈Ni,j
(note that in the decomposition of A, this assumption was not required). The bias of a
forecasting system should be checked before this decomposition is applied, and the forecast
debiased if necessary. With the assumption of no bias, using Eq. B.47,
E[e3k ]
k∈Ni,j
=γ
E[e2k ]
k∈Ni,j
2
− (E[ek ])
k∈Ni,j
!3/2
!3/2
= γ E[e2k ]
k∈Ni,j
+ 3E[ek ]E[e2k ] + 2(E[ek ])3
k∈Ni,j k∈Ni,j
k∈Ni,j
(B.38)
224
where γ is the observed (“true”) skewness of the error distribution. Define the measured shape
factor,
Gi,j =
1 X 3
e ≈ E[e3k ],
Ni,j k∈Ni,j k k∈Ni,j
(B.39)
which is approximately equal to the third moment of the error distribution in each bin, estimated using a finite sample size. Define also the climatological shape factor,
J
I X
n
1X
1X
G=
e3k .
Ni,j Gi,j =
n i=1 j=1
n k=1
(B.40)
From Eq. 1.26 and Eq. B.38 it can be shown that if our forecast standard deviation, si , and
skewness, gj are accurate, the measured shape factor should obey
Gi,j = −si gj e2i,j .
(B.41)
where the negative sign arises because the verification has a negative sign in the definition of
error, m − z, so that gj = −γ for an accurate forecast. B can be written in terms of the shape
factor, Gi,j as
B=
=
J
I X
1X
Ni,j s2i gj2 e2i,j + 2Ni,j si gj Gi,j
n i=1 j=1
J
I X
1X
n i=1 j=1
Ni,j
si gj e2i,j + Gi,j
e2i,j
2
−
J
I X
G2
1X
Ni,j 2i,j .
n i=1 j=1
ei,j
Completing the square by adding and subtracting Ni,j 2Gi,j eG2 − e2i,j
B=
J
I X
1X
n i=1 j=1
Ni,j

si gj e2i,j + Gi,j
n
G
1X
e2
+
k
n k=1
e2
e2i,j
!2
2

G
e2
(B.42)
2 2
J
I X
G
1X
Gi,j
Ni,j e2i,j  2 − 2 
−
n i=1 j=1
e
ei,j

G
− 2e3k 2  .
e
225
(B.43)
Combining with Eq. B.36, the ES score can be decomposed into


2

J
I X

si gj e2i,j + Gi,j 
1X
 2

2
ES =
Ni,j (si − e2i,j ) +

2


n i=1 j=1
{z
}
ei,j
|

a
|
{z
}
b



2 





}
J
I X

Gi,j
G
1X
−
Ni,j 
(e2i,j − e2 )2 + e2i,j 
−
2
|
n i=1 j=1
{z
}
e2
ei,j

c
+

n 
G
1X
 2
(e − e2k )2 + e2k
n k=1 
e2
|
!2
{z
d
− 2e3k
{z
|
e

G


e2 

(B.44)
}
Term a tests the reliability of the ensemble spread, and b the reliability of the ensemble
shape. Term c tests the resolution of the predicted spread and d the resolution of the predicted
shape. The last term is the uncertainty in the forecast, which is not a function of the binning
process.
B.4
Mathematical Properties of Moments
For a random variable X which is drawn from a probability distribution, P(X), the moments
of the distribution are defined as follows. The mean, µ:
µ = E[X],
(B.45)
The variance, σ 2 :
σ 2 = E[(X − µ)2 ]
= E[X 2 ] − µ2 ,
(B.46)
The skewness, γ:
γ=E
=
"
X −µ
σ
3 #
1 3
2
3
E[X
]
−
3µσ
−
µ
,
σ3
226
(B.47)
Bibliography
E. Anderson and A. Persson. User guide to ECMWF forecast products. ECMWF, Shinfield
Park, Reading, RG2 9AX, U.K., 1.1 edition, July 2013.
J. L. Anderson. The impact of dynamical constraints on the selection of initial conditions
for ensemble predictions: Low-order perfect model results. Mon. Weather Rev., 125(11):
2969–2983, 1997.
A Arakawa and W. H. Schubert. Interaction of a cumulus cloud ensemble with the large scale
environment, part I. J. Atmos. Sci., 31(3):674–701, 1974.
H. M. Arnold, I. M. Moroz, and T. N. Palmer. Stochastic parameterizations and model uncertainty in the Lorenz’96 system. Phil. Trans. R. Soc. A, 371(1991), 2013.
J. V. Beck and K. J. Arnold. Parameter estimation in engineering and science. Wiley, New
York, USA, 1977.
L. Bengtsson, M. Steinheimer, P. Bechtold, and J.-F. Geleyn. A stochastic parametrization
for deep convection using cellular automata. Q. J. Roy. Meteor. Soc., 139(675):1533–1543,
2013.
J. Berner, G. J. Shutts, M. Leutbecher, and T. N. Palmer. A spectral stochastic kinetic energy
backscatter scheme and its impact on flow dependent predictability in the ECMWF ensemble
prediction system. J. Atmos. Sci., 66(3):603–626, 2009.
J. Berner, T. Jung, and T. N. Palmer. Systematic model error: The impact of increased
horizontal resolution versus improved stochastic and deterministic parameterizations. J.
Climate, 25(14):4946–4962, 2012.
N. E. Bowler, A. Arribas, K. R. Mylne, K. B. Robertson, and S. E. Beare. The MOGREPS
short-range ensemble prediction system. Q. J. Roy. Meteor. Soc., 134(632):703–722, 2008.
227
G. W. Brier. Verification of forecasts expressed in terms of probability. Mon. Weather Rev.,
78(1):1–3, 1950.
J. Bröcker. Reliability, sufficiency, and the decomposition of proper scores. Q. J. Roy. Meteor.
Soc., 135(643):1512–1519, 2009.
J. Bröcker, D. Engster, and U. Parlitz. Probabilistic evaluation of time series models: A
comparison of several approaches. Chaos, 19(4), 2009.
T. A. Brown. Probabilistic forecasts and reproducing scoring systems. Technical report, RAND
Corporation, Santa Monica, California, June 1970.
R. A. Bryson. The paradigm of climatology: An essay. B. Am. Meteorol. Soc., 78(3):449–455,
1997.
R. Buizza. Horizontal resolution impact on short- and long-range forecast error. Q. J. Roy.
Meteor. Soc., 136(649):1020–1035, 2010.
R. Buizza and T. N. Palmer. The singular-vector structure of the atmospheric global circulation. J. Atmos. Sci., 52(9):1434–1456, 1995.
R. Buizza, M. Miller, and T. N. Palmer. Stochastic representation of model uncertainties in the
ECMWF ensemble prediction system. Q. J. Roy. Meteor. Soc., 125(560):2887–2908, 1999.
R. Buizza, M. Leutbecher, and L. Isaksen. Potential use of an ensemble of analyses in the
ecmwf ensemble prediction system. Q. J. Roy. Meteor. Soc., 134(637):2051–2066, 2008.
B. G. Cohen and G. C. Craig. Fluctuations in an equilibrium convective ensemble. part II:
Numerical experiments. J. Atmos. Sci., 63(8):2005–2015, 2006.
S. Corti, F. Molteni, and T. N. Palmer. Signature of recent climate change in frequencies of
natural atmospheric circulation regimes. Nature, 398(6730):799–802, 1999.
G. C. Craig and B. G. Cohen. Fluctuations in an equilibrium convective ensemble. part I:
Theoretical formulation. J. Atmos. Sci., 63(8):1996–2004, 2006.
D. Crommelin and E. Vanden-Eijnden. Subgrid-scale parametrisation with conditional Markov
chains. J. Atmos. Sci., 65(8):2661–2675, 2008.
228
A. Dawson, T. N. Palmer, and S. Corti. Simulating regime structures in weather and climate
prediction models. Geophys. Res. Let., 39(21):L21805, 2012.
T. DelSole. Predictability and information theory. part I: Measures of predictability. J. Atmos.
Sci., 61(20):2425–2440, 2004.
F. J. Doblas-Reyes, A. Weisheimer, N. Keenlyside, M. McVean, J. M. Murphy, P. Rogel,
D. Smith, and T. N. Palmer. Addressing model uncertainty in seasonal and annual dynamical
ensemble forecasts. Q. J. Roy. Meteor. Soc., 135(644):1538–1559, 2009.
O. Donati, G. F. Missiroli, and G. Pozzi. An experiment on electron interference. Am. J.
Phys., 41(5):639–644, 1973.
J. Dorrestijn, D. T. Crommelin, A. P. Siebesma, and H. J. J. Jonker. Stochastic parameterization of shallow cumulus convection estimated from high-resolution model data. Theor.
Comp. Fluid Dyn., 27(1–2):133–148, 2012.
J. Dorrestijn, D. T. Crommelin, J. A. Biello, and S. J. Böing. A data-driven multi-cloud model
for stochastic parametrization of deep convection. Phil. Trans. R. Soc. A, 371(1991), 2013.
ECMWF. IFS documentation CY37r2. ECMWF, Shinfield Park, Reading, RG2 9AX, U.K.,
2012. http://www.ecmwf.int/research/ifsdocs/CY37r2/.
M. Ehrendorfer. Predicting the uncertainty of numerical weather forecasts: a review. Meteorol.
Z., 6(4):147–183, 1997.
T. H. A. Frame, J. Methven, S. L. Gray, and M. H. P. Ambaum. Flow-dependent predictability
of the North-Atlantic jet. Geophys. Res. Let., 40(10):2411–2416, 2013.
Y. Frenkel, A. J. Majda, and B. Khouider. Using the stochastic multicloud model to improve
tropical convective parametrisation: A paradigm example. J. Atmos. Sci., 69(3):1080–1105,
2012.
N. Gershenfeld, B. Schoner, and E. Metois. Cluster-weighted modelling for time-series analysis.
Nature, 397(6717):329–332, 1999.
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J.
Am. Stat. Assoc., 102(477):359–378, 2007.
229
M. Goldstein and D. Wooff. Bayes Linear Statistics, Theory and Methods. Wiley, Chichester,
UK, 2007.
W. M. Gray and R. W. Jr. Jacobson. Diurnal variation of deep cumulus convection. Mon.
Weather Rev., 105(9):1171–1188, 1977.
E. Halley. An historical account of the trade winds, and monsoons, observable in the seas
between and near the tropicks, with an attempt to assign the phisical cause of the said
winds. Phil. Trans., 16(183):153–168, 1686.
J. A. Hansen and C. Penland. Efficient approximation techniques for integrating stochastic
differential equations. Mon. Weather Rev., 134(10):3006–3014, 2006.
J. A. Hansen and C. Penland. On stochastic parameter estimation using data assimilation.
Physica D, 230(1–2):88–98, 2007.
K. Hasselmann. Climate change — linear and nonlinear signatures. Nature, 398(6730):755–756,
1999.
L. Hermanson. Stochastic Physics: A Comparative Study of Parametrized Temperature Tendencies in a Global Atmospheric Model. PhD thesis, University of Reading, 2006.
H. Hersbach. Decomposition of the continuous ranked probability score for ensemble prediction
systems. Weather Forecast., 15(6):559–570, 2000.
P. Hess and H. Brezowsky. Katalog der grosswetterlagen Europas. Berichte des Deutschen
Wetterdienstes in der US-Zone, 33:39, 1952.
P. Houtekamer, M. Charron, H. Mitchell, and G. Pellerin. Status of the global EPS at environment canada. In Workshop on Ensemble Prediction, 7–9 November 2007, pages 57–68,
Shinfield Park, Reading, 2007. ECMWF.
P. L Houtekamer, L. Lefaivre, and J. Derome. A system simulation approach to ensemble
prediction. Mon. Weather Rev., 124(6):1225–1242, 1996.
G. J. Huffman, R. F. Adler, M. M. Morrissey, D. T. Bolvin, S. Curtis, R. Joyce, B. McGavock,
and J. and Susskind. Global precipitation at one-degree daily resolution from multisatellite
observations. J. Hydrometeor., 2(1):36–50, 2001.
230
L. Isaksen, M. Bonavita, R. Buizza, M. Fisher, J. Haseler, M. Leutbecher, and L. Raynaud.
Ensemble of data assimilations at ECMWF. Technical Report 636, European Centre for
Medium-Range Weather Forecasts, Shinfield park, Reading, 2010.
C. Jakob. Accelerating progress in global atmospheric model development through improved
parameterizations. B. Am. Meteorol. Soc., 91(7):869–875, 2010.
H. Järvinen, M. Laine, A. Solonen, and H. Haario. Ensemble prediction and parameter estimation system: the concept. Q. J. Roy. Meteor. Soc., 138(663):281–288, 2012.
P. Kaallberg. Forecast drift in ERA-Interim. ERA Report Series 10, European Centre for
Medium-Range Weather Forecasts, Shinfield park, Reading, 2011.
R. J. Keane and R. S. Plant. Large-scale length and time-scales for use with stochastic convective parametrization. Q. J. Roy. Meteor. Soc., 138(666):1150–1164, 2012.
B. Khouider and A. J. Majda. A simple multicloud parameterization for convectively coupled
tropical waves. part I: Linear analysis. J. Atmos. Sci., 63(4):1308–1323, 2006.
B. Khouider and A. J. Majda. A simple multicloud parameterization for convectively coupled
tropical waves. part II: Nonlinear simulations. J. Atmos. Sci., 64(2):381–400, 2007.
B. Khouider, A. J. Majda, and M. A. Katsoulakis. Coarse-grained stochastic models for tropical
convection and climate. P. Natl. Acad. Sci. U.S.A., 100(21):11941–11946, 2003.
B. Khouider, J. Biello, and A. J. Majda. A stochastic multicloud model for tropical convection.
Commun. Math. Sci., 8(1):187–216, 2010.
C. G. Knight, S. H. E. Knight, N. Massey, T. Aina, C Christensen, D. J. Frame, J. A. Kettleborough, A. Martin, S. Pascoe, B. Sanderson, D. A. Stainforth, and M. R. Allen. Association of
parameter, software, and hardware variation with large-scale behavior across 57,000 climate
models. P. Natl. Acad. Sci. U.S.A., 104(30):12259–12264, 2007.
R. H. Kraichnan and D. Montgomery. Two-dimensional turbulence. Rep. Prog. Phys., 43:
547–619, 1980.
S. Kullback and R. A. Leibler. On information and sufficiency. Ann. Math. Stat., 22(1):79–86,
1951.
231
F. Kwasniok. Data-based stochastic subgrid-scale parametrisation: an approach using clusterweighted modelling. Phil. Trans. R. Soc. A, 370(1962):1061–1086, 2012.
M. Laine, A. Solonen, H. Haario, and H. Järvinen. Ensemble prediction and parameter estimation system: the method. Q. J. Roy. Meteor. Soc., 138(663):289–297, 2012.
L. A. Lee, K. S. Carslaw, K. J. Pringle, and G. W. Mann. Mapping the uncertainty in global
CCN using emulation. Atmos. Chem. Phys., 12(20):9739–9751, 2012.
M. Leutbecher. Diagnosis of ensemble forecasting systems. In Seminar on Diagnosis of Forecasting and Data Assimilation Systems, 7 - 10 September 2009, pages 235–266, Shinfield
Park, Reading, 2010. ECMWF.
M. Leutbecher and T. N. Palmer. Ensemble forecasting. J. Comput. Phys., 227(7):3515–3539,
2008.
J. W.-B. Lin and J. D. Neelin. Influence of a stochastic moist convective parametrisation on
tropical climate variability. Geophys. Res. Let., 27(22):3691–3694, 2000.
J. W.-B. Lin and J. D. Neelin. Considerations for stochastic convective parameterization. J.
Atmos. Sci., 59(5):959–975, 2002.
J. W.-B. Lin and J. D. Neelin. Towards stochastic deep convective parameterization in general
circulation models. Geophys. Res. Let., 30(4), 2003.
E. N. Lorenz. Deterministic nonperiodic flow. J. Atmos. Sci., 20(2):130–141, 1963.
E. N. Lorenz. Predictability; does the flap of a butterfly’s wings in Brazil set off a tornado in
Texas? In American Association for the Advancement of Science, 139th Meeting, December
1972.
E. N. Lorenz. Predictability — a problem partly solved. In Proceedings, Seminar on Predictability, 4-8 September 1995, volume 1, pages 1–18, Shinfield Park, Reading, 1996. ECMWF.
E. N. Lorenz. Climate is what you expect. eaps4.mit.edu/research/Lorenz/publications, 1997.
52pp.
E. N. Lorenz. Regimes in simple systems. J. Atmos. Sci., 63(8):2056–2073, 2006.
232
P. Lynch. Richardson’s barotropic forecast: A reappraisal. B. Am. Meteorol. Soc., 73(1):35–47,
1992.
A. J. Majda and B. Khouider. Stochastic and mesoscopic models for tropical convection. P.
Natl. Acad. Sci. U.S.A., 99(3):1123–1128, 2002.
G. M. Martin, S. F. Milton, C. A. Senior, M. E. Brooks, and S. Ineson. Analysis and reduction
of systematic errors through a seamless approach to modeling weather and climate. J.
Climate, 23(22):5933–5957, 2010.
D. Masson and R. Knutti. Climate model genealogy. Geophys. Res. Let., 38, 2011.
J.-J. Morcrette. Radiation and cloud radiative properties in the European Centre for Medium
Range Weather Forecasts forecasting system. J. Geophys. Res.-Atmos., 96(D5):9121–9132,
2012.
A. H. Murphy. A note on the utility of probabilistic predictions and the probability score in
the cost-loss ratio decision situation. J. Appl. Meteorol., 5(4):534–537, 1966.
A. H. Murphy. A new vector partition of the probability score. J. Appl. Meteorol., 12(4):
595–600, 1973.
A. H. Murphy. The value of climatological, categorical and probabilistic forecasts in the costloss ratio situatiuon. Mon. Weather Rev., 105(7):803–816, 1977.
A. H. Murphy. A new decomposition of the Brier score: Formulation and interpretation. Mon.
Weather Rev., 114(12):2671–2673, 1986.
A. H. Murphy and M. Ehrendorfer. On the relationship between the accuracy and value of
forecasts in the cost-loss ratio situation. Weather Forecast., 2(3):243–251, 1987.
J. M. Murphy, D. M. H. Sexton, D. N. Barnett, G. S. Jones, M. J. Webb, M. Collins, and
D. A. Stainforth. Quantification of modelling uncertainties in a large ensemble of climate
change simulations. Nature, 430(7001):768–772, 2004.
G. D. Nastrom and K. S. Gage. A climatology of atmospheric wavenumber spectra of wind
and temperature observed by commercial aircraft. J. Atmos. Sci., 42(9):950–960, 1985.
233
F. Nebeker. Calculating the Weather: Meteorology in the 20th Century. Academic Press, Inc.,
San Diego, California, U.S.A., 1995.
A. Oort and J. Yienger. Observed interannual variability in the Hadley circulation and its
connection to ENSO. J. Climate, 9(11):2751–2767, 1996.
T. N. Palmer. A nonlinear dynamical perpective on climate change. Weather, 48(10):314–326,
1993.
T. N. Palmer. A nonlinear dynamical perpective on climate prediction. J. Climate, 12(2):
575–591, 1999.
T. N Palmer. A nonlinear dynamical perspective on model error: A proposal for non-local
stochastic-dynamic parametrisation in weather and climate prediction models. Q. J. Roy.
Meteor. Soc., 127(572):279–304, 2001.
T. N Palmer. The economic value of ensemble forecasts as a tool for risk assessment: From
days to decades. Q. J. Roy. Meteor. Soc., 128(581):747–774, 2002.
T. N. Palmer. Towards the probabilistic earth-system simulator: A vision for the future of
climate and weather prediction. Q. J. Roy. Meteor. Soc., 138(665):841–861, 2012.
T. N. Palmer, A. Alessandri, U. Andersen, P. Cantelaube, M. Davey, P. Délécluse, M. Déqué,
E. Dı́ez, F. J. Doblas-Reyes, H. Feddersen, R. Graham, S. Gualdi, J.-F. Guérémy, R. Hagedorn, M. Hoshen, N. Keenlyside, M. Latif, A. Lazar, E. Maisonnave, V. Marletto, A. P.
Morse, B. Orfila, P Rogel, J.-M. Terres, and M. C. Thomson. Development of a European
multimodel ensemble system for seasonal-to-interannual prediction (DEMETER). B. Am.
Meteorol. Soc., 85(6):853–872, 2004.
T. N. Palmer, F. J. Doblas-Reyes, A. Weisheimer, and M. J. Rodwell. Towards seamless prediction: Calibration of climate change projections using seasonal forecasts. B. Am. Meteorol.
Soc., 89(4):459–470, 2008.
T. N. Palmer, R. Buizza, F. Doblas-Reyes, T. Jung, M. Leutbecher, G. J. Shutts, M. Steinheimer, and A. Weisheimer. Stochastic parametrization and model uncertainty. Technical
Report 598, European Centre for Medium-Range Weather Forecasts, Shinfield park, Reading, 2009.
234
C. Pennell and T. Reichler. On the effective number of climate models. J. Climate, 24(9):
2358–2367, 2011.
K. Peters, C. Jakob, L. Davies, B. Khouider, and A. J. Majda. Stochastic behavior of tropical
convection in observations and a multicloud model. J. Atmos. Sci., 2013. In press.
R. S. Plant and G. C. Craig. A stochastic parameterization for deep convection based on
equilibrium statistics. J. Atmos. Sci., 65(1):87–104, 2008.
B. Pohl and N. Fauchereau. The southern annular mode seen through weather regimes. J.
Climate, 25(9):3336–3354, 2012.
D. Pollard. A User’s Guide to Measure Theoretic Probability. Cambridge University Press,
Cambridge, U.K. and New York, NY, U.S.A., 2002.
L. Ricciardulli and R. R. Garcia. The excitation of equatorial waves by deep convection in the
NCAR community climate model (CCM3). J. Atmos. Sci., 57(21):3461–3487, 2000.
D. S. Richardson. Measures of skill and value of ensemble prediction systems, their interrelationship and the effect of ensemble size. Q. J. Roy. Meteor. Soc., 127(577):2473–2489,
2001.
L. F. Richardson. Weather Prediction by Numerical Process. Cambridge University Press, The
Edinburgh Building, Cambridge, CB2 8RU, England, 2nd edition, 2007.
M. J. Rodwell and T. N. Palmer. Using numerical weather prediction to assess climate models.
Q. J. Roy. Meteor. Soc., 133(622):129–146, 2007.
J. Rougier, D. M. H. Sexton, J. M. Murphy, and D. Stainforth. Analyzing the climate sensitivity
of the HadSM3 climate model using ensembles from different but related experiments. J.
Climate, 22:3540–3557, 2009.
M. S. Roulston and L. A. Smith. Evaluating probabilistic forecasts using information theory.
Mon. Weather Rev., 130(6):1653–1660, 2002.
M. S. Roulston and L. A. Smith. The boy who cried wolf revisited: The impact of false alarm
intolerance on cost-loss scenarios. Weather Forecast., 19(2):391–397, 2004.
F. Sanders. On subjective probability forecasting. J. Appl. Meteorol., 2(2):191–201, 1963.
235
B. M. Sanderson. A multimodel study of parametric uncertainty in predictions of climate
response to rising greenhouse gas concentrations. J. Climate, 24(5):1362–1377, 2011.
B. M. Sanderson, C. Piani, W. J. Ingram, D. A. Stone, and M. R. Allen. Towards constraining
climate sensitivity by linear analysis of feedback patterns in thousands of perturbed-physics
GCM simulations. Clim. Dynam., 30(2–3):175–190, 2008.
G. Shutts. A kinetic energy backscatter algorithm for use in ensemble prediction systems. Q.
J. Roy. Meteor. Soc., 131(612):3079–3102, 2005.
G. J. Shutts. Stochastic backscatter in the unified model. Met Office Scientific Advisory
Committee Paper 14.5, U.K. Met Office, FitzRoy Road, Exeter, 2009.
G. J. Shutts and M. E. B. Gray. A numerical modelling study of the geostrophic adjustment
process following deep convection. Q. J. Roy. Meteor. Soc., 120(519):1145–1178, 1994.
G. J. Shutts and T. N. Palmer. Convective forcing fluctuations in a cloud-resolving model:
Relevance to the stochastic parameterization problem. J. Climate, 20(2):187–202, 2007.
J. Slingo and T. N. Palmer. Uncertainty in weather and climate prediction. Phil. Trans. R.
Soc. A, 369(1956), 2011.
S. Solomon, D. Qin, M. Manning, Z. Chen, M. Marquis, K. B. Averyt, Tignor M., and H. L.
Miller. Summary for policymakers. In Climate Change 2007: The Physical Science Basis.
Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental
Panel on Climate Change, Cambridge, United Kingdom and New York, NY, USA, 2007.
Cambridge University Press.
D. A Stainforth, T. Aina, C. Christensen, M. Collins, N. Faull, D. J. Frame, J. A. Kettleborough, S. Knight, A. Martin, J. M. Murphy, C. Piani, D. Sexton, L. A. Smith, R. A Spicer,
A. J. Thorpe, and M. R Allen. Uncertainty in predictions of the climate response to rising
levels of greenhouse gases. Nature, 433(7024):403–406, 2005.
D. J. Stensrud. Upscale effects of deep convection during the North American monsoon. J.
Atmos. Sci., 70(9):2681–2695, 2013.
236
D. J. Stensrud, J.-W. Bao, and T. T. Warner. Using initial condition and model physics
perturbations in short-range ensemble simulations of mesoscale convective systems. Mon.
Weather Rev., 128(7):2077–2107, 2000.
E. M. Stephens, T. L. Edwards, and D. Demeritt. Communicating probabilistic information
from climate model ensembles — lessons from numerical weather prediction. WIREs: Clim.
Change, 3(5):409–426, 2012.
D. B. Stephenson, A. Hannachi, and A. O’Neill. On the existence of multiple climate regimes.
Q. J. Roy. Meteor. Soc., 130(597):583–605, 2004.
D. M. Straus, S. Corti, and F. Molteni. Circulation regimes: Chaotic variability versus sstforced predictability. J. Climate, 20(10):2251–2272, 2007.
K. E. Taylor, R. J. Stouffer, and G. A. Meehl. An overview of CMIP5 and the experiment
design. B. Am. Meteorol. Soc., 93(4):485–498, 2012.
M. Tiedtke. A comprehensive mass flux scheme for cumulus parameterization in large-scale
models. Mon. Weather Rev., 117(8):1779–1800, 1989.
M. Tiedtke. Representation of clouds in large-scale models. Mon. Weather Rev., 121(11):
3040–3061, 1993.
J. Tödter and B. Ahrens. Generalisation of the ignorance score: Continuous ranked version
and its decomposition. Mon. Weather Rev., 140(6):2005–2017, 2012.
N. P. Wedi. The numerical coupling of the physical parametrizations to the “dynamical”
equations in a forecast model. Technical Report 274, European Centre for Medium-Range
Weather Forecasts, Shinfield park, Reading, 1999.
A. P. Weigel, M. A. Liniger, and C. Appenzeller. Can multi-model combination really enhance
the prediction skill of probabilistic ensemble forecasts? Q. J. Roy. Meteor. Soc., 134(630):
241–260, 2008.
A. Weisheimer, T. N. Palmer, and F. J. Doblas-Reyes. Assessment of representations of model
uncertainty. Geophys. Res. Let., 38, 2011.
237
D. S. Wilks. Effects of stochastic parametrizations in the Lorenz ’96 system. Q. J. Roy. Meteor.
Soc., 131(606):389–407, 2005.
D. S. Wilks. Statistical Methods in the Atmospheric Sciences, volume 91 of International
Geophysics Series. Elsevier, second edition, 2006.
K.-M. Xu, A. Arakawa, and S. K. Krueger. The macroscopic behavior of cumulus ensembles
simulated by a cumulus ensemble model. J. Atmos. Sci., 49(24):2402–2420, 1992.
T. Yokohata, M. J. Webb, M. Collins, K. D. Williams, M. Yoshimori, J. C. Hargreaves, and
J. D. Annan. Structural similarities and differences in climate responses to CO2 increase
between two perturbed physics ensembles. J. Climate, 23(6):1392–1410, 2010.
Y. Zhu, Z. Toth, R. Wobus, D. Richardson, and K. Mylne. The economic value of ensemblebased weather forecasts. B. Am. Meteorol. Soc., 83(1):73–83, 2002.
238
Download