Between and beyond: Irregular series, interpolation, variograms

advertisement
Between and beyond:
Irregular series, interpolation,
variograms, and smoothing
Nicholas J. Cox
Mind the gap!
Repeated reminder, London Underground.
2
Executive summary
A new program mipolate for several kinds of
interpolation is now available.
It can be downloaded from SSC (3 September 2015).
Variograms are useful for examining dependence structure
in time and spatial series.
Work is in progress on a new program vgram for
variograms.
3
Irregular series
Irregular series are series in which non-missing values are
not all equally spaced.
Special case: Values would be equally spaced (every day,
every year, …), but there are some gaps with missing values,
for human or inhuman reasons.
General case: Values are just at known times or points with
no necessary rules about spacing.
Irregular series often seem to invite interpolation.
4
Luke Howard
(1772 – 1864)
Best remembered
for his nomenclature
for clouds
(cumulus, stratus,
cirrus and so forth).
Here we use as
sandbox some of his
temperature data
from Plaistow, near
London, in 1807.
5
Howard, Luke. 1818.
The Climate of London, Deduced from Meteorological
Observations, Made at Different Places in the
Neighbourhood of the Metropolis.
Volume I.
London: W. Phillips, etc.
6
maximum (
F)
90
80
70
60
50
7 May
14 May
21 May
28 May
4 Jun
14 May
21 May
28 May
4 Jun
minimum (
F)
55
50
45
40
7 May
7
Series of events
N.B. We are not talking here about series of events,
or realisations of point processes.
In such series occurrences are typically irregularly spaced,
but the gaps are inherent in the process,
not a failing of our data.
Examples range from eruptions to elections.
8
-8000
-6000
-4000
-2000
0
2000
eruptions, Mt Adams WA
1789
2016
elections of black Presidents, USA
1789
2016
elections of women Presidents, USA
9
Interpolation
Interpolation is the art of reading between the lines.
Historically, it is a deterministic process, often a matter of
going beyond printed tables of functions
(logarithmic, trigonometric, and so forth).
In principle, we should worry about the statistical
properties of interpolation. It is local prediction.
In practice, imputation now appears better known among
statistical researchers.
10
Interpolation in (official) Stata
The ipolate command for linear interpolation (and
extrapolation) was added in Stata 3.1 (1993).
The Mata functions spline3() and spline3eval()
were added in Stata 9.0 (2005).
11
User-written programs on SSC
Programs (NJC) have been available from SSC for
cubic interpolation: cipolate (2002)
cubic spline interpolation: csipolate (2009)
piecewise cubic Hermite interpolation:
pchipolate (2012)
nearest neighbour interpolation:
nnipolate (2012)
A combined and extended program mipolate is now
available too.
12
Two dimensions too
Note also bipolate (Joseph Canner, SSC) (2014).
By default it uses quintic polynomials.
Other available methods include thin plate splines
and Shepard’s method.
Note also twoway contour.
13
mipolate generalises ipolate
Interpolation is of yvar with respect to specified xvar.
Prior tsset or xtset is not assumed.
Regular spacing is not assumed.
Multiple values of yvar at the same xvar are averaged first.
Groupwise operations using by: are supported.
14
Linear and cubic
Linear interpolation just uses previous and following
known values (only).
This is done by ipolate, and also mipolate by default.
Cubic interpolation is another classic method, using two
previous and two following known values (only).
This is done by mipolate, cubic.
The default of mipolate with either method
(as with ipolate) is not to extrapolate.
15
Un peu d’histoire
Cubic interpolation, as a particular kind of polynomial
interpolation, is often attributed to Joseph-Louis Lagrange
(1736–1813) but was proposed earlier by Edward Waring
(1735?–1798).
In fact there is a long history of work with contributions by
many outstanding mathematicians, not least Isaac Newton
(1643–1727) and Leonhard Euler (1707–1783).
16
Lagrange
Waring
17
Cubic splines
As before, we are using cubic polynomials locally, but they
are constrained to join smoothly.
The syntax is mipolate, spline.
This is merely a wrapper for the official Mata functions.
As before, the default of mipolate with this option is not
to extrapolate.
18
Linear extrapolation
As with ipolate linear extrapolation is available as an
option in mipolate to fill in missings at the end of series.
What your teachers told you is true:
extrapolation is dangerous.
“Don’t point that straight line: It can go off anywhere.”
(Allude here to Mark Twain on the Mississippi.)
19
Piecewise cubic Hermite
interpolation
This method also uses piecewise cubics joining smoothly.
The syntax is mipolate, pchip.
The interpolant is shape-preserving and cannot
overshoot locally.
Sections in which yvar is increasing, decreasing or constant
with xvar remain so after interpolation.
Hence local maxima and minima also remain so.
This interpolation method also extrapolates.
20
Charles Hermite (1822–1901)
21
Inverse distance weighting
Interpolation can use a weighted average of known values,
the weights being inverse powers of distance d from
unknown value.
If I don’t know the value at 42, 41 and 43 are distance 1
away, 40 and 44 distance 2, and so on.
For weights d-p, limiting case p = 0 makes all weights equal,
and so the interpolant is the overall mean, while p very
large means that only the very nearest values have effect.
22
Other methods
mipolate adds forward, backward, nearest neighbour and
groupwise interpolation:
Use the previous, next or the nearest known value. Or
extend the single non-missing value in a group to all others.
Using the last known value is often dubious statistically,
but it is a very common request in data management.
The other methods are provided partly for completeness.
There is small print (option choices) about how to break ties when two values are equally
near.
23
mipolate summary
Nine methods:
linear
cubic
(cubic) spline
pchip
idw
forward
backward
nearest
groupwise
Linear extrapolation?
yes
yes
yes
no
no
no
no
no
no
24
maximum (
90
spline
cubic
pchip
linear
F)
80
70
60
50
7 May
14 May
21 May
28 May
4 Jun
25
linear
pchip
cubic
spline
50
minimum (
F)
55
45
40
7 May
14 May
21 May
28 May
4 Jun
26
Simple messages
There are many interpolation methods to choose from.
They will often disagree, even for simple-looking instances.
Disagreement gives a handle on uncertainty.
In a real problem, simulate missings and test how well
known values are estimated.
What makes most sense in your problem will reflect its
dependence structure.
27
We turn from a project that is done to one that is very much
in progress.
28
Variograms
Variograms (more properly semivariograms) are plots of
(mean) half difference between values squared
versus
separation, distance or lag.
By a tempting abuse of terminology, we often use the same
name for the underlying relationship as a function.
29
First known use of term ‘variogram’
Geoffrey H. Jowett (1922– ) in 1955:
The comparison of means of sets of observations from
sections of independent stochastic series.
Journal of the Royal Statistical Society. Series B
(Methodological) 17: 208–227.
30
Spatial and time series
Variograms are central to
one approach to spatial
statistics, in this context
often known as geostatistics.
Georges Matheron (1930–
2000) is most often
mentioned here.
But variograms can be very
useful for time series too.
31
Time series too
Variograms are prominent in these texts on time series and
longitudinal data:
Diggle, P.J. 1990. Time Series: A Biostatistical
Introduction. Oxford: Oxford University Press.
Diggle, P.J., Heagerty, P.J., Liang, K-Y. and Zeger, S.L.
2002. Analysis of Longitudinal Data. Oxford: Oxford
University Press.
32
User-written programs
Programs (NJC) are available from SSC for
variograms in one dimension: variog (2005)
variograms in two dimensions: variog2 (2005)
A combined and extended program vgram is under
development.
33
Generality of variograms
So, variograms are – without undue strain – defined
for time series and for spatial series,
whether regular or irregular,
as they just depend on separation being measured.
Plotting the mean for each distinct separation is
a common, but not compulsory, convention.
34
A simple example: webuse air2
600
500
400
300
200
100
1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960
35
Variograms
vgram air,
recast(connected)
xla(0(12)72)
vgram air
Semi-variogram of Airline Passengers (1949-1960)
Semi-variogram of Airline Passengers (1949-1960)
20000
15000
15000
Semi-variance
20000
10000
10000
5000
5000
0
0
0
20
40
Lag
60
80
0
12
24
36
Lag
48
60
72
36
Comparison at different lags
We are plotting mean squared differences between values
compared at lags 1, 2, 3, …
In this example, we have monthly data, so are comparing
values 1, 2, 3, … months apart.
Many readers may be familiar with the same idea for
calculating autocorrelation and cross-correlation.
The variogram – like the raw data plot – hints at a structure
of trend plus seasonality.
37
Variograms of residuals, not data
Here, as elsewhere, it is a good idea to work with residuals,
rather than the original data.
Time series modellers could have a happy time arguing
which model was best for the airline data, but we just use a
Poisson regression on time and look at its residuals.
On the versatility and virtuosity of Poisson regression,
check out Gould, William.
http://blog.stata.com/2011/08/22/use-poisson-ratherthan-regress-tell-a-friend/
38
Sometimes, structure is this simple
Poisson regression
air = exp(-224.1 + .11747 time)
Residuals from Poisson
2
R = 85.5%
Semi-variogram of response residual
2000
1000
100
200
300
400
Semi-variance
500
600
3000
1950
n = 144
RMSE = 45.799
1955
Time (in months)
1960
0
0
20
40
Lag
60
80
39
A little more formally
The semivariogram γ(h) for response z is given by
2 γ(h) = A{ [z(i) − z(i + h)]2 }
where A{} denotes averaging over pairs of values at lag h.
As emphasised, using a mean is a convention. The fuller
picture (literally!) is a plot of [z(i) − z(i + h)]2 versus h.
This is often known as a variogram cloud.
I borrow the notation A() from Whittle, P. 1970. Probability. Harmondsworth: Penguin.
40
Where does the 2 come from?
The units of the
semivariogram are those of
the response squared.
Semi-variance
Adding the variance to the
graph as a reference line
underlines the connection.
A non-standard formula for
the variance is, for any i, j,
(1/2) E{ (zi − zj)2 } .
Semi-variogram of response residual
3000
variance
2000
1000
0
0
20
40
Lag
60
80
41
Back to vgram
vgram (not yet public) is already quite general.
We take possibilities one by one.
o With just one argument, the response, it checks for a
tsset or xtset time variable and uses it to define
separations if found. Note that panel data are supported
for free.
o With just one argument otherwise, the order of the
observations is taken to define position in time or space.
42
o With two arguments, the second variable is taken to
define position. A width() option is required to specify
the width of bins within which differences squared are
averaged. Equal and unequal spacing can thus both be
accommodated.
o With three arguments, the second and third variables
are taken to define position. A width() option is
required to specify the width of bins within which
differences squared are averaged. Distance is calculated
from coordinates using Pythagoras’ theorem.
43
Why not just use autocorrelation?
Variograms are defined for a wider class of processes.
Autocorrelation functions require weak stationarity;
variograms are defined for processes with stationary
increments.
Variograms are more flexible in the face of irregular
spacing.
The very wide use of autocorrelation reflects custom and
familiarity as well as intrinsic merit.
44
A further example
We look at rainfalls for 8 May 1986 (a single day) for 467
stations in Switzerland.
45
rainfall 8 May 1986 (mm)
-3.3
3.3 - 9.9
9.9 - 15.2
15.2 - 26.3
26.3 - 39.4
39.4 -
percentile breaks 5 25 50 75 95%
46
Semi-variogram of rainfall 8 May 1986 (mm)
200
150
100
50
0
0
10
20
lags are 10 km bands
30
40
47
How much information ?
Optionally the semivariogram results can be saved in
vgram to new variables.
Keeping track of the number of pairs used at each lag is
important.
Here we exploit the feature that spikeplot can show
frequencies on a square root scale.
48
6000
5000
4000
3000
2000
1000
0
0
10
20
Lag
30
40
49
To do list
variogram clouds
model fitting
(valid functional forms)
robust estimators
more flexible binning
spherical distances too
direction as well as lag
use for interpolation
(and smoothing)
(kriging, Gaussian process
regression)
50
Variogram virtues
Defined for time and spatial series.
Defined for regular and irregular series.
Can help identify and check for structure.
… even if you have no interest in their most mentioned use,
as a means towards the end of spatial interpolation.
51
This paper…
This paper fills a much needed gap in the literature.
See Jackson, A. 1997.
Chinese acrobatics, an old-time brewery,
and the “much needed gap”:
The life of Mathematical Reviews.
Notices of the American Mathematical Society
44: 330–337.
52
Acknowledgments
Historical portraits: Wikipedia.
MATLAB code for pchip:
Moler, C. 2004. Numerical Computing with MATLAB.
Philadelphia: SIAM. Chapter 3.
http://www.mathworks.com/moler/interp.pdf)
The Swiss rainfall data can be found here:
http://www.aigeostats.org/pub/AI_GEOSTATS/AI_GEOSTATSData/sic
97data_01.zip
53
Leo Breiman (1928–2005)
The main thing to learn about statistics is what is sensible
and honest and possible.
Doubt and suspicion, as well as technical knowledge, are
indispensable tools in statistics.
1973.
Statistics: With a view towards applications.
Boston: Houghton Mifflin, pp.1, 18.
54
Download