Between and beyond: Irregular series, interpolation, variograms, and smoothing Nicholas J. Cox Mind the gap! Repeated reminder, London Underground. 2 Executive summary A new program mipolate for several kinds of interpolation is now available. It can be downloaded from SSC (3 September 2015). Variograms are useful for examining dependence structure in time and spatial series. Work is in progress on a new program vgram for variograms. 3 Irregular series Irregular series are series in which non-missing values are not all equally spaced. Special case: Values would be equally spaced (every day, every year, …), but there are some gaps with missing values, for human or inhuman reasons. General case: Values are just at known times or points with no necessary rules about spacing. Irregular series often seem to invite interpolation. 4 Luke Howard (1772 – 1864) Best remembered for his nomenclature for clouds (cumulus, stratus, cirrus and so forth). Here we use as sandbox some of his temperature data from Plaistow, near London, in 1807. 5 Howard, Luke. 1818. The Climate of London, Deduced from Meteorological Observations, Made at Different Places in the Neighbourhood of the Metropolis. Volume I. London: W. Phillips, etc. 6 maximum ( F) 90 80 70 60 50 7 May 14 May 21 May 28 May 4 Jun 14 May 21 May 28 May 4 Jun minimum ( F) 55 50 45 40 7 May 7 Series of events N.B. We are not talking here about series of events, or realisations of point processes. In such series occurrences are typically irregularly spaced, but the gaps are inherent in the process, not a failing of our data. Examples range from eruptions to elections. 8 -8000 -6000 -4000 -2000 0 2000 eruptions, Mt Adams WA 1789 2016 elections of black Presidents, USA 1789 2016 elections of women Presidents, USA 9 Interpolation Interpolation is the art of reading between the lines. Historically, it is a deterministic process, often a matter of going beyond printed tables of functions (logarithmic, trigonometric, and so forth). In principle, we should worry about the statistical properties of interpolation. It is local prediction. In practice, imputation now appears better known among statistical researchers. 10 Interpolation in (official) Stata The ipolate command for linear interpolation (and extrapolation) was added in Stata 3.1 (1993). The Mata functions spline3() and spline3eval() were added in Stata 9.0 (2005). 11 User-written programs on SSC Programs (NJC) have been available from SSC for cubic interpolation: cipolate (2002) cubic spline interpolation: csipolate (2009) piecewise cubic Hermite interpolation: pchipolate (2012) nearest neighbour interpolation: nnipolate (2012) A combined and extended program mipolate is now available too. 12 Two dimensions too Note also bipolate (Joseph Canner, SSC) (2014). By default it uses quintic polynomials. Other available methods include thin plate splines and Shepard’s method. Note also twoway contour. 13 mipolate generalises ipolate Interpolation is of yvar with respect to specified xvar. Prior tsset or xtset is not assumed. Regular spacing is not assumed. Multiple values of yvar at the same xvar are averaged first. Groupwise operations using by: are supported. 14 Linear and cubic Linear interpolation just uses previous and following known values (only). This is done by ipolate, and also mipolate by default. Cubic interpolation is another classic method, using two previous and two following known values (only). This is done by mipolate, cubic. The default of mipolate with either method (as with ipolate) is not to extrapolate. 15 Un peu d’histoire Cubic interpolation, as a particular kind of polynomial interpolation, is often attributed to Joseph-Louis Lagrange (1736–1813) but was proposed earlier by Edward Waring (1735?–1798). In fact there is a long history of work with contributions by many outstanding mathematicians, not least Isaac Newton (1643–1727) and Leonhard Euler (1707–1783). 16 Lagrange Waring 17 Cubic splines As before, we are using cubic polynomials locally, but they are constrained to join smoothly. The syntax is mipolate, spline. This is merely a wrapper for the official Mata functions. As before, the default of mipolate with this option is not to extrapolate. 18 Linear extrapolation As with ipolate linear extrapolation is available as an option in mipolate to fill in missings at the end of series. What your teachers told you is true: extrapolation is dangerous. “Don’t point that straight line: It can go off anywhere.” (Allude here to Mark Twain on the Mississippi.) 19 Piecewise cubic Hermite interpolation This method also uses piecewise cubics joining smoothly. The syntax is mipolate, pchip. The interpolant is shape-preserving and cannot overshoot locally. Sections in which yvar is increasing, decreasing or constant with xvar remain so after interpolation. Hence local maxima and minima also remain so. This interpolation method also extrapolates. 20 Charles Hermite (1822–1901) 21 Inverse distance weighting Interpolation can use a weighted average of known values, the weights being inverse powers of distance d from unknown value. If I don’t know the value at 42, 41 and 43 are distance 1 away, 40 and 44 distance 2, and so on. For weights d-p, limiting case p = 0 makes all weights equal, and so the interpolant is the overall mean, while p very large means that only the very nearest values have effect. 22 Other methods mipolate adds forward, backward, nearest neighbour and groupwise interpolation: Use the previous, next or the nearest known value. Or extend the single non-missing value in a group to all others. Using the last known value is often dubious statistically, but it is a very common request in data management. The other methods are provided partly for completeness. There is small print (option choices) about how to break ties when two values are equally near. 23 mipolate summary Nine methods: linear cubic (cubic) spline pchip idw forward backward nearest groupwise Linear extrapolation? yes yes yes no no no no no no 24 maximum ( 90 spline cubic pchip linear F) 80 70 60 50 7 May 14 May 21 May 28 May 4 Jun 25 linear pchip cubic spline 50 minimum ( F) 55 45 40 7 May 14 May 21 May 28 May 4 Jun 26 Simple messages There are many interpolation methods to choose from. They will often disagree, even for simple-looking instances. Disagreement gives a handle on uncertainty. In a real problem, simulate missings and test how well known values are estimated. What makes most sense in your problem will reflect its dependence structure. 27 We turn from a project that is done to one that is very much in progress. 28 Variograms Variograms (more properly semivariograms) are plots of (mean) half difference between values squared versus separation, distance or lag. By a tempting abuse of terminology, we often use the same name for the underlying relationship as a function. 29 First known use of term ‘variogram’ Geoffrey H. Jowett (1922– ) in 1955: The comparison of means of sets of observations from sections of independent stochastic series. Journal of the Royal Statistical Society. Series B (Methodological) 17: 208–227. 30 Spatial and time series Variograms are central to one approach to spatial statistics, in this context often known as geostatistics. Georges Matheron (1930– 2000) is most often mentioned here. But variograms can be very useful for time series too. 31 Time series too Variograms are prominent in these texts on time series and longitudinal data: Diggle, P.J. 1990. Time Series: A Biostatistical Introduction. Oxford: Oxford University Press. Diggle, P.J., Heagerty, P.J., Liang, K-Y. and Zeger, S.L. 2002. Analysis of Longitudinal Data. Oxford: Oxford University Press. 32 User-written programs Programs (NJC) are available from SSC for variograms in one dimension: variog (2005) variograms in two dimensions: variog2 (2005) A combined and extended program vgram is under development. 33 Generality of variograms So, variograms are – without undue strain – defined for time series and for spatial series, whether regular or irregular, as they just depend on separation being measured. Plotting the mean for each distinct separation is a common, but not compulsory, convention. 34 A simple example: webuse air2 600 500 400 300 200 100 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 35 Variograms vgram air, recast(connected) xla(0(12)72) vgram air Semi-variogram of Airline Passengers (1949-1960) Semi-variogram of Airline Passengers (1949-1960) 20000 15000 15000 Semi-variance 20000 10000 10000 5000 5000 0 0 0 20 40 Lag 60 80 0 12 24 36 Lag 48 60 72 36 Comparison at different lags We are plotting mean squared differences between values compared at lags 1, 2, 3, … In this example, we have monthly data, so are comparing values 1, 2, 3, … months apart. Many readers may be familiar with the same idea for calculating autocorrelation and cross-correlation. The variogram – like the raw data plot – hints at a structure of trend plus seasonality. 37 Variograms of residuals, not data Here, as elsewhere, it is a good idea to work with residuals, rather than the original data. Time series modellers could have a happy time arguing which model was best for the airline data, but we just use a Poisson regression on time and look at its residuals. On the versatility and virtuosity of Poisson regression, check out Gould, William. http://blog.stata.com/2011/08/22/use-poisson-ratherthan-regress-tell-a-friend/ 38 Sometimes, structure is this simple Poisson regression air = exp(-224.1 + .11747 time) Residuals from Poisson 2 R = 85.5% Semi-variogram of response residual 2000 1000 100 200 300 400 Semi-variance 500 600 3000 1950 n = 144 RMSE = 45.799 1955 Time (in months) 1960 0 0 20 40 Lag 60 80 39 A little more formally The semivariogram γ(h) for response z is given by 2 γ(h) = A{ [z(i) − z(i + h)]2 } where A{} denotes averaging over pairs of values at lag h. As emphasised, using a mean is a convention. The fuller picture (literally!) is a plot of [z(i) − z(i + h)]2 versus h. This is often known as a variogram cloud. I borrow the notation A() from Whittle, P. 1970. Probability. Harmondsworth: Penguin. 40 Where does the 2 come from? The units of the semivariogram are those of the response squared. Semi-variance Adding the variance to the graph as a reference line underlines the connection. A non-standard formula for the variance is, for any i, j, (1/2) E{ (zi − zj)2 } . Semi-variogram of response residual 3000 variance 2000 1000 0 0 20 40 Lag 60 80 41 Back to vgram vgram (not yet public) is already quite general. We take possibilities one by one. o With just one argument, the response, it checks for a tsset or xtset time variable and uses it to define separations if found. Note that panel data are supported for free. o With just one argument otherwise, the order of the observations is taken to define position in time or space. 42 o With two arguments, the second variable is taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Equal and unequal spacing can thus both be accommodated. o With three arguments, the second and third variables are taken to define position. A width() option is required to specify the width of bins within which differences squared are averaged. Distance is calculated from coordinates using Pythagoras’ theorem. 43 Why not just use autocorrelation? Variograms are defined for a wider class of processes. Autocorrelation functions require weak stationarity; variograms are defined for processes with stationary increments. Variograms are more flexible in the face of irregular spacing. The very wide use of autocorrelation reflects custom and familiarity as well as intrinsic merit. 44 A further example We look at rainfalls for 8 May 1986 (a single day) for 467 stations in Switzerland. 45 rainfall 8 May 1986 (mm) -3.3 3.3 - 9.9 9.9 - 15.2 15.2 - 26.3 26.3 - 39.4 39.4 - percentile breaks 5 25 50 75 95% 46 Semi-variogram of rainfall 8 May 1986 (mm) 200 150 100 50 0 0 10 20 lags are 10 km bands 30 40 47 How much information ? Optionally the semivariogram results can be saved in vgram to new variables. Keeping track of the number of pairs used at each lag is important. Here we exploit the feature that spikeplot can show frequencies on a square root scale. 48 6000 5000 4000 3000 2000 1000 0 0 10 20 Lag 30 40 49 To do list variogram clouds model fitting (valid functional forms) robust estimators more flexible binning spherical distances too direction as well as lag use for interpolation (and smoothing) (kriging, Gaussian process regression) 50 Variogram virtues Defined for time and spatial series. Defined for regular and irregular series. Can help identify and check for structure. … even if you have no interest in their most mentioned use, as a means towards the end of spatial interpolation. 51 This paper… This paper fills a much needed gap in the literature. See Jackson, A. 1997. Chinese acrobatics, an old-time brewery, and the “much needed gap”: The life of Mathematical Reviews. Notices of the American Mathematical Society 44: 330–337. 52 Acknowledgments Historical portraits: Wikipedia. MATLAB code for pchip: Moler, C. 2004. Numerical Computing with MATLAB. Philadelphia: SIAM. Chapter 3. http://www.mathworks.com/moler/interp.pdf) The Swiss rainfall data can be found here: http://www.aigeostats.org/pub/AI_GEOSTATS/AI_GEOSTATSData/sic 97data_01.zip 53 Leo Breiman (1928–2005) The main thing to learn about statistics is what is sensible and honest and possible. Doubt and suspicion, as well as technical knowledge, are indispensable tools in statistics. 1973. Statistics: With a view towards applications. Boston: Houghton Mifflin, pp.1, 18. 54