Model Based Geostatistics

Archie Clements

University of Queensland

School of Population Health

Overview

• Introduction to geostatistics

– Assumptions

– Variogram components

– Variogram models

– Kriging

– Assumptions

• Model-based geostatistics

– Principles

– Building the model

– Prediction

– Validation

• Applications: parasitic disease control in Africa

Spatial variation

Z

Y

X

First and second order variation

• First-order variation:

– Trend

– Large-scale variation

– Can be due to large-scale environmental drivers (e.g. temperature for vector-borne diseases)

• Second-order variation:

– Localised variation: clustering

– Modelled using geostatistics

Spatial dependence

• Observations close in space are more similar than observations far apart

• The variance of pairs of observations that are close together (small h ) tends to be smaller than the variance of pairs far apart

(large h )

• Basis of the semivariogram

– Spatial decomposition of the sample variance

Semivariance: statistical notation

Semivariance is half the average squared difference of values observed at locations separated by a given distance (and direction)

Function of distance (and direction); distance in bins, direction in sectors of compass – “azimuth”

Modelling spatial correlation: semivariogram

Partial

Sill

Nugget

Lag ( h )

Sill

Nugget

• Random variation (white noise); nonspatial measurement error

• Microvariation (spatial variation at a scale smaller than the smallest bin)

• If no spatial correlation:

– Nugget = sill (flat semivariogram)

Semivariogram: decisions to be made

• How many/what sized bins?

– Depends on density of data points

– For regular-spaced (grid-sampled) data bin size = size of cells in the grid

– For irregular sampling – modify according to range of spatial correlation (big range, big bins; small range, small bins)

• What maximum lag( h ) to use?

– Should be estimated up to half the length of the shortest side of study area

• Which parametric model to use?

– Visual fit

– Statistical fit

Variogram models

Schistosoma mansoni, Uganda

Omnidirectional semivariograms

Anisotropy

• Spatial dependence is different in different directions

– Semivariogram calculated in one direction is different from semivariogram calculated in another direction

– Should check for anisotropy and, if present, accommodate it in interpolation

– Range or sill (or both) can differ

Schistosoma mansoni , Uganda: directional semivariograms

Direction

Range

(km)

Sill Nugget

Omnidirectional

45˚

90˚

135˚

43.4

7E-2

39.4

43.6

1E-1

7E-2

35.8

39.5

8E-2

1E-1

4E-2

-3E-3

2E-2

3E-2

2E-2

Schistosoma haematobium, Northwestern

Tanzania

Direction

Omnidirectional

Range

(km)

36.0

Sill

5E-2

260.1

2E-2

Nugget

0

3E-2

45˚

90˚

135˚

163.9

56.2

97.7

6E-3

5E-2

3E-2

3E-2

0

7E-3

Schistosoma haematobium ,

Northwestern Tanzania

Trended and skewed data

• Data should be de-trended

– Polynomials (regression on XY coordinates)

– Generalised linear models (regression on covariates)

– Generalised additive models (can over-fit)

– If directional variograms are calculated & range in one direction is >3 X range in perpendicular, sign of trend

• If skewed, consider transformation (e.g. log transformation, normal score transformation)

– Otherwise, extreme values overly influence interpolated map

– Have to back-transform interpolated values

– Called “disjunctive Kriging”

Non-stationarity

• Spatial correlation structure cannot be generalised to the whole study area

• Why does it occur?

– Different factors may operate in different parts of the study area

– Different ecological zones with different disease epidemiology

• Need to estimate the spatial correlation structure separately in each homogeneous zone

Kriging

 Z(s i ) is the measured value at the i th location

 λ i is the weight attributed to the measured value at the ith location

(calculated using semivariogram)

 S o is the prediction location

For formulae on how the weights are estimated using the variogram: http://en.wikipedia.org/wiki/Kriging

Prediction standard error/variance gives an indication of precision of the prediction

Geostatistics summary

• Geostatistics involves 3 steps:

– Exploratory data analysis

– Definition of a variogram

– Using the variogram for interpolation (Kriging)

• Technique applicable for:

– Point-referenced data

– Spatially continuous processes:

• Disease risk

• Rainfall, elevation, temperature, other climate variables

• Wildlife, vegetation, geology (mineral deposits)

Bayesian model-based geostatistics

Seminal paper:

Diggle, Tawn and Moyeed (1998). Model-based geostatistics. Appl.

Stat. 47:3;299-350

Observed a need for addressing non-Gaussian observational error

Idea is “to embed linear Kriging methodology within a more general distributional framework”

Generalised linear models with an unobserved Gaussian process in the linear predictor

Implemented in a Bayesian framework

Advantages of the Bayesian approach

• Natural framework for incorporation of parameter uncertainty into spatial prediction

– Can build uncertainty into parameters using priors

• Non-informative

• Informative (based on exploratory analysis, additional sources of information)

• Convenient for modelling hierarchical data structures

Bayesian model-based geostatistics

Predictions

• Can predict at specified validation locations (with observed outcomes for comparison)

• Can predict at non-sampled locations, e.g. a prediction grid

• Might be interested in

– outcome

– spatial random effect

– Standard error of predicted outcome

Validation

• Jack-knifing; sampling with replacement

– Remove one observation, do prediction at that location and store predicted value

– Repeat for all observations

– Compare predicted to observed using statistical measures of fit

(RMSE) and discriminatory performance (AUC)

– Not feasible with MBG other than with v. small datasets

• Cross-validation; sampling without replacement

– Set aside a subset for validation (ideally 50%)

– Use remaining data to “train” model

– Compare predicted and observed for the validation subset using statistical measures

– Can then recombine the validation and training subsets for final model build

• External validation: using other prospective or retrospective dataset

Model-based geostatistics summary

Model-based geostatistics involves:

1.

Visual and exploratory data analysis

2.

Variography (to determine if there is secondorder spatial variation)

3.

Variable selection (for deterministic component)

4.

Building model (e.g. in WinBUGS)

5.

Model selection (e.g. using DIC)

6.

Prediction and validation

Application:

Schistosomiasis in Sub-Saharan

Africa

Schistosomiasis

 779 million people at risk

 207 million infected

 Most in Africa

 Significant illness and mortality Two main forms in Africa:

 Urinary schistosomiasis caused by Schistosoma haematobium

 Intestinal schistosomiasis caused by S. mansoni

Life cycle of Schistosoma haematobium

×

Adult worm in human bladder wall

Cercariae released

Sporocysts in snail

Eggs in urine

Miracidia

Diagnosis of infection

 S. haematobium:

 Microscopic examination of urine slides: Presence of eggs and egg counts

 Macrohaematuria (visible blood)

 Microhaematuria (invisible blood) – tested using chemical reagent strips

 Blood in urine questionnaire

 S. mansoni and soiltransmitted helminths:

 microscopic examination of stool samples

School-based control programmes

• School-aged children have highest prevalence

(proportion infected) and intensity (severity) of infection

• Education system is convenient for control; central location to access target population

• World Health

Organisation guidelines: treat communities biannually where prevalence in school-age children is >10% and annually where prevalence >50%

How do we determine which schools should be targeted?

• No surveillance

• Need to do surveys

Field survey: northwest

Tanzania

Lake Victoria

 153 schools surveyed

 60 children per school

 What about non-sampled locations? Need to predict

(interpolate) values

MBG model for S. haematobium prevalence

Y i

~ binomial ( n i

, p i

) logit ( p i

)

    i

 i

   rain i

  

LST 1 i

  

LST 2 i

  i

 i

 f ( d ij

;

)

 exp

(

 d ij

)

S. haematobium model results к

φ

Variable

Intercept

Coefficient

1.9 (-2.3 - 10.3)

Odds Ratio

LST >35-39C 0.4 (-0.3 - 1.1) 1.5 (0.8 - 2.9)

LST >39C 0.3 (-1.5 - 2.2) 1.4 (0.2 - 8.6)

Rainfall >1050mm -1.1 (-3.4 - 1.1) 0.3 (3.3 x 10 -2 - 3.1)

0.9 (0.6 - 1.3)

0.2 (0.1 - 1.0)

Clements et al. TMIH 2006

Uncertainty

Lower bound: 95% PI

Upper bound: 95% PI

Co-ordinated surveys in 3 contiguous countries

•418 schools

•>26,000 children

Probability that prevalence is

>50%

Clements et al. EID 2008

Variable

Sex: Female

Age: 9–10 years

Age: 11–12 years

Age: 13–16 years

Distance to perennial water body

Land surface temperature

Land surface temperature 2

Rate of decay of spatial correlation

Variance of the spatial random effect (sill)

Mean (95% CI)

0.70 (0.65, 0.76)

1.16 (1.00, 1.33)

1.51 (1.31, 1.73)

1.79 (1.53, 2.06)

0.34 (0.21, 0.54)

0.80 (0.51, 1.21)

1.10 (0.85, 1.40)

2.03 (1.48, 2.74)

7.03 (5.36, 9.31)

SD

0.03

0.08

0.10

0.14

0.08

0.18

0.14

0.32

1.01

Other outcomes: co-infection

East Africa: Brooker and

Clements, Int. J. Parasitol., in press

S. mansoni mono-infection:

7.9%

Hookworm mono-infection:

40.5%

25

20

15

10

5

0

45

40

35

30

Co-infection: 8.1%

50

S. mansoni mono-infection

Hookworm mono-infection

Co-infection

0 10 20 30 40 50

% infected

60 70 80 90 100

Lake

Albert

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

# #

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

# #

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

# #

#

#

#

#

#

UGANDA

#

#

#

##

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# #

#

# #

# #

#

#

#

#

#

#

#

#

#

#

#

# #

# #

#

#

# #

#

#

#

#

#

#

#

#

#

#

# #

#

#

#

#

# #

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

# #

#

#

# # #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

Lake Victoria

KENYA

#

#

#

#

#

#

#

#

# #

#

# #

#

#

TANZANIA

#

#

#

#

#

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

Infection status

No infection

S. mansoni monoinfection

Hookworm monoinfection

Coinfection

Country borders

Large water bodies

#

100 0

#

#

#

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

# # #

# #

#

#

# #

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

#

Slide 38

100 200

#

#

N

300 Kilometers

,

Model for co-infection

Y ijk

~Multinomial( p ijk

,n ijk

), p ijk

 k

 ijk n ijk log (

 ijk

)

  k

 ik

N

T 

1 , k

Nk

 x

Nijk

  ik f ( d ij

;

)

 exp

(

 d ij

)

Variable

Intercept

OR: Elevation

OR: DPWB

OR: Rural vs urban

OR: Ext. rural vs urban

OR: LST

OR: Female

OR: Age (9-10 years)

OR: Age (11-13 years)

OR: Age (≥14 years)

Phi (rate of decay)

Sill

S. mansoni monoinfection posterior mean

(95% posterior CI)

-3.8 (-4.7 - -2.9)

0.35 (0.22 - 0.58)

0.23 (0.10 - 0.45)

0.43 (0.21 - 0.79)

0.62 (0.23 - 1.44)

0.88 (0.62 - 1.25)

0.86 (0.76 - 0.96)

1.67 (1.37 - 2.06)

2.44 (2.06 - 2.89)

2.87 (2.19 - 3.71)

3.52 (1.73 - 7.21)

6.39 (3.52 - 11.78)

Hookworm monoinfection posterior mean

(95% posterior CI)

-0.6 (-1.1 - -0.3)

0.77 (0.65 - 0.89)

0.94 (0.76 - 1.15)

0.98 (0.68 - 1.37)

1.16 (0.82 - 1.81)

0.60 (0.50 - 0.72)

0.91 (0.86 - 0.97)

1.17 (1.04 - 1.30)

1.55 (1.39 - 1.71)

1.88 (1.63 - 2.14)

4.98 (3.38 - 7.33)

1.31 (0.98 - 1.76)

S. mansoni/hookworm co-infection posterior mean

(95% posterior CI)

-4.4 (-5.0 - -3.7)

0.30 (0.20 - 0.47)

0.30 (0.18 - 0.58)

0.61 (0.36 - 1.02)

0.75 (0.31 - 1.62)

0.57 (0.31 - 0.87)

0.70 (0.63 - 0.77)

1.82 (1.52 - 2.21)

2.99 (2.55 - 3.52)

3.83 (3.01 - 4.86)

3.76 (2.10 - 7.36)

6.34 (3.98 - 9.95)

Co-infection

S. mansoni monoinfection

Hookworm monoinfection

S. mansoni -

Hookworm coinfection

Other outcomes: Intensity of infection

 Prevalence is used (currently) for disease control planning

 Intensity of infection (eggs/ml urine or /g faeces) is more indicative of:

 Morbidity (anaemia, urine tract, hepatic pathology)

 Transmission

Model for intensity of infection

Y ij

~ negbin ( mu ij

) log ( mu ij

)

     girl ij

  j

 j

   dist j

   elev j

  j

 j

 f ( d jk

;

)

 exp

(

 d jk

)

Intensity of S. mansoni infection, East

Africa

Clements et al. Parasitol 2006

Variable

Intercept

Female

Elevation (m)

DPWB (dec deg)

Sill

Range

Overdispersion

Posterior Mean (95% CI)

10.06 (5.77 - 13.22)

-0.41 (-0.72 - -0.11)

-0.007 (-0.01 - -0.004)

-5.36 (-7.51 - -3.30)

23.96 (19.06 - 32.07)

0.134 (0.09 - 0.20)

0.06 (0.058 - 0.062)

Slide 44

Conclusions

• In disease control we need evidence-based framework for deciding on where to allocate limited control resources

• Maps are useful tools for highlighting sub-national variation; targeting interventions; advocacy (national and local); integrated control programmes; estimating heterogeneities in disease burden

• Model-based geostatistics enables rich inference from spatial data; uncertainty