Stat 406 – Spring 2016 Short answer questions:

advertisement
Stat 406 – Spring 2016
Exam 2 – Answers
Short answer questions:
1) “New” test for spatial correlation in areal data
a) (4 pts) p = 0.0952 = 2/21.
Notes: You are given 21 values (observed statistic and 20 simulated ones). 1 of the 20 simulated
values is more extreme (larger than 5.82), so P[as or more extreme] = (1+1)/21
0
1
Frequency
2
3
4
5
6
Sample Quantiles
3.5 4.0 4.5 5.0 5.5 6.0
b) (4 pts). No, because the distribution of the simulated test statistics is far from normal.
Notes: When talking about the p-value for a test, the normal approximation refers to the distribution
of the test statistic; i.e., computing the p-value assuming the test statistic has a normal distribution.
You do not need the raw data or any information about the observations to answer this question. In
most data analyses, all you have is the raw data, so the various rules are to provide insight into the
unobserved distribution (of the test statistic) from what you have (the data). When you resample, you
have the distribution of the test statistic (or an approximation to that distribution). These 20 test
statistics are far from normal, which you can see using a histogram, a normal QQ plot, or various other
diagnostics.
3.0
3.5
4.0
4.5
5.0
Test statistic under H0
5.5
6.0
-2
-1
0
1
Theoretical Quantiles
2) Spatial correlation in an ecological response
a) (4 pts) No evidence of spatial correlation.
Notes: Estimate is close to zero and p-value very large.
I took off 1 point if your conclusion was “no spatial correlation”. That’s because “no evidence of
correlation” and “no correlation” mean quite different things.
b) (4 pts) The magnitude and direction of the spatial correlation varies across the study region. Positive,
and often significantly so at the W and E edges, and negative in the center.
c) 4 pts) Observation 1 has a much larger sampling variance, so it is smoothed more than observation 2.
Notes: Although we talked about this concept and method in the context of spatial data, there is
actually nothing spatial done in this method to smooth areal data. The only information needed to
compute the smoothed values are the raw value and its variance.
2
3) (3 pts) E 𝐸(𝑥) = 𝜋𝑥 2 or, E 𝐸(𝑥) = 𝜆 𝜋𝑥 2 if you took the “average number of points …” literally
(question should have said “proportional to average number of points …”. The derivation is similar to
the derivation of K(x). The average number of points in an area A is 𝜆 𝐴. The area of the specified circle
is 𝜋𝑥 2 , so the average number of points around an arbitrary point is 𝜆 𝜋𝑥 2 and the analog of the K(x)
function is that divided by λ, i.e. 𝜋𝑥 2 .
4) Interpretation of g(x) and L(x) plots
a) (3 pts) The plot of L2(x) shows evidence of clustering
b) (4 pts) The plot of g(x) shows apparent clustering (estimated g(x) above 1 for most distances up to
0.5. Individually, those deviations do not exceed randomness. But, it is not common for all of those
distances to be larger than 1. Hence, the cumulative curve does exceed its random variation.
Notes: the fact that L(x) or L2(x) is variance stabilized is not relevant; you could plot log g(x), which is a
variance stabilized transformation of g(x), and the observed curve would still fall within the envelope.
5) Data analysis problem
0.05
0.15
a) Produce a map of the estimated intensity of continuous corn farms throughout the study area.
Pt proc. likelihood
-740 -700 -660 -620
Used kernel density smoothing to estimate intensity on a fine grid
I chose bandwidth of 4 based on reducing the point process likelihood estimate a little bit; other
values are fine. Map is more or less similar.
-1.0
Diggle
0.0
1.0


0.5
1.0
1.5
2.0
2.5
3.0
Note: the dots on the plot are the locations of the CAFOs.
2
4
6
8
10
b) Are continuous corn farms clustered, segregated, or randomly located thoughout the study area?
 They are clustered. L(x) is above pointwise simulation envelopes from ca 0.8km to 5km. g(x) is
above pointwise simulation envelopes at most distances from ca 0.5km to 5km. Diggle-CressieLoosmore-Ford (integral) test on the interval (0, 5) gives p = 0.005 or smaller.
 Various tools you could use here.
 And different reasons for choosing one or another. Any reasonable answers accepted.
1.0
0.0
-1.0
-0.5
log g r
0.5
-0.2 0.0 0.2 0.4 0.6
L r
r
Notes: I was looking for three things: an interpretation of the observed curve, the envelope or some
other info on pointwise variation, and an overall test.
0
1
2
3
4
5
0
1
2
r
3
4
5
r
c) If they are clustered: estimate characteristics of the clusters.



There are an estimated 68 clusters, of radius 4.3km, each containing an average of 2.8 farms.
I fit a Matern cluster process from 0km to 5km. The estimated parameters are k=0.0344, R =
4.29, and mu = 2.78. The estimated number of clusters is obtained by Area * intensity =
(100*20)*0.0344.
Choices include type of cluster process (Matern or Thomas) and the range of distances to fit.
0.4
0.2
0.0
K r
r
0.6
Note: To see whether this fit is anywhere near reasonable, you could plot the fitted and observed L2(x)
functions. Looks a little off at short distances (< 2km), but an integral test of lack of fit is far from
significant (p = 0.67).
0
1
2
3
r
4
5
d) Association between distance to the nearest CAFO and log intensity.



Strong association (p < 0.0001). Estimated coefficients are intercept: -1.30 (or -1.297), slope = 0.153. Each additional km distance from a CAFO decreases the number of continuous corn
farms per km2 by 14%.
Notes: I was looking for a statement like my last sentence, which describes how intensity
changes. Saying it decreases is better than nothing, but I was hoping for an amount.
Fit an inhomogeneous Poisson process model with log intensity specified as log 𝜆 = 𝛽0 + 𝛽1 𝑋,
where X is distance to the nearest CAFO. Estimated the effect of distance from CAFO as exp(β1),
and expressed as a percent drop by 100*(1- exp(β1))
Not too many choices to make here
e) Which model for log intensity (constant, linear function of distance, or within 5km/not) best fits the
data?


linear function of distance. It has a substantially smaller AIC value than do the other two models
Fit all three models, calculated AIC statistics for each. Those values are:
1281.2 for the constant model,
1209.9 for the linear distance model, and
1226.7 for the within 5km or not model

Major choice is whether to use model selection statistics (AIC or relatives) or hypothesis testing.
Hypothesis testing allows you to compare the linear and within 5km models to the null of
constant intensity but comparison of those two alternatives is indirect. AIC is a much better way
to compare fits among a collection of models.
Download