STATISTICAL ASPECTS
OF
TO
SENIOR RESEARCH SCIENTIST
DEPARTMENT OF STATISTICS
COLORADO STATE UNIVERSITY
EMAP Affiliate
SPACE-TIME AQUATIC RESOURCE
MODELING and ANALYSIS PROGRAM
(STARMAP)
# 1
Space-Time Aquatic Resources Modeling and Analysis Program
The work reported here today was developed under the STAR Research Assistance Agreement CR-
829095 awarded by the U.S. Environmental
Protection Agency (EPA) to Colorado State
University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of presenter and STARMAP, the
Program he represents. EPA does not endorse any products or commercial services mentioned in these presentation.
This research is funded by
U.S.EPA – Science To Achieve
Results (STAR) Program
Cooperative
Agreement
# CR - 829095
# 2
CONTEXT: Environmental Monitoring and
Assessment Program (EMAP) + Academic
TOPICS TO CONSIDER:
What to Measure = Indicators
Other speakers will address this
Important things to consider in designing a survey
PLAN! , PLAN! , PLAN!
A National or Regional Survey is a
Substantial Undertaking
# 3
IMPORTANT THINGS TO CONSIDER IN
DESIGNING A SURVEY
1. Probability Surveys vs Judgment Collections
2. Population Definition
3. Evaluation Units – hives (colonies) or bees
4. Sampling Frames
5. Selecting the Sample Sites
6. Training
7. Collecting the Bees
8. Handling the Collected Bees
9. Quality Assurance
10. Data Management
11. Data Analysis
# 4
Specialists Usually Know a Tremendous
Amount About Limited Specific Situations
This is the way science accumulates knowledge.
But frequently specialists know a lot less about the overall situation than they think they do!
An illustration follows
Selection of stream segments for spawning studies by
Oregon Department Fisheries and Wildlife
# 5
SELECTION OF STREAM SEGMENTS FOR
SPAWNING STUDIES
(OREGON DEPARTMENT FISHERIES AND WILDLIFE)
OBJECTIVE: Estimate Number Of Coho
Salmon Spawning in Streams of Oregon’s
Coast Range
Stream Segments Were Stratified As Being
“Low,” “Moderate,” Or “High”, relative to quality of spawning habitat
Low was not sampled; high was sampled at three times the rate of moderate
Quality of spawning habitat was evaluated for each selected segment
# 6
SELECTION OF STREAM SEGMENTS FOR
SPAWNING STUDIES
(OREGON DEPARTMENT FISHERIES AND WILDLIFE)
(continued)
ANTICIPATED
QUALITY
LOW
MEDIUM
HIGH
OBSERVED QUALITY
LOW
NOT
SAMPLED
MEDIUM
NOT
SAMPLED
HIGH
NOT
SAMPLED
73% 17% 10%
53% 23% 24%
# 7
SELECTION OF STREAM SEGMENTS FOR
SPAWNING STUDIES
(OREGON DEPARTMENT FISHERIES AND WILDLIFE) continued
EXAMPLE of “Sampling Where Investigators
Think Most of the Large Responses Are.”
Bad idea if “knowledge” isn’t quite right
Even 10% error rate can make this a very inefficient sampling approach
ODF&W Classification Was Off LOTS Further
Than 10%.
Many other such examples exist.
# 8
A Population is the Set of Objects of
Interest in a Survey
Commercial hives
Of cooperating beekeepers
All hives
All hives within 500m of a secondary road
Species
All
Two species of primary interest
# 9
POPULATION DEFINITION continued
So What?!
Major distinction
Target population = what you want to talk about
Sampled population = what you can talk about
You probably don’t want to talk about this sort of population:
All commercial hives owned by cooperating beekeepers within 100 miles of an EPA Regional Office, and within
500m of a paved secondary road in June, 2005.
Where you go to collect bees does make a difference!
# 10
CONCLUSIONS
ABOUT JUDGMENT SELECTED SITES
Ecologists’ “Typical Sites” Probably Are Much
More Homogeneous Than the Larger Context of Interest
Nonprobability Samples Can Be Rather Biased for No Apparent Reason
Typicalness for One Set of Responses Says
Nothing About Typicalness for Any Other
Response, i.e. Any Response Not Used in
Determining Typicalness
# 11
So what?
If Hives (or colonies) Are Your Evaluation
Units, You Must
Select hives in the sampling process
Have a response which can be attached to a selected hive
Give final answers in terms hives
Ex: Proportion of hives (colonies) with yy > xx
# 12
A Sample Frame Provides a Means to
Identify or Locate the Individual Units in the
Population
May be a list
The basis for most of the older sampling theory
Often is imperfect! Sometimes, badly so!
Many living things must be selected by their location
# 13
PLAUSIBLE SPATIAL SAMPLING FRAMES
(Courtesy of Tony Olsen, EMAP, US EPA)
Use 6th Field HUCs as Spatial Units.
Select sample of HUCs incorporation landcover/use into probability of selection. Then find beekeepers within HUC. Sample locations where hives are set up.
Same as Above, Except Ignore Beekeepers.
Go out an trap any bees at selected points within HUC - possibly use landcover again within HUC as selection probability.
Use NRI Sample Points as Frame and Subsample Them.
Use NASS Spatial Frame Sample Points and Subsample Them.
Use NLCD (8million pixels).
Select GRTS sample of pixels based on landcover class. Either trap bees or use that the identify if bee hives are present (in some way). Have to do oversample if expect most pixels to not have hives.....
# 14
PLAUSIBLE SPATIAL SAMPLING FRAMES
(Courtesy of Tony Olsen, EMAP, US EPA)
JARGON!!! - means what?
HUC = Hydrologic Unit Code
NRI = National Resources Inventory – oriented toward soil erosion (Iowa State U)
NASS = National Agricultural Statistical Survey
NLCD = National Land Cover Data
GRTS = Generalized Randomized Tessellation
Stratified
VERY promising approach – provides easy and defensible way to accommodate access denials, etc
# 15
PLAUSIBLE SPATIAL SAMPLING FRAMES
(Courtesy of Tony Olsen, EMAP, US EPA)
Where to Find Info
JARGON!!! - where to find out more about the content the jargon represents
HUC: http://water.usgs.gov/GIS/huc.html
NRI: http://www.nrcs.usda.gov/technical/NRI/
NASS: http://www.usda.gov/nass/
NLCD: http://www.epa.gov/mrlc/nlcd.html
GRTS: http:oregonstate.edu/dept/statistics epa_program/docs/ spatial_balance_imperfect_frame.pdf
# 16
A PLAUSIBLE SPATIAL SAMPLING FRAME
Hydrologic Units
Level 1 – “Two digit”
21 major geographic areas, or regions
Level 2 – “Four Digit”
divides the 21 regions into 222 subregions
Level 3 – “Six Digit”
352 hydrologic accounting units
Level 4 – “Eight Digit”
There are 2150 Cataloging Units in the
Nation
# 17
There are Lots of Ways to Select Collection
Sites – Depending On
Objectives
Sampling Frame
Units chosen (hives or bees)
Possible stratification factors
# 18
continued
One Which Has Come Out of the EMAP
Experience:
Generalized Randomized Tessellation Stratified
(GRTS) Sampling
It allows
Spatially distributed sites
Variable sampling rates – depending factors of interest
A well-defined means for adding sites to accommodate problems like access denial
Implemented in several computational contexts
Using GIS, or
Statistical software
# 19
Data Cannot Be Combined Across Areas, etc
Unless It is Comparable Across Those Same
Features
IMPLICATION: Good Training is Critical to
Assure Consistent Procedures
Various plausible contingencies must be identified in advance, and
Plans made for how they should be dealt with
# 20
Make Sure Field Crews Follow the Collection
Protocols
Be sure collection times don’t collide with fair labor laws
Does a federal employee need to be a member of each field crew?
For safety purposes, crews may need to have at least two members
Collect the Bees As Planned
# 21
Ship the Collected Material to the
Appropriate Labs, According to Specified
Protocols
Need ice?
Consider crew logistics, like
housing, transportation, permits, location of shipping point, availability of shipping point by day of the week
Plan for custody of the collected material
# 22
EPA has Stringent Quality Assurance (QA)
Processes
Approval of a QA plan may take several months
Plan for that
Implication: Indicator(s) needs to be chosen early in the process
# 23
This Will Be a Much Larger Effort Than You
May Expect
This has a QA component, too
20 – 30% of resources! Not 5%!
The collected information becomes part of the public record.
You need to plan to make it available to various interested parties!
# 24
Plan Intended Summaries from the Beginning
Record and keep track of all of the design information,
Like the rate at which various kinds of sites were selected
Consider making estimated cumulative distribution functions (cdf) a major part of the survey summary
# 25
FOR
# 26
ESTIMATED CUMULATIVE DISTRIBUTION FUNCTION
(cdf) OF SECCHI DEPTH, EMAP AND “DIP-IN”
Use cdfs – tails often are of interest
Confidence bounds
Misinformation from convenience data
# 27
QUESTIONS ARE WELCOME
# 28
# 29
# 30