Appropriate Sampling Ann Abbott Rocky Mountain Research Station Moscow Forestry Sciences Laboratory Outline What is Appropriate Sampling How do we do it Questions to Ask Sampling Designs Sample Size Northern Region Protocol What is appropriate sampling? Meets the objectives of the research question Representative of the population Feasible Cost effective Appropriate Sampling is the RESULT Of answering a series of questions The answers to the appropriate questions lead naturally to the appropriate Sampling Design and Data Analysis/Interpretation Questions to Ask Objectives of the Research Population for inferences Sampling Units Translation of the objectives Preliminary Information Choice of Sampling Design Questions to Ask Determination of Sample Size Auxiliary Variables Randomization Recording Results Analysis Stating the Objectives Have the objectives of the investigation been clearly and explicitly stated, along with the reasons for undertaking it? Have the objectives been translated into precise questions that sample determinations can be expected to answer? Defining the Population Has the population about which inferences to be made been carefully defined? What constraints are to be placed on the population? Are the units to be measured or counted representative of the population? If not, what changes must be made to ensure representativeness? Defining the Population Is there a logical framework for the choice of sample units from the defined population? If not, what steps can be taken to impose a logical sampling frame? Sampling Units A successful sampling scheme involves the selection of an appropriate sampling unit Quadrat Leaves of a plant Individual organism Belt transect Point Sampling Units Are the sampling units naturally defined? If not, how will they be defined? Is the number of sampling units finite? If it is finite, is the total number of units in the population large enough to ignore finite sampling considerations? Is the definition of the sampling units appropriate to the objectives Choice of Sampling Unit Must be the unit upon which you wish to make inferences and estimates Defined to be “nonoverlapping collections Sampled without replacement of elements from the population that cover the entire population” Choice of Sampling Unit Point versus Area Point samples allow inferences based on the number of observations in the sample Inferences are made on means or percentages from the sample observations Area samples are generally measured with densities or percent of the area covered Inferences are made by extrapolating the sample density to the entire area Choice of Sampling Unit: Point vs Area Point samples are quicker, can potentially give a more cost effective coverage of the area Area samples can yield more detailed information but may be more time consuming Area sampling assumes that counts are made without error Translating the Objectives What exactly is to be estimated or tested? Are the required estimates proportions, totals, means, totals or means over subpopulations, or something else? Have blank data sheets been constructed? What is the smallest subset of data from which estimates are to be made? What precision is required of the estimates for the various subsets? Preliminary Information Is information about the population available that may be helpful in designing the sampling scheme? Are estimates of the likely variability available? Is a pilot study feasible or desirable? Are there any known factors that help stratify the population? Variability The variation that is inherent in soils data must be accounted for during the design phase of a soil sampling plan, including Sampling design Data collection procedures Analytical procedures Data Analysis “One of the key characteristics of the soil system is its extreme variability.” (Mason 1992) Researchers have long been cautioned about failing to consider the variability in soil sampling when dealing with any study of the soils system (e.g. Cline 1944). Accounting for Variability Ensuring that the sample adequately covers the entire population Reporting variability estimates along with central tendency estimates Reporting interval estimates Use an interactive approach to balance the data quality needs and resources with designs that will either control variation, stratify to reduce variation, or reduce the influence of variation on the decision process Precision, Bias and Accuracy Precision is a measure of the reproducibility of measurements of a particular soil condition or constituent The statistical techniques seen in soil sampling are designed to measure precision and not accuracy Bias is a systematic error that contributes to the difference between the mean of a large number of test results and an accepted reference value. Precision, Bias and Accuracy Accuracy is the correctness of the measurement and cannot be directly measured: it is the sum of precision and bias Red dots are precise but biased Blue dots are unbiased but imprecise Yellow dots are biased and imprecise Green dots are unbiased, precise and therefore accurate Sampling Designs Simple Random Sampling Stratified Random Sampling Systematic Random Sampling Cluster Sampling Other Combinations Sampling Designs Can the population as defined be broken into naturally occurring groups, where the grouping variable affects the measured variable(s)? If it cannot, Simple Random Sampling or Systematic Sampling can be effective If it can, Stratified Random Sampling or Cluster Sampling Simple Random vs Systematic Simple Random Sampling: If there a “list” (sampling frame) of all sampling units in the population Randomly selects from units on the list Systematic Random Sampling: If there is no sampling frame available but there is an estimate of the total number of sampling units Randomly selects starting point Simple Random Sampling Used when there is inadequate information for developing a conceptual model for a site or for stratifying a site Any sample in which the probabilities of selection are known Sampling units are chosen by using some method using chance to determine selection Simple random sampling is the basis for all probability sampling techniques and is the point of reference from which modifications to increase sampling efficiency may be made Alone, simple random sampling may not give the desired precision Simple Random Sampling Advantages Prior information about population is not necessary Easy to perform, easy to analyze Disadvantages May not give desired precision Need a sampling frame Computation Simple Random Sample-continuous variable n Mean y y i 1 i n n (y y )2 Variance s Confidence Interval y t 2 n 2 s Vˆ ( y ) , s2 n Sample Size n t2 s 2 2 w2 i 1 i n 1 Computation Simple Random Sample-Binomial variable n Proportion Variance pˆ y i 1 i n pˆ (1 pˆ ) ˆ ˆ V ( p) n ˆ z Confidence Interval p Sample Size n z2 pˆ (1 pˆ ) 2 w2 2 pˆ (1 pˆ ) n Systematic Random Sampling Attempt to provide better coverage of the study area or population than that provided by a simple random sample or a stratified random sample Is a simple random sample based on spatial distribution over the site Does not require a complete list of sampling units Can give better coverage than a simple random sample Systematic Random Sampling Requires some estimate of the total number of sampling units in the population Required sample size must be calculated Determine sampling interval between units Randomly select starting point Transect sampling is a version of Systematic Random Sampling Systematic Random Sampling Collects samples in a regular pattern over the area in the investigation Grid Line Transect Orientation of grid or transect starting point should be randomly selected Systematic Random Sampling Considerations Sample size and population size estimates Some knowledge of the population to avoid sampling along periodicities Stratified vs Cluster Sampling Used when the population can be broken into naturally occurring groups or segments Stratified Random Sampling: when there is more variability among groups than within groups Cluster Sampling: when there is more variability within groups than among Stratified Random Sampling Prior knowledge of the sampling area and information obtained from background data may be used to reduce the number of observations necessary to attain specified precision Goal is to increase precision and control sources of variability in the data Stratified Random Sampling Variability between strata must be larger than variability with strata for any benefit to be seen Sampling within each stratum is done with a Simple Random Sample Stratified Random Sampling Advantages Gives estimates for subgroups Can be more precise than Simple Random Sampling Can be more convenient to implement Disadvantages Requires prior information about the population More complicated computation Computation Stratified Random Sample-continuous variable 1 L y st N i y i N i 1 Mean 1 ˆ V ( y ) Variance st N2 s Confidence Interval y st t st 2 s Ni2 i ni i 1 L 2 nst Stratified Random Sample Sample Size Calculation Requires information about the relationship between the individuals among strata Can be calculated by weighting strata Can allocate sampling based on minimizing the variance for a fixed cost Other ways to allocate sampling among strata (optimal, Neyman) Post Stratification Can be used when stratification is appropriate for some key variable, but cannot be done until after the sample is selected Often appropriate when a simple random sample is not properly balanced according to major groupings Post Stratification Mean 1 L y st Ni y i N i 1 Variance L L N n 1 2 2 Vˆp ( y st ) W s ( 1 W ) s i i n2 i i Nn i 1 i 1 Cluster Sampling Used when there is more variability within groups than among Groups are randomly sampled Units within groups are sampled Can sample every element within the group Can take a second random sample within the group Questions to Ask in Choosing a Sampling Design If there is no information on population groupings, will simple random or systematic random sampling better meet the objectives? Is Simple Random Sampling likely to be effective? If not, have the reasons for not using simple random sampling been clearly stated? Questions to Ask in Choosing a Sampling Design If Systematic Random Sampling is chosen, what interval will separate units? Is there a likelihood that the interval will coincide with periodicity in the data? If so, what steps will be taken to avoid the resulting bias in the estimates? Questions to Ask in Choosing a Sampling Design If there is a grouping in the population, will stratification improve the precision of the estimates? Has the efficiency of the stratification been calculated? What is the basis of the stratification? How will the sampling units be allocated? Questions to Ask in Choosing a Sampling Design If there is a grouping in the population, is there an advantage to cluster sampling? Has the efficiency been calculated? Sample Size Calculated based on variability (standard deviation) within the population and desired precision of the estimate (confidence level) t2 s 2 Simple Random Sample and 2 n Systematic Random Sample w2 Stratified Random Sample (complicated) but still needs variance Sample Size Specific sampling design considerations Systematic: is the sample size required to uniformly cover the population consistent with the expected precision? Stratified: has the efficiency of the stratification been tested in reducing the sample size or in obtaining the largest number of observations from the part of the population of greatest interest? Sample Size Sample design considerations, continued Multistage: has the efficiency of various combinations of sample units at different stages been tested? Cluster: has the efficiency of various size clusters been tested? Sample Size Cost considerations Must the number of observations be modified to account for variation in cost in different parts of the sampling procedure? If so, can the design be improved for better cost efficiency? Randomization Have the sampling units been selected by an explicit randomization procedure? Has the randomization procedure been documented? Were any constraints correctly applied? Sample Design Example Northern Region Soil Monitoring Protocol Goal: Develop an easy, cost effective and statistically defensible monitoring protocol for disturbance Stating the objectives: Characterize the activity area in terms of management related disturbance Northern Region Protocol Defining the population: All possible ‘points’ within the Activity Area Sampling units defined as ‘points’ Infinite number of possible ‘points’ in the population so finite sample correction factors do not need to be used Northern Region Protocol Sample Design Stratification may be desirable but variability information is unavailable Simple Random Sampling may not give the appropriate coverage Systematic Random Sampling (Transect) was chosen to give the best coverage of the area Northern Region Protocol Translating the objectives What exactly will be measured or tested: Forest floor depth Forest floor missing Topsoil displacement Mixed topsoil/subsoil Erosion Rutting (3 depths) Burning (light, moderate, severe) Compaction (3 depths) Platy/massive structure (3 depths) 5 forest floor variables Northern Region Protocol Translating the objectives: Blank data tables Northern Region Protocol Translating the objectives: what exactly is to be estimated or tested? What proportion of points in the sample have the characteristic of the indicator variable? What is the variability associated with the proportion? Northern Region Protocol Translating the objectives: What is the required precision of the estimates? Confidence intervals within ± 5% of the estimate Confidence levels are determined by the line officer, allow choice from 70% to 95% Northern Region Protocol What preliminary information is available about the activity area? Approximate size and shape Harvest history Variability estimates generally unknown A pilot would be best Stratification potential exists Northern Region Protocol Problem: Variability estimates are unavailable Pilot studies are not feasible due to time and cost constraint Statistically valid sample sizes are required Sequential Sampling An alternative approach to sampling in which the sample size is not fixed in advance Observations are collected individually or in small batches After each observation or batch, the data are examined to determine whether or not a decision may be made from the accumulated data Sequential Sampling Combines data collection and data analysis into a single process or sampling plan Can considerably reduce the sample size requirements and data processing overheads Sequential Sampling Best used in situations where classification of a population is useful and where the emphasis is on decision making In the simplest and most frequently used form, it is used to make binary classifications but can be extended into other applications Northern Region Protocol Use a combination of sequential and systematic random sampling to obtain variability information for sample size calculation at the same sampling visit as the full data collection trip First 30 observations are used to calculate initial sample size, then sample size is continually updated as sampling continues Northern Region Protocol Indicator variables are binomial (0,1) Binomial variables converge to a normal distribution when n ≥ 30 Attractive for sampling since the maximum variability can be computed Northern Region Protocol When sampling is complete for the activity area, the estimates and confidence intervals are computed Protocol allows field crews to sample an activity area with a statistically valid sample size in one visit