Line-Intercept Sampling (Chapter 19) Most of the sampling methods discussed thus far have consisted of choosing a sample from a well-defined population according to some probabilistic mechanism. Consider now taking a sample in some area where there are stationary “objects” of interest (e.g., ponds, shrubs, patches of a certain habitat type, wolf tracks in the snow) and the objective is to estimate the total number or density of the objects or the average or total value of some characteristic of the objects. One way to carry out such objectives is to randomly choose “lines” (transects) through the area, and sample all units which are intercepted by these lines; hence the term line-intercept sampling. Example: Reconsider the farms example, where a 25x25 unit2 area (625 pixels) was partitioned into a number of farms of different sizes and with different numbers of workers per farm. We considered this example in the notes for Chapter 6 on Unequal Probability Sampling. The goal was to estimate the total number of farms, total number of workers per farm, the mean size of a farm, or the mean number of workers per farm. Recall that we sampled farms with probability proportional to size (PPS) by picking random points on the grid. We looked at Hansen-Hurwitz and Horvitz-Thompson estimators of the population parameters. • For line-intercept sampling, consider taking line transects perpendicular to the base of the farms area. Since the length along this base is 25 units, we randomly select a number between 0 and 25, and sample the line drawn vertically from the chosen point. • For each vertical transect chosen, we sample all farms intersected by the transect and record the variables of interest (the total number of farms will be estimated by letting yi = l). • Estimation of these quantities using line-intercept sampling will be demonstrated later in this handout using R. Notation for Line-Intercept Sampling: Let: K = the number of distinct objects in the population (# of farms) yk = the response variable for the k th unit, k = 1, . . . , K τ = K X yk = the population total (total # farms, or total # workers) k=1 A = the total area of the study region (= 625 for the farms example) b = the baseline width of the study region, wk = the width along the baseline of the k th object, k = 1, . . . , K, D = τ /A = the density of the response variable per unit area (# workers per unit2 ) wk = the probability the k th unit is intersected by a randomly chosen line transect pk = b (selection probability), k = 1, . . . , K, 112 πk = the probability that the k th unit is included in a sample of n line transects (inclusion probability), k = 1, . . . , K, πkh = the joint inclusion probability of the k th and hth units in a random sample of n line transects, k, h = 1, . . . , K. Thompson presents two basic ways with line-intercept sampling to estimate population totals and means and calculate associated standard errors. For estimating a population total, these two ways are: 1. Separate transects estimator: calculate the Horvitz-Thompson estimate for each of the n transects separately, then average these n estimates to get an overall estimate of the population total. 2. Horvitz-Thompson estimator: calculate the Horvitz-Thompson estimate based on the distinct objects intercepted by the whole set of n transects. These two approaches are discussed in more detail below. Separate Transects Estimator: Use the information attained from a single line transect (say the ith transect) to estimate the population total τ . Let Ci be the set of all objects in the region intersected by the ith transect and suppose we have measured some response yk on each of the units in Ci . Then an unbiased estimator of the population total is given by: X yk vi = . k∈Ci pk This is the Horvitz-Thompson estimator for a single transect (note that we cannot calculate the Hansen-Hurwitz estimator because individual objects are not selected randomly with replacement). • For n such vertical line transects, we obtain n estimates of τ , given by v1 , v2 , . . . , vn , where these estimates are independent and identically distributed. Hence, an unbiased estimator of τ and corresponding variance based on all n transects are given by: τbp = Var(τbp ) = n 1X vi (the average of the vi ’s), n i=1 1 σ2 Var(vi ) = v (where σv2 is the variance of vi ). n n • Using the sample variance of the vi ’s as an estimate of σv2 , an unbiased estimate of the variance of τbp is given by: d τb ) = Var( p n 1 X s2v where: s2v = (vi − τbp )2 . n n − 1 i=1 • Note that there is no finite population correction (fpc) here. Why? 113 • Note that with this estimator, a given object might be intersected by more than one transect and hence be included in the sample more than once. Horvitz-Thompson Estimator: Recall that the Horvitz-Thompson estimator gives a general way of acquiring an unbiased estimator of a population total where distinct selected units are used but once in the development of the estimator. For line-transect sampling, we use the whole set of objects intercepted by at least one of the n transects. Letting v be the number of distinct objects intercepted, recall that the general form of the Horvitz-Thompson estimator of the total and corresponding variance and estimated variance are given by: v X yk τbπ = (where v = the # of distinct units selected), k=1 πk Var(τbπ ) = d τb ) = Var( π ¶ N µ X 1 − πi i=1 à v X i=1 πi ! yi2 + N X X à i=1 j6=i à v X X πij − πi πj πi πj πij − πi πj 1 − πi 2 yi + 2 πi πi πj i=1 j6=i ! yi yj , ! yi yj , πij where the πi ’s and πij ’s are the inclusion and joint inclusion probabilities. • So, for example, if yk = the number of workers on farm k, and v = the total number of distinct farms selected in the line transect samples, then an unbiased (H-T) estimate of the total number of workers in the farm region is given by: τbπ = v X yk k=1 πk . To use the Horvitz-Thompson estimator for line-intercept sampling, we need to compute the inclusion and joint inclusion probabilities for all objects intercepted (sampled) by the vertical transects. How do we do this? πk = = πkh = = = • One favorable aspect of the Horvitz-Thompson estimator over the separate transect estimator (as previously discussed) is the fact that it only relies on distinct units from the sample. The major drawback is the difficulty with which the joint inclusion probabilities 114 are computed. In addition, the estimated variance of a Horvitz-Thompson estimator is not guaranteed to be positive, while the separate transect variance estimator is. Estimating the Density using the Total: With either of the two estimators for the population total just developed, an estimate of the density per unit area (mean) and corresponding variance are given by: c= D τb c = , Var(D) A where A = the total area of the study region. • If the study region is rectangular with base of width b and constant transect length l, then A = bl. Note that the estimators for the total and the density do not rely on the transect length in any way. • If the study region is irregularly-shaped, so that the transect lengths may vary from transect to transect, the estimators given are still unbiased estimators of the population quantities. However, if the lengths of the transects vary greatly, we might make use of any relationship between the response variable and the transect length via ratio or regression estimation to improve the estimators. This will be discussed later. Example (Example 19.1 on pp. 247-250 in the text): Researchers were interested in estimating the abundance of wolverines in a certain region. Aircraft were flown over selected transects looking for tracks in the snow. Each set of tracks encountered along the transect was mapped. The variable of interest in the study was yk = the number of wolverines associated with the k th set of tracks. A map of the survey area is shown below with the tracks mapped. The area was 36 miles in length at the base and 20 miles in length from the base. 115 Four systematic samples of 3 transects each were taken with random starting points in the first 12 miles given by A1, B1, C1, and D1 in the map. So, for the purposes of the study, there were four total transects, each consisting of three flights over the region. The 12 selected transects intersected v = 4 distinct sets of tracks. Letting: yk = the number of wolverines on the k th set of tracks (k = 1, 2, 3, 4), wk = the width of the projection of the k th set of tracks onto the base of the region, pk = wk /12 = the probability of intersecting the k th set of tracks for a given transect, πk = 1 − (1 − pk )4 = the inclusion probability of the k th set of tracks in the sample, the following table summarizes the information on the tracks: Track # 1 2 3 4 yi 1 2 2 1 wi (miles) 5.25 7.50 2.40 7.05 pi = wi /12 .4375 .6250 .2000 .5875 πi .90 .98 .59 .97 Separate Transects Estimate of τ = Total # of Wolverines The first transect (A1, A2, and A3) intersects the 1st, 2nd, and 4th sets of tracks, giving the following estimate of the total: v1 = y1 y2 y4 1 2 1 + + = + + = 7.1878 wolverines, p1 p2 p4 .4375 .6250 .5875 and similarly, v2 = 7.1878, v3 = 11.7021, & v4 = 11.7021. Hence, the estimated total number of wolverines in the region, and corresponding standard error are given by: τbp = 4 1X 1 vi = [7.1878 + 7.1878 + 11.7021 + 11.7021] = 9.44 wolverines, 4 i=1 4 s d τb ) = SE( p s2v n v u u =t 4 1 X (vi − τbp )2 /n = 4 − 1 i=1 s 6.7930 √ = 1.698 = 1.30 wolverines. 4 Horvitz-Thompson Estimate of τ = Total # of Wolverines Using the inclusion probabilities in the above table, the estimated total number of wolverines in the region is: τbπ v X yk · ¸ 2 2 1 1 = + + + = 7.57 wolverines. = .90 .98 .59 .97 k=1 πk The joint inclusion probabilities are computed on page 250 of the text, with the standard error found to be 2.30 wolverines, which is nearly twice that of the separate transects estimate. 116 Example 2 (Farms Example): Recall the scenario given earlier where line transects are used to sample farms in the farm example. R will be used to estimate population parameters of interest under two settings. First, we consider estimation using a single transect, and second we consider estimation using three transects. The reason for looking at a single transect first is to illustrate clearly the computation of the joint inclusion probabilities, and to examine the gains in terms of the SE’s in taking more than one transect. > runif(1,0,25) [1] 10.78054 # Select x-coordinate for transect # Record the widths of the 6 farms intersected # ============================================ > wk <- c(1,4,2,8,5,5) > pk <- wk/25 # Prob. of inclusion for a single transect > pk µ ¶ wk [1] 0.04 0.16 0.08 0.32 0.20 0.20 pk = πk = b # # > > > > > > > Record overlap of the farms for computation of joint inclusion probabilities ============================================================================ wkh <- matrix(0,nrow=6,ncol=6) wkh[1,2] <- 1; wkh[1,3] <- 1; wkh[1,4] <- 1; wkh[1,5] <- 1 wkh[1,6] <- 1; wkh[2,3] <- 2; wkh[2,4] <- 4; wkh[2,5] <- 4 wkh[2,6] <- 4; wkh[3,4] <- 2; wkh[3,5] <- 2; wkh[3,6] <- 2 wkh[4,5] <- 5; wkh[4,6] <- 5; wkh[5,6] <- 4 pkh <- wkh/25 µ ¶ pkh wkh πkh = b [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0 0.04 0.04 0.04 0.04 0.04 [2,] 0 0.00 0.08 0.16 0.16 0.16 [3,] 0 0.00 0.00 0.08 0.08 0.08 [4,] 0 0.00 0.00 0.00 0.20 0.20 [5,] 0 0.00 0.00 0.00 0.00 0.16 [6,] 0 0.00 0.00 0.00 0.00 0.00 # Estimate the total number of farms (N). Here, yi=1 for all farms # ================================================================= > yk <- rep(1,6) > yk [1] 1 1 1 1 1 1 à ! > tau.hat <- sum(yk/pk) v X yi τbπ = > tau.hat # H-T Estimate of i=1 πi [1] 56.875 # N (N=79 here) 117 # # > > > Estimate the variance of the estimated total number of farms (Eq. 6, p.54) ========================================================================== c1 <- sum((1/pk^2 - 1/pk)*yk^2) à à ! ! v X c2 <- 0 1 1 c1 = yk2 − 2 for (k in 1:5){ π π k k k=1 for (h in (k+1):6){ c2 <- c2 + 2*((1/(pk[k]*pk[h]) - 1/pkh[k,h])*yk[k]*yk[h]) } ¶ v Xµ X 1 1 } c2 = 2 − yk yh π π π > var.tau.hat <- c1 + c2 k h kh k=1 h>k > sqrt(var.tau.hat) # Estimated SE of tau.hat [1] 52.51562 # Estimate the total number of workers # ==================================== > yk <- c(3,5,1,9,4,5) > tau.hat <- sum(yk/pk) > tau.hat # Estimated number of workers [1] 191.875 > c1 <- sum((1/pk^2 - 1/pk)*yk^2) > c2 <- 0 > for (k in 1:5){ for (h in (k+1):6){ c2 <- c2 + 2*((1/(pk[k]*pk[h]) - 1/pkh[k,h])*yk[k]*yk[h]) } } > var.tau.hat <- c1 + c2 > sqrt(var.tau.hat) # Estimated SE of tau.hat [1] 172.0585 Now, suppose three transects are selected at random. We will use the transect selected above and just select two more below. > runif(2,0,25) [1] 7.020874 4.758928 # # > > > > # Two more randomly selected x-coordinates Widths and Selection Probabilities for all 3 Transects ====================================================== w1 <- c(1,4,2,8,5,5) w2 <- c(3,3,4,4,2,4,3,2,6,3) w3 <- c(5,2,2,4,3,2,5,6,2) p1 <- w1/25; p2 <- w2/25; p3 <- w3/25 118 # Estimate the total number of farms (separate transects) #======================================== > y1 <- rep(1,6) > y2 <- rep(1,10) > y3 <- rep(1,9) > v1 <- sum(y1/p1) > v2 <- sum(y2/p2) > v3 <- sum(y3/p3) > c(v1,v2,v3) [1] 56.875 81.250 78.750 > mean(c(v1,v2,v3)) # Estimated total number à ! n 1X [1] 72.29167 # of farms vi τbp = n i=1 > sqrt(var(c(v1,v2,v3))/3) s [1] 7.742043 # SE of the estimate s2v SE(τbp ) = n # Estimate the total number of workers (separate transects) #========================================== > y1 <- c(3,5,1,9,4,5) > y2 <- c(4,2,7,5,3,4,5,2,2,3) > y3 <- c(6,3,3,4,2,3,3,2,1) > v1 <- sum(y1/p1) > v2 <- sum(y2/p2) > v3 <- sum(y3/p3) > c(v1,v2,v3) [1] 191.875 287.500 220.000 > mean(c(v1,v2,v3)) ! [1] 233.125 # Estimated total number of workers ÃX N yi = 249 > sqrt(var(c(v1,v2,v3))/3) i=1 [1] 28.3739 # SE of the estimate # # > > > > > > Horvitz-Thompson Estimates: Use only DISTINCT farms =================================================== w1 <- c(1,4,2,8,5,5) w2 <- c(3,3,4,4,2,4,3,2,6,3) w3 <- c(5,2,2,3,2,5,2) p1 <- w1/25 p2 <- w2/25 p3 <- w3/25 119 # Estimate the total number of farms (H-T) # ======================================== > y1 <- rep(1,6) > y2 <- rep(1,10) > y3 <- rep(1,7) > p <- c(p1,p2,p3) > y <- c(y1,y2,y3) > sum(y/(1-(1-p)^3)) à ! v v X X y y k k [1] 77.26119 τbπ = = n k=1 πk k=1 1 − (1 − pk ) # Estimate the total number of workers (H-T) # ========================================== > y1 <- c(3,5,1,9,4,5) > y2 <- c(4,2,7,5,3,4,5,2,2,3) > y3 <- c(6,3,3,2,3,3,1) > p <- c(p1,p2,p3) > y <- c(y1,y2,y3) > sum(y/(1-(1-p)^3)) à ! v v X X yk yk [1] 253.6919 τbπ = = n k=1 πk k=1 1 − (1 − pk ) The standard errors for the Horvitz-Thompson estimates were not calculated here, as the code becomes increasingly lengthy with more than 1 transect. 120 Farms Map The 25x25 grid map below represents the boundaries of farms in a certain area. • The letters for each farm represent the “type” of farm. • The numbers for each farm represent the number of workers on the farm. • After adding the missing farm information from earlier in the course, there are N = 79 total farms and τy = 249 total workers. B1 A1 C3 A2 D2 B2 C4 C8 A5 A3 B2 C D2 2 B5 C3 B4 D2 A1 D3 C2 D4 B2 B 2 C1 A9 B3 D2 A B 1 2 B4 A1 C3 D5 C2 D4 A3 D4 C B1 3 D5 B3 A1 A C2 B A3 3 2 D1 B4 C2 D7 C3 D5 A2 C6 C 2 B3 C3 B5 B4 D4 A2 D3 A3 B2 D4 B 3 A8 C4 121 A B1 2 B 3 C A2 C2 B6 4 D4 B3 A5 C5