Ocean Dynamics DOI 10.1007/s10236-010-0306-2 Lagrangian analysis by clustering Inga Monika Koszalka · Joseph H. LaCasce Received: 15 February 2010 / Accepted: 21 May 2010 © Springer-Verlag 2010 Abstract We propose a new method for obtaining average velocities and eddy diffusivities from Lagrangian data. Rather than grouping the drifter-derived velocities in geographical bins, we group them by nearestneighbor distance using a clustering algorithm. This yields sets with approximately the same number of observations, covering unequal areas. A major advantage is that, because the number of observations is the same for the clusters, the statistical accuracy is more uniform than with geographical bins. We illustrate the technique using synthetic data from a stochastic model, employing a realistic mean flow. The latter represents the surface currents in the Nordic Seas and is strongly inhomogeneous in space. We use the clustering algorithm to extract the mean velocities and diffusivities and compare the results with the corresponding quantities from the stochastic model. We perform a similar comparison with the means and diffusivities obtained with geographical bins. Clustering is more successful at capturing the mean flow and improves convergence in the eddy diffusivity estimates. We discuss both the advantages and shortcomings of the new method. Keywords Lagrangian analysis · Eddy diffusivity · Binning · Clustering Responsible Editor: John Grue I. M. Koszalka (B) · J. H. LaCasce Department of Geosciences, University of Oslo, P.O. Box 1022, Blindern, 0315 Oslo, Norway e-mail: inga.koszalka@geo.uio.no 1 Introduction Lagrangian instruments, surface drifters and subsurface floats, are widely used for measuring oceanic velocities. Their increased use in recent decades has resulted in coverage over large parts of the world oceans (e.g., http://www.aoml.noaa.gov/phod/dac/gdp.html). Given the amount of data being generated, it is important to continually improve our analysis techniques, to extract as much information as possible from that data. There are a wide range of Lagrangian data analysis techniques (LaCasce 2008). The most common technique involves estimating Eulerian mean velocities and diffusivities. With these quantities, one can write an advection-diffusion equation describing the evolution of a tracer (Davis 1991): ∂ θ + U ∇θ = ∇ K∇θ ∂t (1) Lagrangian data can be used to determine U and K, the time-mean velocity and the eddy diffusivity tensor, both of which can vary in space. The method for calculating U and K is described by Davis (1991). Consider a data set covering a certain region. The drifter trajectories are used to calculate velocities along the drifter paths, by differencing. Then these velocities are grouped in geographical bins of a specified size to estimate the mean velocities in the bins (Fig. 1a). The means pertain to the period spanned by the data set. One assumes that the sampling in the bins is sufficient to capture the actual Eulerian means and that the statistics are stationary over this period. Examples of such calculations are found in Rossby et al. (1983), Owens (1991), Poulain et al. (1996), Swenson and Niiler (1996), and Fratantoni (2001). Ocean Dynamics a b Fig. 1 a A sketch showing Lagrangian observations grouped in geographical bins. b Lagrangian data partitioned by the clustering algorithm under the constraint of a prescribed amount of members in a cluster The diffusivity calculation stems from that of Taylor (1921). For example, in the zonal direction, this is: 1d < x2L (t) >=< xL (t)uL (t) > 2 dt t = < uL (t)uL (τ ) > dτ κxx (t) ≡ 0 t = Pxx (τ ) dτ (2) 0 where xL is Lagrangian displacement, uL the Lagrangian velocity, and P(τ ) the time-lagged Lagrangian velocity covariance. Davis (1991) allows for the diffusivity to also vary in space. To calculate this, one replaces the velocities above with “residual velocities”, those with the mean removed, and the same with the displacements. The diffusivities are obtained for each bin and the averages over all trajectories in the bin. As such, the diffusivity is a mixed Eulerian–Lagrangian measure. It is Lagrangian because it involves integrating along particle paths, but it is Eulerian because the integral occurs for drifters in a specified area and because it involves subtracting the Eulerian mean. There are a number of practical issues with regards to binning (e.g., Mariano and Ryan 2007). One concerns the bin size. The bins should be small enough to resolve the mean flow but larger than the scale of the energy-containing eddies. It should also be large enough to yield a statistically significant estimate. The latter necessarily varies between bins, as the amount of data in each bin varies. Such variations can lead to bias errors (Davis 1991). The diffusivities are similarly affected by the bin size. We assume that the diffusivity converges at long times, i.e., κ(x, t) → κ ∞ (x) as t → ∞. However, the integration time in Eq. 2 depends on the time a drifter spends in the bin, and this will generally differ between individual drifters in the same bin. As such, the mean autocorrelation derives from segments of differing lengths, and this can affect the convergence of the integral (see below). Using larger bins improves this, by allowing for longer individual segments, but some tracks will always be shorter than others. The binning technique has been widely applied to ocean data, and different bin sizes and even different bin shapes and orientations have been explored (e.g., Swenson and Niiler 1996; Falco et al. 2000; Poulain 2001; Jakobsen et al. 2003; Lumpkin and Garraffo 2005; Davis 1998; Thompson et al. 2009). Improvements such as fitting the binned velocities with cubic splines (Bauer et al. 2002), using different sized bins for the means and diffusivities (e.g., Poulain et al. 1996; Swenson and Niiler 1996), using different asymptotic limits for the diffusivity integration (e.g., Poulain et al. 1996; Brink et al. 2000; Thompson et al. 2009), and using different equivalent formulations to Eq. 2 (e.g., Colin de Verdiere 1983; Zhurbas and Oh 2003) have all been explored. Hereafter, we examine an alternate idea. Rather than grouping the velocities in bins of fixed size, we group a specified number of nearest-neighbor realizations together (Fig. 1b) using a clustering algorithm. Such algorithms are used in diverse fields, such as data mining, image processing, and bioinformatics (Lloyd 1982; Kanungo et al. 2002; MacKay 2003). Specifying the number of members in the cluster then determines the number and spatial extent of the clusters for the whole data set. The resulting mean velocities are on a nonuniform grid. However, the coverage is determined by the data; we do not obtain estimates where there are few or no measurements. A major advantage though is that there are approximately the same number of realizations in each cluster. As such, the standard error will depend only on the standard deviation of the velocity rather than also depending on the number of observations in the bin. The calculation of the diffusivities also differs. First, we evaluate the velocity autocorrelation with Eq. 2 for a chosen f ixed period of time. We assign a position to each autocorrelation (the midpoint along the trajectory segment) and then cluster those positions. We then average the autocorrelations in the cluster, with each cluster having a prescribed number of segments. The average is then integrated over the time interval equal to the segment length to obtain an estimate of κ ∞ (x). The length and number of contributing trajectories are thus the same, and these values can be adjusted to improve convergence. The method of calculating the diffusivity is similar to that used previously by Garraffo et al. (2001), Lumpkin and Flament (2001), Lumpkin et al. (2002), and Rupolo (2007). These authors also used trajectory segments of Ocean Dynamics a fixed length in calculating the diffusivity. In contrast though, most used mean velocities from individual trajectories rather than the interpolated Eulerian means estimated from the entire data set. And their estimates were grouped into geographical bins, yielding different numbers of data points in each bin. We illustrate the clustering method hereafter using synthetic trajectories. The latter are generated with a first-order stochastic model, using mean velocities representative of the surface currents in the Nordic Seas. The result is a data set with known mean velocities and diffusivities, allowing us to test the accuracy of our estimates. In addition, we calculate corresponding estimates using bins and compare the results. The currents in the Nordic Seas are narrow and strongly inhomogeneous, so this is a fairly strenuous test. Using synthetic data also ensures that we are not limited by the size of the data set. Previous authors have used stochastic models for Lagrangian analysis (e.g., Griffa 1996; Falco et al. 2000; Garraffo et al. 2001; Veneziani et al. 2004; Rupolo 2007; Sallee et al. 2008). The goal in these studies was to use the stochastic models to reproduce dispersion characteristics in observations. We are treating the stochastic trajectories as the observations, as was done, for example, by Bauer et al. (1998). Davis (1991) used synthetic trajectories in this way, to evaluate estimation errors under binning. However, he did not address the dependence on bin size, an issue addressed here. The paper is organized as follows: The study region and simulated Lagrangian particles are described in Section 2. In Section 3, we consider mean velocities, and eddy diffusivities are addressed in Section 4. We discuss the results in Section 5. two components of the velocity u and v are assumed independent. The velocity autocorrelation is given by: P(τ ) =< u(t)u(t + τ ) >= ν 2 e(−τ/TL ) . (4) (3) From Eq. 2, the diffusivities have the asymptotic value of κ ∞ = ν 2 TL . As noted, we use estimates of the surface currents in the Nordic Seas for the mean velocities, (U, V). The dominant feature here is the Norwegian Atlantic Current, off the western Norwegian coast. This is 20– 30 km wide in its core, a distance somewhat larger than the deformation-scale eddies (5–10 km) which are ubiquitous here (Poulain et al. 1996; Skagseth and Orvik 2002; LaCasce 2005; Koszalka et al. 2009). Our representation derives from a 1-year simulation with the 4-km MIPOM model of the Norwegian Meteorological Institute. This produces fairly realistic velocity fields (LaCasce and Engedahl 2005). The velocities were resampled on a regular grid of 0.25◦ × 0.25◦ and are contoured in Fig. 2a. The means were then interpolated onto the particle’s instantaneous positions for advection. The model also requires the root mean square (rms) velocity, ν, and the Lagrangian integral time scale, TL . Based on earlier estimates (Poulain et al. 1996; LaCasce 2005; Andersson et al., submitted for publication), we assign values of ν = 20 cm/s and TL = 1 day.1 This yields an effective length scale L = νTL = 18 km, comparable to the core width of the Norwegian Atlantic current. For simplicity, we assume that the eddy statistics are isotropic and homogeneous. Koszalka et al. (2009) used a similar stochastic model for comparison with drifter trajectories in the same region. Two thousand particles were deployed on a regularly spaced grid and advected for 60 days, yielding ca. 105,000 drifter days. This is comparable to the number of actual drifter days currently available in the Nordic Seas; however, the areal coverage in the synthetic set is much more uniform. Seeding on a uniform grid also reduces the “array bias”, which can influence the diffusivities (Davis 1991). Some particles collided with the coast or islands, and we discarded the subsequent portions of those trajectories. The model time step was dto = 0.01 day, and the data were saved with a time step of dt = 0.1 day (one tenth of the integral time). The resulting trajectories are plotted in Fig. 2b. For comparison, we ran an additional simulation with 2,000 stochastic particles with The subscript refers to the particle, (U, V) is the background mean flow, ν is the square root of the eddy velocity variance, TL is the Lagrangian integral time scale, and dw is a Wiener (normal) noise process. The et al. (1996) found TL = 1 − 3 days here, while Andersson et al. (submitted for publication) estimated TL = 1.1 days. LaCasce (2005) found that the Eulerian integral time is 1 to 2 days, which implies an equal or shorter Lagrangian time. 2 Data For the synthetic trajectories, we employ a first-order stochastic model (e.g. Griffa 1996), for which the particle positions are given by: dxi = (ui + U(x, y)) dt, dyi = (vi + V(x, y)) dt 1 dui = − ui dt + TL 1 dvi = − vi dt + TL 2 ν dw, TL 2 ν dw. TL 1 Poulain Ocean Dynamics Fig. 2 a Magnitude of the mean velocity √ field |U(x, y)| = U 2 + V 2 (centimeters per second) from a MIPOM model simulation of the Nordic Seas used to advance stochastic particles according to Eq. 3. b Trajectories from 2,000 synthetic particles evolved for 60 days with a first-order stochastic model embedded in this mean flow. Deployment positions are marked with circles a) 76 76 74 74 72 72 70 70 68 68 66 66 64 64 62 62 −15 −10 −5 0 zero mean flow (U = 0, V = 0), all other parameters being the same. 3 Mean velocities We focus first on extracting the mean velocities from the drifters. The resulting estimates will be compared to the actual U, V values from the MIPOM simulation (used as input to generate the trajectories). We have velocities with a time step of dt = TL /10, but we use only a subset of these for calculating the means, with dt = 2TL . Then each observation is treated as independent. 3.1 Methods For binning the velocities, we must first choose the bin sizes. The bins should be small enough to resolve the mean flow but larger than the eddy scale. They should also be large enough to yield statistically significant estimates. The Nordic Seas is problematic in this regard because the mean and the eddy scales are comparable. Previous authors used (2◦ × 1◦ ) bins in this region (Poulain et al. 1996; Saetre 1999; Jakobsen et al. 2003).2 dimensions are listed (degrees longitude × degrees latitude). With (2◦ × 1◦ ), the bins are close to square in the southern part of the domain but are more rectangular in the north. 2 The b) 5 10 15 −15 −10 −5 0 5 10 15 Such bins have a length scale of roughly 100 km. We denote this as our “intermediate” bin size. In addition, we examine smaller and larger bins, with dimensions (4◦ × 2◦ ) and (1◦ × 0.5◦ ). For the clustering, we employ the “k-means” clustering algorithm (Lloyd 1982). The algorithm partitions the nT observations (x1 , x2 , ..., xn ) into k subsets (clusters), S = S1 , S2 , ..., Sk , such that each observation is assigned to the nearest cluster in a way that minimizes the sum, over all clusters, of the squared distance between cluster members and the cluster center μi : min k x j − μi 2 . (5) i=1 x j ∈Si As the cluster centers themselves depend on the positions of the observations, this is necessarily done iteratively, in a two-step assignment/update process. In the assignment step, each data point is assigned to the nearest center. In the update step, cluster centers are adjusted to match the sample means of their member data points. This is repeated until the assignments are unchanged. For more information on clustering algorithms, see, e.g., Kanungo et al. (2002) and MacKay (2003). The main parameter to be specified is k, the number of clusters. If we wish to have clusters with m members, then k = nT /m. As with the bins, we use three choices, ranging from coarser to finer resolution. We chose m so that the mean standard error among the clusters was Ocean Dynamics the same as that in the corresponding bins. The error is defined: ν < σ >=< √ >, (6) n where ν and n are the rms velocity and the number of realizations in the bin/cluster and the brackets indicate an average over all the bins/clusters. Alternately, we could have chosen m to match the mean number of observations in the bins, but the latter varies widely among bins, as will be seen. Matching mean errors yields clusters with m = 125, m = 75, and m = 45 members. To guarantee that all the clusters have approximately m observations, we modified the k-means algorithm (as described in “Appendix”). The various parameters for the bins and clusters are shown in Table 1. Note that the “coarse” bins are roughly twice as large as the coarse clusters and have nearly twice as many observations, on average. The “fine” bins and clusters are more comparable in both regards. 3.2 Results Shown in Fig. 3 are the means obtained by binaveraging (panels a–c) and by clustering (panels d–f). In the lower panels, the clustered means are linearly interpolated onto the same grid as for the input model field (panels g–i), for comparison with the actual mean flow, in Fig. 2a. Consider the bins first (panels a–c). With the finest resolution (1◦ × 0.5◦ ), the major structures in the surface current are recovered. These include the inflow north of Iceland and the inner and outer branches of the Norwegian Atlantic Current (e.g., Orvik and Niiler 2002). With the (2◦ × 1◦ ) bins, we observe where the currents are stronger and weaker but lose much of the finer structure. The currents with the (4◦ × 2◦ ) bins are hard to recognize. The results from clustering are shown in panels d–f. With m = 45, the means are comparably well-resolved as those in the finest resolution bins, with the exception of the currents along the northern periphery (which are not resolved here but marginally seen with the binned set). But the m = 75 and m = 125 clusters are also fairly successful at capturing the mean flow structure. The primary difference is that, with larger m, there are fewer clusters. Of course, part of the difference between the clustered means and the actual field (Fig. 2a) is due to the uneven plotting with the former. Interpolating the clustered means onto the same (0.25◦ × 0.25◦ ) grid as for the input mean flow yields the fields in the lower panels of Fig. 3. We see that the primary structures are captured in the clustered means, even with m = 125 (Fig. 3g). Interpolating the binned means on the other hand produces smoothed versions of those fields (not shown) and produces results comparable to the input field only with the (1◦ × 0.5◦ ) bins. Figure 4 shows further how the statistics vary between the two methods. In panels a and b, we plot the distributions of the number of observations in the bins and clusters, respectively. While the largest bins have many observations, the majority have far fewer. Thus, the distributions are skewed and the mean number of observations in the bins (Table 1) is not representative of the majority. The clusters on the other hand have nearly a delta-function distribution; all the clusters have approximately m observations, by design. A second difference is seen in Fig. 4c, which shows the size of the bins and clusters as a function of the mean standard error. For a given error, the average bin covers a larger area than the corresponding cluster. Moreover, the area covered by the clusters is less sensitive to the mean error than with the bins. Both points follow from the differences in the numbers of observations. Since the clusters have roughly the same number of observations, it is easier to control the error. But the errors in the bins vary widely, just as the numbers of observations do. Since the clusters cover smaller areas, they are more successful at capturing the finer-scale structures in the mean. The standard error determines the significance of the means in the bins/clusters. In Fig. 5, we examine where the calculated means differ significantly from the actual means, averaged over the same areas. Bins in which the means are not statistically different are shown in blue while the purple bins indicate a significant difference Table 1 Parameters of the binning and clustering assignments Resolution Long × Lat km No. of bins <n> < σ > (cm/s) m Nc Dc (km) < σ > (cm/s) Coarse Medium Fine 4×2 2×1 1 × 0.5 186 93 47 61 225 839 865 235 63 1.8 2.5 3.5 125 75 45 452 775 1,356 80 54 34 1.9 2.5 3.3 Bin size (long × lat), bin length scale in kilometers (square root of the area covered by the bin), number of bins, average number of observations in bins, mean standard error in the bins, number of members in cluster, number of clusters, mean cluster diameter, and mean standard error in clusters Ocean Dynamics a) 4 x 2 b) 2 x 1 c) 1 x 0.5 76 76 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 62 − 15 − 10 −5 0 5 10 62 − 15 − 10 15 −5 d) m=125 0 5 10 − 15 − 10 15 e) m=75 76 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 62 62 −10 −5 0 5 10 15 −15 −10 −5 g) m=125 0 5 10 15 −15 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 62 62 −5 0 5 10 15 −15 −10 −5 0 5 10 15 −5 0 5 10 15 i) m=45 76 −10 −10 h) m=75 76 −15 0 f) m=45 76 −15 −5 5 10 15 −15 −10 −5 0 5 10 15 Ocean Dynamics Fig. y)| 3 Pseudo-Eulerian estimate of the mean speed |U(x, derived through averaging of the synthetic Lagrangian observations. Top Obtained by binning the data in grids with varying bin size—4◦ × 2◦ (a), 2◦ × 1◦ (b), and 1◦ × 0.5◦ (c). Bins with no data are plotted in gray. Middle Obtained by clustering the data with different numbers of members—m = 125 (d), m = 75 (e), and m = 45 (f). Bottom Clustered estimates interpolated onto a regular grid of (long × lat) = 0.25◦ × 0.25◦ —m = 125 (g), m = 75 (h), and m = 45 (i) (panels a–c). Panels d–f show the corresponding fields for the clusters. One might expect that, because increasing the bin/ cluster area increases the number of observations in them, this would likewise reduce the errors. But the percentage of rejected bins is actually greater with larger bins and clusters than with smaller ones. The reason for this lies with the mean flow. Because the mean is so inhomogeneous, using a larger bin involves averaging over a wider range of U, V values. The standard error is smaller because the number of observations is greater, making it less likely that the two estimates are statistically the same. In a sense, the larger bins produce a more certain answer of an incorrect velocity. With smaller bins, the sampled mean is more homogeneous and the error larger, increasing the probability of reproducing the mean flow fields correctly in the bin area. A larger proportion of bins than clusters are rejected for a given mean standard error (Fig. 5c). This is again a) b) 80 80 4x2 2x1 1x0.5 70 70 60 % CLUSTERS 60 % BINS 50 40 30 50 40 30 20 20 10 10 0 0 10 125 75 45 1 2 10 0 0 10 3 10 10 1 2 10 10 * 3 10 * N N c) 250 CLUST BIN LENGTH SCALE (km) 200 4x2 150 100 2x1 125 75 50 1 x 0.5 45 0 1 1.5 2 2.5 <σ> (cm/s) Fig. 4 a Distributions of the number of independent observations grouped in bins of different size. b Distributions of the number of independent observations in clusters obtained with 3 3.5 4 different parameter m. c Mean length scale (square root of the area covered by the bin and cluster diameter) vs. mean sampling error for binning and clustering analyses Ocean Dynamics a) 4 x 2 b) 2 x 1 c) 1 x 0.5 76 76 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 62 62 −15 −10 −5 0 5 10 −15 −10 15 −5 d) m=125 0 5 10 −15 −10 15 e) m=75 76 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 −10 −5 0 5 10 15 −10 −5 0 5 10 15 −15 −10 g) CLUST BIN 80 60 4x2 40 2x1 1 x 0.5 125 20 75 45 0 1 5 10 15 62 −15 100 % BINS/CLUSTERS REJECTED −15 0 f) m=45 76 62 −5 1.5 2 2.5 <σ> (cm/s) 3 3.5 4 −5 0 5 10 15 Ocean Dynamics Fig. 5 Comparison of the means in the bins a, b and clusters d–f with the actual means, U, V, used in generating the particle trajectories. Purple color codes bins/clusters that have means which are different from the actual means at the 95% confidence level, and blue color indicates means that are the same. g Percentage of bins and clusters where the means are significantly different (“rejected”) because the clusters cover smaller areas. The three types of cluster used produce a rejection rate between 14% and 22%, while 25–60% of the bins are rejected. Thus, the clusters are more successful at capturing the actual means. We would obtain different results had we used a different metric for comparing the bins and clusters. For instance, if we match the mean number of observations, we obtain larger clusters and a less wellresolved mean. But, as noted earlier, the mean number of observations is not representative for the bins, due to their skewed distributions of observations. This is because the bins, unlike the clusters, are not necessarily where the data are. 4 Diffusivities 4.1 Diffusivities with zero mean flow Now we turn to the eddy diffusivities. There are several technical issues to be addressed. First is how it is actually calculated. Some compute it from the integral of the ensemble-mean velocity autocorrelation (Eq. 2; e.g., Poulain et al. 1996; Poulain 2001; Thompson et al. 2009). Others prefer the product of the residual velocity and displacement (e.g., Swenson and Niiler 1996; Zhurbas and Oh 2003), while some compute half the derivative of the mean absolute dispersion (Colin de Verdiere 1983). The different approaches are not often compared (Zhurbas and Oh 2003). We will do so briefly here, using the stochastic trajectories with zero mean flow. Without a mean flow, the residual velocities are the same as the particle velocities and are homogeneous. The diffusivity estimates, κ(t), from the three methods are shown in Fig. 6a. Also shown is the theoretical curve, obtained by integrating the exponential autocorrelation for a first-order stochastic process: 0 t 2 ∞ exp(−|t |/TL ) dt = κ 1 − exp − . κ(t) = ν TL −t (7) With TL = 1 day and ν = 20 cm/s, the asymptotic limit is κ ∞ = ν 2 TL = 3.46 × 107 cm2 /s. The derivative of the absolute dispersion and the product of the velocity and displacement produce the same result, within the errors. The diffusivities asymptote to the theoretical limit after 3 to 4 days but exhibit significant oscillations thereafter. The integral of the autocorrelation on the other hand yields a smoother curve, and this lies near the theoretical curve. The reason this differs from the other two is that integrating the mean autocorrelation is a smoothing operation. So much of the variability seen in the other curves is removed; Davis (1991) concluded the same. We will use this method exclusively hereafter. There are two additional points. First is that the curves in Fig. 6a derive from 2,000 particles—an enormous number in relation to most observational studies. Such experiments typically have at best an order of magnitude fewer, and this affects the convergence. Examples with fewer particles, using the integrated autocorrelation, are shown in Fig. 6b. With an ensemble of 100 particles, the diffusivity estimate is within 10% of the theoretical value. The asymptote can be approximately correct with fewer particles, but the errors are larger. Second, because the diffusivities should converge after 3 to 4 days, we require track segments of at least that length to obtain proper estimates. Shown in Fig. 6c are the integrals obtained with 100 trajectories of varying length. For tracks with five or fewer days, the curves asymptote to values below the theoretical limit. Evidently track lengths of 10 days, or ten times the integral time, are required to obtain reasonable estimates. So even in the best case scenario with no mean flow, a meaningful estimate of the eddy diffusivity requires 100 track segments of at least 10 TL duration. Knowing this helps interpret the subsequent results with the mean flow restored. 4.2 Diffusivities with a mean flow Now consider the diffusivities with the mean flow present. We perform the calculations using the three bin and cluster classes discussed previously. For the means, we use averages obtained in the fine resolution cases, i.e., from the (1◦ × 0.5◦ ) bins and from the m = 45 clusters. Although the mean standard errors are larger with these cases, they best capture the detailed flow structure (Fig. 3). We linearly interpolated those means onto the instantaneous drifter positions to obtain the residual velocities. For the bins, we use only those segments of drifter trajectories while the drifters were in the bins. These were of varying length, as the drifters spent different Ocean Dynamics a) 6 5 theor adisp/ud autocorr κ (107 cm2/s) 4 3 2 1 0 0 2 4 6 8 days 10 12 14 c) 8 7 7 6 6 5 5 4 4 3 3 60 20 10 5 2 1 7 2 κ (10 cm /s) b) 8 2 5 50 100 500 1 0 0 5 10 15 days 20 25 30 2 1 0 0 2 4 6 8 10 days Fig. 6 a Diffusivity curves derived from 2,000 particles evolving for 60 days in a zero mean flow by mean sequences of the time derivative of the absolute dispersion (adisp), mean products of the single-particle velocity and its displacement (ud), and integration of the ensemble averages of the autocorrelation sequences (autocorr), compared to the theoretical value (theor). Estimation errors δκ are derived from errors on the autocorrelation given by the t-test at the 95% significance level. b Diffusivity curves from the autocorrelation method computed with a varying number of particles, each time series being 60 days long, compared to the theoretical curve drawn in black. c Diffusivity curves from the autocorrelation method computed for 100 particles with a varying length of the time series. The theoretical curve is drawn in black times in the bins. We averaged the autocorrelations from the individual tracks to obtain the bin diffusivity (Davis 1991). We did this for each of the three bin classes (Table 1). With the clusters, we essentially reverse the procedure. First, we break all trajectories into segments of a chosen, uniform time length. Then we calculate the autocorrelations for each segment. The segment is assigned to a position (the midpoint along the track), and those positions are clustered as in Section 3.2. Then the autocorrelations for all segments in the cluster are aver- aged and integrated. We chose the number of members in the cluster to be 100. With 10-day segments, this yielded 122 clusters, with a mean radius of 76 km. With 20-day segments, we obtained 62 clusters with a mean radius of 90 km. Thus, with the bins and clusters, we obtain time series of the diffusivities. The question then is how to estimate the asymptotic value, κ ∞ . Ideally, we would take the value as t approaches infinity. But this is impractical because the particles leave the bin after a finite period of time and also because the sampling error increases Ocean Dynamics as t1/2 (Davis 1991). A number of authors take the first maximum value of the series, which is similar to integrating the autocorrelation to the first zero-crossing (e.g., Brink et al. 2000; Lumpkin et al. 2002; Rupolo 2007). However, the exponential autocorrelation ob- a) 4 x 2 tained with the stochastic model theoretically has no zero crossing at finite lag. So instead, we average the diffusivity over a fixed period, from 4 to 8 days. If the mean autocorrelation is shorter than 8 days in a given bin, the integration terminates. If it is shorter than b) 2 x 1 c) 1 x 0.5 76 76 76 74 74 74 72 72 72 70 70 70 68 68 68 66 66 66 64 64 64 62 62 62 −15 −10 −5 0 5 10 −15 −10 15 −5 0 5 d) 20 days 76 74 74 72 72 70 70 68 68 66 66 64 64 62 62 −5 0 5 −15 −10 15 −5 0 5 10 15 e) 10 days 76 −15 −10 10 10 15 −15 −10 Fig. 7 Maps of eddy diffusivity, scaled by the target theoretical value, derived from the synthetic particles obtained by the binning method for different bin sizes—4◦ × 2◦ (a), 2◦ × 1◦ (b), and 1◦ × 0.5◦ (c)—and by the clustering method for different −5 0 5 10 15 segment lengths—20 days (d) and 10 days (e). All estimates were interpolated onto a regular grid of (long × lat) = 0.25◦ × 0.25◦ prior to plotting Ocean Dynamics 4 days, no estimate of κ ∞ is produced. The results do not change qualitatively when using other choices for the averaging period (e.g., from 5–10 days). The resulting estimates for κ ∞ are mapped in Fig. 7a–c for bins of different size and in panels d and e for clusters with 100 track segments of 20 and 10 days, respectively. We normalize κ ∞ by its theoretical value. For consistency, all estimates are interpolated onto a regular grid of 0.25◦ × 0.25◦ and contoured with the same range of values, from 0 to 1.5. The correct value is 1.0, which is contoured in yellow. The normalized estimates with the (4◦ × 2◦ ) bins span the range from near 0 to 1.3. Too low values are found near the borders of the domain, and too large a) 22 20 20 18 MTL (DAYS) 16 14 12 10 10 8 6 4x2 4 2x1 2 1 x 0.5 0 0 10 1 2 10 3 10 10 NO. SEGMENTS b) c) 2 3.5 1.8 3 3.03 1.6 2.5 1.4 1.2 2 1 0.8 1 0.85 1.9 1 1.5 0.86 1.29 0.69 0.6 1 0.4 0.54 0.5 0.2 0 0.36 1 2 BINSIZE (DEGR) 4 10 20 SEGMENT LENGTH (DAYS) Fig. 8 a A scatter plot showing the number of segments and mean segment length obtained in bins for different bin sizes— 4◦ × 2◦ (cyan), 2◦ × 1◦ (blue) and 1◦ × 0.5◦ (green) for the binning method—and for different prescribed segments lengths (10 and 20 days, red) for the clustering method. The mean values of these parameters over all bins/clusters marked with rectangles 0 1 2 BINSIZE (DEGR) 4 10 20 SEGMENT LENGTH (DAYS) and circles, respectively. The number of segments refers to τ = 0 and it falls off thereafter due to variable length of the tracks that occur in the bin. b A spread of estimates of eddy diffusivity κ ∞ , scaled with the “target” theoretical value in binning and clustering assignments. c The error of the diffusivity estimate < δκ >, averaged over all bins/clusters Ocean Dynamics ones occur near the coasts. In the interior, the values are consistently low, with typical values of 0.8–0.9. With the (2◦ × 1◦ ) bins, the diffusivity exhibits smaller-scale variations, and there are many regions in the interior were the values are too large. The variations are more marked with the (1◦ × 0.5◦ ) bins, with pockets of high and low values. The diffusivities with the clusters are more uniform, both for the 20- and 10-day segments. The extreme low estimates found with the bins do not occur. Instead, the values vary between 0.8 and 1.2. There are larger values along the periphery, but also in the interior. A detailed comparison of the bin/cluster statistics is shown in Fig. 8. Panel a is a histogram of the average length of the segments used in calculating the autocorrelation for each cluster or bin. The clusters have segment lengths of 10 and 20 days, by design. The bins have a range of values, but in most cases, the average length is below 7 days. None exceed 10 days. The mean over all the bins is 5, 3, and 1.5 days, with decreasing bin size. The second point concerns the number of segments. Again, the clusters have nearly the same number. There are small variations, as the clustering procedure could not always obtain 100 segments. Nevertheless, most clusters have 80–120 segments. The bins on the other hand exhibit a wide range of values. There are some (4◦ × 2◦ ) bins with over 700 segments and other with less than 10. And there are some (1◦ × 0.5◦ ) bins with only two or three segments. The average number of segments is 280, 131, and 64 for the bins, in order of decreasing area. Based on the findings in Section 4.1, we expect that the binned estimates of κ ∞ should vary more and be biased low because the segments are generally too short. This is the case. Shown in Fig. 8b are scatterplots of the diffusivities for the five cases. The bin estimates span the range from zero to 1.5 times the actual diffusivity. The spread is less with the larger bins but still pronounced. In all cases, the diffusivities are skewed toward low values. Thus, the average diffusivities for all bins are also low. The clusters on the other hand yield estimates from 0.8 to 1.2 times the actual diffusivity. The distributions are not skewed, so the averages over all the clusters, both with 10- and 20-day segments, agree with the actual diffusivity. In the Fig. 8c, we plot the diffusivity errors. These derive from the student t-test at the 95% significance level, averaged over 4–8 days and over all bins/clusters and normalized by the theoretical value of κ ∞ . The errors are the largest with the (1◦ × 0.5◦ ) bins and decrease with increasing bin size. However, both cluster examples have significantly smaller errors. The mean error is 0.36 times the actual diffusivity with 20-day segments, as compared with 1.29 times the diffusivity for the “best” binning case. Other methods for determining the diffusivity yield similar results. Using the zero-crossing method for estimating κ ∞ yields a similar range of estimates, albeit with slightly larger average diffusivities. The diffusivities are nevertheless skewed to smaller values. The primary shortcoming with the binning calculation is that the segments are too short. With small bins, there are few particles which remain in any bin for periods longer than TL . Thus, the mean autocorrelation curves do not reach the asymptotic period (Fig. 6c). An alternate approach, in line with that of Garraffo et al. (2001), Lumpkin and Flament (2001), Lumpkin et al. (2002), and Rupolo (2007), would be to break the trajectories into uniform segments and regroup them in bins. Then one could control the length of the autocorrelations, just as we have done for the clusters. But by grouping in bins, we would still obtain different numbers of observations in different bins, as we found with the mean velocities. 5 Summary and discussion We considered a new method for calculating pseudoEulerian mean velocities and eddy diffusivities from Lagrangian data. This involves grouping a specified amount of data into spatially localized subsets using a “clustering” algorithm (e.g. Lloyd 1982; MacKay 2003). This is in contrast to the commonly used method in which the data is separated into geographical bins of a specified size. We compared the two approaches by analyzing a set of 2,000 trajectories generated with a first-order stochastic model, with a mean velocity representative of that at the surface in the Nordic Seas and with comparable eddy parameters. Using bins yields Eulerian estimates on a uniform grid. But as the number of observations varies greatly from bin to bin, so does the statistical significance. Clustering on the other hand produces sets with roughly the same number of observations and trajectory segments of the same length. The resulting means and diffusivities are not uniformly spaced but have much more uniform statistics. In terms of the mean velocities, clustering produces regions of smaller areal extent than binning, for comparable mean standard errors. The bins have widely different numbers of observations but the clusters have nearly the same, allowing more control of the significance. With smaller areas, the clusters are better able to Ocean Dynamics resolve details of the mean flow. Further, the accuracy is less dependent on the mean standard error with clusters than it is with bins. The means are more accurate with smaller bins, despite the smaller numbers of observations. Binning with a cell size of (2◦ × 1◦ ), as done previously for the Nordic Seas (Poulain et al. 1996; Saetre 1999; Jakobsen et al. 2003), yields a smooth representation of the mean. Using smaller bins, however, increases the chances of individual bins being rejected for having too few observations (e.g., Poulain et al. 1996; Falco et al. 2000; Thompson et al. 2009). Clustering provides a way around this by allowing the number of observations to be specified a priori. Diffusivities are a more Lagrangian measure than the means, involving an integral along drifter paths. With bins, these segments are of varying length, which impacts the averages. One often finds too many short segments, and this leads to an underestimate of the diffusivity. With clustering, one specifies a priori how long and how many trajectory segments are used for the averages. The resulting diffusivity estimates exhibit less variation than with the bins and moreover are not skewed toward low values. Of course the mean and diffusivity calculations are closely related because the means are subtracted from the trajectories prior to calculating the diffusivities. If the means are calculated with bins which are too large, integrals of the resulting residual velocities may not converge (Swenson and Niiler 1996). With clusters, the areal coverage is typically less and the means apply where the trajectories are, so the residual velocities are better captured. We clustered data according to nearest-neighbor distance, but other choices are also possible. One could for instance group data according to distance along an isopycnal or to position vis a vis topography (LaCasce 2000). In addition, we treat each observation equally, but one can weight the observations, for instance with regard to errors on individual positions. Such alterations in the k-means algorithm are straightforward. A related issue is that of “array bias”, in which nonuniform deployments can produce errors in the diffusivities (Davis 1991). While this is often less of a problem than sampling error (Poulain et al. 1996; Garraffo et al. 2001), it is nevertheless an issue with in situ data. Here too, the clustering approach is preferable because diffusivities are determined locally, where the trajectories are. We do not map onto a uniform grid, introducing variations in coverage. However, this mapping onto an irregular grid may be seen as a shortcoming of the clustering approach. If the means and diffusivities are to be used in a model, they must necessarily be interpolated onto a regular grid. In the present case, this interpolation produced reasonable results (Fig. 3g–i) because the data coverage was uniform (Fig. 2). But this is not usually the case with in situ sets. Nevertheless, the procedure of mapping the nonuniform cluster averages onto a regular grid reminds the user of where the data actually is. With binned estimates, this can be less obvious. A reviewer pointed out that we have avoided the question of time dependency in the mean flow. Indeed, the diffusivity is proportional to the lowest frequency in the Lagrangian spectrum (e.g. LaCasce 2008), and the mean velocity is ideally the component with zero frequency. In regions with pronounced seasonal and/or interannual variability, it is common to segregate the data into climatological groups of several months or years, often combined with filtering in the frequency domain (e.g., Swenson and Niiler 1996; Jakobsen et al. 2003; Sallee et al. 2008). More sophisticated techniques have also been proposed (e.g., Lumpkin 2003). Such processing would in any case be done prior to the proposed clustering, which is really a segregation in space. In a coming study, we apply the clustering method to drifter data from the Nordic Seas. Preliminary calculations suggest that clustering yields a similarly improved representation of the mean flow and the diffusivities. The primary challenge with the in situ data, in comparison with the present stochastic set, is that the eddy field is also strongly inhomogeneous. So more care is required. Acknowledgements The work is part of the Poleward project, funded by the Norwegian Research Council Norklima program (grant number 178559/S30). Details are found on http://www.iaoos.no/ and http://folk.uio.no/ingako/my_files/ POLEWARD_WEBPAGE_MAIN.html. Harald Engedahl provided the MIPOM velocities. We appreciate useful comments from two anonymous reviewers. Appendix: The clustering algorithm We base our clustering procedure on a generalized version of the Llloyd’s (1982) algorithm for the problem described by Eq. 5. However, contrary to conventional applications of k-means (MacKay 2003), in our problem, the number of clusters k does not need to be guessed at, but it is deduced from the total amount of data to match the desired number of cluster members m. Hence, we have developed here a procedure to partition the data into clusters with the number of members being as close as possible to a prescribed value m. This heuristic numerical solution is possibly not an Ocean Dynamics optimal one, but it performed well for the purpose of this study. The implementation is done with the MATLAB k-means toolbox, modified accordingly. The steps of the algorithm are as follows: • • • Choose the desired number of members in a cluster, m Given the total number of independent observations n and m, compute the target number of clusters, k=n/m Start k-means procedure (“batch phase”) – – – – – • • A random set of k clusters is randomly seeded Assign each point to the nearest cluster center minimizing the squared Euclidean distance in geographical coordinates (Eq. 5) Recompute the new cluster centers The two previous steps continues until the convergence criterion is met (the assignment has not changed or maximum number of iterations is reached, set to be 200 here) The four previous steps are repeated 100 times (for 100 initial seedings, or “replicates”) and the “best solution” (global minimum, that is, the lowest value of the sum of within-cluster distances, summed over all clusters) is the output End k-means procedure Clusters with the desired number of members are removed from consideration and stored, while the entire clustering procedure is repeated on the smaller data set. The process continues until all the data are grouped in clusters which satisfy m ∈ (m − 5, m + 5), or until maximum number of iterations, 400, is reached. The requirement was not met in some subsets, which considered typically clusters peripheral to the data-covered area. These were still included in the further analysis making the distribution curves in Fig. 4b differ from delta-functions. Large number of iterations and the requirement of uniform splitting of the data makes the analysis computationally intensive. For that reason, we do not perform a check for a “local minimum” (in terms of Eq. 5) by a series of reassignments of the points between clusters. Nevertheless, we found that repeated runs of the entire procedure described above led merely to a slightly different arrangement of clusters, while the reported results from the Z -test (Fig. 5) changed only within ±2%. The running time of the entire procedure was ca. 6 h on x86_64 GNU/Linux machine with 32 GB RAM. References Bauer S, Swenson MS, Griffa A, Mariano AJ, Owens K (1998) Eddy mean flow decomposition and eddy diffusivity estimates in the tropical Pacific Ocean. J Geophys Res 103(C13):30855–30871 Bauer S, Swenson MS, Griffa A (2002) Eddy mean flow decomposition and eddy diffusivity estimates in the tropical Pacific Ocean: 2. Results. J Geophys Res 107(C10):3154 Brink KH, Breadsley RC, Paduan J, Limeburner R, Caruso M, Sires JG (2000) A view of the 1993–1994 California Current based on surface drifters, floats, and remotely sensed data. J Geophys Res 105(C4):8575–8604 Colin de Verdiere A (1983) Lagrangian eddy statistics from surface drifters in the eastern North Atlantic. J Mar Res 41: 375–398 Davis RE (1991) Observing the general circulation with floats. Deep-Sea Res Suppl 38:S531–S571 Davis RE (1998) Preliminary results from directly measuring mid-depth circulation in the Tropical and South Pacific. J Geophys Res 103:24619–24639 Falco P, Griffa A, Poulain P-M, Zambianchi E (2000) Transport properties in the Adriatic Sea as deduced from drifter data. J Phys Oceanogr 30:2055–2071 Fratantoni DM (2001) North Atlantic surface circulation during the 1990’s observed with satellite-tracked drifters. J Geophys Res 106(C10):22067–22093 Garraffo Z, Griffa A, Mariano AJ, Chassignet EP (2001) Lagrangian data in a high-resolution numerical simulation of the North Atlantic II. On the pseudo-Eulerian averaging of Lagrangian data. J Mar Syst 29:177–200 Griffa A (1996) Applications of stochastic particle models to oceanographical problems. In: Adler R, Muller P, Rozovskii B (eds) Stochastic modelling in physical oceanography. Birkhauser, Boston, pp 114–140 Jakobsen PK, Ribergaard MH, Quadfasel D, Schmith T, Hughes CW (2003) Near-surface circulation in the northern North Atlantic as inferred from Lagrangian drifters: variability from the mesoscale to interannual. J Geophys Res 108(C5): 3251 Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, Wu AY (2002) An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892 Koszalka I, LaCasce JH, Orvik KA (2009) Relative dispersion in the Nordic Seas. J Mar Res 67:411–433 LaCasce J (2005) Statistics of low frequency currents over the western Norwegian shelf and slope I: current meters. Ocean Model 55:213–221 LaCasce J (2008) Statistics from Lagrangian observations. Prog Oceanogr 77(1):1–29 LaCasce J, Engedahl H (2005) Statistics of low frequency currents over the western Norwegian shelf and slope II: model. Ocean Model 55:222–237 LaCasce JH (2000) Floats and f/H. J Mar Res 58:61–95 Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137 Lumpkin R (2003) Decomposition of surface drifter observations in the Atlantic Ocean. Geophys Res Lett 30(14): 1753 Lumpkin R, Flament P (2001) Lagrangian statistics in the central North Pacific. J Mar Syst 29:141–155 Lumpkin R, Garraffo Z (2005) Evaluating the decomposition of Tropical Atlantic drifter observations. J Phys Oceanogr 22:1403–1415 Ocean Dynamics Lumpkin R, Treguier A-M, Speer K (2002) Lagrangian eddy scales in the Northern Atlantic Ocean. J Phys Oceanogr 32:2425–2440 MacKay DJC (2003) Information theory, inference, and learning algorithms. Cambridge University Press, Cambridge Mariano A, Ryan E (2007) Lagrangian analysis and prediction of coastal and ocean dynamics (LAPCOD review). In Griffa A, Kirwan AD, Mariano AJ, Ozgokmen T, Rossby T (eds) Lagrangian analysis and prediction of coastal and ocean dynamics, Chapter 13. Cambridge University Press, Cambridge, pp 423–467 Orvik KA, Niiler P (2002) Major pathways of Atlantic Water in the northern North Atlantic and Nordic Seas towards Arctic. Geophys Res Lett 29(19):1896 Owens WB (1991) A statistical description of the mean circulation and eddy variability in the northwestern North Atlantic using SOFAR floats. Prog Oceanogr 28:257–303 Poulain P-M (2001) Adriatic Sea surface circulation as derived from drifter data between 1990 and 1999. J Mar Syst 29:3–32 Poulain P-M, Warn-Varnas A, Niiler PP (1996) Near-surface circulation of the Nordic Seas as measured by Lagrangian drifters. J Geophys Res 101:18237–18258 Rossby HT, Riser SC, Mariano AJ (1983) The western North Atlantic—a Lagrangian viewpoint. In: Robinson AR (ed) Eddies in marine science. Springer, Heidelberg, pp 66–91 Rupolo V (2007) Observing turbulence regimes and Lagrangian dispersal properties in the oceans. In Griffa A, Kirwan AD, Mariano AJ, Ozgokmen T, Rossby T (eds) Lagrangian analysis and prediction of coastal and ocean dynamics, Chapter 9. Cambridge University Press, Cambridge, pp 231– 274 Saetre R (1999) Features of the central Norwegian shelf circulation. Cont Shelf Res 19:1809–1831 Sallee JB, Speer K, Morrow R, Lumpkin R (2008) An estimate of Lagrangian eddy statistics and diffusion in the mixed layer of the Southern Ocean. J Mar Res 66:441–463 Skagseth Ø, Orvik KA (2002) Identifying fluctuations in the Norwegian Atlantic Slope Current by means of empirical orthogonal functions. Cont Shelf Res 22:547–563 Swenson MS, Niiler PP (1996) Statistical analysis of the surface circulation of the California Current. J Geophys Res 101(C10):22631–22645 Taylor GI (1921) Diffusion by continuous movements. Proc Lond Math Soc 20:196–212 Thompson A, Heywood KJ, Thorpe SE, Renner AH, Trasvina A (2009) Surface circulation at the tip of the Antarctic Peninsula from drifters. J Phys Oceanogr 39:3–25 Veneziani M, Griffa A, Reynolds AM, Mariano AJ (2004) Oceanic turbulence and stochastic models from subsurface Lagrangian data for the Northwest Atlantic Ocean. J Phys Oceanogr 34:1884–1906 Zhurbas V, Oh IS (2003) Lateral diffusivity and Lagrangian scales in the Pacific Ocean as derived from drifter data. J Geophys Res 108(C5):3141