Pelagic Regionalisation: DRAFT Classification Background At the October 2004 meeting of the Bioregionalisation Working Group, CSIRO presented preliminary classification results for Levels 1, 2 and 3 of the pelagic environment using physical and chemical variables. Two key issues which arose in the discussion of the results were: 1. The need to analyse and illustrate the 3D structure – in particular depth variations of classified regions. 2. Analysis and depiction of the variability in the boundary of regions due to seasonal shifts. 3. The classification at the Ocean Basin scale was deemed to be too broadscale in extent, and without support of biological corroboration it was to be described qualitatively. To address the concerns regarding the 3D structure in a consistent manner, and considering the extreme variation in structure evident in the ocean, an attempt was made to extend Rick Smith’s depth-layered analyses, presented at the last BWG workshop, to 3D. By implication, a depth-layered analysis requires a-priori selection of depth layers or strategies for dealing with depth variations in an aggregate analysis. This selection in turn will influence the outcome of the classification analysis (and the subsequent analysis of the depth structuring). Other considerations which guided our analyses included: 1. The volume of the data. Each variable was available as a netcdf file comprising 56 depth levels at the following depths (in metres): 0 10 20 30 40 50 60 70 75 80 90 100 110 125 150 175 200 225 250 275 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 1100 1200 1300 1400 1500 1600 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500 4750 5000 5500 Within each depth level, there were 601 grids for latitude spanning 0600S and 901 grids for longitude spanning 90-1800E, and each layer had fields for mean and annual amplitude. Even with the good capabilities of our current research computing systems, it was not feasible to analyse such volumes of data concurrently using available statistical software. Analysed 2. Variables available for analysis: temperature, salinity, nitrate, oxygen, and silicate were of varying (underlying) spatial resolution and there was high correlation amongst the variables. Visual analysis of the data shows that silicates are highly influenced by land mass and current systems that impinge on land masses. On this basis it was not 1 included for analysis. Nitrates were highly correlated with oxygen and other physical variables but the underlying resolution and/or variability of nitrate caused the resultant distributions to contain what appeared to be spatial artefacts from sampling. Thus, it was also excluded from analysis. Oxygen likewise was of poor but acceptable resolution (as it is now a relatively standard oceanographic measurement) and it was retained with the standard variables of temperature and salinity. Analysis Strategy Mean Analyses To cope with the volume of data, a qualitative stratification was implemented to subsample the data in latitude/longitude/depth space. The 56 levels of depth were reduced down to 8 comprising: 0m 100m 250m 500m 1000m 2000m 3000m 4000m These were assumed to capture the bulk of the variability expected to be important at the Level 1/2 classification scale (broad zonal expanses of water masses). For each variable (temperature, salinity, oxygen), data for all grid cells at these levels were extracted, excluding those masked as land. This data comprised the fundamental input data for analysis. Given the volume of data the classification algorithm chosen was the large array algorithm “clara” (Clustering Large Application) which uses subsamples to build up its classes. As reported in past workshops, in order to neutralise the adverse impact of frontal regions, the data were histogram-scaled. In order to select an appropriate number of clusters, an initial run was conducted with cluster sizes ranging from 2 to 35 (using a sample size of 500 and 2 samples). Silhouette plots were then examined to analyse the distinctiveness of clusters. These showed that groups at 5 and 25 clusters contained local maxima of mean silhouette values. The classes with 5 clusters was very broadscale ocean-basin classes so subsequent analyses concentrated on the 25 clusters. Using a sample size of 1000, 10 samples were used in the clara classification which was then used to model the entire dataset with a nearestneighbour class selection procedure. This involved randomly selecting 100 training data records for each class which were then used in the fitting process. Variability Analyses Work on this aspect is still in progress. A number of preliminary analyses conducted to date on the annual-amplitude data (which is a measure of seasonal variability) shows very patchy spatial patterns which may be reflecting the inherent sampling variability. Coherent large scale patterns are visible off eastern Australia, the tropic and Southern Ocean and silhouette analyses do not show any obvious groupings apart from groupings at 2-4 2 classes. An example analysis for 10 classes, which typifies the problem, is shown below. Figure 1 Classification of annual amplitude for temperature and salinity using 10 classes. Results In conjunction with this report, a zip file is included showing sections through the classification at various depths and along longitude sections. From a management perspective, a considerable volume of information needs to be assimilated but this is simply reflecting the nature of the problem at hand. Aggregations of the information may be possible depending on the intended use and at this stage, it would be useful if the BWG can provide some guidance on future work. 3