Background information and quality testing data for Kohonen's self-organizing maps Background Kohonen’s self-organizing map (SOM) arranges multidimensional data across gradients of similarity through the iterative adjustment of variables in 'neurons', resulting in a distribution that asymptotically approaches that of the input data (Kohonen, 2001). The 'map' is typically organized on a two-dimensional lattice, where each node represents a single neuron, and each neuron is represented by a vector of variables with dimensionality equal to that of the input samples ('codebook vector'). The connectivity of each neuron on the lattice is determined by its shape, and the hexagonal configuration is favoured. The starting values of the variables in the codebook vectors for each neuron can be either randomized, or organized by linear initialization. In the latter approach, the x and y axes of the map are made to span the eigenvectors of the sample autocorrelation matrix that have the largest eigenvalues, and the initial values of the variables in the codebook vectors are set to cover this span linearly. Linear initialization standardizes map size by choosing the relative number of nodes along the x and y axes of the lattice proportional to the corresponding eigenvalues (Kohonen, 2001). However, linear initialization is prone to sub-optimal fitting, particularly when the first and second eigenvectors do not contain all of the relevant variation. For this reason, it is advisable to check by considering the scree plots of eigenvalues and by comparing quality parameters with maps made using random initialization (Figure S1, Table S2). Linear initialization has the benefit of choosing a map size that is proportional to variation across the first two latent variables. However, this means that a set of extreme values, outliers, or disparate clusters of data may result in a map with a large number of empty neurons (i.e. with no samples that 'closest' to them compared to other neurons in the finalized map). Therefore, it is advisable to assess fitting using maps of different sizes. In this experiment, samples with properties that are known to be linearly separable along the first two principal components (Cuss and Guéguen, 2013) were exposed to treatments across a gradient, so that the appropriate degree of departure from cluster centroids and corresponding distances between clusters are difficult to ascertain. Edge effects (Kohonen, 2001) can also cause tight clustering at the edges of maps with a sheet-like geometry due to the lack of connected neighbours along the outer side. Consequently, experimentation with toroidal geometry is recommended. Again, however, an optimal fit for samples with disparate properties is at opposite edges of the map, so that it is difficult to assess this effect with significantly different samples. Self-organization commences by introducing the data set to the map over -multiple iterations, either by choosing an individual sample randomly with replacement (sequential mode) or in batches of samples (batch mode), and each sample vector xi is assigned to a neuron mj according to the following relationship: ‖𝒙𝑖 − 𝒎𝑐 ‖ = min 𝑑(𝒙𝑖 , 𝒎𝑗 ) [1], 𝑗 where d(mj, xi) is some measure of distance between sample vector xi and neuron mj. In other words, each sample is assigned to the 'closest' neuron mc, similar to the way in which a specific biological neuron might be activated by a specific stimulus. This closest neuron is called the best-matching unit (BMU). The Euclidean distance is the most common metric for distance, and was used in this study. Introducing samples in batch mode may also result in sub-optimal fitting, particularly when the dimensionality of the solution space is large (i.e. large number of variables, such as excitation-emission wavelength pairs). Hence, it is advisable to check the fitting using both the batch and sequential modes of sample introduction (Table S2). In the second stage of self-organization, variables in codebook vectors are adjusted to make them 'closer' to the samples assigned to them as follows: 𝒎𝑐 (𝑡 + 1) = 𝒎𝑐 (𝑡) + 𝛼(𝑡) · 𝑁𝑐 (𝑡) · [𝒙𝑖 (𝑡) − 𝒎𝑖 (𝑡)] [2]. Here, the variable t tracks the iterations, 0 < α ≤ 1 is the 'learning-rate factor' which controls the pace at which the neurons move closer to the samples, and Nc is the neighbourhood function, which determines the size of the radius around the central neuron (mc) over which adjustments will take place. Thus, the neighbourhood function 'smoothes' the map by extending the distance adjustment to the neurons around mc, rather than making a point adjustment solely at a single neuron. It should be noted that both Nc and α are functions of t, so that they change as the number of iterations increases. Specifically, both functions are reduced over time so that changes are increasingly localized and adjustments become finer as the algorithm proceeds. The algorithm terminates when the map has stabilized, at which point further adjustments to the distance between samples and their respective BMU codebook vectors fall below a small threshold value. Kohonen maps are highly sensitive to differences between inputs, and so are capable of sorting and creating clusters with samples in which the differences are only marginal. As such, comparison of the finalized sample map and its quality parameters with fits using random data as an input is recommended. For further details on the operation and mathematical foundations of Kohonen maps, the reader is directed to the comprehensive work of their originator (Kohonen, 2001). Map quality testing Map quality was tested using three criteria: 1) mean quantization error (mqe), which is the average Euclidean distance between the samples and their respective BMUs on the finished map (a measure of fit quality or resolution), 2) topographic error (tge), which is the proportion of samples for which the second-closest neuron is not adjacent to the BMU (measures preservation of topology), and 3) visual inspection/comparison of map/changes caused by changing fitting parameters. The procedure of Bieroza et al., (2012) was set as the default condition (rows 1, 8, 15, 16; Table S2), fitting parameters were varied to test fit quality, which was compared to the default. Table S2- (SOM Fit Table) displays these quality parameters for multiple map construction processes and configurations. Standard deviations are reported for 10 separate mappings, where each mapping was conducted using a distinct random number seed. Figure S1 shows the scree plots for the different data formats. Random initialization produced maps with similar clustering and structure compared to linear initialization, with only small differences in mqe (mqe was higher for two of three maps; Compare rows 1 and 3, 8 and 10, and 16 and 18 in Table S2). Notably, tge was also higher for maps of components as a proportion of total fluorescence and normalized proportions, suggesting that linear initialization resulted in maps with topographies that better represent the underlying sample distribution. Sequential sampling mode also resulted in considerably higher mqe in all cases, and higher tge in 2/3 cases (rows 2, 9, 17). Both toroidal geometry and decreasing map size resulted in worsening mqe and tge, with considerable instability in map structure over several fittings. As a result of the quality testing, linear initialization with batch sampling mode was chosen as the optimal method for the generation of all SOMs in this study. Hence, the map dimensions were set proportional to the ratio of the first two eigenvectors, with the total number of nodes set according to the 'rule of thumb': five times the square root of the number of samples (Kohonen, 2001). Randomized input data was fitted using a map with the same dimensions as the proportional component fluorescence using linear initialization and batch sampling mode. The finalized map of the random data had an mqe that was more than five times that for the component proportions, and also had higher tge. It was therefore concluded that the clusters and gradients evidenced in the SOMs were an artifact of valid underlying relationships and not random associations. Scree plots of eigenvalues demonstrated that the first two latent variables adequately captured the primary sources of variation in all cases (Figure S1), reinforcing the validity of linear initialization. It is worth noting that the majority of variation in raw PARAFAC component loadings was spread across a greater number of eigenvectors compared to the loadings as a proportion of total fluorescence (Figures S1 B, C), likely because of the correlation imposed by proportionality and the elimination of differences in fluorescence caused by different DOC concentrations across the experiments. Consequently, proportional component loadings were chosen as the format for subsequent analyses.