Supplementary_info_text_revised_clean_copy

advertisement
Background information and quality testing data for Kohonen's self-organizing maps
Background
Kohonen’s self-organizing map (SOM) arranges multidimensional data across
gradients of similarity through the iterative adjustment of variables in 'neurons', resulting
in a distribution that asymptotically approaches that of the input data (Kohonen, 2001).
The 'map' is typically organized on a two-dimensional lattice, where each node
represents a single neuron, and each neuron is represented by a vector of variables with
dimensionality equal to that of the input samples ('codebook vector'). The connectivity of
each neuron on the lattice is determined by its shape, and the hexagonal configuration is
favoured.
The starting values of the variables in the codebook vectors for each neuron can
be either randomized, or organized by linear initialization. In the latter approach, the x
and y axes of the map are made to span the eigenvectors of the sample autocorrelation
matrix that have the largest eigenvalues, and the initial values of the variables in the
codebook vectors are set to cover this span linearly. Linear initialization standardizes
map size by choosing the relative number of nodes along the x and y axes of the lattice
proportional to the corresponding eigenvalues (Kohonen, 2001). However, linear
initialization is prone to sub-optimal fitting, particularly when the first and second
eigenvectors do not contain all of the relevant variation. For this reason, it is advisable to
check by considering the scree plots of eigenvalues and by comparing quality
parameters with maps made using random initialization (Figure S1, Table S2).
Linear initialization has the benefit of choosing a map size that is proportional to
variation across the first two latent variables. However, this means that a set of extreme
values, outliers, or disparate clusters of data may result in a map with a large number of
empty neurons (i.e. with no samples that 'closest' to them compared to other neurons in
the finalized map). Therefore, it is advisable to assess fitting using maps of different
sizes. In this experiment, samples with properties that are known to be linearly separable
along the first two principal components (Cuss and Guéguen, 2013) were exposed to
treatments across a gradient, so that the appropriate degree of departure from cluster
centroids and corresponding distances between clusters are difficult to ascertain. Edge
effects (Kohonen, 2001) can also cause tight clustering at the edges of maps with a
sheet-like geometry due to the lack of connected neighbours along the outer side.
Consequently, experimentation with toroidal geometry is recommended. Again, however,
an optimal fit for samples with disparate properties is at opposite edges of the map, so
that it is difficult to assess this effect with significantly different samples.
Self-organization commences by introducing the data set to the map over -multiple
iterations, either by choosing an individual sample randomly with replacement (sequential
mode) or in batches of samples (batch mode), and each sample vector xi is assigned to
a neuron mj according to the following relationship:
‖𝒙𝑖 − 𝒎𝑐 ‖ = min 𝑑(𝒙𝑖 , 𝒎𝑗 )
[1],
𝑗
where d(mj, xi) is some measure of distance between sample vector xi and neuron mj. In
other words, each sample is assigned to the 'closest' neuron mc, similar to the way in
which a specific biological neuron might be activated by a specific stimulus. This closest
neuron is called the best-matching unit (BMU). The Euclidean distance is the most
common metric for distance, and was used in this study. Introducing samples in batch
mode may also result in sub-optimal fitting, particularly when the dimensionality of the
solution space is large (i.e. large number of variables, such as excitation-emission
wavelength pairs). Hence, it is advisable to check the fitting using both the batch and
sequential modes of sample introduction (Table S2).
In the second stage of self-organization, variables in codebook vectors are
adjusted to make them 'closer' to the samples assigned to them as follows:
𝒎𝑐 (𝑡 + 1) = 𝒎𝑐 (𝑡) + 𝛼(𝑡) · 𝑁𝑐 (𝑡) · [𝒙𝑖 (𝑡) − 𝒎𝑖 (𝑡)]
[2].
Here, the variable t tracks the iterations, 0 < α ≤ 1 is the 'learning-rate factor' which
controls the pace at which the neurons move closer to the samples, and Nc is the
neighbourhood function, which determines the size of the radius around the central
neuron (mc) over which adjustments will take place. Thus, the neighbourhood function
'smoothes' the map by extending the distance adjustment to the neurons around mc,
rather than making a point adjustment solely at a single neuron. It should be noted that
both Nc and α are functions of t, so that they change as the number of iterations
increases. Specifically, both functions are reduced over time so that changes are
increasingly localized and adjustments become finer as the algorithm proceeds. The
algorithm terminates when the map has stabilized, at which point further adjustments to
the distance between samples and their respective BMU codebook vectors fall below a
small threshold value.
Kohonen maps are highly sensitive to differences between inputs, and so are
capable of sorting and creating clusters with samples in which the differences are only
marginal. As such, comparison of the finalized sample map and its quality parameters
with fits using random data as an input is recommended. For further details on the
operation and mathematical foundations of Kohonen maps, the reader is directed to the
comprehensive work of their originator (Kohonen, 2001).
Map quality testing
Map quality was tested using three criteria: 1) mean quantization error (mqe),
which is the average Euclidean distance between the samples and their respective
BMUs on the finished map (a measure of fit quality or resolution), 2) topographic error
(tge), which is the proportion of samples for which the second-closest neuron is not
adjacent to the BMU (measures preservation of topology), and 3) visual
inspection/comparison of map/changes caused by changing fitting parameters. The
procedure of Bieroza et al., (2012) was set as the default condition (rows 1, 8, 15, 16;
Table S2), fitting parameters were varied to test fit quality, which was compared to the
default. Table S2- (SOM Fit Table) displays these quality parameters for multiple map
construction processes and configurations. Standard deviations are reported for 10
separate mappings, where each mapping was conducted using a distinct random
number seed. Figure S1 shows the scree plots for the different data formats.
Random initialization produced maps with similar clustering and structure
compared to linear initialization, with only small differences in mqe (mqe was higher for
two of three maps; Compare rows 1 and 3, 8 and 10, and 16 and 18 in Table S2).
Notably, tge was also higher for maps of components as a proportion of total
fluorescence and normalized proportions, suggesting that linear initialization resulted in
maps with topographies that better represent the underlying sample distribution.
Sequential sampling mode also resulted in considerably higher mqe in all cases, and
higher tge in 2/3 cases (rows 2, 9, 17). Both toroidal geometry and decreasing map size
resulted in worsening mqe and tge, with considerable instability in map structure over
several fittings. As a result of the quality testing, linear initialization with batch sampling
mode was chosen as the optimal method for the generation of all SOMs in this study.
Hence, the map dimensions were set proportional to the ratio of the first two
eigenvectors, with the total number of nodes set according to the 'rule of thumb': five
times the square root of the number of samples (Kohonen, 2001).
Randomized input data was fitted using a map with the same dimensions as the
proportional component fluorescence using linear initialization and batch sampling mode.
The finalized map of the random data had an mqe that was more than five times that for
the component proportions, and also had higher tge. It was therefore concluded that the
clusters and gradients evidenced in the SOMs were an artifact of valid underlying
relationships and not random associations.
Scree plots of eigenvalues demonstrated that the first two latent variables
adequately captured the primary sources of variation in all cases (Figure S1), reinforcing
the validity of linear initialization. It is worth noting that the majority of variation in raw
PARAFAC component loadings was spread across a greater number of eigenvectors
compared to the loadings as a proportion of total fluorescence (Figures S1 B, C), likely
because of the correlation imposed by proportionality and the elimination of differences in
fluorescence caused by different DOC concentrations across the experiments.
Consequently, proportional component loadings were chosen as the format for
subsequent analyses.
Download