bre12139-sup-0006

advertisement
BRE-12139
Supplementary text 1
Quartz Trace elements and Principal Component Analysis
The large number of elements measured in quartz creates a rich multivariate dataset from
which to infer detrital provenance. A common approach to visually evaluate populations in an ndimensional dataset is principal component analysis (PCA). PCA transforms data by reducing
the dimensionality of a dataset and allowing it to be visualized in 2-D or allowing clustering
algorithms to be run directly on the transformed data. PCA works by considering the data as a
set of points (or vectors) in n-dimensional space, and extracts a set of orthogonal vectors that can
be linearly combined to reproduce the data set. The vectors are extracted in order of the amount
of variability that they explain, such that the first vector “explains” more of the variability than
later vectors. These vectors can be viewed as coefficients or “loadings,” which when multiplied
by the concentration of each element in a given grain, sum to a principal component score (one
score per grain, per basis vector). High or low scores suggest that the composition of that grain
is not well explained by a given PC vector, whereas scores close to zero suggest the grain’s
composition is well explained. Thus, when PC scores are plotted, grains with similar
compositions should plot together, even if the source of their similarity is unknown. The power
of PCA lies in the fact that a single PC vector can explain multiple co-varying elements, thereby
reducing the dimensionality of the dataset and preventing the need for redundant bivariate plots
that show the same population clusters visualized in different element pairings.
Another approach to identify populations within a dataset is cluster analysis, which
groups data points by the Euclidean distance (or other metrics) between points in n-dimensional
space. We employed the K-means algorithm, which begins with a set of “seed” values and
groups the points closest to that seed value into a cluster. It then computes the mean of all points
in each resultant cluster and reclassifies all the original points based on their distance from the
new set of means. This process is iteratively repeated until the cluster means do not shift. Using
an F-test, which plots the ratio of the variance between cluster means over the variance of the
entire population as a function of the number of clusters, we evaluated the appropriate number of
clusters and settled on n=5 clusters (Fig. S3). Both PCA and cluster analysis were performed
using the Statistics Toolbox in Matlab.
Download