1 2 Atmosphere 3 Supporting Information for 4 Multivariate analysis of dim elves from ISUAL observations 5 6 Marc Offroy1, Thomas Farges1, Pierre Gaillard1, Cheng Ling Kuo2, Alfred Bing-Chih Chen3, Rue-Ron Hsu4, Yukihiro Takahashi5 7 8 9 10 1 CEA, DAM, DIF, 91297 Arpajon cedex, France, 2 Institute of Space Science, National Central University, Jhongli, Taiwan, 3 Institute of Space and Plasma Sciences, National Cheng Kung University, Tainan, Taiwan, Department of physics, National Cheng Kung University, Tainan, Taiwan, 5 Department of Cosmosciences, Hokkaido University, Japan, 4 11 12 13 14 15 16 Contents of this file 17 Introduction 18 19 20 21 22 23 This supporting information presents a brief description of the mathematical approaches discussed in the manuscript. The first approach tested is called Principal Component Analysis (PCA). It is a starting point in data mining, based on a descriptive method which is not based on a probabilistic model of data but which simply aims to provide a geometric representation. Among the multivariate analysis techniques, the second approach is a softmodeling method called Parallel FACtor analysis (PARAFAC). 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 1. Principal Component Analysis (PCA) PCA decomposes a data matrix π (π × π) into a product of two matrices, a matrix of scores denoted π (π × π) and a matrix of loadings denoted π (π × π), which is added to a residual matrix π (π × π): (1) π = π. π π + π The dimension of the new space defined by π is determined with the rank of the matrix π. The original variables π from π observations are too complex to be interpreted directly from the raw data, which explains why it is necessary to “reduce” the space dimension using π Principal Components (PCs) that explain the maximum amount of information. The scores represent the coordinates of the observations (or samples) on the axes of the selected PCs. The loadings denote the contributions of the original variables on the same selected PCs. In other words, since the scores are representations of observations in the space formed by the new axes defined by the PCs, loadings are a representation of the variables on this axis. Geometrically, this change of variables by linear combinations results is a set of new variables called PCs. The direction of each newly created axis describes a part of the global Text S1 to S5 1 40 41 42 43 44 45 46 47 information from the original variables. The variance explained by each PCs is sorted in decreasing order. The proportion of variance explained by the first PC, which represents the main part of information, is higher than the second PC, which represents a smaller amount of information, and so on. The same information cannot be shared between two PCs because PCA requires that the PCs are orthogonal to each other. 48 49 50 51 52 53 PARAFAC decomposes a matrix π into a product of three matrices according to three modes. Instead of having a score and a loading matrix as in PCA, each component consists of a score matrix denoted π and two loading matrices denoted π and π. In PARAFAC, it is common not to distinguish the score and the loading matrices. In other words, the PARAFAC model of a three-dimensional matrix is given by three loading matrices π, π and π with elements πππ , πππ and πππ as follows: 2. The PARAllel FACtor analysis (PARAFAC) πΉ 54 ππππ = ∑ πππ πππ πππ + ππππ (2) π=1 55 56 57 58 The elements of π having a size πΌ × π½ × πΎ are denoted ππππ . The trilinear model is found to minimize the sum of squares of the residuals denoted ππππ . πΉ is the number of factors extracted in each mode, which describes the maximum of information contained in the matrix π. The model use a cost function as follows: πΌ 59 π½ πΎ 2 πΏ(π, π, π) = ∑ ∑ ∑ (ππππ − ∑ πππ πππ πππ ) π=1 π=1 π=1 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 πΉ (3) π=1 The advantage of this methodology is that it provides simple and robust models that can be easily interpreted [Harshman, 1970]. Furthermore, the solution of the PARAFAC model is unique [Kruskal, 1976]. Kruskal (1977) proposed even less restrictive conditions in cases where unique solutions can be expected. This latter author uses the π-rank of the loading matrices, showing that if ππ΄ + ππ΅ + ππ ≥ 2πΉ + 2 then the PARAFAC solution is unique, with ππ΄ being the π-rank of matrix π, ππ΅ the πrank of π and ππΆ the π-rank of π. πΉ is the expected number of factors or components. There is a well-known problem for other bilinear decomposition methods which arises from the ambiguities of rotation and intensity [Lawton and Sylvstre, 1971; Tauler et al., 1995]. In the case of an estimated PARAFAC model, the mathematical meaning of uniqueness is that it cannot be rotated without a large error, i.e. a loss of fit for the model [Bro, 1997]. Otherwise, for other two-way methods based on loadings or scores, the ambiguities do not lead to errors on the model. Ambiguities can be defined as the set of solutions that fulfill the constraints applied and fit the data equally well. Consequently, the difficulty for a decomposition method is to determine πΉ. The problem of linear dependence poses a challenge to multivariate algorithms dealing with rank-deficient matrices. Therefore, it is necessary to determine the rank of the data matrix. Ideally, the rank of a data matrix is in agreement with the number of contributions in the studied system. In other words, the rank represents the number of eigenvectors needed to explain all the measurements in the data matrix. 2 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Each of the recorded signals is a linear combination of these eigenvectors. It is therefore difficult to evaluate the rank on noisy data. Nevertheless, while there are various approaches to estimating the rank of the matrix, there are no explicit rules [Bro, 1997]. In our case, the “Core Consistency” diagnostic (CONCORDIA) approach developed by [Bro and Kiers, 2003] is used to determine the appropriate number of components for multiway models. The main idea is to compare the ‘core’ of the model estimated by PARAFAC with the ‘core’ of an ideal model. To understand this approach, a presentation of the Tucker3 model is necessary. Tucker3 is another modelling method for multiway arrays [Tucker, 1964; Rutledge and Bouveresse, 2007]. In addition to the loading matrices, a ‘core’ array is computed (equation 4). With the Tucker3 models, the number of components or factors can be different on each mode. Therefore, the loading matrices do not all necessarily have the same number of columns. If D is a three-dimensional matrix of dimension πΌ × π½ × πΎ and the number of factors on each mode is given by πΌ × π, π½ × π and πΎ × π , respectively, then the dimension of the ‘core’ array will be π × π × π . As shown by the elementwise definition of the Tucker3 models, interactions may exist between loadings of different order in the different modes because of the ‘core’ array: π 97 π π ππππ = ∑ ∑ ∑ πππ πππ πππ π‘πππ + ππππ (4) π=1 π=1 π=1 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 The element denoted π‘πππ defines the ‘core’ array π(π, π, π ). By comparing equations (2) and (4), we note that the PARAFAC model is a restricted version of the Tucker3 model, where π = π = π and π is the theoretical superidentity matrix, i.e. the superdiagonal entries are equal to 1 and zeros elsewhere. The main idea of the core consistency approach is to compare this theoretical superidentity matrix denoted here as π with the ‘core’ array π derived from the matrices π, π, π and π [Bro and Kiers, 2003]. A simple way to access if π and π are similar is to monitor the distribution of superdiagonal and off-superdiagonal elements of π. If the superdiagonal elements are all close to the corresponding elements of π and the off-superdiagonal elements are close to zero, then the model is appropriate. If this is not the case, then either too many components have been extracted or the model is overfitting. When too many components of factors have been extracted, the model is mis-specified, or gross outliers disturb the model. The percentage of core consistency quantifies the similarity between π and π as: %ππππ ππππ ππ π‘ππππ¦ = 100 (1 − ∑πΉπ=1 ∑πΉπ=1 ∑πΉπ=1(ππππ − π‘πππ ) 2 ∑πΉπ=1 ∑πΉπ=1 ∑πΉπ=1 π‘πππ 2 ) (5) The percentages derived from this approach allow us to obtain a fairly good approximation of factors to describe the data matrix. If we gradually increase the number of factors for three-way decomposition, the core consistency index will also decrease monotonically and slowly. The influence of noise and other non-trilinear variations increases with the number of factors πΉ. When the number of “true” factors is exceeded, the core consistency index decreases dramatically, because some directions in the model subspace will be mainly descriptive of noise or some other variation, leading to high off-superdiagonal core values [Bro and Kiers, 2003]. 3 121 122 123 The solution to the PARAFAC model can be found with the Alternating Least Squares (ALS) method by successively assuming the loadings in two known modes and then estimating the unknown set of parameters of the last mode [Bro, 1997]. 124 125 If the algorithm converges to a global minimum, which is most often the case for wellbehaved problems, the least-squares solution to the model is found [Bro, 1997]. 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 ALS is an attractive method because the solution of PARAFAC is certain to be improved at every iteration. However, a major drawback of ALS is the time required to estimate the models, especially when the number of variables is high. Apart from the fact that the algorithm is currently improved by two methods of acceleration, it is necessary to optimize the initialization step (1). Firstly, there are many methods for estimating the starting matrices, for example, Singular Value Decomposition (SVD) [Golub and Van Loan, 1997], or Direct TriLinear Decomposition / Generalized Rank Annihilation Method (DTLD/GRAM) [Sanchez and Kowalski, 1990]. In our case, we use the DTLD/GRAM method because it is quick and the fit of the model estimated from PARAFAC is better with this initialization. Secondly, it is possible to accelerate the fitting of the PARAFAC model by using suitable pre-processing on the studied system [Bro and Smilde, 2003; Massart et al., 1997; Martens and Naes, 1997]. And thirdly, it is possible to constrain the PARAFAC model according to the data studied [Bro, 1997]. During the ALS steps, constraints are used to introduce information to model the π, π and π signal profiles. The main benefit of constraining the solutions of PARAFAC is that it can sometimes be helpful in terms of interpretability or stability of the model. These constraints are based on mathematical or physical properties of the studied system [Bro, 1997]. For example, with spectroscopic data, it is general practice to use the non-negative constraint because the absorbance measurements should be positive if proper blanking is used. In our case, the ?results are similar because the brightness measurements should be positive. A general method called Non-Negative Least Square (NNLS) has been described by [Lawson and Hanson, 1995] and integrated into the PARAFAC procedure. 149 References 150 151 Bro, R. (1997), PARAFAC, Tutorial and Applications, Chemometrics and Intelligent Laboratory Systems, 38, 149-171. 152 153 154 Bro, R., H. A. L. Kiers (2003), A new efficient method for determining the number of components in PARAFAC models, Journal of Chemometrics, 17, 274-286, doi:10.1002/cem.801. 155 156 Bro, R., A. K. Smilde (2003), Centering and scaling in component analysis, Journal of Chemometrics, 17, 16-33. 157 158 159 160 Duponchel, L., S. Laurette, B. Hatirnaz, A. Treizebre, F. Affouard, B. Bocquet (2013), Terahertz microfluidic sensor for in situ exploration of hydration shell of molecules, Chemometrics and Intelligent Laboratory Systems, 123, 28-35, dx.doi.org/10.1016/j.chemolab.2013.01.009. 4 161 162 163 Kruskal, J. B. (1976), More factors than subjects, tests and treatments ; An indeterminacy theorem for canonical decomposition and individual differences scaling, Psychomettrika, 41, 281. 164 165 166 Kruskal, J. B. (1977), Three-way arrays: Rank and uniqueness of trilinear decomposition, with application to arithmetic complexity and statistics, Linear Algebra and its Applications., 18, 95. 167 168 Lawton, W.H., E. A. Technometrics, 13, 617. 169 170 Lawson C. L., R. J. Hanson, Solving least square problems, Society for Industrial and Applied Mathematics, Philadelphia, 1995. 171 172 Malinowski, E.R., Factor Analysis in Chemistry, John Wiley & Sons Inc, New York, 2002. 173 Martens, H., T. Naes, Multivariate calibration, John Wiley & sons, Chichester, 1989. 174 175 176 Massart, D.L., B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi, J.Smeyers-Verbeke (1997), Handbook of Chemometrics and Qualimetrics : Handbook of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam. 177 178 179 180 Offroy, M., Y. Roggo, L. Duponchel (2012), Increasing the spatial resolution of near infrared chemical images (NIR-CI): the super-resolution paradigm applied to pharmaceutical products, Chemometrics and intelligent Laboratory Systems, 117, 183-188. 181 182 183 Ruckebusch C., L. Blanchet (2013), Multivariate curve resolution: A review of advanced and tailored applications and challenges, Analytica Chimica Acta, 765, 2836, dx.doi.org/10.1016/j.aca.2012.12.028. 184 185 Rutledge, D. N., J.-R., Bouveresse (2007) Multi-way analysis of outer product arrays using PARAFAC, Chemometrics and intelligent laboratory systems, 85 (2), 170-178. 186 187 Sanchez, E., B. R., Kowalski (1990), Tensorial resolution: A direct trilinear decomposition, Journal of Chemometrics, 4, 29. 188 189 Tauler, R., B. Kowalski (1993), Multivariate Curve Resolution Applied to spectral Data from Multiple Runs of an Industrial Process, Analytical Chemistry, 65, 2040-2047. 190 191 192 Tauler, R., A. Smilde, B. Kowalski (1995), Selectivity, Local Rank, Three-Way Data Analysis and Ambiguity in Multivariate Curve Resolution, Journal of Chemometrics, 9, 31-58. 193 194 195 Tucker, L. R., Extension of factor analysis to three dimensional matrices, in N. Frederiksen, H. Gulliksen (Eds), Contributions to mathematical Psychology, Holt, Rinehart, & Winston, New York, 1964, 110-182. Sylvestre (1971), Self Modeling Curve Resolution, 196 5