jgrd52327-sup-0001-supinfo

advertisement
1
2
Atmosphere
3
Supporting Information for
4
Multivariate analysis of dim elves from ISUAL observations
5
6
Marc Offroy1, Thomas Farges1, Pierre Gaillard1, Cheng Ling Kuo2, Alfred Bing-Chih Chen3,
Rue-Ron Hsu4, Yukihiro Takahashi5
7
8
9
10
1 CEA, DAM, DIF, 91297 Arpajon cedex, France, 2 Institute of Space Science, National Central University,
Jhongli, Taiwan, 3 Institute of Space and Plasma Sciences, National Cheng Kung University, Tainan, Taiwan,
Department of physics, National Cheng Kung University, Tainan, Taiwan, 5 Department of Cosmosciences,
Hokkaido University, Japan,
4
11
12
13
14
15
16
Contents of this file
17
Introduction
18
19
20
21
22
23
This supporting information presents a brief description of the mathematical approaches
discussed in the manuscript. The first approach tested is called Principal Component
Analysis (PCA). It is a starting point in data mining, based on a descriptive method which is
not based on a probabilistic model of data but which simply aims to provide a geometric
representation. Among the multivariate analysis techniques, the second approach is a softmodeling method called Parallel FACtor analysis (PARAFAC).
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
1. Principal Component Analysis (PCA)
PCA decomposes a data matrix 𝐃 (π‘š × π‘›) into a product of two matrices, a matrix of
scores denoted 𝐓 (π‘š × π‘˜) and a matrix of loadings denoted 𝐏 (π‘˜ × π‘›), which is added
to a residual matrix 𝐄 (π‘š × π‘›):
(1)
𝐃 = 𝐓. 𝐏 𝐓 + 𝐄
The dimension of the new space defined by π‘˜ is determined with the rank of the
matrix 𝐃. The original variables 𝑛 from π‘š observations are too complex to be
interpreted directly from the raw data, which explains why it is necessary to “reduce”
the space dimension using π‘˜ Principal Components (PCs) that explain the maximum
amount of information. The scores represent the coordinates of the observations (or
samples) on the axes of the selected PCs. The loadings denote the contributions of
the original variables on the same selected PCs. In other words, since the scores are
representations of observations in the space formed by the new axes defined by the
PCs, loadings are a representation of the variables on this axis. Geometrically, this
change of variables by linear combinations results is a set of new variables called
PCs. The direction of each newly created axis describes a part of the global
Text S1 to S5
1
40
41
42
43
44
45
46
47
information from the original variables. The variance explained by each PCs is sorted
in decreasing order. The proportion of variance explained by the first PC, which
represents the main part of information, is higher than the second PC, which
represents a smaller amount of information, and so on. The same information cannot
be shared between two PCs because PCA requires that the PCs are orthogonal to
each other.
48
49
50
51
52
53
PARAFAC decomposes a matrix 𝐃 into a product of three matrices according to
three modes. Instead of having a score and a loading matrix as in PCA, each
component consists of a score matrix denoted 𝐀 and two loading matrices denoted 𝐁
and 𝐂. In PARAFAC, it is common not to distinguish the score and the loading
matrices. In other words, the PARAFAC model of a three-dimensional matrix is given
by three loading matrices 𝐀, 𝐁 and 𝐂 with elements π‘Žπ‘–π‘“ , 𝑏𝑗𝑓 and π‘π‘˜π‘“ as follows:
2. The PARAllel FACtor analysis (PARAFAC)
𝐹
54
π‘‘π‘–π‘—π‘˜ = ∑ π‘Žπ‘–π‘“ 𝑏𝑗𝑓 π‘π‘˜π‘“ + π‘’π‘–π‘—π‘˜
(2)
𝑓=1
55
56
57
58
The elements of 𝐃 having a size 𝐼 × π½ × πΎ are denoted π‘‘π‘–π‘—π‘˜ . The trilinear model is
found to minimize the sum of squares of the residuals denoted π‘’π‘–π‘—π‘˜ . 𝐹 is the number
of factors extracted in each mode, which describes the maximum of information
contained in the matrix 𝐃. The model use a cost function as follows:
𝐼
59
𝐽
𝐾
2
𝐿(𝐀, 𝐁, 𝐂) = ∑ ∑ ∑ (π‘‘π‘–π‘—π‘˜ − ∑ π‘Žπ‘–π‘“ 𝑏𝑗𝑓 π‘π‘˜π‘“ )
𝑖=1 𝑗=1 π‘˜=1
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
𝐹
(3)
𝑓=1
The advantage of this methodology is that it provides simple and robust models that
can be easily interpreted [Harshman, 1970]. Furthermore, the solution of the
PARAFAC model is unique [Kruskal, 1976]. Kruskal (1977) proposed even less
restrictive conditions in cases where unique solutions can be expected. This latter
author uses the π‘˜-rank of the loading matrices, showing that if π‘˜π΄ + π‘˜π΅ + π‘˜π‘ ≥ 2𝐹 + 2
then the PARAFAC solution is unique, with π‘˜π΄ being the π‘˜-rank of matrix 𝐀, π‘˜π΅ the π‘˜rank of 𝐁 and π‘˜πΆ the π‘˜-rank of 𝐂. 𝐹 is the expected number of factors or components.
There is a well-known problem for other bilinear decomposition methods which arises
from the ambiguities of rotation and intensity [Lawton and Sylvstre, 1971; Tauler et
al., 1995]. In the case of an estimated PARAFAC model, the mathematical meaning
of uniqueness is that it cannot be rotated without a large error, i.e. a loss of fit for the
model [Bro, 1997]. Otherwise, for other two-way methods based on loadings or
scores, the ambiguities do not lead to errors on the model. Ambiguities can be
defined as the set of solutions that fulfill the constraints applied and fit the data
equally well. Consequently, the difficulty for a decomposition method is to determine
𝐹. The problem of linear dependence poses a challenge to multivariate algorithms
dealing with rank-deficient matrices. Therefore, it is necessary to determine the rank
of the data matrix. Ideally, the rank of a data matrix is in agreement with the number
of contributions in the studied system. In other words, the rank represents the
number of eigenvectors needed to explain all the measurements in the data matrix.
2
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Each of the recorded signals is a linear combination of these eigenvectors. It is
therefore difficult to evaluate the rank on noisy data. Nevertheless, while there are
various approaches to estimating the rank of the matrix, there are no explicit rules
[Bro, 1997]. In our case, the “Core Consistency” diagnostic (CONCORDIA) approach
developed by [Bro and Kiers, 2003] is used to determine the appropriate number of
components for multiway models. The main idea is to compare the ‘core’ of the
model estimated by PARAFAC with the ‘core’ of an ideal model. To understand this
approach, a presentation of the Tucker3 model is necessary. Tucker3 is another
modelling method for multiway arrays [Tucker, 1964; Rutledge and Bouveresse,
2007]. In addition to the loading matrices, a ‘core’ array is computed (equation 4).
With the Tucker3 models, the number of components or factors can be different on
each mode. Therefore, the loading matrices do not all necessarily have the same
number of columns. If D is a three-dimensional matrix of dimension 𝐼 × π½ × πΎ and the
number of factors on each mode is given by 𝐼 × π‘ƒ, 𝐽 × π‘„ and 𝐾 × π‘…, respectively,
then the dimension of the ‘core’ array will be 𝑃 × π‘„ × π‘…. As shown by the elementwise definition of the Tucker3 models, interactions may exist between loadings of
different order in the different modes because of the ‘core’ array:
𝑃
97
𝑄
𝑅
π‘‘π‘–π‘—π‘˜ = ∑ ∑ ∑ π‘Žπ‘–π‘ π‘π‘—π‘ž π‘π‘˜π‘Ÿ π‘‘π‘π‘žπ‘Ÿ + π‘’π‘–π‘—π‘˜
(4)
𝑝=1 π‘ž=1 π‘Ÿ=1
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
The element denoted π‘‘π‘π‘žπ‘Ÿ defines the ‘core’ array 𝐓(𝑃, 𝑄, 𝑅). By comparing equations
(2) and (4), we note that the PARAFAC model is a restricted version of the Tucker3
model, where 𝑃 = 𝑄 = 𝑅 and 𝐓 is the theoretical superidentity matrix, i.e. the
superdiagonal entries are equal to 1 and zeros elsewhere. The main idea of the core
consistency approach is to compare this theoretical superidentity matrix denoted here
as 𝐆 with the ‘core’ array 𝐓 derived from the matrices 𝐀, 𝐁, 𝐂 and 𝐃 [Bro and Kiers,
2003]. A simple way to access if 𝐆 and 𝐓 are similar is to monitor the distribution of
superdiagonal and off-superdiagonal elements of 𝐆. If the superdiagonal elements
are all close to the corresponding elements of 𝐓 and the off-superdiagonal elements
are close to zero, then the model is appropriate. If this is not the case, then either too
many components have been extracted or the model is overfitting. When too many
components of factors have been extracted, the model is mis-specified, or gross
outliers disturb the model. The percentage of core consistency quantifies the
similarity between 𝐆 and 𝐓 as:
%π‘π‘œπ‘Ÿπ‘’ π‘π‘œπ‘›π‘ π‘–π‘ π‘‘π‘’π‘›π‘π‘¦ = 100 (1 −
∑𝐹𝑝=1 ∑πΉπ‘ž=1 ∑πΉπ‘Ÿ=1(π‘”π‘π‘žπ‘Ÿ − π‘‘π‘π‘žπ‘Ÿ )
2
∑𝐹𝑝=1 ∑πΉπ‘ž=1 ∑πΉπ‘Ÿ=1 π‘‘π‘π‘žπ‘Ÿ
2
)
(5)
The percentages derived from this approach allow us to obtain a fairly good
approximation of factors to describe the data matrix. If we gradually increase the
number of factors for three-way decomposition, the core consistency index will also
decrease monotonically and slowly. The influence of noise and other non-trilinear
variations increases with the number of factors 𝐹. When the number of “true” factors
is exceeded, the core consistency index decreases dramatically, because some
directions in the model subspace will be mainly descriptive of noise or some other
variation, leading to high off-superdiagonal core values [Bro and Kiers, 2003].
3
121
122
123
The solution to the PARAFAC model can be found with the Alternating Least Squares
(ALS) method by successively assuming the loadings in two known modes and then
estimating the unknown set of parameters of the last mode [Bro, 1997].
124
125
If the algorithm converges to a global minimum, which is most often the case for wellbehaved problems, the least-squares solution to the model is found [Bro, 1997].
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
ALS is an attractive method because the solution of PARAFAC is certain to be
improved at every iteration. However, a major drawback of ALS is the time required
to estimate the models, especially when the number of variables is high. Apart from
the fact that the algorithm is currently improved by two methods of acceleration, it is
necessary to optimize the initialization step (1). Firstly, there are many methods for
estimating the starting matrices, for example, Singular Value Decomposition (SVD)
[Golub and Van Loan, 1997], or Direct TriLinear Decomposition / Generalized Rank
Annihilation Method (DTLD/GRAM) [Sanchez and Kowalski, 1990]. In our case, we
use the DTLD/GRAM method because it is quick and the fit of the model estimated
from PARAFAC is better with this initialization. Secondly, it is possible to accelerate
the fitting of the PARAFAC model by using suitable pre-processing on the studied
system [Bro and Smilde, 2003; Massart et al., 1997; Martens and Naes, 1997]. And
thirdly, it is possible to constrain the PARAFAC model according to the data studied
[Bro, 1997]. During the ALS steps, constraints are used to introduce information to
model the 𝐀, 𝐁 and 𝐂 signal profiles. The main benefit of constraining the solutions of
PARAFAC is that it can sometimes be helpful in terms of interpretability or stability of
the model. These constraints are based on mathematical or physical properties of the
studied system [Bro, 1997]. For example, with spectroscopic data, it is general
practice to use the non-negative constraint because the absorbance measurements
should be positive if proper blanking is used. In our case, the ?results are similar
because the brightness measurements should be positive. A general method called
Non-Negative Least Square (NNLS) has been described by [Lawson and Hanson,
1995] and integrated into the PARAFAC procedure.
149
References
150
151
Bro, R. (1997), PARAFAC, Tutorial and Applications, Chemometrics and Intelligent
Laboratory Systems, 38, 149-171.
152
153
154
Bro, R., H. A. L. Kiers (2003), A new efficient method for determining the number of
components in PARAFAC models, Journal of Chemometrics, 17, 274-286,
doi:10.1002/cem.801.
155
156
Bro, R., A. K. Smilde (2003), Centering and scaling in component analysis, Journal of
Chemometrics, 17, 16-33.
157
158
159
160
Duponchel, L., S. Laurette, B. Hatirnaz, A. Treizebre, F. Affouard, B. Bocquet (2013),
Terahertz microfluidic sensor for in situ exploration of hydration shell of molecules,
Chemometrics
and
Intelligent
Laboratory
Systems,
123,
28-35,
dx.doi.org/10.1016/j.chemolab.2013.01.009.
4
161
162
163
Kruskal, J. B. (1976), More factors than subjects, tests and treatments ; An
indeterminacy theorem for canonical decomposition and individual differences
scaling, Psychomettrika, 41, 281.
164
165
166
Kruskal, J. B. (1977), Three-way arrays: Rank and uniqueness of trilinear
decomposition, with application to arithmetic complexity and statistics, Linear Algebra
and its Applications., 18, 95.
167
168
Lawton, W.H., E. A.
Technometrics, 13, 617.
169
170
Lawson C. L., R. J. Hanson, Solving least square problems, Society for Industrial and
Applied Mathematics, Philadelphia, 1995.
171
172
Malinowski, E.R., Factor Analysis in Chemistry, John Wiley & Sons Inc, New York,
2002.
173
Martens, H., T. Naes, Multivariate calibration, John Wiley & sons, Chichester, 1989.
174
175
176
Massart, D.L., B.G.M. Vandeginste, L.M.C. Buydens, S. de Jong, P.J. Lewi,
J.Smeyers-Verbeke (1997), Handbook of Chemometrics and Qualimetrics :
Handbook of Chemometrics and Qualimetrics: Part A, Elsevier, Amsterdam.
177
178
179
180
Offroy, M., Y. Roggo, L. Duponchel (2012), Increasing the spatial resolution of near
infrared chemical images (NIR-CI): the super-resolution paradigm applied to
pharmaceutical products, Chemometrics and intelligent Laboratory Systems, 117,
183-188.
181
182
183
Ruckebusch C., L. Blanchet (2013), Multivariate curve resolution: A review of
advanced and tailored applications and challenges, Analytica Chimica Acta, 765, 2836, dx.doi.org/10.1016/j.aca.2012.12.028.
184
185
Rutledge, D. N., J.-R., Bouveresse (2007) Multi-way analysis of outer product arrays
using PARAFAC, Chemometrics and intelligent laboratory systems, 85 (2), 170-178.
186
187
Sanchez, E., B. R., Kowalski (1990), Tensorial resolution: A direct trilinear
decomposition, Journal of Chemometrics, 4, 29.
188
189
Tauler, R., B. Kowalski (1993), Multivariate Curve Resolution Applied to spectral Data
from Multiple Runs of an Industrial Process, Analytical Chemistry, 65, 2040-2047.
190
191
192
Tauler, R., A. Smilde, B. Kowalski (1995), Selectivity, Local Rank, Three-Way Data
Analysis and Ambiguity in Multivariate Curve Resolution, Journal of Chemometrics, 9,
31-58.
193
194
195
Tucker, L. R., Extension of factor analysis to three dimensional matrices, in N.
Frederiksen, H. Gulliksen (Eds), Contributions to mathematical Psychology, Holt,
Rinehart, & Winston, New York, 1964, 110-182.
Sylvestre
(1971),
Self
Modeling
Curve
Resolution,
196
5
Download