and “c”

advertisement
CALIBRATION
Prof.Dr.Cevdet Demir
cevdet@uludag.edu.tr
LINKING TWO SETS OF DATA TOGETHER
• Peak height to concentration
• Spectra to concentrations
• Taste to chemical constituents
• Biological activity to structure
• Biological classification to chromatographic peak areas
NORMALLY WE ARE INTERESTED IN SOME
FUNDAMENTAL PARAMETER e.g. concentration
or biological classification
WE TAKE SOME MEASUREMENTS e.g. spectra or
chromatograms
WE WANT TO USE THESE MEASUREMENTS TO
GIVE US A PREDICTION OF THE
FUNDAMENTAL PARAMETER
UNIVARIATE CALIBRATION
One measurement e.g. a peak height
MULTIVARIATE CALIBRATION
Several measurements e.g. spectra
NOTATION
“x” block is measured data e.g. spectra, chromatograms,
GCMS of biological extract, structural parameters
“c” block is what we are trying to predict e.g. concentration,
species, acceptability of a product, taste
Measurement
e.g.
spectroscopic
Response e.g.
Spectroscopic
Y
Experimental design
X
Independent
variable, e.g.
Concentration
X
Calibration
C
Predicted
parameter, e.g.
Concentration
c
x
c
C
X
X
MULTIVARIATE CALIBRATION IN ANALYTICAL
CHEMISTRY
•Single component.
Example, concentration of chlorophyll a by uv/vis spectra.
•Mixture of components, all compounds known.
Example, mixture of pharmaceuticals, all pure compounds
known.
•Mixture of components, only some compounds
known.
Example, coal tar pitch volatiles in industrial waste
studied by spectroscopy, only some known.
•Statistical parameters.
Example, protein in wheat by NIR spectroscopy.
UNIVARIATE CALIBRATION
“x” and “c” blocks consist of single measurements.
Traditional analytical chemistry
CLASSICAL CALIBRATION
xc.s
Unknown : s
s  c+ . x
where c+ is the pseudo-inverse
x
c
=
s
TREATMENT OF ERRORS IN CLASSICAL CALIBRATION
x
c
PROBLEMS
1. Modern lab : dilution and sample preparation errors (in
“c”) are probably bigger than spectroscopic errors (in
“x”). Spectra are more reproducible. Differs to classical
statistics.
2. Want to predict concentration from spectra etc. not vice
versa.
Most classical textbooks in analytical chemistry and most
spreadsheets incorrectly recommend classical calibration.
INVERSE CALIBRATION
cx.b
Unknown : b
b  c . x+
x
c
x
c
=
b
COMPARING FORWARD AND INVERSE CALIBRATION
40
35
30
25
20
Classical
15
10
Inverse
5
0
0
1
2
3
4
5
6
7
8
9
10
INCLUDING THE INTERCEPT : first column of “x” is 1s
c  b0+ b1x
cX.b
b  X+ . c
c
X
=
b
HOW WELL IS THE MODEL PREDICTED?
Huge number of approaches
• Root mean square error (divide by degrees of freedom –
number of samples – 1 or 2 according to parameters in the
model).
E =
I
 ( x i - xˆ i ) 2 / d
i =1
Often express as percentage either of the mean
measurement or the standard deviation of the
measurements
• Correlation coefficient of predicted versus true – has
problems if the number of samples is small.
• ANOVA and replicates analysis using lack-of-fit error,
as discussed in the experimental design lectures.
• Leaving samples out and predicting them : crossvalidation and testing will be discussed later.
PROBLEMS
•Outliers can be a major difficulty. Graphical ways of
looking for outliers – big area.
•Undue influence on least square models.
MULTIWAVELENGTH
Example : four compounds, four wavelengths.
MULTIPLE LINEAR REGRESSION (MLR)
X = C. B
Know
•X : a series of spectra
•C : concentrations
WAYS OF PERFORMING THE CALIBRATION
1. Producing a series of mixture spectra of known
concentrations by weighing different amounts and
adding together
2. Taking a series of spectra and calibrating against
and independent method e.g. HPLC.
220
240
260
280
300
320
340
360
380
400
EXAMPLE : UV/VIS OF PAHs AT 4 WAVELENGTHS, NO
WAVELENGTH IS UNIQUE
B = X+ . C
estimated [pyrene] = -3.870 A330 + 8.609 A335 – 5.098 A340 + 1.848 A345
Can also use classical methods
Cˆ = X.S+
This can be done by knowledge of the
pure spectra.
Different to calibration where a series of
mixtures recorded
MULTIPLE LINEAR REGRESSION
•Why use only 4 wavelengths?
•Why not 10 or 100 wavelengths?
More information – not arbitrary choice of
wavelengths.
•Number of wavelengths can be greater than number
of compounds.
C
X
=
Example
• 25 spectra
• 10 compounds
• 100 wavelengths
B
B = X+ . C
In this case
•B is a matrix of coefficients, 100  10
•X is a spectral matrix, 25  100
•C is a concentration matrix, 25  10
Some technical problems using inverse
calibration in this case, and often it does not
work.
Better approach
1. First predict the spectra S.
•Either they are known from the calibration of the
pure standards
•Or they can be predicted from the mixture spectra
S  C+. X
2. Then use these predictions in a model (e.g. of
unknowns)
C  X. S+
MLR effectively models a spectrum as a sum of spectra of
the components, e.g. for a 3 component model
Observed spectrum =
conc A  spectrum A +
conc B  spectrum B +
conc C  spectrum C
ENHANCEMENTS
• Selecting only certain variables, not all the wavelengths.
• Weighting of variables.
ERROR ANALYSIS
This now becomes more sophisticated.
In addition to errors in the “c” block (concentration
errors), now also errors in the “x” block
(reconstruction of spectra).
Discuss later.
LIMITATIONS AND PROBLEMS WITH MLR
• Number of experiments and number of wavelengths
must never be less than number of compounds
• All significant compounds must be known. If still
unknowns, then these are mixed up with the knowns.
Problems if no pure standards and no reliable reference
method. THIS IS THE BIGGEST LIMITATION.
•Sometimes extra wavelengths can be bad ones e.g.
noise or background.
• Assume that concentrations are perfectly known,
errors in only one variable, using classical approach.
However if information on all the significant
compounds is known then MLR is a simple an
effective method.
PRINCIPAL COMPONENTS REGRESSION (PCR)
Do not need to know all components in advance,
simply "how many components", and the compounds
of interest.
Overcomes a major limitation of MLR
Detector (e.g. wavelength)
Samples
X
c T . r
PCA
P
T
Regression
concentration
Samples
T
r
c
The first step is to perform PCA.
Obtain a scores matrix, retaining A components
The value of A may be a guess of the number of
compounds in the mixture.
Then r = T+. c
Can extend to more than one concentration –
CT.R
T
C

R
Example
25 spectra taken at 100 wavelengths
We know about and want to predict 4 compounds
We think there are around 10 compounds in the
mixture, 6 are unknown.
T is a matrix of dimensions 25  10
C is a matrix of dimensions 25  4
R is a matrix of dimensions 10  4
Example of the calculation of the concentration of
pyrene in a set of 25 uv/vis spectra containing 10
different PAHS.
How many PCA components to use? The
prediction gets better the more the number of
components.
ERRORS – “x” block
Simply as in PCA, look at eigenvalues as more
principal components are calculated
0.1
0.01
0.001
1
3
5
7
9
11
13
15
ERRORS – “c” block
Look at errors in calculation of concentrations – often
different behaviour
1
0.1
0.01
1
3
5
7
9
11
13
15
0.8
0.7
0.7
0.6
0.6
0.5
0.5
predicted concentration
0.8
0.4
0.3
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.0
0.8
0.1
0.2
0.3
0.4
observed concentration
observed concentration
0.8
0.7
0.6
0.5
predicted concentration
predicted concentration
Predictions for pyrene concentration using 1, 5 and
10 principal components.
0.4
0.3
0.2
0.1
0
0.0
0.1
0.2
0.3
0.4
observed concentration
0.5
0.6
0.7
0.8
0.5
0.6
0.7
0.8
Why not use a large number of PCA components?
Then one can get perfect prediction?
FALLACY : the idea is to predict unknowns, after
the knowns have been modelled. Later PCs often
model noise.
Choose no of PCs equal to number of compounds in
the mixture? Methods for determining number of
PCs described later when this is unknown.
Advantage over MLR - only partial knowledge necessary.
Disadvantage : assumption that all errors in the "x" block.
Practical situation.
•Modern instruments very reproducible.
•Volumetrics, measuring cylinders, syringes are inaccurate.
PARTIAL LEAST SQUARES (PLS)
This technique assumes that errors in both “x”
and “c” block are equally significant.
=
X
c
=
T
.
.
T
P
q
+
+
E
f
What does this mean?
X = T.P + E
c = T.q + f
THERE IS A COMMON SCORES MATRIX FOR
BOTH “x” AND “c” BLOCKS.
In PCR we calculate the scores just for the “x” block
and then use a separate step for regression.
A big difference between PCR and PLS is that in PCR
there is only one scores matrix whereas for PLS (using 1
column) there are different scores matrices according
for each compound.
The vector q is analogous to loadings.
PLS components have some analogies to PC
components.
In PCA, each component consists of a
•scores vector
•loadings vector
•eigenvalue.
In PLS, each component consists of a
•scores vector
• “x” loadings vector (p)
• “c” loadings vector (q) – a single number
• magnitude.
FOR THE TECHNICALLY MINDED.
•Unlike eigenvalues, the magnitudes of success PLS
components do not necessarily decrease in size, although they
do model the overall datasets.
•Unlike loadings for PCA, loadings in PLS are not orthogonal.
•In most cases PLS loadings are not normal.
•There are many algorithms for PLS and it can be confusing.
ERROR ANALYSIS : similar principles to PCR
but different curves for different compounds.
Sometimes different number of PLS components
are used to model different compounds in one
mixture.
60
50
40
c errors
30
x errors
20
10
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
For a dataset consisting of 25 spectra observed at 27
wavelengths, for which 8 PLS components are calculated,
there will be
•a T matrix of dimensions 25  8,
•a P matrix of dimensions 8  27,
•an E matrix of dimensions 25 27,
•a q vector of dimensions 8  1 and
•an f vector of dimensions 25  1.
PLS2 – when more than one “c” variable
=
X
C
=
T
.
.
T
P
Q
+
+
E
F
X = T.P + E
C = T.Q + F
Differences to PLS1
•C is now a matrix
•Q is also a matrix
•F is also a matrix
•Single scores for all compounds in the mixture.
•Theoretically PLS2 should perform better than
PLS1 but in practice it often performs worse.
•Computationally faster, important 10 years ago.
•Useful for non-linear problems such as QSAR
where interactions, but not so useful in analytical
chemistry which is very linear.
SUMMARY OF MAIN METHODS
• Univariate calibration
•Classical
•Inverse
•Multiple linear regression
•Principal components regression
•Partial least squares
•PLS1
•PLS2
Download