Multivariate Data Analysis – In Practice

Multivariate Data Analysis
– In Practice
5th Edition
An Introduction to
Multivariate Data Analysis
and Experimental Design
Kim H. Esbensen
Ålborg University, Esbjerg
with contributions from
Dominique Guyot
Frank Westad
Lars P. Houmøller
www.camo.com
CAMO Software AS.
Nedre Vollgate 8,
N-0158,
Oslo,
NORWAY
CAMO Software Inc.
One Woodbridge Center,
Suite 319,
Woodbridge, NJ 07095,
USA
CAMO Software India Pvt. Ltd.
14 & 15, Krishna Reddy Colony
Domlur Layout,
Bangalore - 560 071,
INDIA
Tel: (47) 223 963 00
Fax: (47) 223 963 22
Tel: (732) 726 9200
Fax: (973) 556 1229
Tel: (91) 80 4125 4242
Fax: (91) 80 4125 4181
This book was produced using Doc-to-Help together with Microsoft Word. Visio and
Excel were used to make some of the illustrations. The screen captures were taken with
Paint Shop Pro.
Trademark Acknowledgments
Doc-To-Help is a trademark of WexTech Systems, Inc.
Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word
are trademarks of the Microsoft Corporation.
PaintShop Pro is a trademark of JASC, Inc.
Visio is a trademark of the Shapeware Corporation.
Information in this book is subject to change without notice. No part of this document
may be reproduced or transmitted in any form or by any means, electronic or
mechanical, for any purpose, without the express written permission of CAMO Process
AS.
ISBN 82-993330-3-2
1994 – 2002 CAMO Process AS
All rights reserved.
5th edition. Re-print December 2004
Preface
iii
Preface
October 2001
Learning to do multivariate data analysis is in many ways like learning
to drive a car: You are not let loose on the road without mandatory
training, theoretical and practical, as required by current concern for
traffic safety. As a minimum you need to know how a car functions and
you need to know the traffic code. On the other hand, everybody would
agree that it is first after having obtained your drivers’ license that the
real practical learning begins. This is when your personal experience
really starts to accumulate. There is a strong interaction between the
theory absorbed and the practice gained in this secondary, personal
training period.
Please substitute ”multivariate data analysis” for ”driving a car” in all of
the above. Neither in this context are you let out on the data analytical
road – without mandatory training, theoretical and practical. The
analogy is actually very apt!
This book presents a basic theoretical foundation for bilinear
(projection-based) multivariate data modeling and gives a conceptual
framework for starting to do your own data modeling on the data sets
provided. There are some 25 data sets included in this training package.
By doing all exercises included you’re off to a flying start!
Driving your newly acquired multivariate data analysis car is very much
an evolutionary process: this introductory textbook is filled with
illustrative examples, many practical exercises and a full set of selfexamination real-world data analysis problems (with corresponding data
sets). If, after all of this, you are able to work confidently on your own
applications, you’ll have reached the goal set for this book.
Multivariate Data Analysis in Practice
iv
Preface
This is the 5th revised edition of this book. The three first editions were
mainly reprints, the only major change being the inclusion of a
completely revised chapter on ”Introduction to experimental design”,
which first appeared in the 3rd edition (CAMO). The 4th revised
edition however (published March 2000) saw very many major
extensions and improvements:
• Text completely rewritten by the senior author, based on five years of
extensive use in teaching at both university and dedicated course
levels. More than 5.500 copies in use.
• 30% new theory & text material added, reflecting extensive student
response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms
and explanations.
• Text revised with an augmented self-learning objective throughout.
• Four new master data sets added (with extended self-exercise
potential):
1.
2.
3.
4.
Master violin data
Norwegian car dealerships
Vintages
Acoustic chemometric calibration
(PCA/PLS)
(PCA/PLS)
(PCA/PLS)
(PCR/PLS)
• Additional chapter on experimental design: new features include
mixture designs and D-optimal designs.
• New chapter on the powerful, novel: ”Martens’ Uncertainty Test”.
• Comprehensive glossary of terms.
This 5th edition also includes essential additional revisions and
improvements:
• Lars P. Houmøller, Ålborg University Esbjerg, has carried out a
complete work-through of all demonstrations and exercises. Many of
these had not been updated with respect to several of the intervening
UNSCRAMBLER software versions. We are happy to have finally
eliminated this most frustrating nuisance.
Multivariate Data Analysis in Practice
Preface
v
About the authors
Kim H. Esbensen, Ph.D., has more than 20 years of experience in
multivariate data analysis and applied chemometrics. He was professor
in chemometrics at the Norwegian Telemark Institute of Technology
(HIT/TF), Institute of Process Technology (PT) 1995-2001, where he
was also head of the Chemometrics Department Tel-Tek, Telemark
Industrial R&D Center, Porsgrunn. Between these institutions he
founded ACRG: the Applied Chemometrics Research Group, HIT/TFTel-Tek, which a.o. hosted SSC6, the 6th Scandinavian Symposium on
Chemometrics, August 1999 as well as numerous other international
courses, workshops and meetings.
July 1st, 2001 he moved to a position as research professor in Applied
Chemometrics at Ålborg University, Esbjerg, Denmark (AUE), where he
is currently leading ACACSRG: the Applied Chemometrics, Analytical
Chemistry and Sampling Research Group. As the name implies, applied
chemometrics activities continue in Esbjerg while new activities are
added – most notably through close collaboration with assoc. prof. Lars
P. Houmøller, who independently built up the area of analytical
chemistry/chemometrics at AUE before Prof. Esbensen’s arrival. Most
recently the discipline of sampling (proper sampling) has been added, in
recognition of the immense importance of sampling in any data
analytical discipline, including chemometrics.
Kim H. Esbensen has published more than 60 papers and technical
reports on a wide range of chemical, geochemical, industrial,
technological, remote sensing, image analytic and acoustic chemometric
applications. Together with Paul Geladi he has been instrumental in codeveloping the concept of Multivariate Image Analysis (MIA); with
ACRG he pioneered the development of the novel area of acoustic
chemometrics.
His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology,
geochemistry), while a Ph.D. was conferred him by the Technical
University of Denmark (DTH) in 1981 within the areas of metallurgy,
meteoritics and multivariate data analysis. He then did post-doctoral
work for two years with the Research Group for Chemometrics at the
University of Umeå 1980-1981, after which he worked in a Swedish
geochemical exploration company, Terra Swede, for two more years.
Moving to Norway, this was followed by eight years as data analytical
research scientist at the Norwegian Computing Center (NCC), Oslo,
Multivariate Data Analysis in Practice
vi
Preface
after which he became a senior research scientist at SINTEF, the
Norwegian Foundation for Industrial and Technological Research for
four additional years. In between these two assignments he was a
visiting guest professor at Norsk Hydro’s Research Center in Bergen,
Norway. He also holds a position as Chercheur associé (now Chercheur
affilié) du Centre de Recherche en Géomatique, Université Laval,
Quebec. He is a member of the editorial board of Journal of
Chemometrics, Wiley Publishers, and is a member of ICS, AGU and
several other geological, data analytical and statistical associations.
Dominique Guyot, educated in Statistics, Economics and
Biomathematics (ENSAE and Université de Paris 7, France), has 15
years of experience in the field of chemometrics. She gained industrial
experience from her work in the pharmaceutical and cosmetic industries,
before joining CAMO from 1995 until 2000. With CAMO, Dominique
worked as a Senior Consultant, and was particularly involved in food
applications. She put together a practical strategy for efficient product
development, based on experimental design and multivariate data
analysis. This strategy was implemented in the Guideline®+ software
package, complemented by an integrated training course focusing on
multivariate methods for food product developers. Dominique is now
studying music and singing at the Conservatoire of Trondheim, Norway.
Frank Westad has a M. Sc. in physical chemistry from the University of
Trondheim, Norway. He has 13 years experience in applied multivariate
data analysis, and he completed a Ph.D. in multivariate regression in
2000. Frank has given numerous courses in experimental design and
multivariate analysis for companies in Europe and in the U.S.A. His
main research fields include variable selection, shift modelling and
image analysis.
Lars P. Houmøller has a M.Sc. in chemistry and physics from the
University of Aarhus, Denmark. He has 12 years of experience in
analytical chemistry and has worked 5-7 years with chemometrics. His
teaching experiences include chemometrics, analytical chemistry,
spectroscopy, physical chemistry, general and technical chemistry,
organic and inorganic chemistry, unit operations and fluid dynamics. His
research field covers NIR spectroscopic applications over a very broad
industrial spectrum. He also has experience from working in the Danish
food production industry.
Multivariate Data Analysis in Practice
Preface
vii
E-mail interaction with the authors:
Kim Esbensen
kes@aue.auc.dk
Dominique Guyot
Frank Westad
Lars P. Houmøller
dominique.guyot@camo.no
fwestad@online.no
lph@aue.auc.dk
About this book
Since 1986, when CAMO ASA first commercialized and started
marketing THE UNSCRAMBLER, many customers have asked for
basic, easy-to-understand literature on chemometrics. In 1993 a group of
data analysts at different competence levels was invited to a one-day
seminar at CAMO, Trondheim, for discussing their experience from
both learning and teaching chemometrics. The result was a blue-print
outline for what came to be this introductory book: the specifications
called for a comprehensive training-package, involving basic, practical,
easy-to-read, largely non-mathematical theory, with plenty of hands-on
examples and exercises on real-world data sets. CAMO contracted
SINTEF to write this book (first three editions), and the parties agreed to
cooperate on the completion of the complete training package.
In the intervening years, this book was published in some 4.500 copies
and was used for the introductory basic training in some 15 universities
and in several hundred industrial companies; reactions were many and
largely constructive. We learned a lot from these criticisms; we thank all
who contributed!
Came 1999, the time was ripe for a complete revision of the entire
package. This was undertaken by the senior author in the summer 1999
with significant assistance from his then Ph.D. student Jun Huang (now
with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14
(Martens’ Uncertainty Test), Dominique Guyot (CAMO) who wrote the
original new entire chapter 17 (Complex Experimental Design
Problems), and with further invaluable editorial and managerical
contributions from Michael Byström (CAMO) and Valérie Lengard
(CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO,
UK) for very effective linguistic streamlining of the 4th edition! The
authors and CAMO also take this opportunity to acknowledge Suzanne
Schönkopf’s (CAMO) contribution to editions previous to the 4th one.
Multivariate Data Analysis in Practice
viii
Preface
The present edition of this book still bears the fruit of her very important
past efforts.
The publication of the 4th edition, in March 2000, was unfortunately
somewhat marred by a less than complete revision of the exercises and
illustrative UNSCRAMBLER runs in the book, which was not
considered fatal at the time – This soon proved to be a serious mistake;
disapointment and frustration from several generations of students, who
wanted to follow all the exercises closely, followed rapidly. A Danish
university teacher, who had himself experienced this frustration close up
when using the book for his own teachings, assoc. prof. Lars P.
Houmøller at the University of Ålborg, Esbjerg voluntarily took it upon
himself to carry out a complete work-through of this essential didactic
aspect of the book. His very valuable demo and exercise revisions, as
well as a very thorough text consistency check, have now been included
in toto in the 5th edition.
Today, this book is a collaborative effort between the senior author and
CAMO Process AS; the tie with SINTEF is now defunct.
There is little academic glamour in writing an introductory level
textbook, as the senior author has well experienced - which was never
the goal anyway. But on the other hand, the introductory level is
definitely where the largest audience and potential market exist, as
CAMO has well experienced. The senior author has used the book for
six consecutive years teaching introductory chemometrics largely to
engineering (M.Sc.) students, as well as for extensive course work in
industrial and foreign university environments. The response from some
accumulated 500 students has made this author happy, while some 5500
sales have made CAMO equally satisfied.
Thus all is well with the training package! We hope that
edition will continue to meet the challenging demands
hopefully now in an improved form. Writing for
introductory audience/market constitutes the highest
didactic challenge, and is thus (still) irresistible!
this revised 5th
of the market,
precisely this
scientific and
Multivariate Data Analysis in Practice
Preface
ix
Acknowledgements
The authors wish to thank the following persons, institutions and
companies for their very valuable help in the preparation of this training
package:
Hans Blom, Østlandskonsult AS, Fredrikstad, Norway
Frode Brakstad, Norsk Hydro F-Center, Porsgrunn, Norway
Rolf Carlson, Department of Chemistry, University of Tromsø, Norway
Chevron Research & Technology Co, Richmond, CA, USA
Lennart Eriksson, Dept. of Organic Chemistry, University of Umeå,
Sweden (now with Umetrics, Inc.)
Professor Magni Martens, The Royal Vetarinary & Agricultural
University, Denmark
Geological Survey of Greenland, Denmark
IKU, Institute for Petroleum Research, Trondhein, Norway
Norwegian Food Research Institute (MATFORSK), Ås, Norway
Norwegian Society of Process Control
Norwegian Chemometrics Society
International Chemometrics Society
UOP Guided Wave, CA, USA
Pierre Gy, Cannes, France (for a gentleman’s introduction to the finest
French wines)
Zander & Ingerstrõm, Oslo, Norway
Tomas Õberg Konsult AB, Karlskoga, Sweden
KAPITAL (weekly Norwegian economic magazine), no 14/1994, p50-55
Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto “violin
no 9”)
Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabotto
oeuvre data)
Sensorteknikk A/S, Bærum, Oslo (Bjørn Hope: sensor technology
entrepreneur extraordinaire; Evy: for innumerable occasions: warm
company, coffee and waffles, waffles, waffles)
Thorbjørn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisen
a.o. (for enormous help in developing acoustic chemometrics)
“Anonymous wine importer”, Odense, Denmark.
Helpful wine assessors (partly anonymous), Manson, Wa, USA.
Finally the author(s) and CAMO wish to thank all THE
UNSCRAMBLER users during the last seven years for their close
relationships with us, which have given us so much added experience in
Multivariate Data Analysis in Practice
x
Preface
teaching multivariate data analysis. And thanks for all the constructive
criticism to the earlier editions of this book. Last, but certainly not least,
a warm thank you to all the students at HIT/TF, at Ålborg University,
Esbjerg and many, many others, who have been associated with the
teachings of the authors, nearly all of whom have been very constructive
in their ongoing criticism of the entire teaching system embedded in this
training package. We even learned from the occasional not-so-friendly
criticisms…
Communication
The period of seven years that has been the formative period for the
training package has come of age. By now we are actually beginning to
be rather satisfied with it!
And yet: The author(s) and CAMO always welcome all critical
responses to the present text. They are seriously needed in order for this
work to be continually improving.
Multivariate Data Analysis in Practice
Contents
xi
Contents
1. Introduction to Multivariate Data Analysis Overview
1.1 Indirect Observations and Correlation
1.2 Hidden Data Structures
1.3 Multivariate Data Analysis vs. Multivariate Statistics
1.4 Main Objectives of Multivariate Data Analytical Techniques
1.5 Multivariate Techniques as Projections
2. Getting Started - with Descriptive Statistics
1
1
7
9
9
11
13
2.1 Purpose
13
2.2 Data Set 1: Quality of Green Peas
13
2.3 Data set 2: Economic Characteristics of Car Dealerships in
Norway
17
3. Principal
Introduction
Component
Analysis
(PCA)
3.1 Representing the Data as a Matrix
3.2 The Variable Space - Plotting Objects in p Dimensions
3.3 Plotting Objects in Variable Space
3.3.1 Exercise - Plotting Raw Data (People)
3.4 The First Principal Component
3.5 Extension to Higher-Order Principal Components
3.6 Principal Component Models - Scores and Loadings
3.6.1 Model Center
3.6.2 Loadings - Relations Between X and PCs
3.6.3 Scores - Coordinates in PC Space
3.6.4 Object Residuals
3.7 Objectives of PCA
3.8 Score Plot - “Map of Samples”
3.9 Loading Plot - “Map of Variables”
Multivariate Data Analysis in Practice
–
19
19
20
21
22
27
30
31
32
33
34
35
35
36
40
xii
Contents
3.10 Exercise: Plotting and Interpreting a PCA-Model (People)
3.11 PC-Models
3.11.1 The PC Model: X = TP T + E = Structure + Noise
3.11.2 Residuals - The E-Matrix
3.11.3 How Many PCs to Use?
3.11.4 Variable Residuals
3.11.5 More about Variances - Modeling Error Variance
3.12 Exercise - Interpreting a PCA Model (Peas)
3.13 Exercise - PCA Modeling (Car Dealerships)
3.14 PCA Modeling – The NIPALS Algorithm
47
54
54
58
61
64
65
66
68
72
4. Principal Component Analysis (PCA) - In Practice
75
4.1 Scaling or Weighting
75
4.2 Outliers
78
4.2.1 Scaling, Transformation and Normalization are Highly
Problem Dependent Issues
80
4.3 PCA Step by Step
81
4.3.1 The Unscrambler and PCA
84
4.4 Summary of PCA
85
4.4.1 Interpretation of PCA-Models
88
4.4.2 Interpretation of Score Plots – Look for Patterns
89
4.4.3 Summary - Interpretation of Score Plots
93
4.4.4 Summary - Interpretation of Loading Plots
94
4.5 PCA - What Can Go Wrong?
95
4.6 Exercise - Detecting Outliers (Troodos)
97
5. PCA Exercises – Real-World Application Examples 105
5.1 Exercise - Find Clusters (Iris Species Discrimination)
5.2 Exercise - PCA for Experimental Design (Lewis Acids)
5.3 Exercise - Mud Samples
5.4 Exercise - Scaling (Troodos)
6. Multivariate Calibration (PCR/PLS)
6.1 Multivariate Modeling (X,Y): The Calibration Stage
6.2 Multivariate Modeling (X, Y): The Prediction Stage
6.3 Calibration Set Requirements (Training Data Set)
6.4 Introduction to Validation
6.5 Number of Components (Model Dimensionality)
6.6 Univariate Regression (y|x) and MLR
105
107
109
112
115
115
116
118
120
122
124
Multivariate Data Analysis in Practice
Contents
xiii
6.6.1 Univariate Regression (y|x)
6.6.2 Multiple Linear Regression, MLR
6.7 Collinearity
6.8 PCR - Principal Component Regression
6.8.1 Exercise - Interpretation of Jam (PCR)
6.8.2 Weaknesses of PCR
6.9 PLS- Regression (PLS-R)
6.9.1 PLS - A Powerful Alternative to PCR
6.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y)
6.9.3 PLS2 – NIPALS Algorithm
6.9.4 Interpretation of PLS Models
6.9.5 The PLS1 NIPALS Algorithm
6.9.6 Exercise - Interpretation of PLS1 (Jam)
6.9.7 Exercise - Interpretation PLS2 (Jam)
6.10 When to Use which Method?
6.10.1 Exercise - Compare PCR and PLS1 (Jam)
6.11 Summary
7. Validation: Mandatory Performance Testing
7.1 The Concept of Test Set Validation
7.1.1 Calculating the Calibration Variance (Modeling Error)
7.1.2 Calculating the Validation Variance (Prediction Error)
7.1.3 Studying the Calibration and Validation Variances
7.2 Requirements for the Test Set
7.3 Cross Validation
7.4 Leverage Corrected Validation
8. How to Perform PCR and PLS-R
9.1 Data Constraints
Multivariate Data Analysis in Practice
Analysis
155
155
157
158
159
161
163
168
171
8.1 PLS and PCR - Step by Step
8.2 Optimal Number of Components in Modeling
8.3 Information in Later PCs
8.4 Exercises on PLS and PCR: the Heart-of-the-Matter!
8.4.1 Exercise - PLS2 (Peas)
8.4.2 Exercise - PLS1 or PLS2? (Peas)
8.4.3 Exercise - Is PCR better than PLS? (Peas)
9. Multivariate Data
Miscellaneous Issues
124
125
127
128
130
136
137
137
137
139
143
144
145
147
149
150
153
–
in
171
172
173
173
174
177
179
Practice:
181
181
xiv
Contents
9.1.1 Data Matrix Dimensions
9.1.2 Missing Data
9.2 Data Collection
9.2.1 Use Historical Data
9.2.2 Monitoring Data from an On-Going Process
9.2.3 Data Generated by Planned Experiments
9.2.4 Perform Experiments or Collect Data - Always by
Careful Reflection
9.2.5 The Random Design – A Powerful Alternative
9.3 Selecting from Abundant Data
9.3.1 Selecting a Calibration Data Set from Abundant
Training Data
9.3.2 Selecting a Validation Data Set
9.4 Error Sources
9.5 Replicates - A Means to Quantify Errors
9.6 Estimates of Experimental - and Measurement Errors
9.6.1 Error in Y (Reference Method): Reproducibility
9.6.2 Stability over Consecutive Measurements: Repeatability
9.7 Handling Replicates in Multivariate Modeling
9.8 Validation in Practice
9.8.1 Test Set
9.8.2 Cross Validation
9.8.3 Leverage Correction
9.8.4 The Multivariate Model – Validation Alternatives
9.9 How Good is the Model: RMSEP and Other Measures
9.9.1 Residuals
9.9.2 Residual Variances (Calibration, Prediction)
9.9.3 Correction for Degrees of Freedom
9.9.4 RMSEP and RMSEC - Average, Representative Errors
in Original Units
9.9.5 RMSEP, SEP and Bias
9.9.6 Comparison Between Prediction Error and Measurement
Error
9.9.7 Compare RMSEP for Different Models
9.9.8 Compare Results with Other Methods
9.9.9 Other Measures of Errors
9.10 Prediction of New Data
9.10.1 Getting Reliable Prediction Results
9.10.2 How Does Prediction Work?
9.10.3 Prediction Used as Validation
183
183
184
184
185
185
186
187
188
188
189
190
190
191
192
193
195
198
198
198
199
199
200
200
201
203
203
205
206
207
207
208
209
209
209
210
Multivariate Data Analysis in Practice
Contents
xv
9.10.4 Uncertainty at Prediction
210
9.10.5 Study Prediction Objects and Training Objects in the
Same Plot
211
9.11 Coding Category Variables: PLS-DISCRIM
211
9.12 Scaling or Weighting Variables
213
9.13 Using the B- and the Bw-Coefficients
214
9.14 Calibration of Spectroscopic Data
215
9.14.1 Spectroscopic Data: Calibration Options
216
9.14.2 Interpretation of Spectroscopic Calibration Models
217
9.14.3 Choosing Wavelengths
219
10. PLS (PCR) Exercises: Real-World Application
Examples - I
221
10.1 Exercise - Prediction of Gasoline Octane Number
10.2 Exercise - Water Quality
10.3 Exercise - Freezing Point of Jet Fuel
10.4 Exercise - Paper
11. PLS (PCR) Multivariate Calibration – In Practice
11.1 Outliers and Subgroups
11.1.1 Scores
11.1.2 X-Y Relation Outlier Plots (T vs. U Scores)
11.1.3 Residuals
11.1.4 Dangerous Outliers or Interesting Extremes?
11.2 Systematic Errors
11.2.1 Y-Residuals Plotted Against Objects
11.2.2 Residuals Plotted Against Predicted Values
11.2.3 Normal Probability Plot of Residuals
11.3 Transformations
11.3.1 Logarithmic Transformations
11.3.2 Spectroscopic Transformations
11.3.3 Multiplicative Scatter Correction
11.3.4 Differentiation
11.3.5 Averaging
11.3.6 Normalization
11.4 Non-Linearities
11.4.1 How to Handle Non-Linearities?
11.4.2 Deleting Variables
11.5 Procedure for Refining Models
Multivariate Data Analysis in Practice
221
230
233
236
241
242
242
244
245
246
248
249
249
251
252
253
254
256
259
259
259
260
262
263
264
xvi
Contents
11.6 Precise Measurements vs. Noisy Measurements
11.7 How to Interpret the Residual Variance Plot
11.8 Summary: The Unscrambler Plots Revealing Problems
265
267
270
12. PLS (PCR) Exercises: Real-World Applications - II 273
12.1 Exercise ~ Log-Transformation (Dioxin)
12.2 Exercise - Multiplicative Scatter Correction (Alcohol)
12.3 Exercise – “Dirty Data” (Geologic Data with Severe
Uncertainties)
12.4 Exercise - Spectroscopy Calibration (Wheat)
12.5 Exercise QSAR (Cytotoxicity)
13. Master Data Sets: Interim Examination
13.1 Sgarabotto Master Violin Data Set
13.2 Norwegian Car Dealerships - Revisited
13.3 Vintages
13.4 Acoustic Chemometrics (a. c.)
273
276
284
291
293
303
305
313
317
321
14. Uncertainty Estimates, Significance and Stability
(Martens’ Uncertainty Test)
327
14.1 Uncertainty Estimates in Regression Coefficients, b
14.2 Rotation of Perturbed Models
14.3 Variable Selection
14.4 Model Stability
14.4.1 Introduction
14.4.2 An Example Using the Paper Data
14.5 Exercise - Paper - Uncertainty Test and Model Stability
15. SIMCA: An Introduction to Classification
327
328
329
330
330
330
332
335
15.1 SIMCA - Fields of Use
339
15.2 How to Make SIMCA Class-Models?
340
15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet
340
15.3 How Do we Classify new Samples?
341
15.4 Classification Results
341
15.4.1 Statistical Significance Level and its Use: An
Introduction
342
15.5 Graphical Interpretation of Classification Results
344
15.5.1 The Coomans Plot
344
15.5.2 The Si vs. Hi Plot (Distance vs. Leverage)
345
Multivariate Data Analysis in Practice
Contents
xvii
15.5.3 Si/S0 vs. Hi
15.5.4 Model Distance
15.5.5 Variable Discrimination Power
15.5.6 Modeling Power
15.6 SIMCA-Exercise – IRIS Classification
347
348
349
350
351
16. Introduction to Experimental Design
361
16.1 Experimental Design
16.2 Screening Designs
16.2.1 Full Factorial Designs
16.2.2 Fractional Factorial Designs
16.2.3 Plackett-Burman Designs
16.3 Analyzing a Screening Design
16.3.1 Significant effects
16.3.2 Using F-Test and P-Values to Determine Significant
Effects
16.3.3 Exercise - Willgerodt-Kindler Reaction
16.4 Optimization Designs
16.4.1 Central Composite Designs
16.4.2 Box-Behnken Designs
16.5 Analyzing an Optimization Design
16.5.1 Exercise - Optimization of Enamine Synthesis
16.6 Practical Aspects of Making an Experimental Design
16.7 Extending a Design
16.8 Validation of Designed Data Sets
16.9 Problems in Designed Data Sets
16.9.1 Detect and Interpret Effects
16.9.2 How to Separate Confounded Effects?
16.9.3 Blocking and Repeated Response Measurements
16.9.4 Fold-Over Designs
16.9.5 What Do We Do if We Cannot Keep to the Planned
Variable Settings?
16.9.6 A “Random Design”
16.9.7 Modeling Uncoded Data
16.10 Exercise - Designed Data with Non-Stipulated Values
(Lacotid)
16.11 Experimental Design Procedure in The Unscrambler
17. Complex Experimental Design Problems
Multivariate Data Analysis in Practice
361
375
376
378
382
383
386
387
391
395
396
400
402
403
414
428
430
431
433
436
436
438
439
440
440
441
444
447
xviii
Contents
17.1 Introduction to Complex Experimental Design Problems
17.1.1 Constraints Between the Levels of Several Design
Variables
17.1.2 A Special Case: Mixture Situations
17.1.3 Alternative Solutions
17.2 The Mixture Situation
17.2.1 An Example of Mixture Design
17.2.2 Screening Designs for Mixtures
17.2.3 Optimization Designs for Mixtures
17.2.4 Designs that Cover a Mixture Region Evenly
17.3 How To Deal With Constraints
17.3.1 Introduction to the D-Optimal Principle
17.3.2 Non-Mixture D-Optimal Designs
17.3.3 Mixture D-Optimal Designs
17.3.4 Advanced Topics
17.4 How To Analyze Results From Constrained Experiments
17.4.1 Use of PLS Regression For Constrained Designs
17.4.2 Relevant Regression Models
17.4.3 The Mixture Response Surface Plot
17.5 Exercise ~ Build a Mixture Design - Wines
447
447
450
451
455
455
457
460
461
463
463
466
467
469
474
474
476
478
479
18. Comparison of Methods for Multivariate Data
Analysis - And their Validation
489
18.1 Comparison of Selected Multivariate Methods
18.1.1 Principal Component Analysis (PCA)
18.1.2 Factor Analysis (FA)
18.1.3 Cluster Analysis (CA)
18.1.4 Linear Discriminant Analysis (LDA)
18.1.5 Comparison: Projection Dimensionality in Multivariate
Data Analysis
18.1.6 Multiple Linear Regression, (MLR)
18.1.7 Principal Component Regression (PCR)
18.1.8 Partial Least Squares Regression (PLS-R)
18.1.9 Increasing Projection Dimensionality in Regression
Modeling
18.2 Choosing Multivariate Methods Is Not Optional!
18.2.1 Problem Formulation
18.3 Unsupervised Methods
18.4 Supervised Methods
489
490
492
494
496
498
498
499
500
501
501
501
502
503
Multivariate Data Analysis in Practice
Contents
xix
18.5 A Final Discussion about Validation
18.5.1 Test Set Validation
18.5.2 Cross Validation
18.5.3 Leverage Corrected Validation
18.5.4 Selecting a Validation Approach in Practice
18.6 Summary of Basic Rules for Success
18.7 From Here – You Are on Your Own. Good Luck!
505
505
506
508
509
510
511
19. Literature
513
20. Appendix: Algorithms
519
20.1 PCA
20.2 PCR
20.3 PLS1
20.4 PLS2
21. Appendix:
Interface
519
520
521
524
Software
Installation
and
21.1 Welcome to The Unscrambler
21.2 How to Install and Configure The Unscrambler
21.3 Problems You Can Solve with The Unscrambler
21.4 The Unscrambler Workplace
21.4.2 The Editor
21.4.3 The Viewer
21.4.4 Dockable Views
21.4.5 Dialogs
21.4.6 The Help System
21.4.7 Tooltips
21.5 Using The Unscrambler Efficiently
21.5.1 Analyses
21.5.2 Some Tips to Make Your Work Easier
User
527
527
527
529
530
532
534
537
537
539
540
540
540
545
Glossary of Terms
549
Index
587
Multivariate Data Analysis in Practice