Multivariate Data Analysis – In Practice 5th Edition An Introduction to Multivariate Data Analysis and Experimental Design Kim H. Esbensen Ålborg University, Esbjerg with contributions from Dominique Guyot Frank Westad Lars P. Houmøller www.camo.com CAMO Software AS. Nedre Vollgate 8, N-0158, Oslo, NORWAY CAMO Software Inc. One Woodbridge Center, Suite 319, Woodbridge, NJ 07095, USA CAMO Software India Pvt. Ltd. 14 & 15, Krishna Reddy Colony Domlur Layout, Bangalore - 560 071, INDIA Tel: (47) 223 963 00 Fax: (47) 223 963 22 Tel: (732) 726 9200 Fax: (973) 556 1229 Tel: (91) 80 4125 4242 Fax: (91) 80 4125 4181 This book was produced using Doc-to-Help together with Microsoft Word. Visio and Excel were used to make some of the illustrations. The screen captures were taken with Paint Shop Pro. Trademark Acknowledgments Doc-To-Help is a trademark of WexTech Systems, Inc. Microsoft is a registered trademark and Windows 95, Windows NT, Excel and Word are trademarks of the Microsoft Corporation. PaintShop Pro is a trademark of JASC, Inc. Visio is a trademark of the Shapeware Corporation. Information in this book is subject to change without notice. No part of this document may be reproduced or transmitted in any form or by any means, electronic or mechanical, for any purpose, without the express written permission of CAMO Process AS. ISBN 82-993330-3-2 1994 – 2002 CAMO Process AS All rights reserved. 5th edition. Re-print December 2004 Preface iii Preface October 2001 Learning to do multivariate data analysis is in many ways like learning to drive a car: You are not let loose on the road without mandatory training, theoretical and practical, as required by current concern for traffic safety. As a minimum you need to know how a car functions and you need to know the traffic code. On the other hand, everybody would agree that it is first after having obtained your drivers’ license that the real practical learning begins. This is when your personal experience really starts to accumulate. There is a strong interaction between the theory absorbed and the practice gained in this secondary, personal training period. Please substitute ”multivariate data analysis” for ”driving a car” in all of the above. Neither in this context are you let out on the data analytical road – without mandatory training, theoretical and practical. The analogy is actually very apt! This book presents a basic theoretical foundation for bilinear (projection-based) multivariate data modeling and gives a conceptual framework for starting to do your own data modeling on the data sets provided. There are some 25 data sets included in this training package. By doing all exercises included you’re off to a flying start! Driving your newly acquired multivariate data analysis car is very much an evolutionary process: this introductory textbook is filled with illustrative examples, many practical exercises and a full set of selfexamination real-world data analysis problems (with corresponding data sets). If, after all of this, you are able to work confidently on your own applications, you’ll have reached the goal set for this book. Multivariate Data Analysis in Practice iv Preface This is the 5th revised edition of this book. The three first editions were mainly reprints, the only major change being the inclusion of a completely revised chapter on ”Introduction to experimental design”, which first appeared in the 3rd edition (CAMO). The 4th revised edition however (published March 2000) saw very many major extensions and improvements: • Text completely rewritten by the senior author, based on five years of extensive use in teaching at both university and dedicated course levels. More than 5.500 copies in use. • 30% new theory & text material added, reflecting extensive student response, full integration of PCA, PLS1 & PLS2 NIPALS algorithms and explanations. • Text revised with an augmented self-learning objective throughout. • Four new master data sets added (with extended self-exercise potential): 1. 2. 3. 4. Master violin data Norwegian car dealerships Vintages Acoustic chemometric calibration (PCA/PLS) (PCA/PLS) (PCA/PLS) (PCR/PLS) • Additional chapter on experimental design: new features include mixture designs and D-optimal designs. • New chapter on the powerful, novel: ”Martens’ Uncertainty Test”. • Comprehensive glossary of terms. This 5th edition also includes essential additional revisions and improvements: • Lars P. Houmøller, Ålborg University Esbjerg, has carried out a complete work-through of all demonstrations and exercises. Many of these had not been updated with respect to several of the intervening UNSCRAMBLER software versions. We are happy to have finally eliminated this most frustrating nuisance. Multivariate Data Analysis in Practice Preface v About the authors Kim H. Esbensen, Ph.D., has more than 20 years of experience in multivariate data analysis and applied chemometrics. He was professor in chemometrics at the Norwegian Telemark Institute of Technology (HIT/TF), Institute of Process Technology (PT) 1995-2001, where he was also head of the Chemometrics Department Tel-Tek, Telemark Industrial R&D Center, Porsgrunn. Between these institutions he founded ACRG: the Applied Chemometrics Research Group, HIT/TFTel-Tek, which a.o. hosted SSC6, the 6th Scandinavian Symposium on Chemometrics, August 1999 as well as numerous other international courses, workshops and meetings. July 1st, 2001 he moved to a position as research professor in Applied Chemometrics at Ålborg University, Esbjerg, Denmark (AUE), where he is currently leading ACACSRG: the Applied Chemometrics, Analytical Chemistry and Sampling Research Group. As the name implies, applied chemometrics activities continue in Esbjerg while new activities are added – most notably through close collaboration with assoc. prof. Lars P. Houmøller, who independently built up the area of analytical chemistry/chemometrics at AUE before Prof. Esbensen’s arrival. Most recently the discipline of sampling (proper sampling) has been added, in recognition of the immense importance of sampling in any data analytical discipline, including chemometrics. Kim H. Esbensen has published more than 60 papers and technical reports on a wide range of chemical, geochemical, industrial, technological, remote sensing, image analytic and acoustic chemometric applications. Together with Paul Geladi he has been instrumental in codeveloping the concept of Multivariate Image Analysis (MIA); with ACRG he pioneered the development of the novel area of acoustic chemometrics. His M. Sc. is from the University of Aarhus, Denmark in 1978 (geology, geochemistry), while a Ph.D. was conferred him by the Technical University of Denmark (DTH) in 1981 within the areas of metallurgy, meteoritics and multivariate data analysis. He then did post-doctoral work for two years with the Research Group for Chemometrics at the University of Umeå 1980-1981, after which he worked in a Swedish geochemical exploration company, Terra Swede, for two more years. Moving to Norway, this was followed by eight years as data analytical research scientist at the Norwegian Computing Center (NCC), Oslo, Multivariate Data Analysis in Practice vi Preface after which he became a senior research scientist at SINTEF, the Norwegian Foundation for Industrial and Technological Research for four additional years. In between these two assignments he was a visiting guest professor at Norsk Hydro’s Research Center in Bergen, Norway. He also holds a position as Chercheur associé (now Chercheur affilié) du Centre de Recherche en Géomatique, Université Laval, Quebec. He is a member of the editorial board of Journal of Chemometrics, Wiley Publishers, and is a member of ICS, AGU and several other geological, data analytical and statistical associations. Dominique Guyot, educated in Statistics, Economics and Biomathematics (ENSAE and Université de Paris 7, France), has 15 years of experience in the field of chemometrics. She gained industrial experience from her work in the pharmaceutical and cosmetic industries, before joining CAMO from 1995 until 2000. With CAMO, Dominique worked as a Senior Consultant, and was particularly involved in food applications. She put together a practical strategy for efficient product development, based on experimental design and multivariate data analysis. This strategy was implemented in the Guideline®+ software package, complemented by an integrated training course focusing on multivariate methods for food product developers. Dominique is now studying music and singing at the Conservatoire of Trondheim, Norway. Frank Westad has a M. Sc. in physical chemistry from the University of Trondheim, Norway. He has 13 years experience in applied multivariate data analysis, and he completed a Ph.D. in multivariate regression in 2000. Frank has given numerous courses in experimental design and multivariate analysis for companies in Europe and in the U.S.A. His main research fields include variable selection, shift modelling and image analysis. Lars P. Houmøller has a M.Sc. in chemistry and physics from the University of Aarhus, Denmark. He has 12 years of experience in analytical chemistry and has worked 5-7 years with chemometrics. His teaching experiences include chemometrics, analytical chemistry, spectroscopy, physical chemistry, general and technical chemistry, organic and inorganic chemistry, unit operations and fluid dynamics. His research field covers NIR spectroscopic applications over a very broad industrial spectrum. He also has experience from working in the Danish food production industry. Multivariate Data Analysis in Practice Preface vii E-mail interaction with the authors: Kim Esbensen kes@aue.auc.dk Dominique Guyot Frank Westad Lars P. Houmøller dominique.guyot@camo.no fwestad@online.no lph@aue.auc.dk About this book Since 1986, when CAMO ASA first commercialized and started marketing THE UNSCRAMBLER, many customers have asked for basic, easy-to-understand literature on chemometrics. In 1993 a group of data analysts at different competence levels was invited to a one-day seminar at CAMO, Trondheim, for discussing their experience from both learning and teaching chemometrics. The result was a blue-print outline for what came to be this introductory book: the specifications called for a comprehensive training-package, involving basic, practical, easy-to-read, largely non-mathematical theory, with plenty of hands-on examples and exercises on real-world data sets. CAMO contracted SINTEF to write this book (first three editions), and the parties agreed to cooperate on the completion of the complete training package. In the intervening years, this book was published in some 4.500 copies and was used for the introductory basic training in some 15 universities and in several hundred industrial companies; reactions were many and largely constructive. We learned a lot from these criticisms; we thank all who contributed! Came 1999, the time was ripe for a complete revision of the entire package. This was undertaken by the senior author in the summer 1999 with significant assistance from his then Ph.D. student Jun Huang (now with CAMO, Norway); Frank Westad (Matforsk) who wrote chapter 14 (Martens’ Uncertainty Test), Dominique Guyot (CAMO) who wrote the original new entire chapter 17 (Complex Experimental Design Problems), and with further invaluable editorial and managerical contributions from Michael Byström (CAMO) and Valérie Lengard (CAMO). A most sincere thank you goes to Peter Hindmarch (CAMO, UK) for very effective linguistic streamlining of the 4th edition! The authors and CAMO also take this opportunity to acknowledge Suzanne Schönkopf’s (CAMO) contribution to editions previous to the 4th one. Multivariate Data Analysis in Practice viii Preface The present edition of this book still bears the fruit of her very important past efforts. The publication of the 4th edition, in March 2000, was unfortunately somewhat marred by a less than complete revision of the exercises and illustrative UNSCRAMBLER runs in the book, which was not considered fatal at the time – This soon proved to be a serious mistake; disapointment and frustration from several generations of students, who wanted to follow all the exercises closely, followed rapidly. A Danish university teacher, who had himself experienced this frustration close up when using the book for his own teachings, assoc. prof. Lars P. Houmøller at the University of Ålborg, Esbjerg voluntarily took it upon himself to carry out a complete work-through of this essential didactic aspect of the book. His very valuable demo and exercise revisions, as well as a very thorough text consistency check, have now been included in toto in the 5th edition. Today, this book is a collaborative effort between the senior author and CAMO Process AS; the tie with SINTEF is now defunct. There is little academic glamour in writing an introductory level textbook, as the senior author has well experienced - which was never the goal anyway. But on the other hand, the introductory level is definitely where the largest audience and potential market exist, as CAMO has well experienced. The senior author has used the book for six consecutive years teaching introductory chemometrics largely to engineering (M.Sc.) students, as well as for extensive course work in industrial and foreign university environments. The response from some accumulated 500 students has made this author happy, while some 5500 sales have made CAMO equally satisfied. Thus all is well with the training package! We hope that edition will continue to meet the challenging demands hopefully now in an improved form. Writing for introductory audience/market constitutes the highest didactic challenge, and is thus (still) irresistible! this revised 5th of the market, precisely this scientific and Multivariate Data Analysis in Practice Preface ix Acknowledgements The authors wish to thank the following persons, institutions and companies for their very valuable help in the preparation of this training package: Hans Blom, Østlandskonsult AS, Fredrikstad, Norway Frode Brakstad, Norsk Hydro F-Center, Porsgrunn, Norway Rolf Carlson, Department of Chemistry, University of Tromsø, Norway Chevron Research & Technology Co, Richmond, CA, USA Lennart Eriksson, Dept. of Organic Chemistry, University of Umeå, Sweden (now with Umetrics, Inc.) Professor Magni Martens, The Royal Vetarinary & Agricultural University, Denmark Geological Survey of Greenland, Denmark IKU, Institute for Petroleum Research, Trondhein, Norway Norwegian Food Research Institute (MATFORSK), Ås, Norway Norwegian Society of Process Control Norwegian Chemometrics Society International Chemometrics Society UOP Guided Wave, CA, USA Pierre Gy, Cannes, France (for a gentleman’s introduction to the finest French wines) Zander & Ingerstrõm, Oslo, Norway Tomas Õberg Konsult AB, Karlskoga, Sweden KAPITAL (weekly Norwegian economic magazine), no 14/1994, p50-55 Hlif Sigurjonsdottir, Reykjavik, Iceland (owner of G. Sgarabotto “violin no 9”) Birgitta Spur, LSO, Reykjavik, Iceland (permission to use the Sgarabotto oeuvre data) Sensorteknikk A/S, Bærum, Oslo (Bjørn Hope: sensor technology entrepreneur extraordinaire; Evy: for innumerable occasions: warm company, coffee and waffles, waffles, waffles) Thorbjørn T. Lied, Maths Halstensen, Tore Gravermoen, Rune Mathisen a.o. (for enormous help in developing acoustic chemometrics) “Anonymous wine importer”, Odense, Denmark. Helpful wine assessors (partly anonymous), Manson, Wa, USA. Finally the author(s) and CAMO wish to thank all THE UNSCRAMBLER users during the last seven years for their close relationships with us, which have given us so much added experience in Multivariate Data Analysis in Practice x Preface teaching multivariate data analysis. And thanks for all the constructive criticism to the earlier editions of this book. Last, but certainly not least, a warm thank you to all the students at HIT/TF, at Ålborg University, Esbjerg and many, many others, who have been associated with the teachings of the authors, nearly all of whom have been very constructive in their ongoing criticism of the entire teaching system embedded in this training package. We even learned from the occasional not-so-friendly criticisms… Communication The period of seven years that has been the formative period for the training package has come of age. By now we are actually beginning to be rather satisfied with it! And yet: The author(s) and CAMO always welcome all critical responses to the present text. They are seriously needed in order for this work to be continually improving. Multivariate Data Analysis in Practice Contents xi Contents 1. Introduction to Multivariate Data Analysis Overview 1.1 Indirect Observations and Correlation 1.2 Hidden Data Structures 1.3 Multivariate Data Analysis vs. Multivariate Statistics 1.4 Main Objectives of Multivariate Data Analytical Techniques 1.5 Multivariate Techniques as Projections 2. Getting Started - with Descriptive Statistics 1 1 7 9 9 11 13 2.1 Purpose 13 2.2 Data Set 1: Quality of Green Peas 13 2.3 Data set 2: Economic Characteristics of Car Dealerships in Norway 17 3. Principal Introduction Component Analysis (PCA) 3.1 Representing the Data as a Matrix 3.2 The Variable Space - Plotting Objects in p Dimensions 3.3 Plotting Objects in Variable Space 3.3.1 Exercise - Plotting Raw Data (People) 3.4 The First Principal Component 3.5 Extension to Higher-Order Principal Components 3.6 Principal Component Models - Scores and Loadings 3.6.1 Model Center 3.6.2 Loadings - Relations Between X and PCs 3.6.3 Scores - Coordinates in PC Space 3.6.4 Object Residuals 3.7 Objectives of PCA 3.8 Score Plot - “Map of Samples” 3.9 Loading Plot - “Map of Variables” Multivariate Data Analysis in Practice – 19 19 20 21 22 27 30 31 32 33 34 35 35 36 40 xii Contents 3.10 Exercise: Plotting and Interpreting a PCA-Model (People) 3.11 PC-Models 3.11.1 The PC Model: X = TP T + E = Structure + Noise 3.11.2 Residuals - The E-Matrix 3.11.3 How Many PCs to Use? 3.11.4 Variable Residuals 3.11.5 More about Variances - Modeling Error Variance 3.12 Exercise - Interpreting a PCA Model (Peas) 3.13 Exercise - PCA Modeling (Car Dealerships) 3.14 PCA Modeling – The NIPALS Algorithm 47 54 54 58 61 64 65 66 68 72 4. Principal Component Analysis (PCA) - In Practice 75 4.1 Scaling or Weighting 75 4.2 Outliers 78 4.2.1 Scaling, Transformation and Normalization are Highly Problem Dependent Issues 80 4.3 PCA Step by Step 81 4.3.1 The Unscrambler and PCA 84 4.4 Summary of PCA 85 4.4.1 Interpretation of PCA-Models 88 4.4.2 Interpretation of Score Plots – Look for Patterns 89 4.4.3 Summary - Interpretation of Score Plots 93 4.4.4 Summary - Interpretation of Loading Plots 94 4.5 PCA - What Can Go Wrong? 95 4.6 Exercise - Detecting Outliers (Troodos) 97 5. PCA Exercises – Real-World Application Examples 105 5.1 Exercise - Find Clusters (Iris Species Discrimination) 5.2 Exercise - PCA for Experimental Design (Lewis Acids) 5.3 Exercise - Mud Samples 5.4 Exercise - Scaling (Troodos) 6. Multivariate Calibration (PCR/PLS) 6.1 Multivariate Modeling (X,Y): The Calibration Stage 6.2 Multivariate Modeling (X, Y): The Prediction Stage 6.3 Calibration Set Requirements (Training Data Set) 6.4 Introduction to Validation 6.5 Number of Components (Model Dimensionality) 6.6 Univariate Regression (y|x) and MLR 105 107 109 112 115 115 116 118 120 122 124 Multivariate Data Analysis in Practice Contents xiii 6.6.1 Univariate Regression (y|x) 6.6.2 Multiple Linear Regression, MLR 6.7 Collinearity 6.8 PCR - Principal Component Regression 6.8.1 Exercise - Interpretation of Jam (PCR) 6.8.2 Weaknesses of PCR 6.9 PLS- Regression (PLS-R) 6.9.1 PLS - A Powerful Alternative to PCR 6.9.2 PLS (X,Y): Initial Comparison with PCA(X), PCA(Y) 6.9.3 PLS2 – NIPALS Algorithm 6.9.4 Interpretation of PLS Models 6.9.5 The PLS1 NIPALS Algorithm 6.9.6 Exercise - Interpretation of PLS1 (Jam) 6.9.7 Exercise - Interpretation PLS2 (Jam) 6.10 When to Use which Method? 6.10.1 Exercise - Compare PCR and PLS1 (Jam) 6.11 Summary 7. Validation: Mandatory Performance Testing 7.1 The Concept of Test Set Validation 7.1.1 Calculating the Calibration Variance (Modeling Error) 7.1.2 Calculating the Validation Variance (Prediction Error) 7.1.3 Studying the Calibration and Validation Variances 7.2 Requirements for the Test Set 7.3 Cross Validation 7.4 Leverage Corrected Validation 8. How to Perform PCR and PLS-R 9.1 Data Constraints Multivariate Data Analysis in Practice Analysis 155 155 157 158 159 161 163 168 171 8.1 PLS and PCR - Step by Step 8.2 Optimal Number of Components in Modeling 8.3 Information in Later PCs 8.4 Exercises on PLS and PCR: the Heart-of-the-Matter! 8.4.1 Exercise - PLS2 (Peas) 8.4.2 Exercise - PLS1 or PLS2? (Peas) 8.4.3 Exercise - Is PCR better than PLS? (Peas) 9. Multivariate Data Miscellaneous Issues 124 125 127 128 130 136 137 137 137 139 143 144 145 147 149 150 153 – in 171 172 173 173 174 177 179 Practice: 181 181 xiv Contents 9.1.1 Data Matrix Dimensions 9.1.2 Missing Data 9.2 Data Collection 9.2.1 Use Historical Data 9.2.2 Monitoring Data from an On-Going Process 9.2.3 Data Generated by Planned Experiments 9.2.4 Perform Experiments or Collect Data - Always by Careful Reflection 9.2.5 The Random Design – A Powerful Alternative 9.3 Selecting from Abundant Data 9.3.1 Selecting a Calibration Data Set from Abundant Training Data 9.3.2 Selecting a Validation Data Set 9.4 Error Sources 9.5 Replicates - A Means to Quantify Errors 9.6 Estimates of Experimental - and Measurement Errors 9.6.1 Error in Y (Reference Method): Reproducibility 9.6.2 Stability over Consecutive Measurements: Repeatability 9.7 Handling Replicates in Multivariate Modeling 9.8 Validation in Practice 9.8.1 Test Set 9.8.2 Cross Validation 9.8.3 Leverage Correction 9.8.4 The Multivariate Model – Validation Alternatives 9.9 How Good is the Model: RMSEP and Other Measures 9.9.1 Residuals 9.9.2 Residual Variances (Calibration, Prediction) 9.9.3 Correction for Degrees of Freedom 9.9.4 RMSEP and RMSEC - Average, Representative Errors in Original Units 9.9.5 RMSEP, SEP and Bias 9.9.6 Comparison Between Prediction Error and Measurement Error 9.9.7 Compare RMSEP for Different Models 9.9.8 Compare Results with Other Methods 9.9.9 Other Measures of Errors 9.10 Prediction of New Data 9.10.1 Getting Reliable Prediction Results 9.10.2 How Does Prediction Work? 9.10.3 Prediction Used as Validation 183 183 184 184 185 185 186 187 188 188 189 190 190 191 192 193 195 198 198 198 199 199 200 200 201 203 203 205 206 207 207 208 209 209 209 210 Multivariate Data Analysis in Practice Contents xv 9.10.4 Uncertainty at Prediction 210 9.10.5 Study Prediction Objects and Training Objects in the Same Plot 211 9.11 Coding Category Variables: PLS-DISCRIM 211 9.12 Scaling or Weighting Variables 213 9.13 Using the B- and the Bw-Coefficients 214 9.14 Calibration of Spectroscopic Data 215 9.14.1 Spectroscopic Data: Calibration Options 216 9.14.2 Interpretation of Spectroscopic Calibration Models 217 9.14.3 Choosing Wavelengths 219 10. PLS (PCR) Exercises: Real-World Application Examples - I 221 10.1 Exercise - Prediction of Gasoline Octane Number 10.2 Exercise - Water Quality 10.3 Exercise - Freezing Point of Jet Fuel 10.4 Exercise - Paper 11. PLS (PCR) Multivariate Calibration – In Practice 11.1 Outliers and Subgroups 11.1.1 Scores 11.1.2 X-Y Relation Outlier Plots (T vs. U Scores) 11.1.3 Residuals 11.1.4 Dangerous Outliers or Interesting Extremes? 11.2 Systematic Errors 11.2.1 Y-Residuals Plotted Against Objects 11.2.2 Residuals Plotted Against Predicted Values 11.2.3 Normal Probability Plot of Residuals 11.3 Transformations 11.3.1 Logarithmic Transformations 11.3.2 Spectroscopic Transformations 11.3.3 Multiplicative Scatter Correction 11.3.4 Differentiation 11.3.5 Averaging 11.3.6 Normalization 11.4 Non-Linearities 11.4.1 How to Handle Non-Linearities? 11.4.2 Deleting Variables 11.5 Procedure for Refining Models Multivariate Data Analysis in Practice 221 230 233 236 241 242 242 244 245 246 248 249 249 251 252 253 254 256 259 259 259 260 262 263 264 xvi Contents 11.6 Precise Measurements vs. Noisy Measurements 11.7 How to Interpret the Residual Variance Plot 11.8 Summary: The Unscrambler Plots Revealing Problems 265 267 270 12. PLS (PCR) Exercises: Real-World Applications - II 273 12.1 Exercise ~ Log-Transformation (Dioxin) 12.2 Exercise - Multiplicative Scatter Correction (Alcohol) 12.3 Exercise – “Dirty Data” (Geologic Data with Severe Uncertainties) 12.4 Exercise - Spectroscopy Calibration (Wheat) 12.5 Exercise QSAR (Cytotoxicity) 13. Master Data Sets: Interim Examination 13.1 Sgarabotto Master Violin Data Set 13.2 Norwegian Car Dealerships - Revisited 13.3 Vintages 13.4 Acoustic Chemometrics (a. c.) 273 276 284 291 293 303 305 313 317 321 14. Uncertainty Estimates, Significance and Stability (Martens’ Uncertainty Test) 327 14.1 Uncertainty Estimates in Regression Coefficients, b 14.2 Rotation of Perturbed Models 14.3 Variable Selection 14.4 Model Stability 14.4.1 Introduction 14.4.2 An Example Using the Paper Data 14.5 Exercise - Paper - Uncertainty Test and Model Stability 15. SIMCA: An Introduction to Classification 327 328 329 330 330 330 332 335 15.1 SIMCA - Fields of Use 339 15.2 How to Make SIMCA Class-Models? 340 15.2.1 Basic SIMCA Steps: A Standard Flow-Sheet 340 15.3 How Do we Classify new Samples? 341 15.4 Classification Results 341 15.4.1 Statistical Significance Level and its Use: An Introduction 342 15.5 Graphical Interpretation of Classification Results 344 15.5.1 The Coomans Plot 344 15.5.2 The Si vs. Hi Plot (Distance vs. Leverage) 345 Multivariate Data Analysis in Practice Contents xvii 15.5.3 Si/S0 vs. Hi 15.5.4 Model Distance 15.5.5 Variable Discrimination Power 15.5.6 Modeling Power 15.6 SIMCA-Exercise – IRIS Classification 347 348 349 350 351 16. Introduction to Experimental Design 361 16.1 Experimental Design 16.2 Screening Designs 16.2.1 Full Factorial Designs 16.2.2 Fractional Factorial Designs 16.2.3 Plackett-Burman Designs 16.3 Analyzing a Screening Design 16.3.1 Significant effects 16.3.2 Using F-Test and P-Values to Determine Significant Effects 16.3.3 Exercise - Willgerodt-Kindler Reaction 16.4 Optimization Designs 16.4.1 Central Composite Designs 16.4.2 Box-Behnken Designs 16.5 Analyzing an Optimization Design 16.5.1 Exercise - Optimization of Enamine Synthesis 16.6 Practical Aspects of Making an Experimental Design 16.7 Extending a Design 16.8 Validation of Designed Data Sets 16.9 Problems in Designed Data Sets 16.9.1 Detect and Interpret Effects 16.9.2 How to Separate Confounded Effects? 16.9.3 Blocking and Repeated Response Measurements 16.9.4 Fold-Over Designs 16.9.5 What Do We Do if We Cannot Keep to the Planned Variable Settings? 16.9.6 A “Random Design” 16.9.7 Modeling Uncoded Data 16.10 Exercise - Designed Data with Non-Stipulated Values (Lacotid) 16.11 Experimental Design Procedure in The Unscrambler 17. Complex Experimental Design Problems Multivariate Data Analysis in Practice 361 375 376 378 382 383 386 387 391 395 396 400 402 403 414 428 430 431 433 436 436 438 439 440 440 441 444 447 xviii Contents 17.1 Introduction to Complex Experimental Design Problems 17.1.1 Constraints Between the Levels of Several Design Variables 17.1.2 A Special Case: Mixture Situations 17.1.3 Alternative Solutions 17.2 The Mixture Situation 17.2.1 An Example of Mixture Design 17.2.2 Screening Designs for Mixtures 17.2.3 Optimization Designs for Mixtures 17.2.4 Designs that Cover a Mixture Region Evenly 17.3 How To Deal With Constraints 17.3.1 Introduction to the D-Optimal Principle 17.3.2 Non-Mixture D-Optimal Designs 17.3.3 Mixture D-Optimal Designs 17.3.4 Advanced Topics 17.4 How To Analyze Results From Constrained Experiments 17.4.1 Use of PLS Regression For Constrained Designs 17.4.2 Relevant Regression Models 17.4.3 The Mixture Response Surface Plot 17.5 Exercise ~ Build a Mixture Design - Wines 447 447 450 451 455 455 457 460 461 463 463 466 467 469 474 474 476 478 479 18. Comparison of Methods for Multivariate Data Analysis - And their Validation 489 18.1 Comparison of Selected Multivariate Methods 18.1.1 Principal Component Analysis (PCA) 18.1.2 Factor Analysis (FA) 18.1.3 Cluster Analysis (CA) 18.1.4 Linear Discriminant Analysis (LDA) 18.1.5 Comparison: Projection Dimensionality in Multivariate Data Analysis 18.1.6 Multiple Linear Regression, (MLR) 18.1.7 Principal Component Regression (PCR) 18.1.8 Partial Least Squares Regression (PLS-R) 18.1.9 Increasing Projection Dimensionality in Regression Modeling 18.2 Choosing Multivariate Methods Is Not Optional! 18.2.1 Problem Formulation 18.3 Unsupervised Methods 18.4 Supervised Methods 489 490 492 494 496 498 498 499 500 501 501 501 502 503 Multivariate Data Analysis in Practice Contents xix 18.5 A Final Discussion about Validation 18.5.1 Test Set Validation 18.5.2 Cross Validation 18.5.3 Leverage Corrected Validation 18.5.4 Selecting a Validation Approach in Practice 18.6 Summary of Basic Rules for Success 18.7 From Here – You Are on Your Own. Good Luck! 505 505 506 508 509 510 511 19. Literature 513 20. Appendix: Algorithms 519 20.1 PCA 20.2 PCR 20.3 PLS1 20.4 PLS2 21. Appendix: Interface 519 520 521 524 Software Installation and 21.1 Welcome to The Unscrambler 21.2 How to Install and Configure The Unscrambler 21.3 Problems You Can Solve with The Unscrambler 21.4 The Unscrambler Workplace 21.4.2 The Editor 21.4.3 The Viewer 21.4.4 Dockable Views 21.4.5 Dialogs 21.4.6 The Help System 21.4.7 Tooltips 21.5 Using The Unscrambler Efficiently 21.5.1 Analyses 21.5.2 Some Tips to Make Your Work Easier User 527 527 527 529 530 532 534 537 537 539 540 540 540 545 Glossary of Terms 549 Index 587 Multivariate Data Analysis in Practice