NC STATE UNIVERSITY Program for North American Mobility in Higher Education Introducing Process Integration for Environmental Control in Engineering Curricula MODULE 17: “Introduction to Multivariate Analysis” Created at: Ecole Polytechnique de Montreal & North Carolina State University, 2003. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Purpose of Module 17 What is the purpose of this module? This module provides a basic introduction to multivariate analysis (“MVA”) as it is applied to chemical engineering. After completing this module, the student should have sufficient understanding to apply this statistical method to real data. The target audience for this module are: •Upper-year engineering students, and •Practising engineers, particularly those in an industrial setting. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Prerequisites for Module 17 What are the prerequisites for this module? Before starting this module, the student must have first completed Module 8, “Introduction to Process Integration”. This module includes basic concepts not repeated here, notably those related to data quality. Applying MVA to real data, without having an understanding of data quality, is a recipe for disaster. The software will generate results, but they could be totally meaningless and misleading. It is further assumed that students already have an introductorylevel background in statistics, such as would normally be part of any undergraduate engineering curriculum. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Structure of Module 17 What is the structure of this module? Module 17 is divided into 3 “tiers”, each with a specific goal: •Tier 1: Basic introduction •Tier 2: Worked example •Tier 3: Open-ended problem These tiers are intended to be completed in order. Students are quizzed at various points, to measure their degree of understanding, before proceeding. Each tier contains a statement of intent at the beginning, and a quiz at the end. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 TIER 1: Basic Introduction NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Tier 1: Statement of Intent Tier 1: Statement of intent: The goal of Tier 1 is to familiarise the student with the basic concepts of multivariate analysis (MVA). At the end of Tier 1, the student should be able to answer the following questions: •What is the difference between univariate and multivariate statistics? •Why is MVA used in a process integration context? •How does MVA fit into the bigger picture? •What are the specific types of MVA analysis? Tier 1 also includes some selected readings, to help the student acquire a deeper understanding of this subject. It is impossible to “spoon-feed” someone about a technique as complex as MVA. The student must begin to delve into this topic independently right from the start. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Tier 1: Contents Tier 1 is broken down into two sections: 1.1 What is MVA used for? 1.2 How does MVA work? At the end of Tier 1 there is a short multiple-answer quiz. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 1.1: What is MVA used for? NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Process Integration Challenge: Make sense of masses of data Drowning in data! Many organisations today are faced with the same challenge: TOO MUCH DATA. These include: –Business - customer transactions –Communications - website use –Government - intelligence –Science - astronomical data –Pharmaceuticals - molecular configurations –Industry - process data It is the last item that is of interest to us as chemical engineers… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Too Much Process Data… A typical industrial plant has hundreds of control loops, and thousands of measured variables, many of which are updated every few seconds. This situation generates tens of millions of new data points each day, and billions of data points each year. Obviously, this is far too much for a human brain to absorb. Because of the way we visualise things, we are basically limited to looking at only one or two variables at a time: 12 10 8 6 4 2 0 1 2 3 4 5 6 NAMP Module 17: “Introduction to Multivariate Analysis” 7 Tier 1, Part 1, Rev.: 0 Data-Rich but Knowledge-Poor As a result, we have become “data-rich but knowledge-poor”. The biggest problem is that interesting, useful patterns and relationships which are not intuitively obvious lie hidden inside enormous, unwieldy databases. Also, many variables are correlated. This has led to the creation of “data-mining” techniques, aimed at extracting this useful knowledge. Some examples are: •Neural Networks •Multiple Regression •Decision Trees •Genetic Algorithms •Clustering •MVA Subject of this module “Mining” data NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Data Information Knowledge The aim of data-mining can be illustrated graphically as follows: • Data – unrelated facts • Information – facts plus relations • Knowledge – information plus patterns Scientific principles Connectedness KNOWLEDGE Observed associations + patterns INFORMATION + relations DATA Raw Numbers Understanding NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Process Modelling from First Principles INSIDE OUT Theoretical Model Chemical engineers create two types of models to simulate an industrial process. The first of these is a theoretical model, which uses First Principles to mimic the inner workings of the process. These models are based on a process flowsheet, and each unit operation is modelled separately: reactors, tanks, mixers, heat exchangers, and so forth. Heat and mass balances are calculated, along with other thermodynamic factors. Chemical reactions are accounted for explicitly, as are the physical properties of the various gas, liquid and solid streams. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Data-Driven Process Modelling OUTSIDE IN Empirical Model The second type of model created by chemical engineers is the empirical or “black-box” model. This approach uses the plant process data directly, to establish mathematic correlations. Unlike the theoretical models, empirical models do NOT take the process fundamentals into account. They only use pure mathematical and statistical techniques. MVA is one such method, because it reveals patterns and correlations independently of any pre-conceived notions. Obviously this approach is very sensitive to “Garbage-in, garbageout” which is why validation of the model is so important. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 What is MVA? Multivariate analysis (MVA) is defined as the simultaneous analysis of more than five variables. Some people use the term “megavariate” analysis to denote cases where there are more than a hundred variables. MVA uses ALL available data to capture the most information possible. The basic principle is to boil down hundreds of variables down to a mere handful. MVA NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Multivariate Analysis is Based on “Ockham’s Razor” Pluralitas non est ponenda sine necessitate. Rough translation: “Don’t make things more complicated than they need to be.” William of Ockham was an English monk who laid one of the cornerstones of the Scientific Method with his famous “razor” (so named because it serves to cut out the unnecessary parts of a scientific theory). William of Ockham (1285-1347) Essentially, Ockham realised back in the 14th century that deep down, Nature is simple… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Example: Apples and Oranges A good example of these ideas is “Apple versus Orange”. Clever scientists could easily come up with hundreds of different things to measure on apples and oranges, to tell them apart: –Colour, shape, firmness, reflectivity,… –Skin: smoothness, thickness, morphology,… –Juice: water content, pH, composition,… –Seeds: colour, weight, size distribution,… –etc. +1 -1 However, there will never be more than one difference: is it an apple or an orange? In MVA parlance, we would say that there is only one latent attribute. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Graphical Representation of MVA The main element of MVA is the reduction in dimensionality. Taken to its extreme, this can mean going from hundreds of dimensions (variables) down to just two, allowing us to create a 2dimensional graph. Using these graphs, which our eyes and brains can easily handle, we are able to ‘peer’ into the database and identify trends and correlations. This is illustrated on the next page… ‘Peering” into the data NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Graphical representation of MVA Statistical Model Tmt X1 X4 X5 Rep Y avec Y sans 1 -1 -1 -1 1 2.51 2.74 1 -1 -1 -1 2 2.36 3.22 1 -1 -1 -1 3 2.45 2.56 2 -1 0 1 1 2.63 3.23 2 -1 0 1 2 2.55 2.47 2 -1 0 1 3 2.65 2.31 3 -1 3 -1 3 -1 4 0 4 0 4 Raw Data: impossible to interpret 1 0 1 2.45 2.67 1 0 2 2.6 2.45 1 0 3 2.53 2.98 -1 1 1 3.02 3.22 -1 1 2 2.7 2.57 0 -1 1 3 2.97 2.63 5 0 0 0 1 2.89 3.16 5 0 0 0 2 2.56 3.32 5 0 0 0 3 2.52 3.26 6 0 1 -1 1 2.44 3.1 6 0 1 -1 2 2.22 2.97 6 0 1 -1 3 2.27 2.92 . . . .. . . . . . . . Y trends X trends trends X X X hundreds of columns thousands of rows (internal to software) 2-D Visual Outputs NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Illustrative Data Set: Food Consumption in European Countries To illustrate these concepts, we take an easy-to-understand example involving food. Data on food preferences in 16 different European countries are considered, involving the consumption patterns for 18 different food groups. Look at the table on the following page. Can you tell anything from the raw numbers? Of course not. No one could. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Data Table: Food Consumption in European Countries Note that MVA can handle up to 10-20% missing data NAMP Module 17: “Introduction to Multivariate Analysis” Courtesy of Umetrics corp. Tier 1, Part 1, Rev.: 0 Score Plot The MVA software generates two main types of plots to represent the data: Score plots and Loadings plots. The first of these, the Score plot, shows all the original data points (observations) in a new set of coordinates or components. Each score is the value of that data point on one of the new component dimensions: . . . .. . .. . . The Score Plot is the projection of the original data points onto a plane defined by two new components. A score plot shows how the observations are arranged in the new component space. The score plot for the food data is shown on the next page. Note how similar countries cluster together… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Score Plot for Food Example 95% Confidence interval (analogous to t-test) Score Plot = observations NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Loadings Plot The second type of data plot generated by the MVA software is the Loadings plots. This is the equivalent to the score plot, only from the point of view of the original variables. Each component has a set of loadings or weights, which express the projection of each original variable onto each new component. Loadings show how strongly each variable is associated with each new component. The loadings plot for the food example is shown on the next page. The further from the origin, the more significant the correlation. Note that the quadrants are the same on each type of plot. Sweden and Denmark are in the top-right corner; so are frozen fish and vegetables. Using both plots, variables and observations can be correlated with one another. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Use of loadings (illustration) Projection of old variabiles onto new Loadings Plot = variables NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 To MVA, Data Overload is Good! One great advantage of MVA is that the more data are available, the less noise matters (assuming that the noise is normally distributed). This is one of the reasons MVA is used to mine huge amounts of data. This is analogous to NMR measurements in a laboratory. The more trials there are, the clearer the spectrum becomes: 1. After 1500 trials 2. 3. Not random at all Looks random NAMP Module 17: “Introduction to Multivariate Analysis” (+ve and –ve noise cancels out) Tier 1, Part 1, Rev.: 0 Too Much Data is Good! Another analogy is the toy compass that used to be given as a prize in a box of Cracker Jack. One of these compasses alone was next to useless. However, if somebody had a thousand compasses and took an average, a useful result might be obtained. Dictionary time: Look up the definitions of “induction” and “deduction”… NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Multivariate Analysis: Benefits What is the point of doing MVA? The first potential benefit is to explore the inter-relationships between different process variables. It is well known that simply creating a model can provide insight in the process itself (“Learn by modelling”). Once a representative model has been created, the engineer can perform “what if?” exercises without affecting the real process. This is a low-cost way to investigate options. Some important parameters, like final product quality, cannot be measured in real time. They can, however, be inferred from other variables that are measured on-line. When incorporated in the process control system, this inferential controller or “soft sensor” can greatly improve process performance. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 MVA is Different to Neural Networks • Both are data-driven “black box” models • Both “learn” using real data • However, what is inside the black box is totally different (NN is non-linear) • Neural Networks seek to reproduce the neuron-to-neuron linkages in the brain – Much as “genetic algorithms” seek to reproduce Darwinian evolution NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Reading List There is no “paint-by-numbers” way to learn MVA. Students are strongly encouraged to read the following papers, in order to begin to develop an independent understanding of what MVA is used for and how it works. After doing this on-line course, reading the references and playing around with real data, the student should at some point experience a “Eureka!” moment when suddenly MVA makes sense. Unfortunately, there is no shortcut to achieving this insight: Broderick, G., J. Paris, J.L. Valade and J. Wood. Applying Latent Vector Analysis to Pulp Characterization, Paperi ja Puu, 77 (6-7): 410-419. Saltin, J. F., and B. C. Strand. Analysis and Control of Newsprint Quality and Paper Machine Operation Using Integrated Factor Networks, Pulp and Paper Canada 96(7): 48-51 NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Reading List (cont’d) Kooi, S. Adaptive Inferential Control of Wood Chip Refiner, Tappi Journal 77(11):185-194. Kresta, J. V., T. E. Marlin and J. F. MacGregor (1994). Development of Inferential Process Models Using PLS, Computers and Chemical Engineering 18 (7):597-611. Marklund, A. Prediction of Strength Parameters for Softwood Kraft Pulps. Nordic Pulp & Paper Research Journal, 13 (3): 211-219. Tessier, P., G. Broderick, P. Plouffe (2001). Competitive Analysis of North American Newsprint Producers Using Composite Statistical Indicators of Product and Process Performance. TAPPI Journal, 84 (3). NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 1.2: How does MVA work? NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Basic Statistics It is assumed that the student is familiar with the following basic statistical concepts: • • • • Mean / median / mode Standard deviation / variance Normality / symmetry Degree of association – Correlation coefficients • Degree of explanation – R2, F-test • Significance of differences – t-test, Chi-square If not, or if it’s been a while, it is advisable to consult an introductory statistics text and do a cursory review. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Statistical Tests Classical statistics is severely hampered by certain assumptions about data: -All values are accurate -All variables are uncorellated -There are no missing data Statistical tests help characterise an existing dataset. They do NOT enable you to make predictions about future data. For this we must turn to regression techniques… For real process data, such assumptions are totally unrealistic. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Regression Regression can be summarised as follows: • Take a set of data points, each described by a vector of values x1, x2, … xn) (y, • Find an algebraic equation y = b 1 x 1 + b2 x 2 + … + b n x n + e that “best expresses” the relationship between y and the xi’s. • This equation can be used to predict a new y-value given new xi’s. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Independent vs. Dependent Variables • The xi’s in the preceding equation are called independent variables. They are used to predict y. • Y is called the dependent variable, because the way the equation is written, its value depends on the xi’s. X X X XX X X NAMP Module 17: “Introduction to Multivariate Analysis” Y Y Y Y Tier 1, Part 1, Rev.: 0 Simple vs. Multiple Regression • Simple regression has only one x: y = bx + e • Multiple regression has more than one x: y = b1x1 + b2x2 + … + bnxn + e X X X X XX X X NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 Linear vs. Nonlinear Regression • Linear regression involves no powers of xi (square, cube etc.) and no cross-product terms of form xixj • If such terms are present, we are dealing with nonlinear regression. XiXj 3 X 2 X NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Error Term e • The error term expresses the uncertainty in an empirical predictive equation derived from imperfect observations. • Factors contributing to the error term include: – measurement error – measurement noise – unaccounted-for natural variations – disturbances to the process being measured NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0 The Least Squares Principle • Regression tries to produce a “best fit equation” --- but what is “best” ? • Criterion: minimize the sum of squared deviations of data points from the regression line. NAMP Module 17: “Introduction to Multivariate Analysis” Tier 1, Part 1, Rev.: 0