Exam notes in TKJ4175 Innhold Foundations ............................................................................................................................................. 3 What is chemometrics? ....................................................................................................................... 3 General steps in analysis ................................................................................................................. 3 Hard vs. soft modelling (chemometrics is soft modelling): ............................................................. 3 Representations .............................................................................................................................. 4 Method and model selection .......................................................................................................... 4 Linear regression ................................................................................................................................. 5 Experimental Design................................................................................................................................ 7 Full factorial design ............................................................................................................................. 7 Yates algorithm................................................................................................................................ 8 Fractional factorial design ................................................................................................................... 8 Effects and regression ..................................................................................................................... 9 Returning to the original variables .................................................................................................. 9 Multilevel and constrained designs................................................................................................... 10 D-optimal design ........................................................................................................................... 10 Simplex .......................................................................................................................................... 11 Response surface........................................................................................................................... 11 Signal processing (preprocessing) ......................................................................................................... 12 Centering and scaling ........................................................................................................................ 12 Normalization .................................................................................................................................... 12 Autoscaling ........................................................................................................................................ 12 The time domain ............................................................................................................................... 13 Numerical differentiation .............................................................................................................. 13 The Fourier domain ........................................................................................................................... 14 Unsupervised analysis ........................................................................................................................... 16 PCA .................................................................................................................................................... 16 Data analysis with PCA .................................................................................................................. 18 Cluster analysis .................................................................................................................................. 18 k-means clustering ........................................................................................................................ 19 Supervised analysis................................................................................................................................ 21 Latent variable based regression ...................................................................................................... 21 PCR (Principal component regression) .......................................................................................... 21 PLSR (Partial least squares regression).......................................................................................... 21 Validation .......................................................................................................................................... 23 Resampling methods ..................................................................................................................... 24 Classification methods ...................................................................................................................... 26 Fisher’s Linear Discriminant Analysis (LDA)................................................................................... 26 Prototype classification ................................................................................................................. 26 Decision trees ................................................................................................................................ 27 Foundations What is chemometrics? Definition of chemometrics: Chemometrics uses mathematical, statistical and artificial intelligence methods to: Design or select optimal experimental procedures Provide maximum chemical information by analyzing chemical data Obtain knowledge about chemical systems General steps in analysis Plan experiments: Use experimental design to set up experiments in a systematic way Examine data: Look at raw data with various plot Pre-process: Is there systematic variation in the data that should not be there? o Noise removal, correction for non-linearity Estimate model: Inspect plots and diagnostic to find outlier etc. o PCA Examine results and validate model: What does the result tell us? Is the model valid for future samples? Prediction: Use model on new data. Examine the results to see how the predictions are in accordance with the expectations Hard vs. soft modelling (chemometrics is soft modelling): Hard modelling is based on existing physical theories, while soft modelling is based on finding structures in the data using statistical/AI methods. The computer calibrates from the data and generates a model. Hard modelling Advantage Disadvantage Better extrapolations Need a physical description of the system Easier to understand and interpret High complexity Deeper understanding of the system Soft modelling Advantage Disadvantage Higher prediction ability than hard models Poor extrapolating capabilities Data driven model Needs more data than hard models Do not need much information about the Does not provide as deep understanding as inner workings of a system hard models Easier to make than hard models Representations Notation – Columns vs. rows: Columns contain variables, rows contain objects/samples. Comparability: Important that a variable is comparable for the object (that it has the same meaning for different objects). Sampling point representation (SPR) o Works fine until it is the smallest confusion if the point i in one curve has the same meaning as point i in another curve Ex: Problem if one profile is shifted or deformed with respect to the other profile Method and model selection Pre/post-processing: Remove noise and non-linearity Unsupervised: Looks for naturally occurring patterns in the data Supervised: Find a relationship between external information (response) and input data. Models are created such that prediction of the external information may be done o Regression: External information consists of real values (for example concentration) o Classification: External information consists of categorical variables (yes/no, cancer/non-cancer) Linear regression Definition – linear equations: A model is linear if it can be written on the form: q 0 1 f1 ( x1 ) ... n f n ( xn ) , where f j ( x j ) itself can be non-linear (it has to be independent of j though). For example, the equation q 0 1 x12 2 log( x2 ) is linear, but q 0 log( x 1 ) is not (because we are not able to write j f ( x j ) ). The idea behind least squares method is to minimize the squared errors, thus making R eT e as small as possible, where e = y -y and y Xb . This can be done by solving remember: If y xT Ax , then R 0 (just b y 2 Ax ) . This gives us b ( X T X )1 X T y , and thus x y Xb X ( X T X )1 X T y Hy . Alternatively, we could use the geometric argument that X T e 0 and use that to get the equation. The methods also work for several y-variables (when we have a Y matrix instead of a y vector), and we get multiple linear regression (MLR): B ( X T X )1 X T Y . We make several assumptions about the residuals ei yi yi : Normally distributed with a zero mean Independent Same variance Homoscedasticity: Residuals are independent of the y-response magnitude (the opposite are called heteroscedasticity) o Plot y against ei to look for heteroscedasticity Normal probability plot on the residuals can be used to see if residuals are normal distributed. It not, something may be wrong. Cannot use linear regression if we have collinearity, which happens when the columns in X is linearly dependent. The reason is because the determinant of XTX is 0. This is problematic because chemical data is often correlated, but PCR and PLS comes to the rescue. Experimental Design Methods to get maximum information about our system using a minimum of experiments. Simultaneous design: Create experimental settings before performing the experiment. Sequential design: Optimize properties of an ongoing process – obtain information from the last experiment(s) to decide on the next. Alternatives to experimental design: Ad hoc experimentation o Based on the outcome of an experiment the next is done o Uses expertise to decide the best conditions for testing out the problem o Problems with understanding the system o Variability can interfere with interpretation from one experiment to the next o Most likely: Optimal solution not found One variable at a time (OVAT) o Vary each variable separately o Assumption: All variables are independent Often not the case; not being able to find true optimum o Many more experiment than necessary is used Full factorial design No. of experiments: kN, where we have N factors for a k level design (ex: k = 2; high and low). The effect observed is from one effect only. In addition to the main effect, we can also estimate effects for all interactions for full factorial design. To find the effect of a factor: Observe how the response changes when going from a low to high value of the factor (given that all other factors do not change, otherwise it will be contaminated) Remember: Always perform experiments in a random order, never in the actual order as shown in the design matrix. This is to minimize the effect of unknown factors confounded with the ordering. Yates algorithm Need the experiments in standard order. Notation: (1): [-,-,-] (A=-, B=-, C=-) ab: [+,+,-] (A=+, B=+, C=-), means effect AB Sequence of experiments is as follows: Start with (1) Then a, b and ab Multiply the previous with c: c x (1) = c, c x a = ac etc. Multiply the previous with d: d x (1) = d, d x a = ad etc. Continue for all factors For 4 factors, standard ordering looks like this: (1), a, b, ab, c, ac, bc, abc, d, ad, bd, abd, cd, acd, bcd, abcd. Yates algorithm calculates all effects (main and interaction) for N factors. Input is a vector of response values in standard order. The algorithm fills in a matrix with N + 2 columns. The last column contains the calculated effects; the first contains the response values. The “arrow thing procedure” has to be done N times, followed by a division by 2N-1 (except the first element, which is divided by 2N). How to remember: Start the “arrow thing procedure” from bottom. Fractional factorial design No. of experiments: kN-p, where p is the number of factors removed. Idea: Not all higher order interactions are important. Consequence: Confounding – the effect we observed may be contaminated by other effects. Columns in the design matrix that are confounded with each other are aliases. These columns have the same design values. If p increases, the number of aliases also increases. Rule: A main effect should never be confounded with anything less than a 3-factor interaction. For example, for a 24-1 design, D should be confounded with ABC; D=ABC is called a generator. From this the defining contrast (which can be used to find which effects are confounded) ABCD=1 may be deduced. From the design matrix, we know that A2 = B2 = C2 = D2 = ABCD = 1. Resolution is the length of the shortest defining contrast. From a 25-2 design, we get three defining contrasts: ABCD = ADE = BCE = 1. Thus the resolution R=III. Effects and regression Instead of using Yates algorithm, one can use regression where the X matrix is based on the design matrix. We then get the effects from the MLR equation: b (XT X)-1XT y . The values are scalable from the ones calculated from the Yates algorithm: Regression coefficient = Factor effect . In a 2 level design, the factor range is 2 (from -1 to Factor range 1). The mean response, however, is the same (remember the division by 2N for the mean in the Yates algorithm). Returning to the original variables Problem: Have a regression model from a design matrix. How can we transform it into the original variables? Assume that y = ax + b holds. Then by inserting low (L) and high (H) values for x, we get low (-1) and high (+1) response: -1 = H*a + b 1 = L*a + b Solving these equations gives: a 2 H L , b . This can be inserted into y = ax + b. H L H L Multilevel and constrained designs Two-level designs are only able to capture linear relationships (lines, planes), while multilevel designs have the ability to model polynomial relationships. Usually, only up to second order models are used (quadratic models). A multilevel design is performed to estimate the quadratic regression coefficients in the following model (for two factors x1 and x2): 𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏11 𝑥12 + 𝑏22 𝑥22 + 𝑏12 𝑥1 𝑥2 . Sometimes various constraints can limit the possible experiments we can perform. Idea: We want to find the optimal experiments given the constraints. Example: Cooking meat Marinating time: [6, 18] (hours) Steaming time: [5, 15] (min) Frying time: [5, 15] (min) However, frying + steaming time: [16, 24] (which is a multilinear constraint). Constraint problems: The shape on the experimental region can be very complex. Constrained designs are not orthogonal Orthogonality ensures that all effects can be studied independently Orthogonality ensures minimal error on the estimated regression coefficients D-optimal design How can we select the smallest number of experiments that are suitable given a constrained variable space? The D-optimal condition can be used. A subset of n objects from the N possible objects in the experimental space is selected which produces a model matrix X where the determinant of the dispersion matrix, det[ (XTX)-1], is minimized. This is the same as maximizing the determinant of the covariance matrix, det(XTX). For constant size of the design, one can say that the higher det(XTX), the closer to orthogonality is the dispersion matrix (XTX)-1. It is important that the same number n is extracted each time. Since (XTX)-1 does not contain any information about the response, the quality of the regression model is only dependent on the design. Simplex Uses n + 1 data points in a simplex structure, where n is the number of dimensions. The simplex structure is moved through a series of mirror reflections and contractions. The main idea is to mirror reflect the simplex structure away from the point with the worst response. We want to reflect away from the worst point. If the new point is the worst point in the new simplex, we get oscillation. A solution is to try to reflect from the second worst point in the original simplex. Rules: 1. Except at initialization, only one vertex is added (and removed) at each stage 2. Reflect from the worst point w to make r 3. If the new point r is the worst point w in the new simplex, we get oscillations a. Try another point of contract 4. If same vertex kept in k + 1 simplex moves and not discarded, then re-evaluate response at vertex 5. Punish with bad value if simples is out of range Response surface After a fractional factorial design has been performed to find what factors are important AND a full factorial design has been performed on the most important factors, then multilevel design and response surfaces are used for final optimization. Based on sign and magnitude of all eigenvalues λ, the critical point is determined If all λ’s are negative, then we have a maximum point If all λ’s are positive, we have a minimum point If the signs of λ differ, we have a saddle point o Can have N – 1 different saddle points, where N is the no. of factors If one or more of the λ’s are zero, we do not have a critical point Signal processing (preprocessing) Centering and scaling The data we want to analyze may contain noise etc. We want to remove this unwanted variation. Centering is based on the idea that there are offsets. We column-wise center a matrix X by subtracting the column means: X c = X - 1mc T , where m c = 1 T X 1 and n is the n 1 number of rows in X. Alternatively, it can be written: X c X Z X 11T X , where Z is n the offset matrix. We want to center if: It increases the fit of the data It reduces numerical problems for analysis algorithms Normalization Sometimes intensity shifts are undesirable, for example when different amounts of samples are used. Normalization can solve this problem. If X is the data matrix, then the normalized matrix Z would be: Z [diag(k )]-1X , where k is a vector with the normalization factors. If we want to column normalize, we need to use the transpose of the data matrix: Z t [diag(k )]-1XT , and then transpose back: Z Z t T . Autoscaling Autoscaling gives each column an equal chance to participate in the modelling process. Autoscaling is when you subtract the mean and divide by the standard deviation (the same as with the normal distribution). Be careful thou; when it comes to spectra intensities is relevant. We want to autoscale when we have very different variables. The time domain Convolution is when you “slide” a vector over another (see figure). With convolution you can perform (depending on the shape): Smoothings Deformations Differentiations Formula: g (t ) f (t ) h(t ) f (m)h(m t ) (discrete) m Continuous: f (t ) h(t ) f (t )h(t )d A function convoluted on itself will converge to a Gaussian function. Different kinds of filtering: Mean smoother Running mean smoother (moving average): g (i) m j m f (i j ) , where m is the number of points 2m 1 in the window. Problem with moving average: Broadening of spikes. Solution: Running median Numerical differentiation Savitsky-Golay (SG) is a way to numerically smooth and differentiate signals. Idea: Fit a polynomial of k’th order to N data points within a convolving window which is moving along the signal. Find the estimated y-value at z = 0 in the scaled coordinates (if N is odd, z = 0 should be in the middle). In other words, we use the model (polynomial) to predict value at position z = 0. Polynomial of the form: y a0 a1 z a2 z 2 ... ak z k , y( z 0) a0 . From this the filter vector to be convoluted with the signal will be found. For differentiation: dy ( z ) dy( z ) 0 a1 a2 z ... kak z k 1 , a1 . This gives us a filter dz dz z 0 vector that can be convoluted with the signal to perform 1. order numerical differentiation. What size of window? Given first order differentiation: Little noise small window. The Fourier domain Idea: Describes signal in term of different frequency components and their amplitudes. Fourier series: f (t) a0 an cos nt bn sin nt , where ω is the angular frequency, n 1 2 , for a period T = 2L. We have that: T L L a0 1 f (t) dt 2 L L an 1 f (t) cos nt dt L L 1 bn f (t) sin nt dt . L L L L Fourier transform: f ( ) Inverse FT: f (t) 1 2 1 2 f (t)eit dt f ( )eit d The observed signal could be composed of the underlying signal and noise. We assume that the frequency distribution of the noise is different from the chemical signal. It is often assumed that chemical signal is dominated by low frequencies. Noise could be removed with a low pass filter L(ν). The ideal low pass filter cuts off all frequencies above a certain threshold and let the remaining lower frequencies pass unchanged. From the figure, we see that lowpass filter 1 is a box. The filtering is (usually) performed as a multiplication in the frequency domain (where imaginary values are used): Fsmooth ( ) F ( ) L( ) . The smooth function in the time domain is obtained by running inverse FT on Fsmooth ( ) . The ideal high pass filter is like the opposite of the low pass filter. It only let through frequencies that are higher than a threshold. Unsupervised analysis PCA Idea: Project from higher dimension to lower to be able to visualize. It is unsupervised in the sense that it does not care about the response, only the data itself (the X matrix, not the y). From book: Unsupervised are methods used for data exploration where one is looking for naturally occurring patterns in the data. We want to make latent variables, which is linear combination of the original variables. For example; the latent variable “overall body size” from the original variables “height”, “weight” and “shoe size”. The new latent variable axis points in the direction of maximum variance. PC1 is the first component; PC2 is the second and is orthogonal to PC1. PC2 points in the direction of maximum variance not explained by PC1. We need not use all the components; we use as many until we are satisfies with the explained variance (the last components we say is related to noise). Originally: In a data matrix, it is often the case that: Columns contains variables Rows contain objects or samples PCA models: X TPT E (bilinear since a matrix is written as a product of two others) The PC equation above can be derived from Lagrange multipliers, where we want to find a vector t=Xp such that the variance of t is maximized. Scores (T) is the coordinate of an object in the new latent axes (principle component axes in this case). Score plot can be used to: Detect clusters Detect (possible) outliers (plotting PC1 vs. PC2, PC1 vs. PC3 etc.) o Can also plot leverage versus residual X-variance Find patterns/trends See what the different PC-axes separates (tumor, non-tumor etc.) M Loadings (P) determine how much a variable influences the latent variable, ti p j x j . j 1 Loadings can be used to detect: Which variables contribute to each PC Correlated variables (by looking for clusters) Negatively correlated variables (which are opposite along a PC). If the angle between two variables (from the origin) is close to 180°, they are strongly negatively correlated Unimportant variables (cluster around the origin) Single value decomposition (SVD): Gives same numerical results as PCA, but the algorithm is a bit different. Here we write X USV T , where U is a column orthonormal matrix, S is a diagonal matrix and V is a row orthonormal matrix. NIPALS: NIPALS (non-linear iterative partial least squares) starts with a trial vector t for the scores and iteratively finds the loading vector. The steps are (i=1:Amax, Amax= max number of PC extracted): 1. Project E onto t to find the vector w 2. Scale w to length 1 3. Project E onto w to find a new vector t 4. Check for convergence 5. Set the scores vector equal to the newest vector t, and the loadings vector p equal to w. 6. Remove estimated PC components from E Data analysis with PCA 1. Plot the raw data 2. Initial exploration (with PCA) a. Make sure to use a sufficient number of PCs 3. Further exploration a. Search for outliers, group, trends etc. b. Look for patterns in the score plot c. Should try to estimate the true no. of components in the data by using validation methods 4. Interpretation a. Combine the information from the previous stage and our external information about the problem Cluster analysis Goal: Finding natural patterns/clusters in a data set. Different ways to calculate proximity: 1 2 Euclidean: dij( E ) ( xik x jk ) 2 k 1 Manhattan (“Taxi”): dij(M) xik x jk N N k 1 1 Minkowski: dij(M(p)) N p ( xik x jk ) p (if p = 2, we have k 1 Euclidean distance) Mahala 2 ) ( xK xB )T C 1 ( x K x B ) , where C-1 Mahalanobis: (d KB is the inverse covariance matrix (if C = I, we have Euclidean distance) Even though the Euclidean distance from C to D is shorter than A to B, the Mahalanobis distance A-B is smaller than C-D because A and B are oriented along the same direction. Wards method: Idea: Use the sum of squares between cluster centers and members of clusters as a criterion to merge clusters (instead of using cluster distance). nm M (m) Total within-cluster error sum of squares: Em xij( m ) x j i 1 j 1 2 K Adding the total within-cluster error sum of squares for each cluster, we get: Etot Em , m 1 where K is the number of clusters. If we merge two clusters A and B, we are then left with K – 1 clusters. (1) (0) Ward criterion: Want the difference between Etot (after merging) and Etot (before merging) (1) (0) Etot to be as small as possible: Etot . Need to check the difference for every possible merging of two clusters. The pair that gives the lowest number is merged. No. of possible pairs of clusters: K ( K 1) 2 k-means clustering Algorithm: Select the number of clusters K K max to look for Start by creating K random cluster centers, mk For each object xj, assign it to the cluster center it is nearest to Re-compute center points mk for the new clusters and iterate toward convergence This procedure minimizes the within-cluster variance. Problem: Have to decide the no. of clusters to look for in beforehand. Solution: Gap statistics If we want to estimate the optimal number of clusters K*, we do as following: Compute k-means for K = 1, 2, …, Kmax Compute the mean within cluster variance, WK, for each selection of K The variance Wj will generally decrease with increasing K When K < K*, we expect a significant decrease of the within cluster variance: WK+1 << WK. When K > K*, the decrease of variance will be less evident. This means that there will be flattening of the Wj curve. A sharp drop may be used to find optimal no. of clusters – basis for gap statistics. Gap statistics compare the curves log(WKsimul ) and log(WKdata ) , where “simul” represents simulated data over the given data region. The optimal no. of clusters is where the gap between these two curves is largest. Supervised analysis Latent variable based regression Multivariate calibration involves finding a relation between two data matrices X and Y, where X contains the independent variables (samples with variables we choose as we like) and Y contains the dependent variables (dependent on what has been chosen in X). X and Y are related through a regression relationship. The regression process consists of two main steps: Calibration (learning, training) and prediction (testing). In general, at least 50% of the data should be used for calibration, maybe even 80%. Be aware that prediction is only valid for objects which resembles those in the calibration set (for example, the fraction of cancerous cells should be equal in both testing and training). PCR (Principal component regression) Problem with MLR: More variables than object; XTX cannot be inverted. Idea: Using regression on latent variables instead of original variables (thus reducing the number of variables). Project our samples X onto a new basis W such that we can use the scores from these projections: T = XW. If T is equal to the PCA scores matrix we have PCR. The PCR process is as follows: For an optimal number of PC components: X = TPT + E and Y = TQT + F, where QT (T T T )1T T Y and is the regression coefficient matrix. We know that TTT is invertible (and also that T is orthogonal), and thus solves our initial problem with inversion. However, we want a regression coefficient expressed in terms of the original variables: Y TQT (XP)QT XP(T T T) -1T T Y = XB PCR , where BPCR = P(T T T)-1T T Y=PP T BMLR = H P BMLR . PLSR (Partial least squares regression) Problem: In PCR we performed latent variable projection in X whether or not it is relevant for the prediction in Y. Idea: In PLSR we find latent variables for X that are directly relevant for prediction of Y Use two different NIPALS blocks (one for X and one for Y). In the X-block: 1. Use a column in Y as start, u 2. Compute w vector and scale it to 1 3. Calculate t = Xw 4. Calculate pT = tTX/(tTt) In the Y-block: 1. Use a column in X as start, t 2. Compute q vector and scale it to 1 3. Calculate u = Yq 4. Use the new u-vector in the X-block This process is done until convergence is reached. The X and Y spaces are then updated. Some properties: The scores matrix T is orthogonal and (in general) not equal to the PCA scores matrix The loadings matrix P is not orthogonal as in PCA Equations for PLSR with A factors (T = TA etc.): X = TPT + E Y = UQT + F Where T is the scores matrix for X and U is the scores matrix for Y. There is an inner relation between these scores: U = TG, where G is a diagonal matrix. We can then write: X = TPT + E, Y = TCT + F, where CT = GQT. We can find the regression coefficients by using that EW = 0: X TPT E XW TPTW EW T XW ( PTW )1 Inserted this into Y = TCT, we get: Y XW ( PT W ) 1 C T XBPLS Remark: If we have extracted the maximum no. of possible factors, A = Amax, then BPLSR = BMLR (because then we are using the full X-space to create a model, and not a well-suited subspace of X) Lagrange multiplier approach: Want to find a vector t in the column space of X, t = Xw, and a vector u in the column space of Y, u = Yq, such that the squared covariance between t and u is maximized. Different PLS algorithms: PLS1: The Y matrix has only 1 column vector o No iterations for each PLS factor extracted PLS2: The Y matrix has > 1 column vectors o Several iterations for each PLS factor extracted When to use PCR, PLS1 and PLS2? PCR: It is a subset of PLS, so you do not have to use it o PLSR will in general produce models with fewer factors One Y-variable: Use PLS1 Multiple Y-variables: o If there is much covariance between Y variables, use PLS2 o Else, use separate PLS1 on each Y variable Validation Prediction error: If we have a data set 𝑋𝑣𝑎𝑙 and𝑦𝑣𝑎𝑙 , and the model from the calibration is applied to obtain 𝑦̂𝑣𝑎𝑙 . Then the root mean squared error of prediction (RMSEP) is ∑𝑛 (𝑦𝑣𝑎𝑙 − 𝑦̂𝑣𝑎𝑙 )2 𝑅𝑀𝑆𝐸𝑃 = √ 𝑖=1 𝑛 Plot of RMSEP can be used to find the number of PLS factors needed (too many PLS factors results in overfitting and modeling noise, which is unwanted). The optimal number of PLS factors is when the RMSEP is at a minimum. Resampling methods Cross-validation and bootstrapping: Use the same data many times differently. Typically used when we are not able to get enough validation/test set (could be the case if it is expensive or time consuming). Idea: Remove a part of the data to simulate an independent validation/test set. Then the data inserted again and another part is extracted (in bootstrapping, the same vector can be extracted multiple times). Steps in cross-validation (CV): Take out k object from data set with n objects Create a calibration model on the remaining n - k objects Use the k objects as a validation set Record prediction error Put the k objects back and draw another k objects Repeat the process If k = 1 we have Leave One Out (LOO) cross-validation. Problem with CV: When the model constructed for a CV segment is based on few objects, it performs worse than a model based on all objects. Permutation tests: The idea is to perform random permutation (randomizing) of the dependent variable Y in order to destroy relationship between X- and Y-space. Then we have sample where we know there is no relationship between X and Y, which imply that there is no structure except noise in the data. The prediction error using the real Y matrix should then be much lower than using the permutation set. We can then see how many PLSR factors (or PCA components) we should use such that the “noise value” is not higher than a significance level α. Classification methods Fisher’s Linear Discriminant Analysis (LDA) Idea: Want to find a new coordinate system that minimizes within-class variance and maximizes the between-class variance. In other words, we want to find new coordinates which produces tight clusters that are far from each other. These will be latent variables best suited M to separate the classes, zi X ij rj Xr , where r is the direction in the new DFA coordinate j 1 r T Br T system (analogous to PC). We want to find: Fmax arg max T , where r Br is the r W r r between-class variance and r T Wr is the within-class variance with the constraint that riTWrj ij . Characteristics of LDA: No assumption of distribution is needed Can also be used as a postprocessing step Prototype classification Do not form a model, but keep a storage of object with known classes (memory based methods). k-means clustering: The k-means clustering method in unsupervised classification can also be used for classification by providing prototype objects. The prototype will be the cluster centers provided by the method. Assume we have a data set containing K classes with Nj objects in each class Perform k-means clustering with Lk classes for each class data set, giving Lk cluster centers The Lk cluster centers for each class are prototype object which are used to perform classification When a new object is to be classified, we compute the distance from this object to each of the Lk cluster centers for each class The closest objects are inspected ad is given the class membership corresponding to the class the prototype belongs to k-nearest neighbor classification (k-NN): Assigning class label to the prototype closest to a new object: k=1 k-NN: The k nearest neighbors of the new object are used in the assignment process. In the picture, k = 5 and the dotted lines indicates that the majority of the 5 nearest neighbors are red, and thus the new object (yellow) is assigned to the red class. Advantages and disadvantages of k-NN Advantages Disadvantages Simple to understand Does not create a decision boundary model Can handle complex decision boundaries (if Difficult to generalize since the result, since k-NN can’t classify the object, other methods no mathematical model is created will most likely also fail) Simple to implement Difficult to extract what are the important attributes for the classification Decision trees Idea: Want to find if-then-else rules. In order to construct a decision tree we want to find which attribute/variable is the best for classification (this will decide the distribution of classes). To measure “purity” (less K randomness), the entropy function is used: Entropy( p) E( p) pi log 2 pi , where i 1 pi = #objects in class i . Lower entropy means lower uncertainty, which will be the criterion #all objects for attribute selection. When building decision trees, we look for the attribute that will give the maximum reduction in entropy (starting entropy can be found from the initial distribution of each class). Short tree better than long trees (to avoid statistical coincidence) Occam’s razor