Exam notes TKJ4175

advertisement
Exam notes in TKJ4175
Innhold
Foundations ............................................................................................................................................. 3
What is chemometrics? ....................................................................................................................... 3
General steps in analysis ................................................................................................................. 3
Hard vs. soft modelling (chemometrics is soft modelling): ............................................................. 3
Representations .............................................................................................................................. 4
Method and model selection .......................................................................................................... 4
Linear regression ................................................................................................................................. 5
Experimental Design................................................................................................................................ 7
Full factorial design ............................................................................................................................. 7
Yates algorithm................................................................................................................................ 8
Fractional factorial design ................................................................................................................... 8
Effects and regression ..................................................................................................................... 9
Returning to the original variables .................................................................................................. 9
Multilevel and constrained designs................................................................................................... 10
D-optimal design ........................................................................................................................... 10
Simplex .......................................................................................................................................... 11
Response surface........................................................................................................................... 11
Signal processing (preprocessing) ......................................................................................................... 12
Centering and scaling ........................................................................................................................ 12
Normalization .................................................................................................................................... 12
Autoscaling ........................................................................................................................................ 12
The time domain ............................................................................................................................... 13
Numerical differentiation .............................................................................................................. 13
The Fourier domain ........................................................................................................................... 14
Unsupervised analysis ........................................................................................................................... 16
PCA .................................................................................................................................................... 16
Data analysis with PCA .................................................................................................................. 18
Cluster analysis .................................................................................................................................. 18
k-means clustering ........................................................................................................................ 19
Supervised analysis................................................................................................................................ 21
Latent variable based regression ...................................................................................................... 21
PCR (Principal component regression) .......................................................................................... 21
PLSR (Partial least squares regression).......................................................................................... 21
Validation .......................................................................................................................................... 23
Resampling methods ..................................................................................................................... 24
Classification methods ...................................................................................................................... 26
Fisher’s Linear Discriminant Analysis (LDA)................................................................................... 26
Prototype classification ................................................................................................................. 26
Decision trees ................................................................................................................................ 27
Foundations
What is chemometrics?
Definition of chemometrics: Chemometrics uses mathematical, statistical and artificial
intelligence methods to:

Design or select optimal experimental procedures

Provide maximum chemical information by analyzing chemical data

Obtain knowledge about chemical systems
General steps in analysis

Plan experiments: Use experimental design to set up experiments in a systematic way

Examine data: Look at raw data with various plot

Pre-process: Is there systematic variation in the data that should not be there?
o Noise removal, correction for non-linearity

Estimate model: Inspect plots and diagnostic to find outlier etc.
o PCA

Examine results and validate model: What does the result tell us? Is the model valid
for future samples?

Prediction: Use model on new data. Examine the results to see how the predictions
are in accordance with the expectations
Hard vs. soft modelling (chemometrics is soft modelling):
Hard modelling is based on existing physical theories, while soft modelling is based on
finding structures in the data using statistical/AI methods. The computer calibrates from the
data and generates a model.
Hard modelling
Advantage
Disadvantage
Better extrapolations
Need a physical description of the system
Easier to understand and interpret
High complexity
Deeper understanding of the system
Soft modelling
Advantage
Disadvantage
Higher prediction ability than hard models
Poor extrapolating capabilities
Data driven model
Needs more data than hard models
Do not need much information about the
Does not provide as deep understanding as
inner workings of a system
hard models
Easier to make than hard models
Representations
Notation – Columns vs. rows: Columns contain variables, rows contain objects/samples.
Comparability: Important that a variable is comparable for the object (that it has the same
meaning for different objects).

Sampling point representation (SPR)
o Works fine until it is the smallest confusion if the point i in one curve has the
same meaning as point i in another curve

Ex: Problem if one profile is shifted or deformed with respect to the
other profile
Method and model selection

Pre/post-processing: Remove noise and non-linearity

Unsupervised: Looks for naturally occurring patterns in the data

Supervised: Find a relationship between external information (response) and input
data. Models are created such that prediction of the external information may be done
o Regression: External information consists of real values (for example
concentration)
o Classification: External information consists of categorical variables (yes/no,
cancer/non-cancer)
Linear regression
Definition – linear equations: A model is linear if it can be written on the form:
q  0  1 f1 ( x1 )  ...  n f n ( xn ) , where f j ( x j ) itself can be non-linear (it has to be
independent of  j though). For example, the equation q   0  1 x12   2 log( x2 ) is linear, but
q  0  log( x  1 ) is not (because we are not able to write  j f ( x j ) ).
The idea behind least squares method is to minimize the squared errors, thus making R  eT e
as small as possible, where e = y -y and y  Xb . This can be done by solving
remember: If y  xT Ax , then
R
 0 (just
b
y
 2 Ax ) . This gives us b  ( X T X )1 X T y , and thus
x
y  Xb  X ( X T X )1 X T y  Hy . Alternatively, we could use the geometric argument that
X T e  0 and use that to get the equation. The methods also work for several y-variables
(when we have a Y matrix instead of a y vector), and we get multiple linear regression
(MLR): B  ( X T X )1 X T Y .
We make several assumptions about the residuals ei  yi  yi :

Normally distributed with a zero mean

Independent

Same variance

Homoscedasticity: Residuals are independent of the y-response magnitude (the
opposite are called heteroscedasticity)
o Plot y against ei to look for heteroscedasticity
Normal probability plot on the residuals can be used to see if residuals are normal distributed.
It not, something may be wrong.
Cannot use linear regression if we have collinearity, which happens when the columns in X is
linearly dependent. The reason is because the determinant of XTX is 0. This is problematic
because chemical data is often correlated, but PCR and PLS comes to the rescue.
Experimental Design
Methods to get maximum information about our system using a minimum of experiments.
Simultaneous design: Create experimental settings before performing the experiment.
Sequential design: Optimize properties of an ongoing process – obtain information from the
last experiment(s) to decide on the next.
Alternatives to experimental design:

Ad hoc experimentation
o Based on the outcome of an experiment the next is done
o Uses expertise to decide the best conditions for testing out the problem
o Problems with understanding the system
o Variability can interfere with interpretation from one experiment to the next
o Most likely: Optimal solution not found

One variable at a time (OVAT)
o Vary each variable separately
o Assumption: All variables are independent

Often not the case; not being able to find true optimum
o Many more experiment than necessary is used
Full factorial design
No. of experiments: kN, where we have N factors for a k level design (ex: k = 2; high and
low). The effect observed is from one effect only. In addition to the main effect, we can also
estimate effects for all interactions for full factorial design.
To find the effect of a factor: Observe how the response changes when going from a low to
high value of the factor (given that all other factors do not change, otherwise it will be
contaminated)
Remember: Always perform experiments in a random order, never in the actual order as
shown in the design matrix. This is to minimize the effect of unknown factors confounded
with the ordering.
Yates algorithm
Need the experiments in standard order. Notation:

(1): [-,-,-] (A=-, B=-, C=-)

ab: [+,+,-] (A=+, B=+, C=-), means effect AB
Sequence of experiments is as follows:

Start with (1)

Then a, b and ab

Multiply the previous with c: c x (1) = c, c x a = ac etc.

Multiply the previous with d: d x (1) = d, d x a = ad etc.

Continue for all factors
For 4 factors, standard ordering looks like this: (1), a, b, ab, c, ac, bc, abc, d, ad, bd, abd, cd,
acd, bcd, abcd.
Yates algorithm calculates all effects (main and interaction) for N factors. Input is a vector of
response values in standard order. The algorithm fills in a matrix with N + 2 columns. The last
column contains the calculated effects; the first contains the response values. The “arrow
thing procedure” has to be done N times, followed by a division by 2N-1 (except the first
element, which is divided by 2N).
How to remember: Start the “arrow thing procedure” from bottom.
Fractional factorial design
No. of experiments: kN-p, where p is the number of factors removed. Idea: Not all higher order
interactions are important. Consequence: Confounding – the effect we observed may be
contaminated by other effects. Columns in the design matrix that are confounded with each
other are aliases. These columns have the same design values. If p increases, the number of
aliases also increases.
Rule: A main effect should never be confounded with anything less than a 3-factor
interaction. For example, for a 24-1 design, D should be confounded with ABC; D=ABC is
called a generator. From this the defining contrast (which can be used to find which effects
are confounded) ABCD=1 may be deduced. From the design matrix, we know that A2 = B2 =
C2 = D2 = ABCD = 1.
Resolution is the length of the shortest defining contrast. From a 25-2 design, we get three
defining contrasts: ABCD = ADE = BCE = 1. Thus the resolution R=III.
Effects and regression
Instead of using Yates algorithm, one can use regression where the X matrix is based on the
design matrix. We then get the effects from the MLR equation: b  (XT X)-1XT y . The values
are scalable from the ones calculated from the Yates algorithm:
Regression coefficient =
Factor effect
. In a 2 level design, the factor range is 2 (from -1 to
Factor range
1). The mean response, however, is the same (remember the division by 2N for the mean in
the Yates algorithm).
Returning to the original variables
Problem: Have a regression model from a design matrix. How can we transform it into the
original variables?
Assume that y = ax + b holds. Then by inserting low (L) and high (H) values for x, we get low
(-1) and high (+1) response:
-1 = H*a + b
1 = L*a + b
Solving these equations gives: a 
2
H L
, b
. This can be inserted into y = ax + b.
H L
H L
Multilevel and constrained designs
Two-level designs are only able to capture linear relationships (lines, planes), while multilevel designs have the ability to model polynomial relationships. Usually, only up to second
order models are used (quadratic models).
A multilevel design is performed to estimate the quadratic regression coefficients in the
following model (for two factors x1 and x2): 𝑦̂ = 𝑏0 + 𝑏1 𝑥1 + 𝑏2 𝑥2 + 𝑏11 𝑥12 + 𝑏22 𝑥22 +
𝑏12 𝑥1 𝑥2 .
Sometimes various constraints can limit the possible experiments we can perform. Idea: We
want to find the optimal experiments given the constraints. Example: Cooking meat

Marinating time: [6, 18] (hours)

Steaming time: [5, 15] (min)

Frying time: [5, 15] (min)
However, frying + steaming time: [16, 24] (which is a multilinear constraint).
Constraint problems: The shape on the experimental region can be very complex.
Constrained designs are not orthogonal

Orthogonality ensures that all effects can be studied independently

Orthogonality ensures minimal error on the estimated regression coefficients
D-optimal design
How can we select the smallest number of experiments that are suitable given a constrained
variable space? The D-optimal condition can be used.
A subset of n objects from the N possible objects in the experimental space is selected which
produces a model matrix X where the determinant of the dispersion matrix, det[ (XTX)-1], is
minimized. This is the same as maximizing the determinant of the covariance matrix,
det(XTX). For constant size of the design, one can say that the higher det(XTX), the closer to
orthogonality is the dispersion matrix (XTX)-1. It is important that the same number n is
extracted each time.
Since (XTX)-1 does not contain any information about the response, the quality of the
regression model is only dependent on the design.
Simplex
Uses n + 1 data points in a simplex structure, where n is the number of dimensions. The
simplex structure is moved through a series of mirror reflections and contractions. The main
idea is to mirror reflect the simplex structure away from the point with the worst response.
We want to reflect away from the worst point. If the new point is the worst point in the new
simplex, we get oscillation. A solution is to try to reflect from the second worst point in the
original simplex.
Rules:
1. Except at initialization, only one vertex is added (and removed) at each stage
2. Reflect from the worst point w to make r
3. If the new point r is the worst point w in the new simplex, we get oscillations
a. Try another point of contract
4. If same vertex kept in k + 1 simplex moves and not discarded, then re-evaluate
response at vertex
5. Punish with bad value if simples is out of range
Response surface
After a fractional factorial design has been performed to find what factors are important AND
a full factorial design has been performed on the most important factors, then multilevel
design and response surfaces are used for final optimization.
Based on sign and magnitude of all eigenvalues λ, the critical point is determined

If all λ’s are negative, then we have a maximum point

If all λ’s are positive, we have a minimum point

If the signs of λ differ, we have a saddle point
o Can have N – 1 different saddle points, where N is the no. of factors

If one or more of the λ’s are zero, we do not have a critical point
Signal processing (preprocessing)
Centering and scaling
The data we want to analyze may contain noise etc. We want to remove this unwanted
variation. Centering is based on the idea that there are offsets. We column-wise center a
matrix X by subtracting the column means: X c = X - 1mc T , where m c =
1 T
X 1 and n is the
n
1
number of rows in X. Alternatively, it can be written: X c  X  Z  X  11T X , where Z is
n
the offset matrix. We want to center if:

It increases the fit of the data

It reduces numerical problems for analysis algorithms
Normalization
Sometimes intensity shifts are undesirable, for example when different amounts of samples
are used. Normalization can solve this problem. If X is the data matrix, then the normalized
matrix Z would be: Z  [diag(k )]-1X , where k is a vector with the normalization factors. If we
want to column normalize, we need to use the transpose of the data matrix: Z t  [diag(k )]-1XT
, and then transpose back: Z  Z t T .
Autoscaling
Autoscaling gives each column an equal chance to participate in the modelling process.
Autoscaling is when you subtract the mean and divide by the standard deviation (the same as
with the normal distribution). Be careful thou; when it comes to spectra intensities is relevant.
We want to autoscale when we have very different variables.
The time domain
Convolution is when you “slide” a vector
over another (see figure). With convolution
you can perform (depending on the shape):

Smoothings

Deformations

Differentiations


Formula: g (t )  f (t )  h(t ) 
f (m)h(m  t ) (discrete)
m 

Continuous: f (t )  h(t ) 

f (t )h(t   )d

A function convoluted on itself will converge to a Gaussian function.
Different kinds of filtering:
Mean smoother
Running mean smoother (moving average):
g (i) 
m

j  m
f (i  j )
, where m is the number of points
2m  1
in the window.
Problem with moving average: Broadening of spikes.
Solution: Running median
Numerical differentiation
Savitsky-Golay (SG) is a way to numerically smooth and differentiate signals. Idea: Fit a
polynomial of k’th order to N data points within a convolving window which is moving along
the signal. Find the estimated y-value at z = 0 in the scaled coordinates (if N is odd, z = 0
should be in the middle). In other words, we use the model (polynomial) to predict value at
position z = 0.
Polynomial of the form: y  a0  a1 z  a2 z 2  ...  ak z k , y( z  0)  a0 . From this the filter
vector to be convoluted with the signal will be found.
For differentiation:
dy ( z )
 dy( z ) 
 0  a1  a2 z  ...  kak z k 1 , 
 a1 . This gives us a filter
dz
 dz  z 0
vector that can be convoluted with the signal to perform 1. order numerical differentiation.
What size of window? Given first order differentiation: Little noise  small window.
The Fourier domain
Idea: Describes signal in term of different frequency components and their amplitudes.

Fourier series: f (t)  a0    an cos  nt   bn sin  nt   , where ω is the angular frequency,
n 1

2 
 , for a period T = 2L. We have that:
T
L
L

a0 
1
f (t) dt
2 L L

an 
1
f (t) cos  nt  dt
L L

1
bn   f (t) sin  nt  dt .
L L
L
L
Fourier transform: f ( ) 
Inverse FT: f (t) 
1
2


1
2


f (t)eit dt

f ( )eit d

The observed signal could be composed of the underlying signal and noise. We assume that
the frequency distribution of the noise is different from the chemical signal. It is often
assumed that chemical signal is dominated by low frequencies.
Noise could be removed with a low pass filter L(ν).
The ideal low pass filter cuts off all frequencies above
a certain threshold and let the remaining lower
frequencies pass unchanged. From the figure, we see
that lowpass filter 1 is a box. The filtering is (usually)
performed as a multiplication in the frequency domain
(where imaginary values are used):
Fsmooth ( )  F ( ) L( ) . The smooth function in the time domain is obtained by running
inverse FT on Fsmooth ( ) .
The ideal high pass filter is like the opposite of the low pass filter. It only let through
frequencies that are higher than a threshold.
Unsupervised analysis
PCA
Idea: Project from higher dimension to lower to be able to visualize. It is unsupervised in the
sense that it does not care about the response, only the data itself (the X matrix, not the y).
From book: Unsupervised are methods used for data exploration where one is looking for
naturally occurring patterns in the data.
We want to make latent variables, which is linear combination of the original variables. For
example; the latent variable “overall body size” from the original variables “height”, “weight”
and “shoe size”.
The new latent variable axis points in the direction of maximum variance. PC1 is the first
component; PC2 is the second and is orthogonal to PC1. PC2 points in the direction of
maximum variance not explained by PC1. We need not use all the components; we use as
many until we are satisfies with the explained variance (the last components we say is related
to noise).
Originally: In a data matrix, it is often the case that:

Columns contains variables

Rows contain objects or samples
PCA models: X  TPT  E (bilinear since a matrix is written as a product of two others)
The PC equation above can be derived from Lagrange multipliers, where we want to find a
vector t=Xp such that the variance of t is maximized.
Scores (T) is the coordinate of an object in the new latent axes (principle component axes in
this case). Score plot can be used to:

Detect clusters

Detect (possible) outliers (plotting PC1 vs. PC2, PC1 vs. PC3 etc.)
o Can also plot leverage versus residual X-variance

Find patterns/trends

See what the different PC-axes separates (tumor, non-tumor etc.)
M
Loadings (P) determine how much a variable influences the latent variable, ti   p j x j .
j 1
Loadings can be used to detect:

Which variables contribute to each PC

Correlated variables (by looking for clusters)

Negatively correlated variables (which are opposite along a PC). If the angle between
two variables (from the origin) is close to 180°, they are strongly negatively correlated

Unimportant variables (cluster around the origin)
Single value decomposition (SVD): Gives same numerical results as PCA, but the algorithm
is a bit different. Here we write X  USV T , where U is a column orthonormal matrix, S is a
diagonal matrix and V is a row orthonormal matrix.
NIPALS: NIPALS (non-linear iterative partial least
squares) starts with a trial vector t for the scores and
iteratively finds the loading vector. The steps are (i=1:Amax,
Amax= max number of PC extracted):
1. Project E onto t to find the vector w
2. Scale w to length 1
3. Project E onto w to find a new vector t
4. Check for convergence
5. Set the scores vector equal to the newest vector t,
and the loadings vector p equal to w.
6. Remove estimated PC components from E
Data analysis with PCA
1. Plot the raw data
2. Initial exploration (with PCA)
a. Make sure to use a sufficient number of PCs
3. Further exploration
a. Search for outliers, group, trends etc.
b. Look for patterns in the score plot
c. Should try to estimate the true no. of components in the data by using
validation methods
4. Interpretation
a. Combine the information from the previous stage and our external information
about the problem
Cluster analysis
Goal: Finding natural patterns/clusters in a data set. Different ways to calculate proximity:
1
2



Euclidean: dij( E )    ( xik  x jk ) 2 
 k 1


Manhattan (“Taxi”): dij(M)   xik  x jk
N
N
k 1
1

Minkowski: dij(M(p))
N
p
   ( xik  x jk ) p  (if p = 2, we have
 k 1

Euclidean distance)

Mahala 2
)  ( xK  xB )T C 1 ( x K  x B ) , where C-1
Mahalanobis: (d KB
is the inverse covariance matrix (if C = I, we have Euclidean
distance)
Even though the Euclidean distance from C to D is shorter than A to
B, the Mahalanobis distance A-B is smaller than C-D because A and
B are oriented along the same direction.
Wards method:
Idea: Use the sum of squares between cluster centers and members of clusters as a criterion to
merge clusters (instead of using cluster distance).
nm
M
(m)
Total within-cluster error sum of squares: Em    xij( m )  x j 


i 1 j 1
2
K
Adding the total within-cluster error sum of squares for each cluster, we get: Etot   Em ,
m 1
where K is the number of clusters. If we merge two clusters A and B, we are then left with K
– 1 clusters.
(1)
(0)
Ward criterion: Want the difference between Etot
(after merging) and Etot
(before merging)
(1)
(0)
 Etot
to be as small as possible:   Etot
. Need to check the difference for every possible
merging of two clusters. The pair that gives the lowest number is merged. No. of possible
pairs of clusters:
K ( K  1)
2
k-means clustering
Algorithm:

Select the number of clusters K  K max to look for

Start by creating K random cluster centers, mk

For each object xj, assign it to the cluster center it is nearest to

Re-compute center points mk for the new clusters and iterate toward convergence
This procedure minimizes the within-cluster variance.
Problem: Have to decide the no. of clusters to look for in beforehand. Solution: Gap statistics
If we want to estimate the optimal number of clusters K*, we do as following:

Compute k-means for K = 1, 2, …, Kmax

Compute the mean within cluster variance, WK, for each selection of K

The variance Wj will generally decrease with increasing K
When K < K*, we expect a significant decrease of the within cluster variance: WK+1 << WK.
When K > K*, the decrease of variance will be less evident. This means that there will be
flattening of the Wj curve. A sharp drop may be used to find optimal no. of clusters – basis for
gap statistics.
Gap statistics compare the curves log(WKsimul ) and
log(WKdata ) , where “simul” represents simulated data over
the given data region. The optimal no. of clusters is where
the gap between these two curves is largest.
Supervised analysis
Latent variable based regression
Multivariate calibration involves finding a relation between two data matrices X and Y, where
X contains the independent variables (samples with variables we choose as we like) and Y
contains the dependent variables (dependent on what has been chosen in X).
X and Y are related through a regression relationship. The regression process consists of two
main steps: Calibration (learning, training) and prediction (testing). In general, at least 50% of
the data should be used for calibration, maybe even 80%.
Be aware that prediction is only valid for objects which resembles those in the calibration set
(for example, the fraction of cancerous cells should be equal in both testing and training).
PCR (Principal component regression)
Problem with MLR: More variables than object; XTX cannot be inverted.
Idea: Using regression on latent variables instead of original variables (thus reducing the
number of variables).
Project our samples X onto a new basis W such that we can use the scores from these
projections: T = XW. If T is equal to the PCA scores matrix we have PCR. The PCR process
is as follows:
For an optimal number of PC components: X = TPT + E and Y = TQT + F, where
QT  (T T T )1T T Y and is the regression coefficient matrix. We know that TTT is invertible
(and also that T is orthogonal), and thus solves our initial problem with inversion. However,
we want a regression coefficient expressed in terms of the original variables:
Y  TQT  (XP)QT  XP(T T T) -1T T Y = XB PCR , where
BPCR = P(T T T)-1T T Y=PP T BMLR = H P BMLR .
PLSR (Partial least squares regression)
Problem: In PCR we performed latent variable projection in X whether or not it is relevant for
the prediction in Y.
Idea: In PLSR we find latent variables for X that are directly relevant for prediction of Y
Use two different NIPALS blocks (one for X and one for Y). In the X-block:
1. Use a column in Y as start, u
2. Compute w vector and scale
it to 1
3. Calculate t = Xw
4. Calculate pT = tTX/(tTt)
In the Y-block:
1. Use a column in X as start, t
2. Compute q vector and scale
it to 1
3. Calculate u = Yq
4. Use the new u-vector in the X-block
This process is done until convergence is reached. The X and Y spaces are then updated.
Some properties:

The scores matrix T is orthogonal and (in general) not equal to the PCA scores matrix

The loadings matrix P is not orthogonal as in PCA
Equations for PLSR with A factors (T = TA etc.):
X = TPT + E
Y = UQT + F
Where T is the scores matrix for X and U is the scores matrix for Y. There is an inner relation
between these scores: U = TG, where G is a diagonal matrix. We can then write: X = TPT + E,
Y = TCT + F, where CT = GQT.
We can find the regression coefficients by using that EW = 0:
X  TPT  E  XW  TPTW  EW  T  XW ( PTW )1
Inserted this into Y = TCT, we get: Y  XW ( PT W ) 1 C T  XBPLS
Remark: If we have extracted the maximum no. of possible factors, A = Amax, then BPLSR =
BMLR (because then we are using the full X-space to create a model, and not a well-suited
subspace of X)
Lagrange multiplier approach: Want to find a vector t in the column space of X, t = Xw,
and a vector u in the column space of Y, u = Yq, such that the squared covariance between t
and u is maximized.
Different PLS algorithms:

PLS1: The Y matrix has only 1 column vector
o No iterations for each PLS factor extracted

PLS2: The Y matrix has > 1 column vectors
o Several iterations for each PLS factor extracted
When to use PCR, PLS1 and PLS2?

PCR: It is a subset of PLS, so you do not have to use it
o PLSR will in general produce models with fewer factors

One Y-variable: Use PLS1

Multiple Y-variables:
o If there is much covariance between Y variables, use PLS2
o Else, use separate PLS1 on each Y variable
Validation
Prediction error: If we have a data set 𝑋𝑣𝑎𝑙 and𝑦𝑣𝑎𝑙 , and the model from the calibration is
applied to obtain 𝑦̂𝑣𝑎𝑙 . Then the root mean squared error of prediction (RMSEP) is
∑𝑛 (𝑦𝑣𝑎𝑙 − 𝑦̂𝑣𝑎𝑙 )2
𝑅𝑀𝑆𝐸𝑃 = √ 𝑖=1
𝑛
Plot of RMSEP can be used to find the number of PLS factors needed (too many PLS factors
results in overfitting and modeling noise, which is unwanted). The optimal number of PLS
factors is when the RMSEP is at a minimum.
Resampling methods
Cross-validation and bootstrapping: Use the same data many times differently. Typically
used when we are not able to get enough validation/test set (could be the case if it is
expensive or time consuming).
Idea: Remove a part of the data to simulate an independent validation/test set. Then the data
inserted again and another part is extracted (in bootstrapping, the same vector can be extracted
multiple times).
Steps in cross-validation (CV):

Take out k object from data set with n objects

Create a calibration model on the remaining n - k objects

Use the k objects as a validation set

Record prediction error

Put the k objects back and draw another k objects

Repeat the process
If k = 1 we have Leave One Out (LOO) cross-validation.
Problem with CV: When the model constructed for a CV segment is based on few objects, it
performs worse than a model based on all objects.
Permutation tests: The idea is to perform random permutation (randomizing) of the
dependent variable Y in order to destroy relationship between X- and Y-space. Then we have
sample where we know there is no relationship between X and Y, which imply that there is no
structure except noise in the data. The prediction error using the real Y matrix should then be
much lower than using the permutation set. We can then see how many PLSR factors (or PCA
components) we should use such that the “noise value” is not higher than a significance level
α.
Classification methods
Fisher’s Linear Discriminant Analysis (LDA)
Idea: Want to find a new coordinate system that minimizes within-class variance and
maximizes the between-class variance. In other words, we want to find new coordinates which
produces tight clusters that are far from each other. These will be latent variables best suited
M
to separate the classes, zi   X ij rj  Xr , where r is the direction in the new DFA coordinate
j 1
 r T Br 
T
system (analogous to PC). We want to find: Fmax  arg max  T
 , where r Br is the
r
W
r
r


between-class variance and r T Wr is the within-class variance with the constraint that
riTWrj  ij .
Characteristics of LDA:

No assumption of distribution is needed

Can also be used as a postprocessing step
Prototype classification
Do not form a model, but keep a storage of object with known classes (memory based
methods).
k-means clustering: The k-means clustering method in unsupervised classification can also
be used for classification by providing prototype objects. The prototype will be the cluster
centers provided by the method.

Assume we have a data set containing K classes with Nj objects in each class

Perform k-means clustering with Lk classes for each class data set, giving Lk cluster
centers

The Lk cluster centers for each class are prototype object which are used to perform
classification

When a new object is to be classified, we compute the distance from this object to
each of the Lk cluster centers for each class

The closest objects are inspected ad is given the class membership corresponding to
the class the prototype belongs to
k-nearest neighbor classification (k-NN):
Assigning class label to the prototype closest to a new
object: k=1
k-NN: The k nearest neighbors of the new object are
used in the assignment process. In the picture, k = 5
and the dotted lines indicates that the majority of the 5
nearest neighbors are red, and thus the new object
(yellow) is assigned to the red class.
Advantages and disadvantages of k-NN
Advantages
Disadvantages
Simple to understand
Does not create a decision boundary model
Can handle complex decision boundaries (if
Difficult to generalize since the result, since
k-NN can’t classify the object, other methods
no mathematical model is created
will most likely also fail)
Simple to implement
Difficult to extract what are the important
attributes for the classification
Decision trees
Idea: Want to find if-then-else rules.
In order to construct a decision tree we want to find which attribute/variable is the best for
classification (this will decide the distribution of classes). To measure “purity” (less
K
randomness), the entropy function is used: Entropy( p)  E( p)   pi log 2 pi , where
i 1
pi =
#objects in class i
. Lower entropy means lower uncertainty, which will be the criterion
#all objects
for attribute selection. When building decision trees, we look for the attribute that will give
the maximum reduction in entropy (starting entropy can be found from the initial distribution
of each class).
Short tree better than long trees (to avoid statistical coincidence)
Occam’s razor
Download