Principal Component Analysis
Application of PCA to the study of gait
William Rose
It is interesting but difficult to compare EMG or kinematic patterns between groups, because
EMGs & kinematic data are complex, or “high-dimensional”, i.e. they have a lot of numbers: they
vary with time and by site. Pattern analysis tries to find patterns in complex data, and thus to reduce
complex data to a smaller set of numbers that can be quantitatively compared. Principal component
analysis (PCA) is one kind of pattern analysis.
Methods used by Hubley-Kosey et al., 2006
Data was collected from three quadriceps muscles, two hamstrigs, and R & L gastrocnemii.
The analysis is described on p. 368. EMG data from each muscle is normalized by the EMG
amplitude from MVIC in that subject in that muscle. EMG data is time-normalized to 100% of one
gait cycle, thus there are 101 time points per EMG. Normalized EMGs from 5 walking trials per
subject were used to create an “average” normalized EMG for each muscle from each subject.
There were 78 subjects (38 normal, 40 with OA).
Pattern recognition was done on the quads as a group, so patterns were found to describe the
EMGs from all three quads muscles, assuming that the same pattern would fit each of the three
quads equally well. Patterns were also found for the two hamstrings grouped together and the two
gastrocs grouped together.
A matrix X is made which has 101 rows (one row per each time point) and 234 columns
(3x78) for quads, and a matrix with 156 columns (2x78) for hams, and a matrix with 156 columns
for gastrocs. The columns xj of each matrix are the average normalized EMGs from each subject for
those 2 or 3 muscles (separate column for each of the 2 or 3 muscles).
We will consider matrix X for quads which is a m=101 row x n=234 column matrix. Matrix
C = X XT / (n-1)
is the unbiased estimate of the covariance matrix if columns of X are “de-meaned” (mean of each column subtracted from elements in
that column) first. This could be written
C = (X-) (X-)T / (n-1)
where  represents the column means. (This is not the usual definition of the covariance matrix of a matrix, because the order of
transpose is reversed from usual.)
We will consider matrix X for quads which is a m=101 row x n=234 column matrix. Matrix
C = X XT / (n-1)
is the unbiased estimate of the covariance matrix if columns of X are “de-meaned” (mean of each
row subtracted from elements in that row) first. This could be written
C = (X- row) (X- row)T / (n-1)
where row represents the row means. (This is not the usual definition of the covariance matrix of a
matrix, because the order of transpose is reversed from usual.)
We find in Matlab that C1==C2, where C1=cov(X’) and C2=Xzmr*Xzmr’/(m-1), where
and mnXrows=mean(X,2), and M=# columns of X, which equals #
of subjects or # of trials. We use cov(X’) instead of cov(X) in order to get the 100x100 covariance
matrix instead of the MxM matrix. We also find in Matlab that we get essentially the same results
>> C1=cov(X’); [coefs1,latent1,pc_exp]=pcacov(C1);
and with
>> [coefs2,d]=eig(C2); (or =eig(C1), since C1==C1) coefs2=fliplr(coefs2);
>> latent2=flipud(diag(d)); scores2=(coefs2'*Xzm)';
and with
>> [coefs3,scores3,latent3]=princomp(X’)
except latent3=constant*latent2. (However, latent1=latent2.)
C is NxN=101x101. It will be symmetric, i.e C(i,j)=C(j,i). The value of C(i,j) tells us the
covariance of the normalized EMG at time i with itself at time j, based on analysis of all 238 “input”
normalized EMGs. [I think the following sentence is wrong because it fails to take into account the demeaning:] Another way
of understanding the covariance matrix is that
C(i,j)=sqrt[average normalized EMG power at time i] × sqrt[average power at time j] × r ,
where r=correlation between EMG at time i and EMG at time j. (Now we can see why
C is symmetric: the
covariance of EMG at time i with itself at time j will equal the covariance of the EMG at time j with
itself at time i.) If the EMGs are very reliable and repeatable across subjects, we will expect the
absolute values of covariance to be relatively large. For example, suppose the amplitude of the
normalized EMGs at time t=25% varies across subjects, sometimes positive, sometimes negative,
sometimes large, sometimes small, and suppose also that the EMG at t=75% in each subject is
always exactly the negative of the value at time t=25% in the same subject. Then we’ll get a
relatively large negative value for C(25,75), because the correlation r will = -1. Now consider a
different situation, in which the values of the normalized EMGs at times t=25 and t=75 are always
large, but sometimes they are large positive and sometimes they are large negative, and they aren’t
correlated. In this case the value of C(25,75) will be zero, because the correlation r = 0. Each
element along main diagonal of C, i.e. the value C(i,i), is the mean (across subjects and across 3
muscles) power of the 238 normalized EMGs at time i. (The correlation of a signal with itself at the
same time point is unity.)
Now that we have the covariance matrix, we use eigenvector decomposition to find the
principal components. The idea is that we want to find the normalized EMG pattern t1 (a set of 101
numbers, corresponding to values at each % of the cycle) which, when scaled up or down by a
multiplicative factor (which will be set differently for each subject and each of the 3 quads), will
give the best fit to all 238 normalized EMGs. The first column of the “transform matrix” T is this
“best pattern”, where T is given by
C = T  TT.
All the matrices in the equation above are 101x101. The second column of T (t2, another 101 point
long vector) is the second best pattern, i.e. the EMG pattern which, when appropriately scaled for
each particular EMG, makes the greatest improvement in the predictions of the EMGs, after t1.
Altogether there are 101 patterns (101 columns in T), each of which is the next best pattern, after
those which have come before. T is also called the matrix of eigenvectors. The authors call them
“orthonormal eigenvectors”, which is redundant, since eigenvectors are always orthogonal to
eachother, and are typically normalized to have length=1. The matrix  is diagonal, i.e. it is zero
except along the main diagonal. The diagonal elements of  are the eigenvalues. The first
eigenvalue, i.e. the 1,1 element of , tells us the variance, or power, in the normalized EMGs due to
the first pattern, t1. Since T1 is by definition the best pattern, it accounts for more of the power or
variance than any other pattern, and so its eigenvalue, 11, will be the largest eigenvalue. 22,
which is the power accounted for by the next best pattern, is the next largest eigenvalue, then 33,
and so on.
For each particular normalized EMG (i.e. for each column xj of the matrix X) we can
compute the scaling factors that patterns t1, t2, etc should be multiplied by to predict normalized
EMG xj. We will call this vector of scaling factors yj. (Don’t get confused. y1 is not a single scaling
factor; it is a whole set of scaling factors for EMG x1. The scaling factor for pattern t1, EMG x1 is
the first element of y1. The scaling factor for t2, EMG x1 is the second element of y1, etc. The
vector of pattern scaling factors for EMG xj is
yj = TT xj .
The authors call this the vector of scores. (Matlab’s PCA (princomp) uses the term “scores” the
same way.) We can reconstruct xj by adding up the patterns, with each pattern scaled by the
appropriate y-value:
xj = i yij tj ,
where yij = the ith element of vector yj. We can probably get a pretty good approximation without
using all 101 patterns. For example, we could just use the first k patterns. The eigenvalue matrix 
tells us how much better the approximation gets as you add patterns. The sum of all the diagonal
elements in C (trace(C), by definition of “trace”) is the total power in the normalized EMGs. The
sum of the first k eigenvalues tells us how much power is accounted for by the first k patterns. So if
the sum of the first 3 patterns is 90% of trace(C), we might decide to just analyze the first 3 patterns.
Patterns that account for only a small fraction of the variance are likely to just be “noise” anyway,
not “real” patterns, so going out to too high a value of k, for example choosing k high enough to
account for 99% of the variance, may cause us to analyze patterns that aren’t important or “real”.
The authors choose k large enough to account for (at least?) 90% of total power, and/or (which?)
they exclude patterns if the eigenvalue is < 1% of total variance (i.e. if kk < trace(C)/100). This
leads them to select k=3 for all three muscle groups (quads, hams, gastrocs). The authors use “PP1,
PP2, PP3” to refer to principal patterns 1, 2, 3, which are equal to the pattern vectors t1, t2, and t3.
They use the scores (the yi’s) for k principal patterns for statistical analysis of changes in patterns.
Summary of results of Hubley-Kosey et al., 2006
The authors use principal component analysis (PCA) to compare EMG patterns in normals
and subjects with osteoarthritis (OAs). The PCA done here assumes that there are k principal
patterns for the 234 (or 156) EMGs, and that the patterns (but not the weights of the patterns) are
the same in all 3 (or 2) muscles of a group, and in normals and OAs. The “best” such patterns are
found for each muscle group. Three patterns (k=3) are enough to account for >90% of the variance
in EMG patterns in each of the groups. The authors analyze whether the scores (also known as
weights, or y vectors) are different for normals vs. OAs, and whether the weights are different for
different muscles in a group (VL vs. VM vs. RF in the quads group, for example), and whether
weights differ by muscle and group (VL in normal vs. control, etc.) They find that there are
differences between normals and OAs for some muscles and groups of muscles. The large number
of comparisons made (scores for PP1, PP2, PP3, various muscles and groups, pairwise and not, etc.)
make it hard (for me) to tell what’s really important and not important. There is no a priori
hypothesis about scores that is tested here. A lot of scores are measured and compared with each
other. Some statistically significant score differences are found between normals and OAs.
Brief notes on Astephen at al., 2008
Analyze kinematic & EMG data from asymp, moderate, & severe OA patients. Forty-five
PCs were determined and 45 ANOVAs were done to compare those PCs between the three groups
(aymp, moderate, & severe OA). (Three PCs each for the 12 kinetic/kinematic variables and three
PCs each for EMGs from 3 muscle groups).
Fig 1A shows knee IR moment waveforms for asymp & moderate OA subjects. The
waveforms are not from actual subjects. Both waveforms are built from the 3 PCs for knee IR, with
(maybe - they're not totally clear I think) the "mean values" for the weights of PC1 and PC2, and, for
PC3, the weighting is at the 5%ile for the asymp trace and at 95%ile for the mod OA trace. Fig 1B
shows RF EMG waveforms for same 2 groups. Each waveform is built from the 3 PCs for RF
EMG, with (I think) the "mean values" for the weights of PC2 and PC3, and, for PC1, the weighting
is at the 5%ile for the asymp trace and at 95%ile for the mod OA trace. Likewise, Figs 1C, 1D, 1E
show waveforms built using 5%ile and 95%ile values for PC2, PC3, and PC2 respectively of their
traces. Of the 16 PCs (out of 45) which were signif different between asymp & moderate OA, traces
using the 5% and 95% levels of these particular 5 PCs are shown in Fig 1, because these 5 PCs (but
not the other 11, evidently) were found to be important in the linear discriminant analysis of asymp
& moderate OA.
General notes on use of PCA in gait analysis
Some authors (Raptopoulos et al. 2006; Schutte et al. 2000; Deluzio et al. 1997) use corr(X)
instead of Cov(X). Hubley-Kozey et al. don’t indicate in their papers that they subtract the means.
Raptopoulos et al. 2006 and Hall et al. 2006 suggest PCA and KLT are the same thing. The clearest
article comparing PCA to KLT is Gerbrands (1981). Gerbrands says transform T is same for PCA
and KLT and is the matrix of eigenvectors of the covariance matrix. In KLT this is applied to the
“raw” data vectors:
yj(KLT) = TT xj .
In PCA, the “de-meaned” data vectors are used:
yj(PCA) = TT (xj – E(x)) ,
where E(x)=mean vector.
Multi-muscle PCA for gait analysis: 2010-12-15 discussion with Shradda Srivastava.
Shraddha Srivastava & John Scholz have collected EMG data from 10 muscles during
walking. By analogy with the earlier analysis, matrix X has m=100 rows (one row per each time
point of the gait cycles) and n=10 columns. The elements are xij, where i=time index and j=muscle
number. Here each column is a different muscle: vector xj (whose elements are xij, i=1..100) is the
EMG from muscle j.
In Hubley-Kosey et al. (2006), and in Astephen et al. (2008), each column is a different
individual, and data from each muscle is analyzed separately.
C = XdmT Xdm / (m-1)
is the unbiased estimate of the covariance matrix, where Xdm = “de-meaned” version of X: the data
in each column have had the mean for that column subtracted. Therefore, each column of Xdm is the
EMG for one muscle, adjusted to have zero mean for the full gait cycle.
Xdm = X - µ
where µ = matrix of the column-means of X.
In Matlab:
The same result can be obtained more easily by
C is a symmetric n by n, i.e. 10 by 10, matrix. The value of cij tells us the covariance of the EMG
of muscle i with the EMG of muscle j, based on all m=100 times through the gait cycle.
Vijaya Krishnamoorthy, John Scholz, et al. (2003) used PCA to study postural adjustments in
response to perturbations. They measured EMGs from eleven postural muscles during postural
perturbations. Each individual repeated the perturbations 50 times for some tasks, 22 times for other
tasks. The EMG was integrated for 100 ms around the start of each perturbation. After
normalization (which is somewhat complicated; see the paper), this yields one number per muscle
per trial, in one individual: normalized integrated EMG activity for 100 ms. Thus the matrix which
was PCA’d was
𝑋 = 𝑥𝑖𝑗
where i=1..11=muscle number and j=1..50=repetition number. (m=11, n=50). PCA was done on X;
“the correlations were computed among the [muscles].” Note that, unlike PCA of Hubley-Kosey
and others described earlier, time does not appear as a variable or parameter in this data or model:
all the data comes from a single time point. Each principal component is a “muscular” pattern (the
levels of simultaneous activation of all 11 muscles), instead of a temporal pattern. For H-K et al,
different columns were different individuals. For Krishnamoorthy et al., different columns are
different trials in the same individual. For K-K et al., each individual got a “score” vector, the
elements of which indicate how much of PC1, PC2, etc. was used by that individual. For
Krishnamoorthy et al., each trial gets a score vector, the elements of which indicate how much of
PC1, PC2, etc. was used in that trial. As far as I can tell, each individual is analyzed separately, and
there is no requirement that the PCs in one individual look like the PCs in another individual.
Krishnamoorthy call each PC an “M-mode” (muscle mode). They used the three largest M-modes
for their next stage of analysis, uncontrolled manifold (UCM) analysis, which we will not discuss in
detail here. Krishnamoorthy et al. distinguish between M-modes (PCs) and “synergies”: “Muscle
synergies are defined as co-variations of control variables (M-modes) that stabilize a particular
value of COP shift.” The authors compute a UCM “in the M-mode space corresponding to a certain
average (across trials) shift of the COP”. In other words, they looked for (and found) that there is a
subspace of M-mode space corresponding to a particular COP shift. This subspace is the UCM. It is
a plane or hyperplane of “allowable tradeoffs” between M-modes that will not disturb the COP
position. If one moves orthogonally (in M-mode space) to that plane, the COP will change
Shraddha Srivastava analyzes EMGs from different individuals separately. Each column in the
data matrix is a different muscle. This type of PCA finds EMG versus muscle patterns (PCs). The
patterns do not change with time. They are used with different weights at different times. Each PC
is a unit vector with n=10 elements. Each PC represents a pattern of activation across all 10
muscles. Each PC is a vector of EMG versus muscle whose components correspond to different
muscles. The PCs are ranked according to their eigenvalues. The eigenvector with largest
eigenvalue is the first principal component. By using a different “weight” of this vector at each time
point, we can account for the biggest fraction of the variability of the “EMG versus muscle
relationship” from different times. The second PC is an EMG versus muscle vector which is
orthogonal to the first. Each time point has a different set of weights for PCs (i.e. for the the “EMG
versus muscle” functions). This PCA does not compare across individuals, since all the data is from
one individual. The PCA as described so far does not compare across gait cycles, since all the EMG
data is from one gait cycle, or is cycle-averaged across gait cycles.
In PCA as done by Wooten et al. (1990), Deluzio et al. (1997), Hubley-Kosey et al. (2006)
Astephen et al. (2008), and others, EMGs from different muscles are analyzed separately. Each
column in the data matrix is a different individual. These authors use PCA to find PCs which are EMG
versus time functions, i.e. temporal patterns that persist across individuals. The PCs are the same for
all individuals, and are used with different weights by different individuals. Each PC is a unit vector
with m=100 elements and represents an EMG versus time vector whose components correspond to
successive time points. By using different a different weight of the first PC for each individual, we can
account for the biggest fraction of the variability of the “EMG versus time relationship” from different
individuals. The second PC is orthogonal to the first. Each individual has a different set of weights for
the “EMG versus time” functions. The studies mentioned do not compare one stride to another, since
data from different strides are averaged together to make the input data for PCA.
Shraddha S. wonders whether and how the patterns of muscle activation are changing from one gait
cycle to the next. The PCA as described above does not make it easy to do such a comparison. One
may do the PCA as described for each gait cycle individually, and there is no mathematical
requirement that the PCs be the same or even related for different cycles. One could then
investigate cycle-to-cycle changes by analyzing how PC1 changes from cycle to cycle, and how PC2
changes from cycle to cycle, etc. It may be hard to visually compare 10-vectors which all have unit
length. One could compare how the weighting-versus-time vector (which has m=100 elements) for
PC1 changes from cycle 1 to cycle 2 to cycle 3, and do likewise for the weighting-versus-time
vectors for PC2, PC3, etc.
Multidimensional alternative to PCA
A stride-by-stride analysis of changes in gait assessed by EMG was done by Jansen et al. (2003).
They did not do PCA; they introduced a novel method. They recorded EMGs from 4 muscles and
computed a trajectory in 4-space for each stride. Template trajectories are found. Each stride is
assigned to the most similar template.
PCA code in Matlab
I have written Matlab programs to do PCA.
Disregard PCA01.m.
PCA02.m uses pcacov() which operates on the covariance matrix C. I calculate the covariance
matric C as cov(X’). I use X’ instead of X because I assume the data matrix X (with N rows x M
columns) has different observations in different columns. Column 1 is all the data from one
observation (e.g. EMG in one individual during 1 stride). (The usual PCA, which would use
cov(X), assumes each observation is a separate row.) Therefore each principle component will
have N elements since that is the length of each observation. If data is gait EMG or similar, N=100
is typical, for a cycle divided into 100 parts. Sample synthetic data that have been analyzed include
EMG_set2.txt and EMG_set3.txt and EMG_set4.txt, all made with Excel spreadsheet
PCA02A.m uses princomp(), which operates on the raw data (as opposed to the covariance matrix).
I use princomp(X’) rather than princomp(X), for the reasons described above.
PCA03zm.m does the PCA by eigenvector analysis of the covariance matrix of the data. This was
done to make sure I fully understood how princomp() and pcacov() worked, because such an
understanding is necessary to do “non-standard” PCA with data that had not been “de-meaned”.
(See discussion above, under “General notes on use of PCA in gait analysis.”) PCA03zm.m uses
“zero-mean-value” rows. In other words, the raw data is adjusted by having the mean value for a
row (i.e. for one time point) subtracted from all the observations values at that time point. The
resulting data matrix Xzmr has zero mean value for every row. I demonstrated with PCA03zm.m that
I could use Xzmr matrix to compute PC’s and weights and reconstruct the data with the same results
as with princomp() and pcacov(). Details: compute C=Xzmr*Xzmr’/(N-1), since this = cov(X’). Do
eigenvector analysis on C.
PCA03.m is like PCA03zm.m except it does not “de-mean” the data, so the rows do not have zero
mean value. The eigenvector analysis is done on matrix D=X*X’/(N-1), yielding PCs. Note D is
not the covariance matrix. Compute scores by dotting each PC with X (instead of dotting with
Xzmr. Plots show that sometimes this method has significant failure to properly reconstruct data.
This may not be how to do the KLT as described by Gerbrands (1981).
ICA versus PCA
ICA is, like PCA, a method for analyzing high-dimensional data. It is not uncommon in EEG
analysis. Gael Varoquax gives a summary of the difference between ICA and PCA at http://gaelvaroquaux.info/scientific_computing/ica_pca/index.html (retreived 2012-03-27). Varoquax presents
a situation in which ICA and PCA give very different results, and the ICA results are in some sense
better. A key aspect of the difference is that the axes in ICA are not necessarily orthogonal, and
ICA does not assume normally distributed data. The author says that ICA finds axes that yield
maximally non-normal scatter. Varoquax does not discuss multiple channels of time-varying data.
This ICA summary http://www.cs.helsinki.fi/u/ahyvarin/papers/NN00new.pdf by Hyvarinen and
Oja includes description of their “FastICA” algorithm and has links to a Matlab ICA package. See
also their paper in Neural Networks (2000). They state that ICA is a technique for solving the
“cocktail party problem” of extracting separate (independent) speakers’ voices from multiple
recordings, each of which is a different mixture of all the speakers. This is obviously directly
applicable to EEG analysis, in which the multiple electrodes are presumed to represent different
mixtures of underlying independent processes. The independent processes are assumed to be (and
must be) non-Gaussian. (This is not equivalent to assuming that the “measurement noise” is nonGaussian. Hyvarinen & Oja state that their derivation of ICA assumes there is no measurement
noise.) Hyvarinen and Oja point out that ICA has the following ambiguities: there is no way to
separate the amplitude of a signal from the strength of it coupling coefficients; the sign (+ or -) of
each coupling coefficient is arbitrary; there is no meaningful way to rank the independent
components by strength or importance. H&O discuss “whitening” as a pre-processing step, and they
say whitening can be combined with data dimensionality reduction. “Whitening” is done by PCA.
They say that whitening is a linear transformation of the data that yields data whose covariance
matrix is the identity matrix. PCA is a linear transform of the data to a form in which the covariance
matrix is diagonal. However, the diagonal elements of the covariance matrix of data after PCA are
not all equal. Thus, Hyvarinen and Oja appear to recommend something like PCA as a precursor to
ICA. Also see citations in the above document for more information. These include articles by
Terry Sejnowski.
Tresch et al (J Neurophysiol 2006) compared various methods for identifying “synergies”, using
simulated and real data. Their definition of synergy differs from that of Krishnamoorthy et al. They
write “the definition of a synergy used here is identical to the “muscle modes” used [by
Krishnamoorthy et al., 2003]”. (Krishnamoorthy et al. defined a muscle mode as a principal
component.) Tresch et al. concluded that the best balance between computational complexity and
robust results was obtained by using “PCAICA”, meaning PCA followed by ICA. They used PCA
to reduce the data to the four largest principal components. (How did they choose four? Four was
the number of synergies in the simulated data, which they should not have known beforehand. Did
that bias the results in favor of PCAICA?) This is similar or identical to the recommendation of
Hyvarinen and Oja, discussed above. They did PCA with Matlab 6.13, with the princomp function.
They did ICA with the function runica in the EEGLAB package (v4.1; Bell and Sejnowski 1995;
Makeig et al. 1996; http://sccn.ucsd.edu/eeglab/). The “model order determination problem” in this
context is the problem of determining the number of synergies to in a data set. Tresch et al. reported
that the most robust method for finding the correct number of synergies in simulated data with
signal-intensity-dependent noise was the likelihood ratio obtained from factor analysis. (Since the
dat was simulated, the true number of synergies was known.) The plot of likelihood ratio versus
number of assumed synergies should and often did “flatten out” above the correct number of
synergies. This makes sense: we expect little further improvement in the log likelihood, if the
number of synergies exceeds the true number of synergies. They also tested other methods for
determining the number of synergies, including AIC and plot of explained variance versus number
of assumed synergies. The plot of log likelihood determined with the factor analysis was clearly
better than other measures and methods for correctly identifying the number of synergies. Another
part of their study showed that the synergies identified by PCAICA (and some other methods) were
very similar even if the number of synergies was one larger or smaller than the true number of
synergies. They conclude that "a slightly incorrect estimate of the number of synergies does not
lead to a drastically incorrect estimate of the underlying synergies, but that features of the estimated
synergies are preserved."
