Multidimensional visualisation of time series PCA biplot

advertisement
Multidimensional visualisation of time series
and the construction of acceptance regions in a
PCA biplot
Sugnet Gardner and Niël J le Roux
Department of Statistics and Actuarial Science, Stellenbosch University, Private
Bag X1, Matieland, 7602, South Africa njlr@sun.ac.za
Summary. It is shown how to mimic an ordinary scatterplot in a principal component analysis biplot by constructing calibrated axes representing p > 2 variables.
The methodology is extended to represent a multidimensional target in the biplot as
well as to equip the biplot with acceptance regions. These acceptance regions enable
researchers to evaluate how well their data compare to a specified target together
with an understanding of the roles played by all variables. Finally it is demonstrated
how changes over time can be visually displayed in the biplot.
Key words: acceptance regions, biplot, industrial quality control, principal component analysis, trend lines.
1 Introduction
The biplot as presented by [G71] is widely applied in practice as a graphical aid
for visualisation of multidimensional data. [GH96] introduced a new philosophy in
biplot methodology, viewing biplots as the multivariate analogues of scatterplots.
This allows the visual appraisal of the structure of data in a few dimensions. Biplot
axes are used to relate the plotted points to the original variables as with ordinary
scatterplots. By extending the scatterplot principles to multivariate displays these
biplots ease interpretation and are accessible to non-statistical audiences.
The data set used in this paper originates from a manufacturing industry. The
quality of the final product is influenced by a number of different variables. Measurements of these variables are made over time on a daily basis and averaged to obtain
monthly values. For each of the variables a target or specification value is fixed by
management. The quality of the product can therefore be considered as a multidimensional attribute that is to be compared to a multidimensional target. A principal
component analysis (PCA) biplot is a valuable graphical aid when monitoring such
processes.
After a brief introduction to the construction of a PCA biplot in the next section
we show by example how to represent a multidimensional target in the biplot as well
as to equip the biplot with acceptance regions to evaluate how product meet target
values.
830
S Gardner, NJ le Roux
2 Principal component analysis biplots
Several types of biplot can be constructed depending on the nature of the data and
the distance metric used. Probably the most widely used and well known type of
biplot is the PCA biplot that is based on ordinary Euclidean distances. PCA aims
to optimally represent the variation in a data set in a few dimensions. Although
algebraically computations can be performed for any dimension r ≤ min(n, p), where
n is the number of samples or observations and p is the dimension of the original
data set, PCA biplots are usually constructed in r = 2 (or 3) dimensions.
A data matrix X: n × p gives the co-ordinates of n samples in the space Rp .
The principal component method chooses the r -dimensional subspace L that is best
fitting in the least squares sense. If X is the centred data matrix, as will be assumed
unless stated otherwise, it follows from Huygens’ Principle that L passes through
the origin.
Let the singular value decomposition of X′ X be given by X′ X = VΛV′ with
′
V V = I. [GH96] shows that the first r columns of V denoted by Vr , form an
orthonormal basis for L. These principal axes define a natural set of orthogonal
co-ordinate axes that could be used as scaffolding for plotting the sample points in
the biplot. Relative to the principal axes, the co-ordinates of the projections of the
sample points are given by Z = XVr . This process of finding the representation
of the sample points in terms of the biplot scaffolding in the space L is called
interpolation. The above formula for interpolating the original sample points can also
be used for algebraically interpolating a new point onto the biplot: Let x∗′ : 1 × p
in the space Rp denote the co-ordinates of a new point to be interpolated. It then
follows that the interpolated position of the latter point in the biplot is given by
z∗′ = x∗′ Vr .
The terms interpolation and prediction refer to the relationships between the
scaffolding and the original variables. Prediction is the inverse of interpolation inferring the values of the original variables for any point in the biplot space L.
Inverting the above interpolation formula provides the prediction of z∗ as z∗′ Vr ′ . In
ordinary scatterplots, a single set of Cartesian axes is used for both interpolation and
prediction. Since in general p > r biplot axes are not orthogonal and interpolation
and prediction necessitate two different sets of axes. [GH96] shows that the k -th
interpolation axis is defined by τ e′ k Vr where −∞ < τ < ∞ and ek denotes a vector
of zeros except with unity in the k -th position.
For constructing biplot axes analogous to the Cartesian axes of scatterplots, the
biplot axes need to be calibrated with markers. Since X is the centred data matrix
the point τ e′ k Vr where τ = 0 corresponds to the mean value of the k -th variable.
Placing markers at values µ = τ where µ = . . . , −2, −1, 0, 1, 2, . . . will correspond to
the markers xk − 2, xk − 1, xk , xk + 1, xk + 2, etc. However, the mean value is not
necessarily a sensibly rounded value, resulting in axis markers not being particularly
useful. [GH96] suggests that a ‘sensible’ scale value xk ≤ φ < xk + 1 be identified
and markers be placed where τ = µ for values µ = . . . , −2 + φ − xk , −1 + φ − xk ,
φ − xk , 1 + φ − xk , 2 + φ − xk , . . . for intervals of one unit. The interval size can
easily be adjusted according to the unit of measurement.
It is well known that PCA is not scale invariant. However, in the data set used
in this paper large differences in units of measurements occur, necessitating some
scaling of the data matrix. The scaling applied was to divide each variable by its standard deviation. For positioning markers on the k -th biplot axis, assume that some
Multidimensional visualisation of time series and acceptance regions
831
scaling is performed on the k -th variable by dividing by a value s. Placing markers at
µ = τ for µ = . . . , −2, −1, 0, 1, 2, . . . will result in the markers . . . , −2, −1, 0, 1, 2, . . .,
etc. being placed at standard deviation units from the mean. In using the biplot
as a multivariate scatterplot this is not very helpful, since sample points should be
related to the original unit of measurement. To place ‘sensible’ scale markers on the
k
biplot axis, the markers should be placed where τ = µ for values µ = · · · , −2+φ−x
,
s
−1+φ−xk φ−xk 1+φ−xk 2+φ−xk
,
,
,
,
·
·
·
.
s
s
s
s
When inspecting any graphical representation, the axes are generally used to read
off the values of the variables. This is closely related to Gower and Hand’s ([GH96])
concept of prediction - inferring the values of the original variables for points in L .
In general, the interpolation and prediction biplot axes have different directions but
[GN05] shows that in the case of an additive distance metric, as is the case with PCA
biplots, the directions of the interpolation and prediction biplot axes will coincide
because the distances for each variable separately are Euclidean. The interpolation
axes can therefore be used as prediction axes, fitted with a different set of prediction
markers. The relationship between the interpolation markers (orthogonal projection
onto the biplot axis βk ) and prediction markers (orthogonal back projection onto
the k -th Cartesian axis ξk ) is illustrated in Figure 1.
Fig. 1. Relationship between the interpolation (I) and prediction (P) markers on
PCA biplot axis βk . The axis βk is the representation in L of Cartesian axis ξk in
Rp with N denoting a p − 1 dimensional hyperplane perpendicular to ξk passing
through the marker µ. (After [GH96])
The markers on the k -th prediction biplot axis are located at
for values of µ as discussed above.
µ
′
′ e V r ek
e′k Vr Vr
k
3 Representing a multidimensional target in a PCA
biplot
In Figure 2 a PCA biplot of the 15 dimensional industrial data set is shown. The multivariate target is also interpolated onto the biplot using the interpolation formula
described in the previous section. It is stressed that the biplot axes are prediction
axes and cannot be used for interpolation. The simultaneous representation of the
variables and sample points allows for inferring the values of all 15 variables from the
display. This is done by orthogonal projection onto an axis and reading off the value
832
S Gardner, NJ le Roux
of the associated variable. Thus there is very little difference between the values of
A2 for June and July 2004, but June 2004 has a value of approximately 30.45 for
C8 while the corresponding July 2004 value is approximately 32.1.
66
32.5
1.8
18
5.6
32
1.2
Jul04
64
1.15
5.4
1.7
2.5
31.5
Mar05
16
62
1.1
14.2
20
3
5.2
31
1.05
3.5
60
1
A2
Apr04
TARGET
Jun04
22
A5 B5
D6
14
Aug04
30.5
14.3
20.5
5
45
21
50
43
D7
29
20.5
0.9
21
12
Feb04
Jan04
14.5
Dec04
1.4
May04
Mar04
30
Feb05
27
26
A1
49
1.5
79
Sep04
Jan05
0.85
4.5
56
29.5
Nov04
Oct04
4.6
0.8
C5 E5
C4
D4 C7
C8
C6 A4 A3
Fig. 2. PCA biplot of the 15 dimensional industrial data set
Interpolating the multivariate target onto the biplot leads to considering the
quality performance of an observation. For instance, June 2004 was relatively close
to the target, compared to July 2004, which is much more distant. The axes are
used to judge the differences between June and July 2004: these differences are
largely negligible for variables A1, A2, A5, B5, D6 and D7 but considerable for
the other variables. The centroid of the data, where the biplot axes are concurrent,
does not correspond to the target value. This shows that the attribute values tend
to differ from target, e.g. C8 tends to be too low with a mean value close to 30
and a target of close to 31. In order to obtain a more detailed assessment of the
product quality, the display space needs to be ‘quality-calibrated’ reflecting the level
of quality corresponding to the associated combinations of variables.
Multidimensional visualisation of time series and acceptance regions
833
4 Acceptance regions
Any quality management system for assessing multivariate quality can be used with
the PCA biplot methodology. The only requirement for computing acceptance regions is that the system must result in a single score describing the multivariate
quality. In the example considered here an in-house system calculating a quality
index value between 0 and 100 based on all 15 variables’ values is used. The aim is
to visually represent the 15 variables included in the calculation of the quality index
and the multivariate target to assess quality performance.
The calibration method combines PCA biplot prediction with the quality index calculations or any other system in use for evaluating quality. The following
algorithm can be used to construct quality regions:
• Construct a two-dimensional m × m grid in the biplot space and represent all
cells formed by the grid in a matrix, say E: m2 × 2. Choose m large enough to
allow accurate representation of a cell by a single point.
• Let z∗′ be the i-th row of E, sequentially for i = 1, 2, . . . , m2 .
– Find the predicted values of z∗ either by graphical prediction using the
prediction axes or, preferably, algebraic prediction where X̂ = EV′r .
– Calculate the quality index value based on the 15 predicted values for each
row of X̂.
• Label the cell in the biplot space according to the corresponding quality index
(QI) value:
- POOR:
0 ≤ QI < 50 (LIGHT GREY)
- SATISFACTORY: 50 ≤ QI < 80 (DARK GREY)
- GOOD:
80 ≤ QI ≤ 100 (WHITE).
The above quality ratings are those used by the industrial process considered
in the case study but any user-specified system of categories can be applied. These
quality regions are added to the PCA biplot in Figure 2 leading to Figure 3.
The large white area surrounding the target value represents good quality product. The satisfactory quality region is a narrow dark grey band around the white
area. This indicates that a certain amount of deviation around the target is tolerated
but once this limit is exceeded, the quality index is severely penalised. The light grey
area on the outside denotes the unacceptable values resulting in poor quality.
Using the colour-coded quality regions indicating levels of acceptance of the
product quality, it can be seen at a glance that October 2004 is flagged for not having
the desired quality level. Furthermore November 2004 and March 2005 produced
satisfactory quality, but it could be improved upon. Months such as March 2004, May
2004, September 2004 and January 2005 border on the danger zone with December
2004 just outside the good quality region. The quality regions are more effective
in assessing acceptance level of quality than merely considering the distance an
observation is away from the target. For instance, July 2004 was singled out as an
observation situated far away from the target, yet the product quality region shows
the quality still to be good.
Furthermore, the shape of the quality regions conveys important information: In
this case quality regions are not circular around the point of concurrency of axes.
This shows that product quality is not equally sensitive to all variables. The means
834
S Gardner, NJ le Roux
66
32.5
1.8
18
5.6
32
1.2
Jul04
64
1.15
5.4
1.7
2.5
31.5
Mar05
16
Poor quality
62
1.1
14.2
5.2
31
20
3
1.05
A2
Apr04
22
A5 B5
D6
14
Aug04
30.5
14.3
20.5
5
TARGET
Jun04
49
1.5
45
79
21
50
43
D7
29
20.5
Good quality
0.9
21
12
Feb04
Jan04
14.5
Dec04
1.4
May04
Mar04
30
Feb05
27
26
A1
Satisfactory quality
3.5
60
1
Sep04
Jan05
0.85
4.5
56
Nov04
29.5
Oct04
4.6
0.8
C5 E5
C4
D4 C7
C8
C6 A4 A3
Fig. 3. PCA biplot of the multivariate quality data with QI quality regions
of variables C4, C5 and E5 are according to target specification but the means of
A3, A4, C6 and C8 are below target specifications. Therefore any month showing
below average values for one or more of the latter variables will quickly result in
poor quality product. On the other hand variables A1, A2, A5, B5, D6 and D7 are
almost parallel with the first principal component (horisontal scaffolding axis) and
also correspond to the wider section of the white area. Relatively large deviations
from the mean values of these variables will still result in good quality product. It
follows that the QI biplot allows for judgements to be made not relying on distance
only or evaluating attributes individually, but that the portion of each axis in the
good quality region as well as the position of a data point relative to all attributes
contribute to a decision regarding remedial actions to be adopted. If a more detailed
visual display of the quality performance of the process is needed only the number
of descriptive categories needs to be increased.
It is to be noted that there seems to be some time effect associated with variables
A1, A2, A5, B5, D6 and D7: Two distinct clusters of QI values are associated with
these variables - the first half of 2004 appearing to the left of the point of concurrency
and August 2004 to February 2005 to its right. A proposal to address this aspect in
a biplot is discussed in the next section.
Multidimensional visualisation of time series and acceptance regions
835
5 Identifying trends in quality performance over time
From the discussion in the previous section it is clear that there is evidence in the
biplot of systematic changes over time. Also, it was pointed out that June 2004
was characterised by a high quality product while October 2004 to March 2005
showed product of lower quality. Since the mean values form a multivariate time
series the sample points can be sequentially connected to trace the performance
over time. A smoother can be applied to these points to filter out short term disturbances. Since the scaffolding consists of two orthogonal axes the smoothing is
performed by smoothing each dimension separately. The resulting biplot is shown in
Figure 4. This biplot clearly shows systematic changes over time. A local smoothing technique (in this example a loess-procedure) is applied to the sample pairs
(z1j , t1 ), (z2j , t2 ), . . . , (znj , tn ) for j = 1, 2 and ti = i. The resulting fitted values
form the co-ordinates (ẑ11 , ẑ12 ), (ẑ21 , ẑ22 ), . . . , (ẑn1 , ẑn2 ), which are then connected
to trace a smooth path over time leading to the smooth trend line shown in the
biplot in Figure 4. The amount of smoothing can easily be interactively controlled
by adaptive adjustment of the smoothing parameters.
66
32.5
1.8
18
5.6
32
1.2
Jul04
64
1.15
5.4
1.7
2.5
31.5
Mar05
16
Poor quality
62
1.1
14.2
5.2
31
20
3
1.05
Apr04
14
Aug04
30.5
14.3
20.5
5
TARGET
Jun04
22
A5 B5
D6
49
1.5
45
79
Feb05
27
26
21
50
43
D7
30
29
20.5
Good quality
0.9
21
A1
Satisfactory quality
3.5
60
1
A2
May04
12
Feb04
Mar04
Jan04
C5 E5
C4
14.5
Dec04
1.4
Sep04
Jan05
0.85
4.5
56
Nov04
29.5
Oct04
4.6
0.8
D4 C7
C8
C6 A4 A3
Fig. 4. PCA biplot of the multivariate quality data with smooth trend line superimposed
Perusal of Figure 4 reveals a clockwise progression around the target value during
2004: Initially good quality but some distance away from the target towards the edge
of the white area. This is followed by good quality product approximating the target
in the winter (southern hemisphere) of 2004 but declining quality during the spring
of 2004. After unacceptable quality in late spring a pronounced switch back to good
quality occurred in the summer of 2005.
836
S Gardner, NJ le Roux
6 Conclusion
Several novel extensions of the biplot methodology suggested by [GH96] are proposed
in this paper. The usefulness of these extensions has been illustrated by an example
data set from industry. This example demonstrates potential usages of PCA biplots equipped with acceptance regions and smoothed trend lines in industry. These
acceptance regions provide the user with information as to which samples and combination of values of the variables yield the most desirable results. Since the feature
space is p-dimensional only approximated acceptance regions can be represented in
the r -dimensional biplot. Several methods of approximation can be considered. In
this paper the prediction approach is followed.
References
[G71]
Gabriel, K.R.: The biplot graphical display of matrices with application to
principal component analysis. Biometrika, 58, 453–467 (1971)
[GH96] Gower, J.C. and Hand, D.J.: Biplots. Chapman and Hall, London (1996)
[GN05] Gower, J.C. and Ngouenet, R.F.: Nonlinearity effects in multidimensional
scaling. J. Multivariate Anal., 94, 344–365 (2005)
Download