Multidimensional visualisation of time series and the construction of acceptance regions in a PCA biplot Sugnet Gardner and Niël J le Roux Department of Statistics and Actuarial Science, Stellenbosch University, Private Bag X1, Matieland, 7602, South Africa njlr@sun.ac.za Summary. It is shown how to mimic an ordinary scatterplot in a principal component analysis biplot by constructing calibrated axes representing p > 2 variables. The methodology is extended to represent a multidimensional target in the biplot as well as to equip the biplot with acceptance regions. These acceptance regions enable researchers to evaluate how well their data compare to a specified target together with an understanding of the roles played by all variables. Finally it is demonstrated how changes over time can be visually displayed in the biplot. Key words: acceptance regions, biplot, industrial quality control, principal component analysis, trend lines. 1 Introduction The biplot as presented by [G71] is widely applied in practice as a graphical aid for visualisation of multidimensional data. [GH96] introduced a new philosophy in biplot methodology, viewing biplots as the multivariate analogues of scatterplots. This allows the visual appraisal of the structure of data in a few dimensions. Biplot axes are used to relate the plotted points to the original variables as with ordinary scatterplots. By extending the scatterplot principles to multivariate displays these biplots ease interpretation and are accessible to non-statistical audiences. The data set used in this paper originates from a manufacturing industry. The quality of the final product is influenced by a number of different variables. Measurements of these variables are made over time on a daily basis and averaged to obtain monthly values. For each of the variables a target or specification value is fixed by management. The quality of the product can therefore be considered as a multidimensional attribute that is to be compared to a multidimensional target. A principal component analysis (PCA) biplot is a valuable graphical aid when monitoring such processes. After a brief introduction to the construction of a PCA biplot in the next section we show by example how to represent a multidimensional target in the biplot as well as to equip the biplot with acceptance regions to evaluate how product meet target values. 830 S Gardner, NJ le Roux 2 Principal component analysis biplots Several types of biplot can be constructed depending on the nature of the data and the distance metric used. Probably the most widely used and well known type of biplot is the PCA biplot that is based on ordinary Euclidean distances. PCA aims to optimally represent the variation in a data set in a few dimensions. Although algebraically computations can be performed for any dimension r ≤ min(n, p), where n is the number of samples or observations and p is the dimension of the original data set, PCA biplots are usually constructed in r = 2 (or 3) dimensions. A data matrix X: n × p gives the co-ordinates of n samples in the space Rp . The principal component method chooses the r -dimensional subspace L that is best fitting in the least squares sense. If X is the centred data matrix, as will be assumed unless stated otherwise, it follows from Huygens’ Principle that L passes through the origin. Let the singular value decomposition of X′ X be given by X′ X = VΛV′ with ′ V V = I. [GH96] shows that the first r columns of V denoted by Vr , form an orthonormal basis for L. These principal axes define a natural set of orthogonal co-ordinate axes that could be used as scaffolding for plotting the sample points in the biplot. Relative to the principal axes, the co-ordinates of the projections of the sample points are given by Z = XVr . This process of finding the representation of the sample points in terms of the biplot scaffolding in the space L is called interpolation. The above formula for interpolating the original sample points can also be used for algebraically interpolating a new point onto the biplot: Let x∗′ : 1 × p in the space Rp denote the co-ordinates of a new point to be interpolated. It then follows that the interpolated position of the latter point in the biplot is given by z∗′ = x∗′ Vr . The terms interpolation and prediction refer to the relationships between the scaffolding and the original variables. Prediction is the inverse of interpolation inferring the values of the original variables for any point in the biplot space L. Inverting the above interpolation formula provides the prediction of z∗ as z∗′ Vr ′ . In ordinary scatterplots, a single set of Cartesian axes is used for both interpolation and prediction. Since in general p > r biplot axes are not orthogonal and interpolation and prediction necessitate two different sets of axes. [GH96] shows that the k -th interpolation axis is defined by τ e′ k Vr where −∞ < τ < ∞ and ek denotes a vector of zeros except with unity in the k -th position. For constructing biplot axes analogous to the Cartesian axes of scatterplots, the biplot axes need to be calibrated with markers. Since X is the centred data matrix the point τ e′ k Vr where τ = 0 corresponds to the mean value of the k -th variable. Placing markers at values µ = τ where µ = . . . , −2, −1, 0, 1, 2, . . . will correspond to the markers xk − 2, xk − 1, xk , xk + 1, xk + 2, etc. However, the mean value is not necessarily a sensibly rounded value, resulting in axis markers not being particularly useful. [GH96] suggests that a ‘sensible’ scale value xk ≤ φ < xk + 1 be identified and markers be placed where τ = µ for values µ = . . . , −2 + φ − xk , −1 + φ − xk , φ − xk , 1 + φ − xk , 2 + φ − xk , . . . for intervals of one unit. The interval size can easily be adjusted according to the unit of measurement. It is well known that PCA is not scale invariant. However, in the data set used in this paper large differences in units of measurements occur, necessitating some scaling of the data matrix. The scaling applied was to divide each variable by its standard deviation. For positioning markers on the k -th biplot axis, assume that some Multidimensional visualisation of time series and acceptance regions 831 scaling is performed on the k -th variable by dividing by a value s. Placing markers at µ = τ for µ = . . . , −2, −1, 0, 1, 2, . . . will result in the markers . . . , −2, −1, 0, 1, 2, . . ., etc. being placed at standard deviation units from the mean. In using the biplot as a multivariate scatterplot this is not very helpful, since sample points should be related to the original unit of measurement. To place ‘sensible’ scale markers on the k biplot axis, the markers should be placed where τ = µ for values µ = · · · , −2+φ−x , s −1+φ−xk φ−xk 1+φ−xk 2+φ−xk , , , , · · · . s s s s When inspecting any graphical representation, the axes are generally used to read off the values of the variables. This is closely related to Gower and Hand’s ([GH96]) concept of prediction - inferring the values of the original variables for points in L . In general, the interpolation and prediction biplot axes have different directions but [GN05] shows that in the case of an additive distance metric, as is the case with PCA biplots, the directions of the interpolation and prediction biplot axes will coincide because the distances for each variable separately are Euclidean. The interpolation axes can therefore be used as prediction axes, fitted with a different set of prediction markers. The relationship between the interpolation markers (orthogonal projection onto the biplot axis βk ) and prediction markers (orthogonal back projection onto the k -th Cartesian axis ξk ) is illustrated in Figure 1. Fig. 1. Relationship between the interpolation (I) and prediction (P) markers on PCA biplot axis βk . The axis βk is the representation in L of Cartesian axis ξk in Rp with N denoting a p − 1 dimensional hyperplane perpendicular to ξk passing through the marker µ. (After [GH96]) The markers on the k -th prediction biplot axis are located at for values of µ as discussed above. µ ′ ′ e V r ek e′k Vr Vr k 3 Representing a multidimensional target in a PCA biplot In Figure 2 a PCA biplot of the 15 dimensional industrial data set is shown. The multivariate target is also interpolated onto the biplot using the interpolation formula described in the previous section. It is stressed that the biplot axes are prediction axes and cannot be used for interpolation. The simultaneous representation of the variables and sample points allows for inferring the values of all 15 variables from the display. This is done by orthogonal projection onto an axis and reading off the value 832 S Gardner, NJ le Roux of the associated variable. Thus there is very little difference between the values of A2 for June and July 2004, but June 2004 has a value of approximately 30.45 for C8 while the corresponding July 2004 value is approximately 32.1. 66 32.5 1.8 18 5.6 32 1.2 Jul04 64 1.15 5.4 1.7 2.5 31.5 Mar05 16 62 1.1 14.2 20 3 5.2 31 1.05 3.5 60 1 A2 Apr04 TARGET Jun04 22 A5 B5 D6 14 Aug04 30.5 14.3 20.5 5 45 21 50 43 D7 29 20.5 0.9 21 12 Feb04 Jan04 14.5 Dec04 1.4 May04 Mar04 30 Feb05 27 26 A1 49 1.5 79 Sep04 Jan05 0.85 4.5 56 29.5 Nov04 Oct04 4.6 0.8 C5 E5 C4 D4 C7 C8 C6 A4 A3 Fig. 2. PCA biplot of the 15 dimensional industrial data set Interpolating the multivariate target onto the biplot leads to considering the quality performance of an observation. For instance, June 2004 was relatively close to the target, compared to July 2004, which is much more distant. The axes are used to judge the differences between June and July 2004: these differences are largely negligible for variables A1, A2, A5, B5, D6 and D7 but considerable for the other variables. The centroid of the data, where the biplot axes are concurrent, does not correspond to the target value. This shows that the attribute values tend to differ from target, e.g. C8 tends to be too low with a mean value close to 30 and a target of close to 31. In order to obtain a more detailed assessment of the product quality, the display space needs to be ‘quality-calibrated’ reflecting the level of quality corresponding to the associated combinations of variables. Multidimensional visualisation of time series and acceptance regions 833 4 Acceptance regions Any quality management system for assessing multivariate quality can be used with the PCA biplot methodology. The only requirement for computing acceptance regions is that the system must result in a single score describing the multivariate quality. In the example considered here an in-house system calculating a quality index value between 0 and 100 based on all 15 variables’ values is used. The aim is to visually represent the 15 variables included in the calculation of the quality index and the multivariate target to assess quality performance. The calibration method combines PCA biplot prediction with the quality index calculations or any other system in use for evaluating quality. The following algorithm can be used to construct quality regions: • Construct a two-dimensional m × m grid in the biplot space and represent all cells formed by the grid in a matrix, say E: m2 × 2. Choose m large enough to allow accurate representation of a cell by a single point. • Let z∗′ be the i-th row of E, sequentially for i = 1, 2, . . . , m2 . – Find the predicted values of z∗ either by graphical prediction using the prediction axes or, preferably, algebraic prediction where X̂ = EV′r . – Calculate the quality index value based on the 15 predicted values for each row of X̂. • Label the cell in the biplot space according to the corresponding quality index (QI) value: - POOR: 0 ≤ QI < 50 (LIGHT GREY) - SATISFACTORY: 50 ≤ QI < 80 (DARK GREY) - GOOD: 80 ≤ QI ≤ 100 (WHITE). The above quality ratings are those used by the industrial process considered in the case study but any user-specified system of categories can be applied. These quality regions are added to the PCA biplot in Figure 2 leading to Figure 3. The large white area surrounding the target value represents good quality product. The satisfactory quality region is a narrow dark grey band around the white area. This indicates that a certain amount of deviation around the target is tolerated but once this limit is exceeded, the quality index is severely penalised. The light grey area on the outside denotes the unacceptable values resulting in poor quality. Using the colour-coded quality regions indicating levels of acceptance of the product quality, it can be seen at a glance that October 2004 is flagged for not having the desired quality level. Furthermore November 2004 and March 2005 produced satisfactory quality, but it could be improved upon. Months such as March 2004, May 2004, September 2004 and January 2005 border on the danger zone with December 2004 just outside the good quality region. The quality regions are more effective in assessing acceptance level of quality than merely considering the distance an observation is away from the target. For instance, July 2004 was singled out as an observation situated far away from the target, yet the product quality region shows the quality still to be good. Furthermore, the shape of the quality regions conveys important information: In this case quality regions are not circular around the point of concurrency of axes. This shows that product quality is not equally sensitive to all variables. The means 834 S Gardner, NJ le Roux 66 32.5 1.8 18 5.6 32 1.2 Jul04 64 1.15 5.4 1.7 2.5 31.5 Mar05 16 Poor quality 62 1.1 14.2 5.2 31 20 3 1.05 A2 Apr04 22 A5 B5 D6 14 Aug04 30.5 14.3 20.5 5 TARGET Jun04 49 1.5 45 79 21 50 43 D7 29 20.5 Good quality 0.9 21 12 Feb04 Jan04 14.5 Dec04 1.4 May04 Mar04 30 Feb05 27 26 A1 Satisfactory quality 3.5 60 1 Sep04 Jan05 0.85 4.5 56 Nov04 29.5 Oct04 4.6 0.8 C5 E5 C4 D4 C7 C8 C6 A4 A3 Fig. 3. PCA biplot of the multivariate quality data with QI quality regions of variables C4, C5 and E5 are according to target specification but the means of A3, A4, C6 and C8 are below target specifications. Therefore any month showing below average values for one or more of the latter variables will quickly result in poor quality product. On the other hand variables A1, A2, A5, B5, D6 and D7 are almost parallel with the first principal component (horisontal scaffolding axis) and also correspond to the wider section of the white area. Relatively large deviations from the mean values of these variables will still result in good quality product. It follows that the QI biplot allows for judgements to be made not relying on distance only or evaluating attributes individually, but that the portion of each axis in the good quality region as well as the position of a data point relative to all attributes contribute to a decision regarding remedial actions to be adopted. If a more detailed visual display of the quality performance of the process is needed only the number of descriptive categories needs to be increased. It is to be noted that there seems to be some time effect associated with variables A1, A2, A5, B5, D6 and D7: Two distinct clusters of QI values are associated with these variables - the first half of 2004 appearing to the left of the point of concurrency and August 2004 to February 2005 to its right. A proposal to address this aspect in a biplot is discussed in the next section. Multidimensional visualisation of time series and acceptance regions 835 5 Identifying trends in quality performance over time From the discussion in the previous section it is clear that there is evidence in the biplot of systematic changes over time. Also, it was pointed out that June 2004 was characterised by a high quality product while October 2004 to March 2005 showed product of lower quality. Since the mean values form a multivariate time series the sample points can be sequentially connected to trace the performance over time. A smoother can be applied to these points to filter out short term disturbances. Since the scaffolding consists of two orthogonal axes the smoothing is performed by smoothing each dimension separately. The resulting biplot is shown in Figure 4. This biplot clearly shows systematic changes over time. A local smoothing technique (in this example a loess-procedure) is applied to the sample pairs (z1j , t1 ), (z2j , t2 ), . . . , (znj , tn ) for j = 1, 2 and ti = i. The resulting fitted values form the co-ordinates (ẑ11 , ẑ12 ), (ẑ21 , ẑ22 ), . . . , (ẑn1 , ẑn2 ), which are then connected to trace a smooth path over time leading to the smooth trend line shown in the biplot in Figure 4. The amount of smoothing can easily be interactively controlled by adaptive adjustment of the smoothing parameters. 66 32.5 1.8 18 5.6 32 1.2 Jul04 64 1.15 5.4 1.7 2.5 31.5 Mar05 16 Poor quality 62 1.1 14.2 5.2 31 20 3 1.05 Apr04 14 Aug04 30.5 14.3 20.5 5 TARGET Jun04 22 A5 B5 D6 49 1.5 45 79 Feb05 27 26 21 50 43 D7 30 29 20.5 Good quality 0.9 21 A1 Satisfactory quality 3.5 60 1 A2 May04 12 Feb04 Mar04 Jan04 C5 E5 C4 14.5 Dec04 1.4 Sep04 Jan05 0.85 4.5 56 Nov04 29.5 Oct04 4.6 0.8 D4 C7 C8 C6 A4 A3 Fig. 4. PCA biplot of the multivariate quality data with smooth trend line superimposed Perusal of Figure 4 reveals a clockwise progression around the target value during 2004: Initially good quality but some distance away from the target towards the edge of the white area. This is followed by good quality product approximating the target in the winter (southern hemisphere) of 2004 but declining quality during the spring of 2004. After unacceptable quality in late spring a pronounced switch back to good quality occurred in the summer of 2005. 836 S Gardner, NJ le Roux 6 Conclusion Several novel extensions of the biplot methodology suggested by [GH96] are proposed in this paper. The usefulness of these extensions has been illustrated by an example data set from industry. This example demonstrates potential usages of PCA biplots equipped with acceptance regions and smoothed trend lines in industry. These acceptance regions provide the user with information as to which samples and combination of values of the variables yield the most desirable results. Since the feature space is p-dimensional only approximated acceptance regions can be represented in the r -dimensional biplot. Several methods of approximation can be considered. In this paper the prediction approach is followed. References [G71] Gabriel, K.R.: The biplot graphical display of matrices with application to principal component analysis. Biometrika, 58, 453–467 (1971) [GH96] Gower, J.C. and Hand, D.J.: Biplots. Chapman and Hall, London (1996) [GN05] Gower, J.C. and Ngouenet, R.F.: Nonlinearity effects in multidimensional scaling. J. Multivariate Anal., 94, 344–365 (2005)