Descriptive statistics for boxplot variables Carlos Maté1 and Javier Arroyo2 1 2 Departamento de Organización Industrial, ETSI (ICAI), Universidad Pontificia Comillas, Alberto Aguilera 25, 28015 Madrid, Spain. cmate@upcomillas.es Departamento de Sistemas Informáticos, Universidad Complutense, Profesor Garcı́a-Santesmases s/n, 28040 Madrid, Spain. javier.arroyo@fdi.ucm.es Summary. Boxplots (BPs) are exploratory charts used to extract meaningful information from batches of data at a quick glance. However, they possess an untapped potential: they can be considered as a kind of symbolic variable as they can summarize data retaining the key information. This paper introduces BP variables and proposes a set of descriptive statistics for BP variables based on the principles that guide the statistics already proposed for other symbolic variables (i.e. interval variables and interval-valued modal variables). A stock market example shows that the approach efficiently summarizes the original data set preserving its knowledge. Key words: boxplot,symbolic data analysis, aggregation,exploratory data analysis. 1 Introduction Symbolic data analysis (SDA) proposes an alternative approach to deal with large and complex data sets [BD03]. It allows the summarization of these data sets into smaller and more manageable ones retaining the key knowledge. In these data sets, items are described by symbolic variables, which provide more information than classical ones, where only one single number or category is allowed as value. Types of symbolic variables include lists of values, intervals, histograms and distributions [Boc00], but new types can be considered. In this article, BPs are proposed as a new kind of symbolic variable. BPs [Tuk77] are an extremely useful exploratory tool. They allow the description of distributions by a condensed display with has a clear graphical representation and which can be easily explained to non-statisticians [Ben88]. These features allow BPs to be proposed as a new kind of symbolic variable which can play a very interesting role in SDA. BP variables allow the representation of batches of data reporting, in a shortened way, about their location, spread, skewness and normality. They offer a good compromise between the well-defined structure and simplicity of interval variables and the detailed information provided by histogram variables. Descriptive statistics for BP variables based on the statistics proposed for other symbolic variables will also be proposed. 1550 Carlos Maté and Javier Arroyo 2 Definition of Boxplot Variable Let Z be a variable defined for all items of a finite set E = {1, ..., N }, Z is termed a BP variable with domain of values Y , if Z is a mapping from E to a range B(Y) such as given u ∈ E, Z(u) = {mu , qu , M eu , Qu , Mu } , where −∞ < mu ≤ qu ≤ M eu ≤ Qu ≤ Mu < ∞, and mu represents the minimum, qu the lower quartile, M eu the median, Qu the upper quartile, and Mu the maximum of the values of the variable for the item u. BP variables can be considered as a particular case of interval-valued modal variables. An interval-valued modal variable is a multistate variable with a weight attached to each specific interval in the data (histograms are other particular case of this kind of variable). Thus, Z can be represented as an intervalvalued modal variable composed by four consecutive intervals: lower-whisker interval ψu,1 = [mu , qu ), lower-mid-box interval ψu,2 = [qu , M eu ), upper-midbox interval ψu,3 = [M eu , Qu ), and upper-whisker interval ψu,4 = [Qu , Mu ], each one with frequency (or probability) pu,k = 0.25, k = 1, ..., 4; i.e. Z(u) = {(ψu,1 , 0.25), (ψu,2 , 0.25), (ψu,3 , 0.25), (ψu,4 , 0.25)}. 3 Basic Statistics for BP Variables As BP variables can be considered as a particular case of interval-valued modal symbolic variables, the statistics already proposed for interval variables and for intervalvalued modal symbolic variables ( [BG00] [BD02] [BD03]) can be adapted to BP variables. In the next definitions, the considered BP variable Z is defined for all objects u ∈ E, where E = {1, ..., N }, Z(u) = {(ψu,k , 0.25)}, with k = 1, ..., 4. If k = 1, 2, 3, ψu,k = [au,k , bu,k ) and if k = 4, ψu,k = [au,k , bu,k ]. As the four intervals are consecutive, then bu,k = au,k+1 , k = 1, 2, 3. It is assumed that within each interval, values are uniformly distributed. 3.1 Univariate Statistics Histogram Let I = [minu∈E au,1 , maxu∈E bu,4 ] be the interval that spans all of the observed values of Z and let I be partitioned into r subintervals, Ig = [ξg−1 , ξg ), g = 1, ..., r−1, and Ir = [ξr−1 , ξr ]. Then, the observed frequency for the interval Ig is OZ (g) = X X u∈E ψu,k ∈Z(g) k ψu,k ∩ Ig k 0.25, k ψu,k k (1) where Z(g) represents the set of all those intervals ψu,k = [auk , buk ) that overlap with kψu,k ∩Ig k k0k Ig , for a given u, and where k A k is the length of the interval A. If kψ = k0k u,k k this term takes the value 1; it happens when ψu,k has length zero, i.e. when it is a classical point value. The relative frequency for the interval Ig is RZ (g) = OZ (g)/N . Descriptive statistics for boxplot variables 1551 Empirical density function f (ξ) = 1 N XX 4 u∈E k=1 Tuk (ξ) 0.25, k ψu,k k ξ ∈ R, (2) where Tuk (ξ) is the indicator function that ξ is in the interval ψu,k . Percentiles and boxplot Given (2), the percentiles can be defined as ck = ξk given that P (f (ξ) < f (ξk )) = k, where k ∈ [0, 1]. If minimum, quartiles and maximum are computed, the boxplot of the considered BP variable can be displayed. Mean Z= 1 N XX 4 u∈E k=1 1 buk + auk 0.25 = 2 8N XX 4 (buk + auk ). (3) u∈E k=1 This measure can be interpreted as the center of gravity of the centers of gravity of each considered BP. It is easy to show that Z − Z = 0. Variance There are two possible definitions for the symbolic variance for data from a BP variable. The first one, based on the variance for interval-valued modal variables in [BD03], is discarded because given a BP variable Z such as Z(u) = {m, q, M e, Q, M }∀u, the variance could be greater than 0. The second proposal, based on the variance for histogram variables [BD02], is defined as 2 SZ = 1 64N XX 4 { (buk + auk )}2 − u∈E k=1 XX 1 { 64N 2 u∈E 4 (buk + auk )}2 . (4) k=1 3.2 Bivariate Statistics The joint density function, the covariance and the correlation for modal intervalvalued and histogram variables are defined in [BD02] [BD03]. Here, we adapt them for BP variables, being Z1 and Z2 BP variables defined ∀u ∈ E = {1, ..., N }, Zi (u) = i {(ψu,k , 0.25)}, with k = 1, ..., 4 and i = 1, 2. Empirical joint density function f (ξ1 , ξ2 ) = 1 16N X XX 4 4 { u∈E k1 =1 k2 =1 Tuk1 ,k2 (ξ1 , ξ2 ) }, kZ(k1 , k2 : u)k (5) 1 2 where kZ(k1 , k2 : u)k is the area of the rectangle Z(k1 , k2 : u) = ψu,k × ψu,k and 1 2 Tuk1 ,k2 (ξ1 , ξ2 ) is the indicator function that the point (ξ1 , ξ2 ) is in the rectangle Z(k1 , k2 : u). 1552 Carlos Maté and Javier Arroyo Joint histogram Analogously to (1), the joint histogram for Z1 and Z2 is found by plotting {Rg1 g2 , πg1 g2 } over the rectangles Rg1 g2 = {[ξ1,g1 −1 , ξ1,g1 ) × [ξ2,g2 −1 , ξ2,g2 )}, g1 = 1, ..., r1 , g2 = 1, ..., r2 with πg1 ,g2 = 1 N X X u∈E ψ 1 u,k1 X 2 ∈Z(g1 ) ψu,k ∈Z(g2 ) 0.252 kZ(k1 , k2 : u) ∩ Rg1 g2 k , kZ(k1 , k2 : u)k (6) 2 j where Z(gj ) represents the set of all the intervals ψu,k that overlap with the interval j [ξj,gj −1 , ξj,gj ) with j = 1, 2, for each given u value. Covariance cov(Z1 , Z2 ) = 1 4N X XX 4 4 { 0.252 (b1uk1 +a1uk1 )(b2uk2 +a2uk2 )}−Z1 ·Z2 (7) u∈E k1 =1 k2 =1 where Zj , j = 1, 2, is obtained from (3). Pearson correlation coefficient r(Z1 , Z2 ) = r1,2 = q cov(Z1 , Z2 ) 2 SZ S2 1 Z2 . (8) The behavior of this coefficient is similar to the behavior of the Pearson correlation coefficient for classical data. Partial correlation coefficients The first-order partial correlation coefficient between Zj and Zi controlling Zk is defined by rj,i•k = q rj,i − rj,k ∆ri,k 2 2 (1 − rj,k )(1 − ri,k ) . (9) The higher-order partial correlation coefficient is recursively defined by rj,i•k1 ,k2 ,...,ks = q rj,i•k1 ,k2 ,...,ks−1 − rj,ks •k1 ,k2 ,...,ks−1 ∆ri,ks •k1 ,k2 ,...,ks−1 2 2 (1 − rj,k )(1 − rj,k ) s •k1 ,k2 ,...,ks−1 s •k1 ,k2 ,...,ks−1 , (10) where i, j 6= ks . These partial correlation coefficients evaluate the behavior of two BP variables controlling for the effect of another BP variables. They can be useful in order to detect false relations, to identify relevant variables or to reveal hidden relations between BP variables. It is worth to mention an interesting property that supports the statistics presented. Let Z1 and Z2 be a pair of BP variables defined on a set E such that Zi (k) = {aik , aik , aik , aik , aik }, aik ∈ R, i = 1, 2, ∀k ∈ E. Then the values of the BP statistics defined in this paper equal the values of their respective classical statistics for the classical variable Yi (k) = aik , ∀k ∈ E, i = 1, 2. Descriptive statistics for boxplot variables 1553 Fig. 1. Quarterly close value of the FTSE 100 (up-left), Nikkei 225 (up-right), Dow Jones (down-left) and IBEX 35 (down-right) index during 2002-2004 4 An example The behavior of the daily close value of four stock market indexes is analyzed during a period of three years (from 2002-1-1 to 2004-12-31). The considered indexes are Dow Jones (the US share index), Nikkei 225 (from the Japanese Market), FTSE 100 (the index which tracks the performance of the biggest 100 companies on the London Market) and IBEX 35 (the official index of the Spanish Continuous Market). Data have been obtained from the web site http://finance.yahoo.com/. The daily values have been aggregated quarterly, leading to 12 elements (i.e. 12 quarters) described by four BP variables (one BP variable for each index in the considered period). The BP variables values for each element have been obtained computing the BPs of the original data for each quarter. Outliers have not been eliminated, because in financial data the extreme behavior of variables usually must be taken into account. The univariate statistics values for BP variables are shown in Table 1. They fit with the visual impression that can be obtained analyzing the charts. Table 2 1554 Carlos Maté and Javier Arroyo shows the same statistics for the classical data set. The results in both tables are quite similar, which reinforces the statistics proposed and the use of BP variables as a tool for summarizing quantitative data retaining key information. However, it is important to keep in mind that BP statistics reflect the behavior of classes of individuals (quarters, in this case), while classical statistics reflect the behavior of individuals (days, in this case); information provided in both cases is similar but not equivalent. Table 1. Univariate statistics for the BP variables representing quarterly values of four stock market indexes Index Dow Jones Nikkei 225 FTSE 100 IBEX 35 Mean Variance Variat. Coef. Ptle. 25% Ptle. 50% Ptle. 75% 9510.38 689925.3 8.73 · 10−2 10205.08 12019.5 · 102 10.74 · 10−2 4387.25 177225.7 9.6 · 10−2 7317.31 726206.9 11.65 · 10−2 8707.36 9783.77 10252.92 9158.94 10522.98 11133.31 4083.34 4373.32 4596.18 6516.06 7421.27 8079.59 Table 2. Univariate statistics for the classical variables representing daily values of four stock market indexes Index Dow Jones Nikkei 225 FTSE 100 IBEX 35 Mean Variance Variat. Coef. Ptle. 25% Ptle. 50% Ptle. 75% 9512.33 767814.33 9.21 · 10−2 10205.81 13821.7 · 102 11.53 · 10−2 4386.51 188805.18 9.91 · 10−2 7320.85 795342.31 12.18 · 10−2 8711.92 9781.02 10242.43 9190.95 10505.05 11138.33 4081.3 4371.2 7387.6 6503.1 7387.6 8076.5 Table 2 shows the correlation matrix for the symbolic data set and, in parenthesis, for the classical data set. Both tables are quite similar again. The partial correlation coefficients controlling the effect of the Dow Jones index computed for the BP data set and, in parenthesis, for the classical data set are shown in Table 3. This table reveals the importance of the Dow Jones index in the other indexes considered. It is worth mentioning that the high correlation between Nikkei and IBEX shown in Table 2 is absolutely spurious because is due to the influence of Dow Jones, as Table 3 proves. Anybody knowing the basics of stock markets would not be surprised by this conclusion. The great similarity between the values of the classical and symbolic statistics shows that the features of the daily data are preserved in the quarterly aggregated data. However, in other cases aggregation can uncover hidden relations between classes or, in the worst case, it can mask the knowledge of the disaggregated data. It depends on the data set and on the appropriateness of the aggregation. The example also shows that aggregation by means of BP variables allows an efficient data set reduction: while the original data set has 781 elements described by four variables, the symbolic data set has 12 elements described by four BP variables Descriptive statistics for boxplot variables 1555 Table 3. Correlation matrix for the symbolic (resp. classical) data set described by BP (resp. classical) variables. Correlation Dow Jones Nikkei 225 FTSE 100 IBEX 35 Dow Jones 1 (1) 0.91 (0.86) 0.77 (0.77) 0.97 (0.96) Nikkei 225 1 (1) 0.76 (0.73) 0.89 (0.84) 1 (1) 0.81 (0.81) FTSE 100 IBEX 35 1 (1) Table 4. Partial correlation matrix for BP (resp. classical) variables controlling the effect of the Dow Jones BP (resp. classical) variable. Correlation Nikkei 225 FTSE 100 IBEX 35 Nikkei 225 1 (1) 0.25 (0.21) 0.01 (0.08) FTSE 100 1 (1) 0.41 (0.39) IBEX 35 1 (1) (and each BP variable can be seen as composed by five values), it means that the size of the symbolic data set is the 7.7% of the size of the original one. Obviously, in massive data sets the reduction could be even more drastic; e.g. in the case of the intra-daily values (which are recorded several times in a minute) of stock market indexes that are daily aggregated. 5 Conclusions In [BD03], Billard and Diday state that even in large data sets where in theory available methodology might seem to apply, routine use of such statistical techniques is often inappropriate. They suggest the use of SDA in these situations. This paper shows that aggregation by means of BP variables and its analysis by symbolic statistics fits in the SDA context as it allows an efficient summarization and knowledge extraction from quantitative data sets. This approach is needed when the amount of data is overwhelming and needs to be cut down (e.g. sensors providing continuous real-time data) or when the interest lies in the classes and not in the individuals (e.g. the analysis of a feature measured in individuals across different regions). In those cases, summarization by a single value such as the mean seems highly inappropriate. In addition, in some cases, the analysis of the disaggregated data is not possible. For example, the correlation analysis between the intra-daily values of IBEX 35 and Dow Jones cannot be carried out due to the time lag between them; in this case, aggregation is required for analyzing the correlation of the data. As has been shown, BPs possess a great potential that can be exploited considering them as a type of symbolic variable. Their peculiar nature allow them to be be considered as special cases of other symbolic variables, e.g. as discretized distributions, as histogram variables with some constraints, or, as in the proposed approach, as a particular case of interval-valued modal variables. A set of descriptive statistics for BP variables has been presented but new data analysis methods for BP variables can be proposed, e.g. principal component analysis or least square regression. These 1556 Carlos Maté and Javier Arroyo developments will result in a promising alternative, far from negligible, to deal with massive quantitative data sets. References [Ben88] Benjamini, Y.: Opening the Box of a Boxplot. American Statistician, 42, 257–262 (1988) [BG00] Bertrand, P., Goupil, F.: Descriptive statistics for symbolic data. In: Bock H. -H., Diday E. (eds.) Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer-Verlag, Berlin (2000) [BD02] Billard, L., Diday, E.: Symbolic regression analysis. In: Jajuga, K. et al. (eds) Classification, Clustering and Data Analysis. Springer-Verlag, Berlin (2002) [BD03] Billard, L., Diday, E.: From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association, 98, 991–999 (2003) [Boc00] Bock, H. -H.: Symbolic data. In: Bock H. -H., Diday E. (eds.) Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data. Springer-Verlag, Berlin (2000) [Tuk77] Tukey, J. W.: Exploratory Data Analysis. Addison-Wesley, Reading, MA (1977)