Descriptive statistics for boxplot variables

advertisement
Descriptive statistics for boxplot variables
Carlos Maté1 and Javier Arroyo2
1
2
Departamento de Organización Industrial, ETSI (ICAI), Universidad Pontificia
Comillas, Alberto Aguilera 25, 28015 Madrid, Spain. cmate@upcomillas.es
Departamento de Sistemas Informáticos, Universidad Complutense, Profesor
Garcı́a-Santesmases s/n, 28040 Madrid, Spain. javier.arroyo@fdi.ucm.es
Summary. Boxplots (BPs) are exploratory charts used to extract meaningful information from batches of data at a quick glance. However, they possess an untapped
potential: they can be considered as a kind of symbolic variable as they can summarize data retaining the key information. This paper introduces BP variables and
proposes a set of descriptive statistics for BP variables based on the principles that
guide the statistics already proposed for other symbolic variables (i.e. interval variables and interval-valued modal variables). A stock market example shows that the
approach efficiently summarizes the original data set preserving its knowledge.
Key words: boxplot,symbolic data analysis, aggregation,exploratory data analysis.
1 Introduction
Symbolic data analysis (SDA) proposes an alternative approach to deal with large
and complex data sets [BD03]. It allows the summarization of these data sets into
smaller and more manageable ones retaining the key knowledge. In these data sets,
items are described by symbolic variables, which provide more information than
classical ones, where only one single number or category is allowed as value. Types
of symbolic variables include lists of values, intervals, histograms and distributions
[Boc00], but new types can be considered.
In this article, BPs are proposed as a new kind of symbolic variable. BPs [Tuk77]
are an extremely useful exploratory tool. They allow the description of distributions
by a condensed display with has a clear graphical representation and which can be
easily explained to non-statisticians [Ben88]. These features allow BPs to be proposed as a new kind of symbolic variable which can play a very interesting role in
SDA. BP variables allow the representation of batches of data reporting, in a shortened way, about their location, spread, skewness and normality. They offer a good
compromise between the well-defined structure and simplicity of interval variables
and the detailed information provided by histogram variables. Descriptive statistics
for BP variables based on the statistics proposed for other symbolic variables will
also be proposed.
1550
Carlos Maté and Javier Arroyo
2 Definition of Boxplot Variable
Let Z be a variable defined for all items of a finite set E = {1, ..., N }, Z is termed
a BP variable with domain of values Y , if Z is a mapping from E to a range B(Y)
such as given u ∈ E, Z(u) = {mu , qu , M eu , Qu , Mu } , where −∞ < mu ≤ qu ≤
M eu ≤ Qu ≤ Mu < ∞, and mu represents the minimum, qu the lower quartile,
M eu the median, Qu the upper quartile, and Mu the maximum of the values of the
variable for the item u.
BP variables can be considered as a particular case of interval-valued modal
variables. An interval-valued modal variable is a multistate variable with a weight
attached to each specific interval in the data (histograms are other particular case of this kind of variable). Thus, Z can be represented as an intervalvalued modal variable composed by four consecutive intervals: lower-whisker interval ψu,1 = [mu , qu ), lower-mid-box interval ψu,2 = [qu , M eu ), upper-midbox interval ψu,3 = [M eu , Qu ), and upper-whisker interval ψu,4 = [Qu , Mu ],
each one with frequency (or probability) pu,k = 0.25, k = 1, ..., 4; i.e. Z(u) =
{(ψu,1 , 0.25), (ψu,2 , 0.25), (ψu,3 , 0.25), (ψu,4 , 0.25)}.
3 Basic Statistics for BP Variables
As BP variables can be considered as a particular case of interval-valued modal symbolic variables, the statistics already proposed for interval variables and for intervalvalued modal symbolic variables ( [BG00] [BD02] [BD03]) can be adapted to BP
variables. In the next definitions, the considered BP variable Z is defined for all
objects u ∈ E, where E = {1, ..., N }, Z(u) = {(ψu,k , 0.25)}, with k = 1, ..., 4. If
k = 1, 2, 3, ψu,k = [au,k , bu,k ) and if k = 4, ψu,k = [au,k , bu,k ]. As the four intervals are consecutive, then bu,k = au,k+1 , k = 1, 2, 3. It is assumed that within each
interval, values are uniformly distributed.
3.1 Univariate Statistics
Histogram
Let I = [minu∈E au,1 , maxu∈E bu,4 ] be the interval that spans all of the observed
values of Z and let I be partitioned into r subintervals, Ig = [ξg−1 , ξg ), g = 1, ..., r−1,
and Ir = [ξr−1 , ξr ]. Then, the observed frequency for the interval Ig is
OZ (g) =
X X
u∈E ψu,k ∈Z(g)
k ψu,k ∩ Ig k
0.25,
k ψu,k k
(1)
where Z(g) represents the set of all those intervals ψu,k = [auk , buk ) that overlap with
kψu,k ∩Ig k
k0k
Ig , for a given u, and where k A k is the length of the interval A. If kψ
= k0k
u,k k
this term takes the value 1; it happens when ψu,k has length zero, i.e. when it is a
classical point value. The relative frequency for the interval Ig is RZ (g) = OZ (g)/N .
Descriptive statistics for boxplot variables
1551
Empirical density function
f (ξ) =
1
N
XX
4
u∈E k=1
Tuk (ξ)
0.25,
k ψu,k k
ξ ∈ R,
(2)
where Tuk (ξ) is the indicator function that ξ is in the interval ψu,k .
Percentiles and boxplot
Given (2), the percentiles can be defined as ck = ξk given that P (f (ξ) < f (ξk )) = k,
where k ∈ [0, 1]. If minimum, quartiles and maximum are computed, the boxplot of
the considered BP variable can be displayed.
Mean
Z=
1
N
XX
4
u∈E k=1
1
buk + auk
0.25 =
2
8N
XX
4
(buk + auk ).
(3)
u∈E k=1
This measure can be interpreted as the center of gravity of the centers of gravity of
each considered BP. It is easy to show that Z − Z = 0.
Variance
There are two possible definitions for the symbolic variance for data from a BP
variable. The first one, based on the variance for interval-valued modal variables in [BD03], is discarded because given a BP variable Z such as Z(u) =
{m, q, M e, Q, M }∀u, the variance could be greater than 0. The second proposal,
based on the variance for histogram variables [BD02], is defined as
2
SZ
=
1
64N
XX
4
{
(buk + auk )}2 −
u∈E k=1
XX
1
{
64N 2 u∈E
4
(buk + auk )}2 .
(4)
k=1
3.2 Bivariate Statistics
The joint density function, the covariance and the correlation for modal intervalvalued and histogram variables are defined in [BD02] [BD03]. Here, we adapt them
for BP variables, being Z1 and Z2 BP variables defined ∀u ∈ E = {1, ..., N }, Zi (u) =
i
{(ψu,k
, 0.25)}, with k = 1, ..., 4 and i = 1, 2.
Empirical joint density function
f (ξ1 , ξ2 ) =
1
16N
X XX
4
4
{
u∈E k1 =1 k2 =1
Tuk1 ,k2 (ξ1 , ξ2 )
},
kZ(k1 , k2 : u)k
(5)
1
2
where kZ(k1 , k2 : u)k is the area of the rectangle Z(k1 , k2 : u) = ψu,k
× ψu,k
and
1
2
Tuk1 ,k2 (ξ1 , ξ2 ) is the indicator function that the point (ξ1 , ξ2 ) is in the rectangle
Z(k1 , k2 : u).
1552
Carlos Maté and Javier Arroyo
Joint histogram
Analogously to (1), the joint histogram for Z1 and Z2 is found by plotting
{Rg1 g2 , πg1 g2 } over the rectangles Rg1 g2 = {[ξ1,g1 −1 , ξ1,g1 ) × [ξ2,g2 −1 , ξ2,g2 )}, g1 =
1, ..., r1 , g2 = 1, ..., r2 with
πg1 ,g2 =
1
N
X X
u∈E ψ 1
u,k1
X
2
∈Z(g1 ) ψu,k
∈Z(g2 )
0.252
kZ(k1 , k2 : u) ∩ Rg1 g2 k
,
kZ(k1 , k2 : u)k
(6)
2
j
where Z(gj ) represents the set of all the intervals ψu,k
that overlap with the interval
j
[ξj,gj −1 , ξj,gj ) with j = 1, 2, for each given u value.
Covariance
cov(Z1 , Z2 ) =
1
4N
X XX
4
4
{
0.252 (b1uk1 +a1uk1 )(b2uk2 +a2uk2 )}−Z1 ·Z2 (7)
u∈E k1 =1 k2 =1
where Zj , j = 1, 2, is obtained from (3).
Pearson correlation coefficient
r(Z1 , Z2 ) = r1,2 =
q
cov(Z1 , Z2 )
2
SZ
S2
1 Z2
.
(8)
The behavior of this coefficient is similar to the behavior of the Pearson correlation
coefficient for classical data.
Partial correlation coefficients
The first-order partial correlation coefficient between Zj and Zi controlling Zk is
defined by
rj,i•k =
q
rj,i − rj,k ∆ri,k
2
2
(1 − rj,k
)(1 − ri,k
)
.
(9)
The higher-order partial correlation coefficient is recursively defined by
rj,i•k1 ,k2 ,...,ks =
q
rj,i•k1 ,k2 ,...,ks−1 − rj,ks •k1 ,k2 ,...,ks−1 ∆ri,ks •k1 ,k2 ,...,ks−1
2
2
(1 − rj,k
)(1 − rj,k
)
s •k1 ,k2 ,...,ks−1
s •k1 ,k2 ,...,ks−1
,
(10)
where i, j 6= ks . These partial correlation coefficients evaluate the behavior of two
BP variables controlling for the effect of another BP variables. They can be useful
in order to detect false relations, to identify relevant variables or to reveal hidden
relations between BP variables.
It is worth to mention an interesting property that supports the statistics presented. Let Z1 and Z2 be a pair of BP variables defined on a set E such that
Zi (k) = {aik , aik , aik , aik , aik }, aik ∈ R, i = 1, 2, ∀k ∈ E. Then the values of the BP
statistics defined in this paper equal the values of their respective classical statistics
for the classical variable Yi (k) = aik , ∀k ∈ E, i = 1, 2.
Descriptive statistics for boxplot variables
1553
Fig. 1. Quarterly close value of the FTSE 100 (up-left), Nikkei 225 (up-right), Dow
Jones (down-left) and IBEX 35 (down-right) index during 2002-2004
4 An example
The behavior of the daily close value of four stock market indexes is analyzed during
a period of three years (from 2002-1-1 to 2004-12-31). The considered indexes are
Dow Jones (the US share index), Nikkei 225 (from the Japanese Market), FTSE 100
(the index which tracks the performance of the biggest 100 companies on the London
Market) and IBEX 35 (the official index of the Spanish Continuous Market). Data
have been obtained from the web site http://finance.yahoo.com/. The daily values
have been aggregated quarterly, leading to 12 elements (i.e. 12 quarters) described
by four BP variables (one BP variable for each index in the considered period).
The BP variables values for each element have been obtained computing the BPs
of the original data for each quarter. Outliers have not been eliminated, because in
financial data the extreme behavior of variables usually must be taken into account.
The univariate statistics values for BP variables are shown in Table 1. They
fit with the visual impression that can be obtained analyzing the charts. Table 2
1554
Carlos Maté and Javier Arroyo
shows the same statistics for the classical data set. The results in both tables are
quite similar, which reinforces the statistics proposed and the use of BP variables
as a tool for summarizing quantitative data retaining key information. However, it
is important to keep in mind that BP statistics reflect the behavior of classes of
individuals (quarters, in this case), while classical statistics reflect the behavior of
individuals (days, in this case); information provided in both cases is similar but not
equivalent.
Table 1. Univariate statistics for the BP variables representing quarterly values of
four stock market indexes
Index
Dow Jones
Nikkei 225
FTSE 100
IBEX 35
Mean
Variance Variat. Coef. Ptle. 25% Ptle. 50% Ptle. 75%
9510.38
689925.3 8.73 · 10−2
10205.08 12019.5 · 102 10.74 · 10−2
4387.25
177225.7
9.6 · 10−2
7317.31
726206.9 11.65 · 10−2
8707.36 9783.77 10252.92
9158.94 10522.98 11133.31
4083.34 4373.32 4596.18
6516.06 7421.27 8079.59
Table 2. Univariate statistics for the classical variables representing daily values of
four stock market indexes
Index
Dow Jones
Nikkei 225
FTSE 100
IBEX 35
Mean
Variance Variat. Coef. Ptle. 25% Ptle. 50% Ptle. 75%
9512.33
767814.33 9.21 · 10−2
10205.81 13821.7 · 102 11.53 · 10−2
4386.51
188805.18 9.91 · 10−2
7320.85
795342.31 12.18 · 10−2
8711.92 9781.02 10242.43
9190.95 10505.05 11138.33
4081.3
4371.2
7387.6
6503.1
7387.6
8076.5
Table 2 shows the correlation matrix for the symbolic data set and, in parenthesis, for the classical data set. Both tables are quite similar again. The partial
correlation coefficients controlling the effect of the Dow Jones index computed for
the BP data set and, in parenthesis, for the classical data set are shown in Table
3. This table reveals the importance of the Dow Jones index in the other indexes
considered. It is worth mentioning that the high correlation between Nikkei and
IBEX shown in Table 2 is absolutely spurious because is due to the influence of Dow
Jones, as Table 3 proves. Anybody knowing the basics of stock markets would not
be surprised by this conclusion.
The great similarity between the values of the classical and symbolic statistics
shows that the features of the daily data are preserved in the quarterly aggregated
data. However, in other cases aggregation can uncover hidden relations between
classes or, in the worst case, it can mask the knowledge of the disaggregated data.
It depends on the data set and on the appropriateness of the aggregation.
The example also shows that aggregation by means of BP variables allows an
efficient data set reduction: while the original data set has 781 elements described by
four variables, the symbolic data set has 12 elements described by four BP variables
Descriptive statistics for boxplot variables
1555
Table 3. Correlation matrix for the symbolic (resp. classical) data set described by
BP (resp. classical) variables.
Correlation Dow Jones Nikkei 225 FTSE 100 IBEX 35
Dow Jones
1 (1) 0.91 (0.86) 0.77 (0.77) 0.97 (0.96)
Nikkei 225
1 (1) 0.76 (0.73) 0.89 (0.84)
1 (1) 0.81 (0.81)
FTSE 100
IBEX 35
1 (1)
Table 4. Partial correlation matrix for BP (resp. classical) variables controlling the
effect of the Dow Jones BP (resp. classical) variable.
Correlation Nikkei 225 FTSE 100 IBEX 35
Nikkei 225
1 (1) 0.25 (0.21) 0.01 (0.08)
FTSE 100
1 (1) 0.41 (0.39)
IBEX 35
1 (1)
(and each BP variable can be seen as composed by five values), it means that the
size of the symbolic data set is the 7.7% of the size of the original one. Obviously, in
massive data sets the reduction could be even more drastic; e.g. in the case of the
intra-daily values (which are recorded several times in a minute) of stock market
indexes that are daily aggregated.
5 Conclusions
In [BD03], Billard and Diday state that even in large data sets where in theory available methodology might seem to apply, routine use of such statistical techniques is
often inappropriate. They suggest the use of SDA in these situations. This paper
shows that aggregation by means of BP variables and its analysis by symbolic statistics fits in the SDA context as it allows an efficient summarization and knowledge
extraction from quantitative data sets. This approach is needed when the amount of
data is overwhelming and needs to be cut down (e.g. sensors providing continuous
real-time data) or when the interest lies in the classes and not in the individuals (e.g.
the analysis of a feature measured in individuals across different regions). In those
cases, summarization by a single value such as the mean seems highly inappropriate.
In addition, in some cases, the analysis of the disaggregated data is not possible.
For example, the correlation analysis between the intra-daily values of IBEX 35 and
Dow Jones cannot be carried out due to the time lag between them; in this case,
aggregation is required for analyzing the correlation of the data.
As has been shown, BPs possess a great potential that can be exploited considering them as a type of symbolic variable. Their peculiar nature allow them to be be
considered as special cases of other symbolic variables, e.g. as discretized distributions, as histogram variables with some constraints, or, as in the proposed approach,
as a particular case of interval-valued modal variables. A set of descriptive statistics
for BP variables has been presented but new data analysis methods for BP variables
can be proposed, e.g. principal component analysis or least square regression. These
1556
Carlos Maté and Javier Arroyo
developments will result in a promising alternative, far from negligible, to deal with
massive quantitative data sets.
References
[Ben88] Benjamini, Y.: Opening the Box of a Boxplot. American Statistician, 42,
257–262 (1988)
[BG00] Bertrand, P., Goupil, F.: Descriptive statistics for symbolic data. In: Bock
H. -H., Diday E. (eds.) Analysis of Symbolic Data: Exploratory Methods for
Extracting Statistical Information from Complex Data. Springer-Verlag, Berlin
(2000)
[BD02] Billard, L., Diday, E.: Symbolic regression analysis. In: Jajuga, K. et al. (eds)
Classification, Clustering and Data Analysis. Springer-Verlag, Berlin (2002)
[BD03] Billard, L., Diday, E.: From the statistics of data to the statistics of knowledge: symbolic data analysis. Journal of the American Statistical Association,
98, 991–999 (2003)
[Boc00] Bock, H. -H.: Symbolic data. In: Bock H. -H., Diday E. (eds.) Analysis of
Symbolic Data: Exploratory Methods for Extracting Statistical Information
from Complex Data. Springer-Verlag, Berlin (2000)
[Tuk77] Tukey, J. W.: Exploratory Data Analysis. Addison-Wesley, Reading, MA
(1977)
Download