Expected convex hull trimming of a data set Ignacio Cascos Department of Statistics, Universidad Carlos III de Madrid Av. Universidad 30, E-28911 Leganés (Madrid) Spain ignacio.cascos@uc3m.es Summary. Given a data set in Rd , we study regions of central points built by elementwise addition of the convex hulls of all their subsets with a fixed number of elements. An algorithm to compute such central regions for a bivariate data set is described. Key words: convex hull, depth-trimmed regions, Minkowski sum 1 Introduction The lack of a natural order in the multivariate Euclidean space makes it difficult to translate many univariate data analytic tools to the multivariate setting. The approach based on depth functions and depth-trimmed regions was first proposed by Tukey [Tuk75] and has been extensively developed in the last years, see Liu et al. [LPS99] for statistical applications of depth functions. Depth functions assign a point its degree of centrality with respect to a multivariate probability distribution or a data cloud and thus provide a center-outward ordering of points, see Tukey [Tuk75] or Zuo and Serfling [ZS00a]. Meanwhile, depthtrimmed regions are sets of central points. They are usually obtained from a depth function as its level sets. Among others, Koshevoy and Mosler [KM97] and Cascos and López-Dı́az [CLD05] have proposed and studied particular instances of families of central regions. We will propose a new family of central regions. Given a data set x1 , . . . xn ∈ Rd , its mean value x is the sum of all the points from the data set multiplied by n−1 . Imagine we want to obtain a set that contains x and is central with respect to the data cloud. We can build a natural candidate taking all the segments that join two distinct points from the data set and averaging them. The way we average them is by taking their Minkowski (or elementwise) sum weighted by the inverse of the −1 number of segments, that is, we multiply the sum by n2 . Such a procedure can be generalized from the set of all pairs of points to the set of all k-tuples for any 1 ≤ k ≤ n. In Section 2, we will briefly introduce the expected convex hull trimmed regions of a data set. We use the classical method of the circular sequence to develop 674 Ignacio Cascos and algorithm of complexity O(n2 log n) to compute arbitrary expected convex hull trimmed regions of a bivariate data set in Section 3. Sections 4 and 5 are devoted to present an example and an alternative representation of the expected convex hull trimmed regions. Finally, an application of these trimmed regions to measure multivarite scatter is discussed in Section 6 and some concluding remarks presented in Section 7. 2 The expected convex hull trimming Before defining the expected convex hull trimmed regions, we will introduce some notation and tools that are needed. Given x1 , . . . , xn ∈ Rd , we denote its convex hull by co{x1 , . . . , xn }. For two sets F, G ⊂ Rd , its Minkowski sum or elementwise sum is given by F + G := {x + y : x ∈ F, y ∈ G}. For λ ∈ R, we define λF := {λx : x ∈ F }. Definition 1. Given x1 , . . . , xn ∈ Rd and 1 ≤ k ≤ n, the expected convex hull trimmed region of level k is given by !−1 CDk (x1 , . . . , xn ) := n k X co{xi1 , . . . , xik } , 1≤i1 <...<ik ≤n where the summation is in the Minkowski or elementwise sense and is performed over all the subsets of k elements of {x1 , . . . , xn }. The reason we name these sets expected convex hull trimmed regions is that through the Minkowski sum and the multiplication by a scalar, we are averaging (finding expectations of) convex hulls. Whenever it is possible, we will write CDk shortly instead of CDk (x1 , . . . , xn ). In Figure 4, we have plotted the expected convex hull region of a data set of level k = 2. In three different frames, we have respectively depicted n = 5 points in the plane, the 10 segments joining them and CD2 . Fig. 1. Building CD2 . Expected convex hull trimming of a data set 675 2.1 Properties of the expected convex hull trimmed regions We will briefly summarize the main properties of the expected convex hull trimmed regions. Theorem 1. For x1 , . . . xn ∈ Rd we have, (i) affine equivariance: for any matrix A and any point b, we have CDk (Ax1 + b, . . . , Axn + b) = ACDk (x1 , . . . , xn ) + b; (ii) nested: if k1 ≤ k2 , then CDk1 ⊂ CDk2 ; (iii) convex: CDk is convex; (iv) compact: CDk is compact; (v) if k = 1: CD1 = {x}; (vi) if k = n: CDn = co{x1 , . . . , xn }. Properties (i)–(iv) in Theorem 1 are similar to those studied for depth-trimmed regions in Theorem 3.1 in [ZS00b], except for the fact that the inclusion of the nesting property (ii) is here reversed with respect to the order relation in parameter k. Properties (iii) and (iv) follow from the fact that CDk is the Minkowski sum of a finite number of polytopes and thus, it is a polytope. It is possible to define a depth function in terms of these regions. For a fixed data set {x1 , . . . , xn } and any x ∈ Rd , we can define the function CD(x) := min{k : x ∈ CDk }−1 if x ∈ co{x1 , . . . , xn } . 0 otherwise The function CD is a bounded, non-negative mapping and from Theorem 1 it follows that it satisfies the properties of affine invariance, minimality at x, decreasing in arrays from the mean value and vanishing at infinity. Therefore CD is a depth function, see Definition 2.1 in [ZS00a]. 2.2 Extreme points Since the set CDk is a polytope, there is a finite number of points in its boundary such that CDk is their convex hull. The minimal set of such points is constituted by its extreme points. These can be expressed in terms of summations of the extreme points of each of the summands co{xi1 , . . . , xik }. For any set F ⊂ Rd , its support function is defined as h(F, u) := sup{hx, ui : x ∈ F } for any u ∈ Rd . Since the support function is linear on F and h(co{x1 , . . . , xn }, u) = max{hx1 , ui, . . . , hxn , ui}, we obtain k h(CD , u) = n k !−1 X max{hxi1 , ui, . . . , hxik , ui}. 1≤i1 <...<ik ≤n Let x1 , . . . , xn be pairwise distinct. Directions in the d-dimensional Euclidean space can be interpreted as elements of S d−1 , the unit sphere in Rd . If u ∈ S d−1 \ L where L is the set of directions that are orthogonal to the line through any two points xi and xj with i 6= j, then there is πu a permutation of {1, . . . , n} with hxπu (1) , ui < hxπu (2) , ui < . . . < hxπu (n) , ui. 676 Ignacio Cascos We will characterize the extreme point of CDk in the direction of u. Given j ≥ k, the number of sets of k points such that hxπu (j) , ui is greater than any other hxi , ui j−1 is kj − j−1 where k−1 = 0, or equivalently k−1 . Consequently the point k k x(u) = n k !−1 n X j=k ! j−1 xπu (j) k−1 is the extreme point of CDk such that h(CDk , u) = hx(u), ui. It can alternatively be written as x(u) = n k X (j − 1)(k−1) xπu (j) , n(k) j=k where n(k) = Qk−1 j=0 (n (1) − j) if n < k and k(k) = k! Theorem 2. Let x1 , . . . , xn ∈ Rd be pairwise distinct, the set of extreme points of CDk is {x(u) : u ∈ S d−1 \ L}. By considering all possible orderings of {x1 , . . . , xn }, we obtain CDk = co n k X (k−1) (j − 1) x : π ∈ P , n π(j) n(k) j=k (2) where Pn is the set of permutations of {1, . . . , n}. Unfortunately formula (2) is not efficient to compute the extreme points of CDk since many of the points involved are not extreme points. In R2 , the number of directions involved in Theorem 2 is at most n(n − 1) and we will take advantage of this to build an algorithm in the next section. 3 Algorithm Given a bivariate data set, it is possible to consider all the orderings of its 1dimensional projections in an efficient way. This can be done through a circular sequence, see Chapter 2 in [Ede87]. Ruts and Rousseeuw [Rut96] and Dyckerhoff [Dyc00] have used circular sequences to build algorithms to compute the extreme points of the halfspace and zonoid trimmed regions for a bivariate data set. The algorithm we present here is based on that of Dyckerhoff [Dyc00], where we have modified Step 5 to obtain our family of central regions. Step 1. Store the x-coordinates of the data points in an array X and the y-coordinates in an array Y, so that each point of the data cloud can be written as (X[i], Y[i]) for certain i. These arrays must be ordered in such a way that X[i] ≤ X[i + 1] and if X[i] = X[i + 1], then Y[i] > Y[i + 1]. Initialize the arrays ORD and RANK with the values from 1 to n. Initialize the arrays EXT1 and EXT2 to store the extreme points. Step 2. In an array ANGLE of length n2 , store the angle between the first coordinate axis and the normal vector of the line through the points (X[i], Y[i]) and (X[j], Y[j]), take always angles in [0, π). This array is finally sorted in an increasing way. Expected convex hull trimming of a data set 677 Step 3. Find all successive entries of the array ANGLE with the same value, these will correspond to sets of collinear points. Step 4. Reverse the order of each set of collinear points in the arrays ORD and RANK. If (X[i], Y[i]) and (X[j], Y[j]) are collinear, then interchange the values of ORD[RANK[i]] and ORD[RANK[j]] and afterwards RANK[i] and RANK[j]. If (X[p], Y[p]), (X[q], Y[q]) and (X[r], Y[r]) with p < q < r are also collinear then ORD[RANK[p]], ORD[RANK[q]] and ORD[RANK[r]] are interchanged with ORD[RANK[r]], ORD[RANK[q]] and ORD[RANK[p]] respectively and RANK[p], RANK[q] and RANK[r] with RANK[r], RANK[q] and RANK[p]. Step 5. If any RANK[i] ≥ k has been modified in Step 4, append n X (j − 1)(k−1) (X[ORD[j]], Y[ORD[j]]) (3) j=k at the end of the array EXT1. If any RANK[i] ≤ n − k has been modified in Step 4, append n X (j − 1)(k−1) (X[ORD[n + 1 − j]], Y[ORD[n + 1 − j]]) (4) j=k at the end of the array EXT2. Notice that each new entry of EXT1 or EXT2 can be computed in terms of the previous one and the entries modified in the array ORD in Step 4 and thus it is not necessary to perform the summation of n − k + 1 terms. Step 6. If there is any angle left in the array ANGLE continue with Step 3. Step 7. Multiply the entries of EXT1 and EXT2 by k/n(k) and concatenate both arrays to obtain the sequence of extreme points in counterclockwise order. The sorting of the array ANGLE in Step 2 has complexity O(n2 log n) and it determines the overall complexity of the algorithm. 4 Example We have applied the algorithm described in Section 3 to some real data and the outcome is depicted in Figure 2. The data corresponds to the results of 25 heptathletes in the 1988 Olympics in the 100 meters hurdles race (in seconds, axis X) and their shot put distances (in meters, axis Y) and has been obtained from Hand et al. [HDL94]. Dyckerhoff [Dyc00] has plotted the zonoid trimmed regions for this same data set, we have chosen the heptathlon data so that the interested reader can compare the central regions presented here with the ones plotted in [Dyc00]. 5 Alternative trimmed regions The main idea in this paper is to build central regions of a data set by averaging convex hulls of combinations of its points through Minkowski summation. The population counterpart of these central regions is discussed by Cascos [Cas06a] and it is defined in terms of the selection expectation for random sets, see Molchanov [Mol05]. Ignacio Cascos 10 11 12 13 14 15 16 678 13 14 15 16 Fig. 2. Contour plots of the expected convex hull trimmed regions of the heptathlon data set with k ranging from 1 to 25. The set CDk is build as a U -statistic, i.e., we consider all subsets from the data set with a fixed size. An alternative family of central regions is obtained if the points from the data set are selected at random and with replacement, that is, for k ≥ 1 we consider all combinations of k integers 1 ≤ i1 , i2 , . . . , ik ≤ n, n n X 1 X ··· co{xi1 , . . . , xik }. k n i =1 i =1 1 (5) k The extreme points would be now x(u) = n 1 X k j − (j − 1)k xπu (j) . k n j=1 (6) If we replace (j − 1)(k−1) with j k − (j − 1)k in equations (3) and (4) of Step 5 and k/n(k) with n−k in Step 7, the algorithm from Section 3 can be used to compute the extreme points of (5). These changes are due to the differences between the extreme points in equations (1) and (6). 6 Multivarite scatter estimates The expected convex hull central regions provide us with information about the scatter of the data. The bigger the regions are, the more scattered the data set Expected convex hull trimming of a data set 679 is. Consequently, the volume of the expected convex hull trimmed regions can be used to define multivariate scatter estimates. Take k = 2, then the set CD2 is a sum of segments, i.e., a zonotope. It is, in fact, a translation of the zonotope of a symmetrization of the empirical probability associated to the data set, see [Ms02]. Applications of zonotopes in data analysis have been discussed by Koshevoy and Mosler [KM97] and Mosler [Ms02]. Using the relation CD2 = X 1 co{xi , xj } n(n − 1) i6=j (7) and the formula for the volume of a zonotope, see Mosler [Ms02], we obtain the d-dimensional volume of the set CD2 vold CD2 = n X n n X n X X 1 det[xi − xj , . . . , xi − xj ], · · · 1 1 d d d d d!n (n − 1) i =1 j =1 i =1 j =1 1 1 d d where det[xi1 − xj1 , . . . , xid − xjd ] is the determinant of the matrix whose columns are xi1 − xj1 , . . . , xid − xjd . A consequence of the expected convex hull trimmed regions of level 2 being zonotopes, see equation (7), is that they will always be centrally symmetric. Nevertheless, we understand CD2 has interest of its own as a descriptive statistic although we are working with skewed data. Notice that, in such a case, the trimmed regions of higher levels will not be symmetric. 7 Concluding remarks Our effort in this paper has focused on the description of the expected convex hull regions and the development of and algorithm for efficiently finding their extreme points for a bivariate data set. However, the set of extreme points described in Theorem 2 holds in any dimension. It is thus possible to built an approximate algorithm for dimensions greater than 2 considering only a random subset of all directions as Dyckerhoff [Dyc00] briefly discusses. Acknowledgements The financial support of the grant MTM2005-02254 of the Spanish Ministry of Science and Technology is acknowledged. References [Cas06a] Cascos, I.: Depth functions based on a number of observations of a random vector. In preparation (2006) [CLD05] Cascos, I., López-Dı́az, M.: Integral trimmed regions. J. Multivariate Anal., 96, 404–424 (2005) 680 [Dyc00] Ignacio Cascos Dyckerhoff, R.: Computing zonoid trimmed regions of bivariate data sets. In: Bethlehem J., van der Heijden, P. (eds) COMPSTAT 2000. Proceedings in Computational Statistics, 295–300. Physica-Verlag, Heidelberg (2000) [Ede87] Edelsbrunner, H.: Algorithms in Combinatorial Geometry. SpringerVerlag, Berlin (1987) [HDL94] Hand, D.J., Daly, F., Lunn, A.D., McConway, K.J., Ostrowski, E. (eds): A Handbook of Small Data Sets. Chapman and Hall, London (1994) [KM97] Koshevoy, G., Mosler, K.: Zonoid trimming for multivariate distributions. Ann. Statist., 25, 1998–2017 (1997) [LPS99] Liu, R.Y., Parelius, J.M., Singh, K.: Multivariate analysis by data depth: descriptive statistics, graphics and inference. Ann. Statist. 27, 783–858 (1999) [Mol05] Molchanov, I.: Theory of Random Sets. Springer-Verlag, London (2005) [Ms02] Mosler, K.: Multivariate dispersion, central regions and depth. The lift zonoid approach. LNS 165. Springer-Verlag, Berlin (2002) [Rut96] Ruts, I., Rousseeuw, P.J.: Computing depth contours of bivariate point clouds. Comput. Stat. Data Anal., 23, 153–168 (1996) [Tuk75] Tukey, J.W.: Mathematics and the picturing of data. In: Proceedings of the International Congress of Mathematicians, Vancouver, 2, 523–531 (1975) [ZS00a] Zuo, Y., Serfling, R.: General notions of statistical depth function. Ann. Statist., 28, 461–482 (2000) [ZS00b] Zuo, Y., Serfling, R.: Structural properties and convergence results for contours of sample statistical depth functions. Ann. Statist., 28, 483–499 (2000)