A tree structured classiﬁer for symbolic class description

A tree structured classifier for symbolic class description Suzanne Winsberg1 , Edwin Diday2 , and M. Mehdi Limam2 1 2 Predisoft, San Pedro, Costa Rica Suzanne.Winsberg@predisoft.com Universite de Paris Dauphine, Paris, France diday@ceremade.dauphine.fr Summary. Consider a class of statistical units from a population for which the data table contains symbolic data, that is, the entries of the table are multivalued, say an interval of values, as well as single-valued entries, for each variable describing each statistical unit. Our aim is to describe this class, E, of statistical units by partitioning it. Each class of the partition is described by a conjunction of characteristic properties, and the class to describe, E, is described by a disjunction of these conjunctions. We use a stepwise top-down binary tree method. At each step we select the best variable and its optimal splitting to optimize simultaneously a homogeneity criterion and a discrimination criterion given by a prior partition of the population, thus combining unsupervised and supervised learning. We illustrate the method on a real data set. Key words: classification, clustering, discrimination, binary tree, decision tree, symbolic data analysis, class description 1 Introduction We want to describe a class, C of statistical units in a population of units. So we must find the properties that characterize the class, and a way to do that is to partition the class. Classification methods are often designed to split a class of statistical units, yielding a partition into L subclasses, or clusters, where each cluster may then be described by a conjunction of properties, and the class, C is described by a disjunction of the L conjunctions. Most partitioning methods are one of two types: clustering methods optimizing an intra-class homogeneity criterion, or decision trees optimizing an inter-class discrimination criterion. We partition the class using a topdown binary divisive method. Such divisive methods are referred to as tree structured classifiers. It is of prime importance that the subdivisions of C be homogeneous with respect to the selected group of variables from the data base. If in addition, we need to explain an external criterion, giving rise to an a priori partition of the population, or some part of it encompasing C , we need to consider a discrimination criterion based on that a priori partition as well. Our technique arrives at a description of C by producing a partition of C into L subclasses or clusters satisfying both a 928 Winsberg, Diday and Limam homogeneity criterion and a discrimination criterion with respect to some given a priori partition. So unlike other classification methods which are either unsupervised or supervised, this method has both unsupervised and supervised aspects. Not only does our method combine supervised and unsupervised learning itself an innovation, but we are able to treat symbolic data as well as the classical single valued type data treated by classical algorithms such as CART ( [BR84]) and ID3 ( [QU86]). Symbolic data are richer possessing potentially more information than classical single-valued data. Symbolic data may be encountered when dealing with more complex, aggregated statistical units found when analyzing huge data sets. For example it may be more interesting to deal with aggregated statistical units such as teams rather than with the individual players on the team. Then the resulting data set after aggregation will most likely contain symbolic data rather than classical data values. By symbolic data we mean that rather than having a specific single value for an observed variable, an observed value for an aggregated statistical unit may be multivalued, eg an interval of values. For a more extensive treatment of symbolic data analysis see Bock and Diday ( [BO00]). Consider an example, with a symbolic data table relating to professional basketball teams in the United States. From the data table we choose a class to describe, C, consisting of teams belonging to the NBA. The prior partition, that is the discriminatory categorical variable is winning teams, those that have won more than 70% of their games, versus the rest. The variables available in the symbolic data table might be interval variables such as the range of the height of the players on the team, the range of the age of the players on the team, the range of the weight of the players on the team, the range for the team of the total number of points scored by each player on the team over a given season, the range for the team of the total number of assists made by each player on the team over a given season, and the range for the team of the average number of minutes of play for each player per game over a given season. Note that since each statistical unit is a team with many players on it, numerical variables such as height cover a range, that is they would be interval-type variables not single-valued classical variables. If we use only a homogeneity criterion we would get subclasses or clusters of teams, each of which would be homogeneous with respect to the variables such as number of assists, number of points, height etc and these clusters would of course be well discriminated from one another with respect to those same variables. Here, in addition they will be discriminated from one another with respect to the prior partition ie the winning teams versus the rest of the teams. Moreover, our aim here is to produce a description of a class C of uc coming from a population of units. Naturally, the description includes all the units of the class C because we induce this description from all the units of this class. However, we want to find a description such that units not belonging to this class but included in this description is minimized. So, to refer back to our example, if we want to describe the teams in the NBA which are men’s professional basketball teams, we prefer that the extent of this description not cover women’s professional basketball teams. Below, we present a stopping rule that we have integrated into our method, which limits as much as possible what we call the overflow of the description that is units not in C but which are in the extent of the description. This work is an extension of previous work by [VR02], and [LI03], which combine supervised and non supervised learning. Here we focus on interval type symbolic data and we extend [LI03] by examining the prediction quality of the algorithm. We also Symbolic class description 929 present a choice of stopping rules. Others have developed divisive algorithms for specific data types encountered when dealing with symbolic data, considering either a homogeneity criterion ( [CH97]) or a discrimination criterion ( [PE99]) based on an a priori partition, but not both simultaneously. Our method yields a description for each final cluster. In addition each final cluster can be assigned with a given power to a class of the prior partition, so we can induce rules for the whole population under study. Below we describe our method: we define a cutting or split for interval data; we define a cutting point and affectation rules for this type of data; we outline the approach used to combine the two criteria, and a data driven approach to weight each criterion optimally; we also present the stopping rules. We illustrate the algorithm with a real example. Our method has also been extended to handle weighted categorical variables (histogram symbolic data), classical data and any combination of the above. However in this paper we focus on interval type symbolic data. 2 The Underlying Population and the Data We consider a population Ω = {1, ..., K} with K units. Ω is partitioned into S known disjointed classes G1 , ..., Gs, ..., GS and into T other disjointed classes C1 , ..., Ct , ..., CT also known could be the same or different from the partition G1 , ..., Gs , ..., GS . The class to describe, C ≡ Ct . Each unit k ∈ Ω of the population is described by three categories of variables : - G : the a priori partition variable, which is a nominal variable defined on Ω with S categories 1, ..., s, ..., S; - yC : the class to describe variable, which is a nominal variable defined on Ω with T categories 1, ...t, ..., T ; - y1 , ..., yj , ..., yJ : J independent interval type variables. G(k) is the index of the class of the unit k ∈ Ω and G is a mapping defined on Ω with domain 1, ..., s, ..., S which generates the a priori partition into S classes to discriminate. yC (k) is the index of the class of the unit k ∈ Ω and yC is a mapping defined on Ω with domain {1, , ...t, ..., T } which generates the second partition into T classes C1 , ..., C(t), ..., CT . One of the above classes, Ct will be chosen to be described. We denote this class to describe as C. To each unit k and to each independent, or explanatory variable yj is associated a symbolic description (with imprecision or variation). This description denoted by [yj (k)] is quantitative that is the description associated with yj to a unit k ∈ Ω is an interval yj (k) ⊂ Yj with Yj the set of possible values for yj . Naturally, the case of a single-valued quantitative variable is a special case of this type of variable. Let yjk be the size of a unit k, denoted by yj (k) = [18, 27]. Summarizing, each cell of the symbolic data table contains an interval of R. 3 The Method Five inputs are required for our algorithm : 1) the data, consisting of K statistical units, each described by J symbolic or classical variables; 2) the prior partition of 930 Winsberg, Diday and Limam either some defined part of the population or the entire population; 3) the class, C, the user aims to describe consisting of K ≡ Kc statistical units coming from the population of K statistical units; 4) a coefficient, α, which gives more or less importance to the discriminatory power of the prior partition or to the homogeneity of the description of the given class, C., (alternatively, instead of specifying this last coefficient, the user may choose to determine an optimum value of this coefficient, using this algorithm); 5) the choice of a stopping rule, (see the section on the stopping rule below). The method uses a monothetic hierarchical descending approach working by division of a node into two nodes. At each step l ( l nodes corresponding to a partition into l classes), one of the nodes of the tree is cut into two nodes in order to optimize a quality criterion Q for the constructed partition into l + 1 classes. The division of a node N sends a proportion of units to the left node N1 and the other proportion to the right node N2 . This division is done by a “cutting” (y, c), where y is called the cutting variable and c the cutting point. The algorithm always generates two kinds of output. The first is a graphical representation, in which the class to describe, C, is represented by a binary tree. The final leaves are the clusters constituting the class and each branch represents a cutting (y, c). The second is a description: each final leaf is described by the conjunction of the cutting points from the top of the tree to this final leaf. The class, C, is then described by a disjunction of these conjunctions. The class, C, and each final cluster may be reified into a concept which may be described by all of the variables. If the user wishes to choose an optimal value of α using our data driven method, a graphical representation enabling this choice is also generated as output. Let H(N ) and h(N1 ; N2 ) be respectively the homogeneity criterion of a node N and of a couple of nodes (N1 ; N2 ). then we define ∆H(N ) = H(N ) − h(N1 ; N2 ). Similarly we define ∆D(N )D(N ) − d(N1 ; N2 ) for the discrimination criterion. The quality Q of a node N (respectively q of a couple of nodes (N1 ; N2 )) is the weighted sum of the two criteria, namely Q(N ) = αH(N ) + βD(N ) (respectively q(N1 ; N2 ) = αh(N1 ; N2 ) + βd(N1 ; N2 )) where α + β = 1. So the quality variation induced by the splitting of N into (N1 ; N2 ) is ∆Q(N ) = Q(N ) − q(N1 ; N2 ). We maximize ∆Q(N ). Note that since we are optimizing two criteria the criteria must be normalized. The user can modulate the values of α and β so as to weight the importance that he gives to each criterion. To determine the cutting value (y; c) and the node to split: first, for each node N select the cutting variable and its cutting point minimizing q(N1 ; N2 ); second, select and split the node N which maximizes the difference between the quality before the cutting and the quality after the cutting, max∆Q(N ) = max[α∆H(N ) + βD(N )]. We are working with interval variables, so we must define what constitutes a cutting and what constitutes a cutting value or cutting point for this type of data. For an interval variable yj , we define the cutting point c using the mean of the interval. We order the means of the intervals for all units, the cutting point is then the mean of two consecutive means of intervals. The mean of the interval is just one possible choice on which to base this definition. We could have used the min or max of the interval or both. In further extensions of the algorithm, we might consider other possibilities. Each cutting (yj ; c) is associated with binary questions q of the form ”Is yj ≤ c” if yj is an interval variable, ”Is ( yj ∈ V ) ≤ cV ”. If yj is a quantitative variable then it is an interval multivalued continuous variable, we must define a rule to assign to Symbolic class description 931 N1 and N2 (respectively, the left and the right node of N) an interval with respect to a cutting point c. Here we use a pure approach in which a unit is assigned either to the left node N1 or the right node N2 with certainty that is a probability of 1 or 0 for assignment to N1 and to N2 ; We define pwi the probability to assigne a unit wi to N1 with yj (wi ) = [yjmin (wi ), yjmax (wi )] by : c−yjmin (wi ) yjmax (wi )−yjmin (wi ) if c ∈ [yjmin (wi ), yjmax (wi )] 0 if c < yjmin (wi ) 1 if c > yjmax (wi ) We use the following rule to asign a unit to the left or right node: wi belongs to N1 if pwi ≥ 12 else it belongs to N2 . The probabilities calculated above are used to calculate the homogeneity and the discrimination criteria. We shall now define the homogeneity criterion for interval type data. The clustering or homogeneity criterion we use is an inertia criterion. This criterion is used in Chavent ( [CH97]). The inertia of a node t which could be N1 or N2 is pwi = H(t) = wi ∈t wk pi pk ∆(wi , wk ), 2µ ∈t where pi = the weight of the individual wi , in our case pi = pk (t), µ = wi ∈N pi = the weight of class t , ∆ is a distance between individuals defined J as ∆(wi , wk ) = 2 δ (wi , wk ) = δ 2 (wi , wk ) with J is the number of variables. j=1 yjmin (wi )+yjmax (wi ) 2 − mj yjmin (wk )+yjmax (wk ) 2 2 , where [yjmin (wi ), yjmax (wi )] is the interval value of the variable yj for the unit wi and mj = max yjmax − min yjmin which represent the maximum area of the variable wi wi yj . We remark that δ 2 falls in the interval [0, 1]. Moreover, the homogeneity criterion must be normalized to fall in the interval [0, 1]. The discrimination criterion we choose is an impurity criterion, Gini’s index, which we denote as D. This criterion was introduced by Breiman ( [BR84]) for the CART algorithm and measures the impurity of a node (which could be N1 or N2 ) with respect to the prior partition (G1 , G2 , ..., GS ). The Gini measure of node impurity is a measure which reaches a value of zero when only one class is present at a node (with priors estimated from class sizes, the Gini measure is computed as the sum of products of all pairs of class proportions for classes present at the node; it reaches its maximum value when class sizes at the node are equal). The Gini index has an interesting interpretation. Instead of using the plurality rule to classify objects in a node n, use the rule that assigns an object selected at random from the node to class Gs with probability ps The Gini index is also simple and quickly computed. It is defined as: pl pi = 1 − ( D(t) = l=i p2i ) i=1,...,J with pi = Pt (Gi ). D(N )must be normalized to fall in the interval [0, 1]. 932 Winsberg, Diday and Limam To assign each object k with vector description (Y1 , ..., Yp ) to one class Gi of the prior partition, the following rule is applied : for j = 1, ..., J, i = j}. G(k) = Gi ⇐⇒ pk (Gi ) > pk (Gj) Here the membership probability pk (Gi ) is calculated as the weighted average (on all terminal nodes) of the conditional probability Pt (Gi ),with weights pk (t): T pk (Gi ) = Pt (Gi ) × pk (Gi ) t=1 In other words, it consists of summing the probabilities that an individual belongs to the class Gi , for all terminal nodes t, these probabilities being weighted by the probabilities that this object reach the different leaves. The set of individuals assigned to t on the basis of the majority rule, node(t) = {k\ pk (t) > pk (s), ∀s = t}. Therefore, we can calculate the size of the class Gi inside a terminal node t : nt (Gi ) = {k\ k ∈ node(k) and k ∈ Gi }. Our aim here is to produce a description of a class C of uc coming from a population of K units. Naturally, the description includes all the units of the class C because we induce this description from all the units of this class. However, units not belonging to this class but included in this description should be minimized. For example, consider the class C to describe is the districts of Great Britain containing a number of inhabitants greater than 500k and the population is the districts of Great Britain. So, the description should include as little as possible districts with a number of inhabitants less than 500k. So it is of interest to consider the overflow of the class to describe in making a stopping rule which stops the division of a node N in two nodes N1 and N2 . First, we define the overflow rate of a node N : OR(N ) = n(C N ) nN with n(C N ) = number of units belonging to the complement C of the class C which verify the current description of the node N and nN the total number of units verifying the description of the node N . The overflow rate OR of a couple of nodes (N1 , N2 ) : is then OR(N1 , N2 ) = n(C N1 ) + n(C N2 ) . nN1 + nN2 The variation of the overflow rate of a node due to its being split N : ∆OR(N ) = OR(N )−OR(N1 , N2 ) might be a useful indicator to stop splitting this node. A small variation, shows that there is not much improvement in the overflow of the class to describe when the node is split. A node N is considered as terminal (a leaf) if: the variation of its overflow ∆OR(N ) overflow is less than a threshold fixed by the user or a default value say 10%; or, the difference between the discrimination before the cutting and the quality after the cutting ∆D(N ) is less than a threshold fixed by the user or a default value say 10%; or, its size is not large enough (value < 2); or, it creates two empty son nodes (value < 1). We stop the division when all the nodes are terminal or alternatively if we reach a number of divisions set by the user in order to provide a quick and simple description . Symbolic class description 933 The user may select one of two ways to choose the value of α. One way is for the user to fix the value of α. The motivation to select this option is that the user may have some particular aim motivating the choice of α, that is the relative weights he/she wishes to place on the homogeneity and discrimination criteria. Then he/she may choose α as any value between 0 and 1. The second way to choose the value of α is allow the algorithm to optimize the value of the coefficient α, using a data driven approach. This data driven approach allows the user to examine the effect of varying α on both the inertia and the misclassification rate by examining a given graph. To implement this data driven approach, one must fix the threshold for the overflow rate. The influence of the coefficient α can be determinant both in the construction of the tree and in its prediction qualities. The variation of α (or of β since α + β = 1) from 0 to 1 increases the importance of the homogeneity and decreases the importance of discrimination. This variation influences splitting and consequently results in different terminal nodes. We need to find the inertia of the terminal nodes and the rate of misclassification as we vary α. Then we can determine the value of α which optimizes both the inertia and the rate of misclassification i.e. the homogeneity and discrimination simultaneously. Note that if the user fixes α = 0, considering only the discrimination criterion, and in addition the data are classical, the algorithm functions just as CART. So CART is a special case of this algorithm. For each terminal node l of the terminal nodes L of the tree associated with the majority class Gs ,we can calculate the corresponding misclassification rate R(s/l) = nr (l) J r=1 P (r/l) with r = s , P (r/l) = nl , nr (l) is the number of units of the node l belonging to the class Gr and nl is the number of units of the node l. R(s/l) represent the proportion of units allocated to the class Gs but belonging to the other classes. The misclassification rate M R of the tree L is the sum over all terminal nodes J nr (l) nl with r = s. We use the M R i.e. M R(L) = l∈L n R(s/l) = l∈L r=1 n to measure the discrimination because this rate is simple to interpret. For each terminal node of the tree L we can calculate the corresponding inertia, H(l) and we can calculate the total inertia by summing over all the terminal nodes. So, the total inertia of L, I(L) = l∈L H(l). For each value of α between 0 and 1, we build a tree from the population under consideration and calculate our two parameters (inertia and misclassification rate). Varying α from 0 to 1 (say with a stepsize of 0.1) gives us 11 couples of values of inertia and misclassification rate. In order to visualize the variation of the two parameters, we display a curve showing the inertia and a curve showing a misclassification rate as a function of α. The best value of α is the one which minimizes the normalized sum of the two parameters I + M RNorm . 4 Example Our example deals with real symbolic data with a population of u =16 units. Each unit represents a musical instrument; the original data base consists of 1323 sounds from 16 different musical instruments. The 16 instruments are; the guitar, harp, violin, viola, cello, double bass, flute, oboe, clarinet, bassoon, soprano saxophone, horn, trombone, trumpet, tuba, and accordian. The class to describe C are the instruments with sustained sounds; this class includes all of the instruments except 934 Winsberg, Diday and Limam the guitar and the harp, which have unsustained sounds. We thus have 14 units in the class to describe. Because we have aggregated the data for the sounds for each musical instrument we are not dealing with classical data with a single value for each variable for each statistical unit, here the musical instrument. Each unit is described by J = 54 explanatory variables, all interval variables. The variables represent all the interesting features of sound some of them are temporal such as the log attack time; sume relate to energy such as the total energy in the signal and the signal-to-noise ratio; some are perceptual such as the perceptual loudness; some relate to the entire spectrum of the sound such as the spectral center of gravity, the spectral spread, the spectral kurtosis, and the spectral slope; some relate to just the harmonic part of the spectrum such as the harmonic spectral center of gravity, the harmonic spectral spread, the harmonic kurtosis. Some of the variables are the mean value of these indices others are their standard deviation over the time window. These variables have been shown to be the most relevant in describing sounds (see [PR99]). The data set obtained from the above study contained 100 variables but many of them were the same variable measured on a different scale such as the logarithmic scale. To make the data set manageable we excluded variables which differed only on the scale used to calculate their values (eg many variables were calculated on some or all of linear log and power scales), 54 variables were retained. Each variable covers an interval of values for each unit as each unit represents the aggregate of many sounds. The a priori partition we wish to explain is the blown instruments ie those using air to produce the sound from the non-blown instruments. In our class to describe the non-blown instruments are the strings ie the violin, viola, cello and double bass.We want to explain the factors which discriminate the blown from the non-blown instruments in the class of sustained instruments. But we also wish to have good descriptors of the resultant clusters due to their homogeneity. The class to describe, C, contains the sustained instruments and we do not want the extent of our resultant descriptions to cover the non-sustained instruments ie the guitar and the harp. We stopped the division when we reached 8 terminal nodes. We show the results for three values of α, α = 1, α = 0 and α optimized with data driven method α = 0.6. The results are shown in Table 1; OR(T ) is the overflow rate aver all terminal nodes. Table 1. Results for the musical example α I(T ) M R(T )% OR(T )% 1 0.038 0 0.111 0.6 0.038 7 0 0 7 0 0 From Table 1 we see that using the data driven value of α = 0.6 we obtain homogeneity that matches that when we use only a homogeneity criterion and a misclassification rate of 0 which is what we obtain when we only use a discrimination criterion. Moreover our overflow rate is 0 so we do not describe the unsustained instruments. In this example we have unusually good results enabling us to attain for practical purposes both optimum homogeneity and discrimination simultaneously. Symbolic class description 935 The tree we obtain for the data driven value of α = 0.6 is shown in Figure 4 below. Let us consider the eight terminal nodes of the tree which are respectively: node 14 with 1 unit, the saxsop; node 13 with 2 units, the flute and the clarinet; node 11 with 1 unit the accordian; node 8 with 2 units the bassoon and the trombone; node 7 with 2 units the tuba and the horn; node 6 has 2 units the violin and the viola; and node 5 has 2 units the cello and the double bass. So the strings are well separated from the blown instruments and each node is very homogeneous containing very similar sounding instruments. We remind the reader that the homogeneity is measured on all the input variables. Each node can be described by the conjunction of cuts. In the tree the variables used for the cuts are denoted by Champ j, which means field j, and corresponds to variable j − 1, (Champs 1 is the unit identifier, here the instrument name). Notice that we describe the class with only 5 descriptors, namely, variable 1 (Champs2), the mean value of the kurtosis of the signal spectrum, variable 39 (Champs40), the temporal variation measured on Bark’s perceptual scale, variable 78 (Champs79), the odds-evens ratio of the harmonics of the spectrum, variable 90 (Champs91), the standard deviation of the kurtosis of the spectrum filtered in Bark bands, and variable 93 (Champs94) a measure of the perceptual spectral temporal fluctuation. Using a classical algorithm with unaggregated data that is the individual sounds Peeters ( [PR03]) needed many more descriptors and did not obtain as good a missclassification rate. Here we have achieved a much more parsimonious description with better homogeneity and misclassification rate. Note also that it is variable 39 the temporal variation measured on Bark’s perceptual scale which separates the blown from the non-blown (the strings) instruments in the class.This variable has been demonstarted to be one of three dimensions recovered in MDS studies of musical timbre (see [MC95] and [CA05]). Each node can be described by a conjunction of the cuts to the top of the tree. For example node 6 containing the violin and the viola can be described as follows: [V ar39 ∈ [0.012, 0.022]] [V ar90 ∈ [0.860, 1.33]]. The class C consisting of the 14 sustained instruments can be described by a disjunction of these eight conjunctions obtained for each of the eight terminal nodes. To examine the prediction capabilty of this method we considered as the population the sustained instruments listed above. Using the uniform distribution from 0 to 1 we randomly selected 25% of the blown and 25 % of the non blown instruments to be in the test group and the rest of the 14 sustained instruments were assigned to the learn group. The cello (non blown) and the trumpet (blown) and the tuba (blown) were thus randomly chosen to be in the test group. The other 11 instruments were in the learn group. We remark here that we have an unusually high ratio for the number of objects in the test group to the number of objects in the learn group. We performed the algorithm on the learn group and obtained the rule to discriminate the blown from the non blown instruments. We obtained a misclassification rate of 0% on the learn group, but most importantly we also obtained a misclassification rate of 0% on the test group, thus demonstrating the excellent predictive capability of the algorithm in this example. In the future, methods specially adapted to investigate the robustness and predictive quality of this method, dealing with symbolic data, (as distinguished from classical data), need to be developed. For example, in the symbolic data analysis framework, one is often dealing with a relatively small finite set of aggregated statistical units, (say the units are the countries in the European union). In such cases it might be more interesting to look at the stability and 936 Winsberg, Diday and Limam N0 14 Champ40 in ]0.01,0.02] Champ40 in [0.00,0.01] N2 4 N1 10 Champ2 in [7.16,38.54] Champ91 in [0.51,0.86] Champ2 in ]38.54,98.33] N3 6 N4 4 Champ79 in [4.36,35.33] N10 3 N9 3 Champ2 in [27.04,32.90] Champ2 in ]11.66,14.14] N12 2 N11 1 nb : 0.00 b : 1.00 nb : 0.00 b : 1.00 nb : 1.00 b : 0.00 N6 2 nb : 1.00 b : 0.00 Champ79 in ]35.33,67.62] Champ94 in [0.012,0.017] Champ2 in [7.16,11.66] Champ91 in ]0.86,1.33] N5 2 N7 2 nb : 0.00 b : 1.00 Champ94 in ]0.017,0.021] N8 2 nb : 0.00 b : 1.00 Champ2 in ]32.90,32.97] N13 2 N14 1 nb : 0.00 b : 1.00 nb : 0.00 b : 1.00 Fig. 1. Tree for musical instruments robustness of the aggregating process. In other cases, it would be useful to look into the stability of the results by bootstrappng and bagging the aggregated units. 5 Conclusion In this paper we present a new approach to get a description of a class. Our approach is based on a divisive top-down tree method, restricted to recursive binary partitions, until a suitable stopping rule prevents further divisions. This method is applicable to most types of data, that is, classical numerical and categorical data, symbolic data, including interval type data and histogram type data, and any mixture of these types of data. In this paper we have focussed and applied the method on interval type data. The idea is to combine a homogeneity criterion and a discrimination criterion to describe a class and explain an a priori partition. The class to describe can be a class from a prior partition, the whole population or any class from the population. Having chosen this class, the interest of the method is that the user can choose the weights α and thus β = 1 − α he/she wants to put on the homogeneity and discrimination criteria respectively, depending on the importance of these criteria to reach a desired goal. Alternatively, the user can optimize both criteria simultaneously, choosing a data driven value of α. A data driven choice can yield an almost optimal discrimination, while improving homogeneity, leading to improved class description. The user may also select to use a stopping rule yielding a description which overflows the class as little as possible and which is as pure as possible. We also show that our method has excellent prediction capability for the musical instrument example. Further tests of the algorithms predictive capabilities will be conducted. Symbolic class description 937 References [BO00] Bock H..H., Diday, E.: Analysis of Symbolic Data. Springer,Berlin (2000) [BR84] Breiman, L., Freidman, J.H., Olshen,R.A.: Classification and Regression Trees. Wadsworth,CA (1984) [CA05] Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic correlates of timbre space: A confirmatory study using synthetic tones. J. Acous. Soc. of Am., 116 471–482 (2005) [CH97] Chavent, M.: Analyse de données symboliques, une méthode divisive de classification. Thèse de Doctorat, Université Paris IX Dauphine (1997) [LI03] Limam,M., Diday,E., Winsberg,S.: Symbolic class description with interval data. Journal of Symbolic Data Analysis.,1(1997) [MC95] McAdams, S., Winsberg, S., Donnadieu, S., DeSoete, G., Krimphoff, J.: Perceptual scaling of synthesized musical timbres: Common dimensions, specificitis, and latent subject classes. Psychological Research., 58 177–192 (1995) [PR99] Peeters, G., Herrera, P.: M4884 Cuidad generic audio description scheme. In: Proceedings ISO/IEC JTCI/SC29/WG11 Mpeg99 (1999) . [PR03] Peeters, G.: Automatic classification of large musical instrument databases using hierarchical clustering with inertia ratio maximization. In: Proceedings of AES 115 convention (2003) [PE99] Périnel, E.: Construire un arbre de discrimination binaire à partir de données imprécises. Revue de Statistique Appliquée., 47, 5–30 (1999) [QU86] Quinlan, J.R.: Induction of decision trees. Machine Learning., 1, 81–106 (1986) [VR02] Vrac, M., Limam, M., Diday, E., Winsberg, S.: Symbolic class description. In: K. Jajuga, K., A. Sokolowski, A., H.H. Bock, H;,H;, (eds) Classification, Clustering, and Data Analysis. Springer. 329–337.(2002)

A tree structured classiﬁer for symbolic class description

Related documents

Products

Support

A tree structured classiﬁer for symbolic class description

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib