A tree structured classifier for symbolic class description

advertisement
A tree structured classifier for symbolic class
description
Suzanne Winsberg1 , Edwin Diday2 , and M. Mehdi Limam2
1
2
Predisoft, San Pedro, Costa Rica Suzanne.Winsberg@predisoft.com
Universite de Paris Dauphine, Paris, France diday@ceremade.dauphine.fr
Summary. Consider a class of statistical units from a population for which the
data table contains symbolic data, that is, the entries of the table are multivalued,
say an interval of values, as well as single-valued entries, for each variable describing
each statistical unit. Our aim is to describe this class, E, of statistical units by partitioning it. Each class of the partition is described by a conjunction of characteristic
properties, and the class to describe, E, is described by a disjunction of these conjunctions. We use a stepwise top-down binary tree method. At each step we select
the best variable and its optimal splitting to optimize simultaneously a homogeneity
criterion and a discrimination criterion given by a prior partition of the population,
thus combining unsupervised and supervised learning. We illustrate the method on
a real data set.
Key words: classification, clustering, discrimination, binary tree, decision tree,
symbolic data analysis, class description
1 Introduction
We want to describe a class, C of statistical units in a population of units. So we must
find the properties that characterize the class, and a way to do that is to partition
the class. Classification methods are often designed to split a class of statistical
units, yielding a partition into L subclasses, or clusters, where each cluster may
then be described by a conjunction of properties, and the class, C is described by a
disjunction of the L conjunctions. Most partitioning methods are one of two types:
clustering methods optimizing an intra-class homogeneity criterion, or decision trees
optimizing an inter-class discrimination criterion. We partition the class using a topdown binary divisive method. Such divisive methods are referred to as tree structured
classifiers. It is of prime importance that the subdivisions of C be homogeneous with
respect to the selected group of variables from the data base. If in addition, we need
to explain an external criterion, giving rise to an a priori partition of the population,
or some part of it encompasing C , we need to consider a discrimination criterion
based on that a priori partition as well. Our technique arrives at a description of
C by producing a partition of C into L subclasses or clusters satisfying both a
928
Winsberg, Diday and Limam
homogeneity criterion and a discrimination criterion with respect to some given a
priori partition. So unlike other classification methods which are either unsupervised
or supervised, this method has both unsupervised and supervised aspects.
Not only does our method combine supervised and unsupervised learning itself
an innovation, but we are able to treat symbolic data as well as the classical single
valued type data treated by classical algorithms such as CART ( [BR84]) and ID3
( [QU86]). Symbolic data are richer possessing potentially more information than
classical single-valued data. Symbolic data may be encountered when dealing with
more complex, aggregated statistical units found when analyzing huge data sets. For
example it may be more interesting to deal with aggregated statistical units such as
teams rather than with the individual players on the team. Then the resulting data
set after aggregation will most likely contain symbolic data rather than classical data
values. By symbolic data we mean that rather than having a specific single value
for an observed variable, an observed value for an aggregated statistical unit may
be multivalued, eg an interval of values. For a more extensive treatment of symbolic
data analysis see Bock and Diday ( [BO00]).
Consider an example, with a symbolic data table relating to professional basketball teams in the United States. From the data table we choose a class to describe, C,
consisting of teams belonging to the NBA. The prior partition, that is the discriminatory categorical variable is winning teams, those that have won more than 70% of
their games, versus the rest. The variables available in the symbolic data table might
be interval variables such as the range of the height of the players on the team, the
range of the age of the players on the team, the range of the weight of the players
on the team, the range for the team of the total number of points scored by each
player on the team over a given season, the range for the team of the total number
of assists made by each player on the team over a given season, and the range for
the team of the average number of minutes of play for each player per game over a
given season. Note that since each statistical unit is a team with many players on it,
numerical variables such as height cover a range, that is they would be interval-type
variables not single-valued classical variables. If we use only a homogeneity criterion
we would get subclasses or clusters of teams, each of which would be homogeneous
with respect to the variables such as number of assists, number of points, height
etc and these clusters would of course be well discriminated from one another with
respect to those same variables. Here, in addition they will be discriminated from
one another with respect to the prior partition ie the winning teams versus the rest
of the teams.
Moreover, our aim here is to produce a description of a class C of uc coming from
a population of units. Naturally, the description includes all the units of the class C
because we induce this description from all the units of this class. However, we want
to find a description such that units not belonging to this class but included in this
description is minimized. So, to refer back to our example, if we want to describe the
teams in the NBA which are men’s professional basketball teams, we prefer that the
extent of this description not cover women’s professional basketball teams. Below,
we present a stopping rule that we have integrated into our method, which limits as
much as possible what we call the overflow of the description that is units not in C
but which are in the extent of the description.
This work is an extension of previous work by [VR02], and [LI03], which combine
supervised and non supervised learning. Here we focus on interval type symbolic data
and we extend [LI03] by examining the prediction quality of the algorithm. We also
Symbolic class description
929
present a choice of stopping rules. Others have developed divisive algorithms for
specific data types encountered when dealing with symbolic data, considering either
a homogeneity criterion ( [CH97]) or a discrimination criterion ( [PE99]) based on
an a priori partition, but not both simultaneously.
Our method yields a description for each final cluster. In addition each final
cluster can be assigned with a given power to a class of the prior partition, so
we can induce rules for the whole population under study. Below we describe our
method: we define a cutting or split for interval data; we define a cutting point and
affectation rules for this type of data; we outline the approach used to combine the
two criteria, and a data driven approach to weight each criterion optimally; we also
present the stopping rules. We illustrate the algorithm with a real example. Our
method has also been extended to handle weighted categorical variables (histogram
symbolic data), classical data and any combination of the above. However in this
paper we focus on interval type symbolic data.
2 The Underlying Population and the Data
We consider a population Ω = {1, ..., K} with K units. Ω is partitioned into
S known disjointed classes G1 , ..., Gs, ..., GS and into T other disjointed classes
C1 , ..., Ct , ..., CT also known could be the same or different from the partition
G1 , ..., Gs , ..., GS . The class to describe, C ≡ Ct . Each unit k ∈ Ω of the population is described by three categories of variables :
- G : the a priori partition variable, which is a nominal variable defined on Ω
with S categories 1, ..., s, ..., S;
- yC : the class to describe variable, which is a nominal variable defined on Ω
with T categories 1, ...t, ..., T ;
- y1 , ..., yj , ..., yJ : J independent interval type variables.
G(k) is the index of the class of the unit k ∈ Ω and G is a mapping defined on
Ω with domain 1, ..., s, ..., S which generates the a priori partition into S classes to
discriminate.
yC (k) is the index of the class of the unit k ∈ Ω and yC is a mapping defined on
Ω with domain {1, , ...t, ..., T } which generates the second partition into T classes
C1 , ..., C(t), ..., CT . One of the above classes, Ct will be chosen to be described. We
denote this class to describe as C.
To each unit k and to each independent, or explanatory variable yj is associated
a symbolic description (with imprecision or variation). This description denoted
by [yj (k)] is quantitative that is the description associated with yj to a unit k ∈ Ω
is an interval yj (k) ⊂ Yj with Yj the set of possible values for yj . Naturally, the case
of a single-valued quantitative variable is a special case of this type of variable. Let
yjk be the size of a unit k, denoted by yj (k) = [18, 27]. Summarizing, each cell of
the symbolic data table contains an interval of R.
3 The Method
Five inputs are required for our algorithm : 1) the data, consisting of K statistical
units, each described by J symbolic or classical variables; 2) the prior partition of
930
Winsberg, Diday and Limam
either some defined part of the population or the entire population; 3) the class,
C, the user aims to describe consisting of K ≡ Kc statistical units coming from
the population of K statistical units; 4) a coefficient, α, which gives more or less
importance to the discriminatory power of the prior partition or to the homogeneity
of the description of the given class, C., (alternatively, instead of specifying this last
coefficient, the user may choose to determine an optimum value of this coefficient,
using this algorithm); 5) the choice of a stopping rule, (see the section on the stopping
rule below).
The method uses a monothetic hierarchical descending approach working by
division of a node into two nodes. At each step l ( l nodes corresponding to a
partition into l classes), one of the nodes of the tree is cut into two nodes in order
to optimize a quality criterion Q for the constructed partition into l + 1 classes. The
division of a node N sends a proportion of units to the left node N1 and the other
proportion to the right node N2 . This division is done by a “cutting” (y, c), where y
is called the cutting variable and c the cutting point.
The algorithm always generates two kinds of output. The first is a graphical
representation, in which the class to describe, C, is represented by a binary tree.
The final leaves are the clusters constituting the class and each branch represents
a cutting (y, c). The second is a description: each final leaf is described by the
conjunction of the cutting points from the top of the tree to this final leaf. The
class, C, is then described by a disjunction of these conjunctions. The class, C, and
each final cluster may be reified into a concept which may be described by all of the
variables. If the user wishes to choose an optimal value of α using our data driven
method, a graphical representation enabling this choice is also generated as output.
Let H(N ) and h(N1 ; N2 ) be respectively the homogeneity criterion of a node N
and of a couple of nodes (N1 ; N2 ). then we define ∆H(N ) = H(N ) − h(N1 ; N2 ).
Similarly we define ∆D(N )D(N ) − d(N1 ; N2 ) for the discrimination criterion. The
quality Q of a node N (respectively q of a couple of nodes (N1 ; N2 )) is the weighted
sum of the two criteria, namely Q(N ) = αH(N ) + βD(N ) (respectively q(N1 ; N2 ) =
αh(N1 ; N2 ) + βd(N1 ; N2 )) where α + β = 1. So the quality variation induced by the
splitting of N into (N1 ; N2 ) is ∆Q(N ) = Q(N ) − q(N1 ; N2 ). We maximize ∆Q(N ).
Note that since we are optimizing two criteria the criteria must be normalized. The
user can modulate the values of α and β so as to weight the importance that he
gives to each criterion.
To determine the cutting value (y; c) and the node to split: first, for each node N
select the cutting variable and its cutting point minimizing q(N1 ; N2 ); second, select
and split the node N which maximizes the difference between the quality before the
cutting and the quality after the cutting, max∆Q(N ) = max[α∆H(N ) + βD(N )].
We are working with interval variables, so we must define what constitutes a cutting
and what constitutes a cutting value or cutting point for this type of data. For an
interval variable yj , we define the cutting point c using the mean of the interval.
We order the means of the intervals for all units, the cutting point is then the mean
of two consecutive means of intervals. The mean of the interval is just one possible
choice on which to base this definition. We could have used the min or max of the
interval or both. In further extensions of the algorithm, we might consider other
possibilities.
Each cutting (yj ; c) is associated with binary questions q of the form ”Is yj ≤ c”
if yj is an interval variable, ”Is ( yj ∈ V ) ≤ cV ”. If yj is a quantitative variable then
it is an interval multivalued continuous variable, we must define a rule to assign to
Symbolic class description
931
N1 and N2 (respectively, the left and the right node of N) an interval with respect
to a cutting point c. Here we use a pure approach in which a unit is assigned either
to the left node N1 or the right node N2 with certainty that is a probability of 1 or
0 for assignment to N1 and to N2 ; We define pwi the probability to assigne a unit
wi to N1 with yj (wi ) = [yjmin (wi ), yjmax (wi )] by :
c−yjmin (wi )
yjmax (wi )−yjmin (wi )
if c ∈ [yjmin (wi ), yjmax (wi )]
0
if c < yjmin (wi )
1
if c > yjmax (wi )
We use the following rule to asign a unit to the left or right node:
wi belongs to N1 if pwi ≥ 12 else it belongs to N2 .
The probabilities calculated above are used to calculate the homogeneity and the
discrimination criteria. We shall now define the homogeneity criterion for interval
type data. The clustering or homogeneity criterion we use is an inertia criterion.
This criterion is used in Chavent ( [CH97]). The inertia of a node t which could be
N1 or N2 is
pwi
=
H(t) =
wi ∈t wk
pi pk
∆(wi , wk ),
2µ
∈t
where pi = the weight of the individual wi , in our case pi = pk (t), µ =
wi ∈N pi = the weight of class t , ∆ is a distance between individuals defined
J
as ∆(wi , wk ) =
2
δ (wi , wk ) =
δ 2 (wi , wk ) with J is the number of variables.
j=1
yjmin (wi )+yjmax (wi )
2
−
mj
yjmin (wk )+yjmax (wk )
2
2
,
where [yjmin (wi ), yjmax (wi )] is the interval value of the variable yj for the unit wi
and mj = max yjmax − min yjmin which represent the maximum area of the variable
wi
wi
yj . We remark that δ 2 falls in the interval [0, 1]. Moreover, the homogeneity criterion
must be normalized to fall in the interval [0, 1].
The discrimination criterion we choose is an impurity criterion, Gini’s index,
which we denote as D. This criterion was introduced by Breiman ( [BR84]) for
the CART algorithm and measures the impurity of a node (which could be N1 or
N2 ) with respect to the prior partition (G1 , G2 , ..., GS ). The Gini measure of node
impurity is a measure which reaches a value of zero when only one class is present
at a node (with priors estimated from class sizes, the Gini measure is computed as
the sum of products of all pairs of class proportions for classes present at the node;
it reaches its maximum value when class sizes at the node are equal). The Gini
index has an interesting interpretation. Instead of using the plurality rule to classify
objects in a node n, use the rule that assigns an object selected at random from
the node to class Gs with probability ps The Gini index is also simple and quickly
computed. It is defined as:
pl pi = 1 − (
D(t) =
l=i
p2i )
i=1,...,J
with pi = Pt (Gi ).
D(N )must be normalized to fall in the interval [0, 1].
932
Winsberg, Diday and Limam
To assign each object k with vector description (Y1 , ..., Yp ) to one class Gi of the
prior partition, the following rule is applied :
for j = 1, ..., J, i = j}.
G(k) = Gi ⇐⇒ pk (Gi ) > pk (Gj)
Here the membership probability pk (Gi ) is calculated as the weighted average
(on all terminal nodes) of the conditional probability Pt (Gi ),with weights pk (t):
T
pk (Gi ) =
Pt (Gi ) × pk (Gi )
t=1
In other words, it consists of summing the probabilities that an individual belongs to the class Gi , for all terminal nodes t, these probabilities being weighted
by the probabilities that this object reach the different leaves. The set of individuals assigned to t on the basis of the majority rule, node(t) = {k\ pk (t) > pk (s),
∀s = t}. Therefore, we can calculate the size of the class Gi inside a terminal node
t : nt (Gi ) = {k\ k ∈ node(k) and k ∈ Gi }.
Our aim here is to produce a description of a class C of uc coming from a
population of K units. Naturally, the description includes all the units of the class
C because we induce this description from all the units of this class. However, units
not belonging to this class but included in this description should be minimized. For
example, consider the class C to describe is the districts of Great Britain containing
a number of inhabitants greater than 500k and the population is the districts of
Great Britain. So, the description should include as little as possible districts with
a number of inhabitants less than 500k. So it is of interest to consider the overflow
of the class to describe in making a stopping rule which stops the division of a node
N in two nodes N1 and N2 .
First, we define the overflow rate of a node N :
OR(N ) =
n(C N )
nN
with n(C N ) = number of units belonging to the complement C of the class C
which verify the current description of the node N and nN the total number of
units verifying the description of the node N .
The overflow rate OR of a couple of nodes (N1 , N2 ) : is then
OR(N1 , N2 ) =
n(C N1 ) + n(C N2 )
.
nN1 + nN2
The variation of the overflow rate of a node due to its being split N : ∆OR(N ) =
OR(N )−OR(N1 , N2 ) might be a useful indicator to stop splitting this node. A small
variation, shows that there is not much improvement in the overflow of the class to
describe when the node is split.
A node N is considered as terminal (a leaf) if: the variation of its overflow
∆OR(N ) overflow is less than a threshold fixed by the user or a default value say
10%; or, the difference between the discrimination before the cutting and the quality
after the cutting ∆D(N ) is less than a threshold fixed by the user or a default value
say 10%; or, its size is not large enough (value < 2); or, it creates two empty son nodes
(value < 1). We stop the division when all the nodes are terminal or alternatively
if we reach a number of divisions set by the user in order to provide a quick and
simple description .
Symbolic class description
933
The user may select one of two ways to choose the value of α. One way is for the
user to fix the value of α. The motivation to select this option is that the user may
have some particular aim motivating the choice of α, that is the relative weights
he/she wishes to place on the homogeneity and discrimination criteria. Then he/she
may choose α as any value between 0 and 1.
The second way to choose the value of α is allow the algorithm to optimize the
value of the coefficient α, using a data driven approach. This data driven approach
allows the user to examine the effect of varying α on both the inertia and the misclassification rate by examining a given graph. To implement this data driven approach,
one must fix the threshold for the overflow rate. The influence of the coefficient α can
be determinant both in the construction of the tree and in its prediction qualities.
The variation of α (or of β since α + β = 1) from 0 to 1 increases the importance
of the homogeneity and decreases the importance of discrimination. This variation
influences splitting and consequently results in different terminal nodes. We need to
find the inertia of the terminal nodes and the rate of misclassification as we vary
α. Then we can determine the value of α which optimizes both the inertia and
the rate of misclassification i.e. the homogeneity and discrimination simultaneously.
Note that if the user fixes α = 0, considering only the discrimination criterion, and
in addition the data are classical, the algorithm functions just as CART. So CART
is a special case of this algorithm.
For each terminal node l of the terminal nodes L of the tree associated with the
majority class Gs ,we can calculate the corresponding misclassification rate R(s/l) =
nr (l)
J
r=1 P (r/l) with r = s , P (r/l) = nl , nr (l) is the number of units of the node l
belonging to the class Gr and nl is the number of units of the node l. R(s/l) represent
the proportion of units allocated to the class Gs but belonging to the other classes.
The misclassification rate M R of the tree L is the sum over all terminal nodes
J
nr (l)
nl
with r = s. We use the M R
i.e. M R(L) =
l∈L n R(s/l) =
l∈L
r=1
n
to measure the discrimination because this rate is simple to interpret. For each
terminal node of the tree L we can calculate the corresponding inertia, H(l) and we
can calculate the total inertia by summing over all the terminal nodes. So, the total
inertia of L, I(L) = l∈L H(l).
For each value of α between 0 and 1, we build a tree from the population under
consideration and calculate our two parameters (inertia and misclassification rate).
Varying α from 0 to 1 (say with a stepsize of 0.1) gives us 11 couples of values
of inertia and misclassification rate. In order to visualize the variation of the two
parameters, we display a curve showing the inertia and a curve showing a misclassification rate as a function of α. The best value of α is the one which minimizes the
normalized sum of the two parameters I + M RNorm .
4 Example
Our example deals with real symbolic data with a population of u =16 units. Each
unit represents a musical instrument; the original data base consists of 1323 sounds
from 16 different musical instruments. The 16 instruments are; the guitar, harp,
violin, viola, cello, double bass, flute, oboe, clarinet, bassoon, soprano saxophone,
horn, trombone, trumpet, tuba, and accordian. The class to describe C are the
instruments with sustained sounds; this class includes all of the instruments except
934
Winsberg, Diday and Limam
the guitar and the harp, which have unsustained sounds. We thus have 14 units in
the class to describe. Because we have aggregated the data for the sounds for each
musical instrument we are not dealing with classical data with a single value for each
variable for each statistical unit, here the musical instrument. Each unit is described
by J = 54 explanatory variables, all interval variables. The variables represent all the
interesting features of sound some of them are temporal such as the log attack time;
sume relate to energy such as the total energy in the signal and the signal-to-noise
ratio; some are perceptual such as the perceptual loudness; some relate to the entire
spectrum of the sound such as the spectral center of gravity, the spectral spread, the
spectral kurtosis, and the spectral slope; some relate to just the harmonic part of
the spectrum such as the harmonic spectral center of gravity, the harmonic spectral
spread, the harmonic kurtosis. Some of the variables are the mean value of these
indices others are their standard deviation over the time window. These variables
have been shown to be the most relevant in describing sounds (see [PR99]). The
data set obtained from the above study contained 100 variables but many of them
were the same variable measured on a different scale such as the logarithmic scale.
To make the data set manageable we excluded variables which differed only on the
scale used to calculate their values (eg many variables were calculated on some or
all of linear log and power scales), 54 variables were retained. Each variable covers
an interval of values for each unit as each unit represents the aggregate of many
sounds. The a priori partition we wish to explain is the blown instruments ie those
using air to produce the sound from the non-blown instruments. In our class to
describe the non-blown instruments are the strings ie the violin, viola, cello and
double bass.We want to explain the factors which discriminate the blown from the
non-blown instruments in the class of sustained instruments. But we also wish to
have good descriptors of the resultant clusters due to their homogeneity. The class
to describe, C, contains the sustained instruments and we do not want the extent
of our resultant descriptions to cover the non-sustained instruments ie the guitar
and the harp. We stopped the division when we reached 8 terminal nodes. We show
the results for three values of α, α = 1, α = 0 and α optimized with data driven
method α = 0.6. The results are shown in Table 1; OR(T ) is the overflow rate aver
all terminal nodes.
Table 1. Results for the musical example
α
I(T ) M R(T )% OR(T )%
1 0.038
0 0.111
0.6 0.038
7
0
0
7
0
0
From Table 1 we see that using the data driven value of α = 0.6 we obtain
homogeneity that matches that when we use only a homogeneity criterion and a
misclassification rate of 0 which is what we obtain when we only use a discrimination
criterion. Moreover our overflow rate is 0 so we do not describe the unsustained
instruments. In this example we have unusually good results enabling us to attain for
practical purposes both optimum homogeneity and discrimination simultaneously.
Symbolic class description
935
The tree we obtain for the data driven value of α = 0.6 is shown in Figure 4
below. Let us consider the eight terminal nodes of the tree which are respectively:
node 14 with 1 unit, the saxsop; node 13 with 2 units, the flute and the clarinet; node
11 with 1 unit the accordian; node 8 with 2 units the bassoon and the trombone;
node 7 with 2 units the tuba and the horn; node 6 has 2 units the violin and the
viola; and node 5 has 2 units the cello and the double bass. So the strings are well
separated from the blown instruments and each node is very homogeneous containing
very similar sounding instruments. We remind the reader that the homogeneity is
measured on all the input variables.
Each node can be described by the conjunction of cuts. In the tree the variables
used for the cuts are denoted by Champ j, which means field j, and corresponds to
variable j − 1, (Champs 1 is the unit identifier, here the instrument name). Notice
that we describe the class with only 5 descriptors, namely, variable 1 (Champs2),
the mean value of the kurtosis of the signal spectrum, variable 39 (Champs40), the
temporal variation measured on Bark’s perceptual scale, variable 78 (Champs79),
the odds-evens ratio of the harmonics of the spectrum, variable 90 (Champs91),
the standard deviation of the kurtosis of the spectrum filtered in Bark bands, and
variable 93 (Champs94) a measure of the perceptual spectral temporal fluctuation.
Using a classical algorithm with unaggregated data that is the individual sounds
Peeters ( [PR03]) needed many more descriptors and did not obtain as good a
missclassification rate. Here we have achieved a much more parsimonious description
with better homogeneity and misclassification rate. Note also that it is variable 39 the
temporal variation measured on Bark’s perceptual scale which separates the blown
from the non-blown (the strings) instruments in the class.This variable has been
demonstarted to be one of three dimensions recovered in MDS studies of musical
timbre (see [MC95] and [CA05]).
Each node can be described by a conjunction of the cuts to the top of the tree.
For example node 6 containing the violin and the viola can be described as follows:
[V ar39 ∈ [0.012, 0.022]] [V ar90 ∈ [0.860, 1.33]]. The class C consisting of the 14
sustained instruments can be described by a disjunction of these eight conjunctions
obtained for each of the eight terminal nodes.
To examine the prediction capabilty of this method we considered as the population the sustained instruments listed above. Using the uniform distribution from 0 to
1 we randomly selected 25% of the blown and 25 % of the non blown instruments to
be in the test group and the rest of the 14 sustained instruments were assigned to the
learn group. The cello (non blown) and the trumpet (blown) and the tuba (blown)
were thus randomly chosen to be in the test group. The other 11 instruments were
in the learn group. We remark here that we have an unusually high ratio for the
number of objects in the test group to the number of objects in the learn group. We
performed the algorithm on the learn group and obtained the rule to discriminate
the blown from the non blown instruments. We obtained a misclassification rate of
0% on the learn group, but most importantly we also obtained a misclassification
rate of 0% on the test group, thus demonstrating the excellent predictive capability
of the algorithm in this example. In the future, methods specially adapted to investigate the robustness and predictive quality of this method, dealing with symbolic
data, (as distinguished from classical data), need to be developed. For example, in
the symbolic data analysis framework, one is often dealing with a relatively small
finite set of aggregated statistical units, (say the units are the countries in the European union). In such cases it might be more interesting to look at the stability and
936
Winsberg, Diday and Limam
N0
14
Champ40 in ]0.01,0.02]
Champ40 in [0.00,0.01]
N2
4
N1
10
Champ2 in [7.16,38.54]
Champ91 in [0.51,0.86]
Champ2 in ]38.54,98.33]
N3
6
N4
4
Champ79 in [4.36,35.33]
N10
3
N9
3
Champ2 in [27.04,32.90]
Champ2 in ]11.66,14.14]
N12
2
N11
1
nb : 0.00
b : 1.00
nb : 0.00
b : 1.00
nb : 1.00
b : 0.00
N6
2
nb : 1.00
b : 0.00
Champ79 in ]35.33,67.62]
Champ94 in [0.012,0.017]
Champ2 in [7.16,11.66]
Champ91 in ]0.86,1.33]
N5
2
N7
2
nb : 0.00
b : 1.00
Champ94 in ]0.017,0.021]
N8
2
nb : 0.00
b : 1.00
Champ2 in ]32.90,32.97]
N13
2
N14
1
nb : 0.00
b : 1.00
nb : 0.00
b : 1.00
Fig. 1. Tree for musical instruments
robustness of the aggregating process. In other cases, it would be useful to look into
the stability of the results by bootstrappng and bagging the aggregated units.
5 Conclusion
In this paper we present a new approach to get a description of a class. Our approach is based on a divisive top-down tree method, restricted to recursive binary
partitions, until a suitable stopping rule prevents further divisions. This method is
applicable to most types of data, that is, classical numerical and categorical data,
symbolic data, including interval type data and histogram type data, and any mixture of these types of data. In this paper we have focussed and applied the method
on interval type data. The idea is to combine a homogeneity criterion and a discrimination criterion to describe a class and explain an a priori partition. The class
to describe can be a class from a prior partition, the whole population or any class
from the population. Having chosen this class, the interest of the method is that
the user can choose the weights α and thus β = 1 − α he/she wants to put on the
homogeneity and discrimination criteria respectively, depending on the importance
of these criteria to reach a desired goal. Alternatively, the user can optimize both
criteria simultaneously, choosing a data driven value of α. A data driven choice can
yield an almost optimal discrimination, while improving homogeneity, leading to
improved class description. The user may also select to use a stopping rule yielding
a description which overflows the class as little as possible and which is as pure as
possible. We also show that our method has excellent prediction capability for the
musical instrument example. Further tests of the algorithms predictive capabilities
will be conducted.
Symbolic class description
937
References
[BO00] Bock H..H., Diday, E.: Analysis of Symbolic Data. Springer,Berlin (2000)
[BR84] Breiman, L., Freidman, J.H., Olshen,R.A.: Classification and Regression
Trees. Wadsworth,CA (1984)
[CA05] Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic correlates of
timbre space: A confirmatory study using synthetic tones. J. Acous. Soc. of
Am., 116 471–482 (2005)
[CH97] Chavent, M.: Analyse de données symboliques, une méthode divisive de
classification. Thèse de Doctorat, Université Paris IX Dauphine (1997)
[LI03] Limam,M., Diday,E., Winsberg,S.: Symbolic class description with interval
data. Journal of Symbolic Data Analysis.,1(1997)
[MC95] McAdams, S., Winsberg, S., Donnadieu, S., DeSoete, G., Krimphoff, J.: Perceptual scaling of synthesized musical timbres: Common dimensions, specificitis, and latent subject classes. Psychological Research., 58 177–192 (1995)
[PR99] Peeters, G., Herrera, P.: M4884 Cuidad generic audio description scheme.
In: Proceedings ISO/IEC JTCI/SC29/WG11 Mpeg99 (1999) .
[PR03] Peeters, G.: Automatic classification of large musical instrument databases
using hierarchical clustering with inertia ratio maximization. In: Proceedings
of AES 115 convention (2003)
[PE99] Périnel, E.: Construire un arbre de discrimination binaire à partir de données
imprécises. Revue de Statistique Appliquée., 47, 5–30 (1999)
[QU86] Quinlan, J.R.: Induction of decision trees. Machine Learning., 1, 81–106
(1986)
[VR02] Vrac, M., Limam, M., Diday, E., Winsberg, S.: Symbolic class description.
In: K. Jajuga, K., A. Sokolowski, A., H.H. Bock, H;,H;, (eds) Classification,
Clustering, and Data Analysis. Springer. 329–337.(2002)
Download