Decision-tree Induction from Time-series Data Based on a Standard-example Split Test

advertisement
Decision-tree Induction from Time-series Data
Based on a Standard-example Split Test
Yuu Yamada
yuu@slab.dnj.ynu.ac.jp
Einoshin Suzuki
suzuki@ynu.ac.jp
Electrical and Computer Engineering, Yokohama National University, 79-5, Tokiwadai, Hodogaya, Yokohama,
240-8501, Japan
Hideto Yokoi
yokoih@telemed.ho.chiba-u.ac.jp
Katsuhiko Takabayashi
takaba@ho.chiba-u.ac.jp
Division of Medical Informatics, Chiba University Hospital, 1-8-1 Inohana, Chuo-ku, Chiba, 260-8677 Japan
Abstract
This paper proposes a novel decision tree for
a data set with time-series attributes. Our
time-series tree has a value (i.e. a time sequence) of a time-series attribute in its internal node, and splits examples based on dissimilarity between a pair of time sequences.
Our method selects, for a split test, a time
sequence which exists in data by exhaustive
search based on class and shape information.
Experimental results confirm that our induction method constructs comprehensive and
accurate decision trees. Moreover, a medical
application shows that our time-series tree is
promising for knowledge discovery.
1. Introduction
A decision tree (Breiman, Friedman, Olshen, & Stone,
1984; Murthy, 1998; Quinlan, 1993) represents a treestructured classifier which performs a split test in its
internal node and predicts a class of an example in
its leaf node. Various algorithms for decision-tree induction have been successful in numerous application
domains. A basic split test assumes a nominal attribute, and allocates examples according to their attribute values to their corresponding child nodes. In
order to apply decision-tree induction methods to various problems, the basic split test has been extended for
numerical attributes, tree-structured attributes (Almuallim, Akiba, & Kaneda, 1996), and set-valued attributes (Takechi & Suzuki, 2002). A tree-structured
attribute takes a value in a hierarchy, and a set-valued
attribute takes a set as its value.
Time-series data consist of a set of time sequences each
of which represents a list of values sorted in chronological order, and are abundant in various domains
(Keogh, 2001). Conventional classification methods
for such data can be classified into a transformation
approach and a direct approach. The former maps a
time sequence to another representation. The latter,
on the other hand, typically relies on a dissimilarity
measure between a pair of time sequences. They are
further divided into those which handle time sequences
that exist in data (Rodrı́guez, Alonso, & Bostrvm,
2000) and those which rely on abstracted patterns
(Drücker et al., 2002; Geurts, 2001; Kadous, 1999).
Comprehensiveness of a classifier is highly important
in various domains including medicine. The direct approach, which explicitly handles real time sequences,
has an advantage over other approaches. In our
chronic hepatitis domain, we have found that physicians tend to prefer real time sequences instead of
abstracted time sequences which can be meaningless.
However, conventional methods such as (Rodrı́guez,
Alonso, & Bostrvm, 2000) rely on sampling and problems related with extensive computation in such methods still remained unknown.
In this paper, we propose, for decision-tree induction,
a split test which finds the “best” time sequence that
exists in data with exhaustive search. A time-series
tree represents a novel decision tree which employs this
sort of split tests and dynamic time warping (DTW)
Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.
time-series
attributes
GPT
ALB
200
examples
84
PLT
6
300
5
100
200
4
0
-500 -250
0
250 500
200
3
-500 -250
250
500
6
0
250 500
3
-500 -250
250
500
0
-500 -250
250 500
300
100
200
4
0
250 500
LC
100
0
250
500
0
-500 -250
20 30
40 50
60
0
10
20 30
40 50
60
Dissimilarity based on
Dynamic Time Warping
non-LC
0
400
5
10
Euclidean distance
100
0
6
3
-500 -250
0
250 500
300
4
0
-500 -250
0
200
200
930
0
-500 -250
400
100
0
-500 -250
non-LC
100
0
5
85
class
400
0
Figure 2. Correspondence of a pair of time sequences in the
Euclidean distance and the DTW-based measure (Artwork
courtesy of Eamonn Keogh)
250 500
Figure 1. Data set which consists of time-series attributes
(Sakoe & Chiba, 1978) in our dissimilarity measure.
2. Classification from Time-series Data
2.1. Definition of the Problem
A time sequence A represents a list of values
α1 , α2 , · · · , αI sorted in chronological order. For simplicity, this paper assumes that the values are obtained
or sampled with an equivalent interval (=1)1 .
A data set D consists of n examples e1 , e2 , · · · , en ,
and each example ei is described by m attributes
a1 , a2 , · · · , am and a class attribute c. We assume
that an attribute aj represents a time-series attribute
which takes a time sequence as its value2 . The class attribute c represents a nominal attribute and its value
is called a class. We show an example of a data set
which consists of time-series attributes in Figure 1.
In classification from time-series data, the objective
represents induction of a classifier, which predicts the
class of an example e, given a training data set D. This
paper mainly assumes a decision tree (Breiman et al.,
1984; Murthy, 1998; Quinlan, 1993) as a classifier.
2.2. Dissimilarity Measure Based on Dynamic
Time Warping
In this section, we define two dissimilarity measures for
a pair of time sequences. The first measure represents
a Manhattan distance, which is calculated by taking all
the vertical differences between a pair of data points of
the time sequences, and then summing their absolute
values together3 . It should be noted that Manhat1
This restriction is not essential in DTW.
Though it is straightforward to include other kinds of
attributes in the definition, we exclude them for clarity.
3
We selected this instead of Euclidean distance hoping
for its robustness against outliers. Experiments, however,
2
tan distance as well as Euclidean distance cannot be
applied to a pair of time sequences with different numbers of values, and the results can be counter-intuitive
(Keogh & Pazzani, 2000). The main reason lies in the
fact that these measures assume a fixed vertical correspondence of values, while a human being can flexibly
recognize the shape of a time sequence.
The second measure is based on Dynamic Time Warping (DTW) (Sakoe & Chiba, 1978). DTW can afford
non-linear distortion along the time axis since it can
assign multiple values of a time sequence to a single
value of the other time sequence. This measure, therefore, can not only be applied to a pair of time sequences
with different numbers of values, but also fits human
intuition. Figure 2 shows examples of correspondence
in which the Euclidean distance and the dissimilarity
measure based on DTW are employed. From the Figure, we see that the right-hand side seems more natural
than the other.
Now we define the DTW-based measure G(A, B) between a pair of time sequences A = α1 , α2 , · · · , αI and
B = β1 , β2 , · · · , βJ . The correspondence between A
and B is called a warping path, and can be represented as a sequence of grids F = f1 , f2 , . . . , fK on an
I × J plane as shown in Figure 3.
Let the distance between two values αik and βjk be
d(fk ) = |αik − βjk |, then an evaluation function ∆(F )
K
is given by ∆(F ) = 1/(I + J) k=1 d(fk )wk . The
smaller the value of ∆(F ) is, the more similar A and
B are. In order to prevent excessive distortion, we
assume an adjustment window (|ik − jk | ≤ r), and
consider minimizing ∆(F ) in terms of F , where wk is
a positive weight for fk , wk = (ik − ik−1 ) + (jk − jk−1 ),
i0 = j0 = 0. The minimization can be resolved without
checking all possible F since dynamic programming,
of which complexity is O(IJ), can be employed. The
minimum value of ∆(F ) gives the value of G(A, B).
showed that there is no clear winner in accuracy.
3. Time-series Tree
3.1. Standard-example Split Test
A
0
5
10
15
20
25
30
fK= (I, J)
bJ
15
20
25
Adjustment
Window
fk = (i k , jk )
0
B
5
10
bj
In order to circumvent the aforementioned problems,
our time-series tree has a time sequence which exists in
data and an attribute in its internal node, and splits a
set of examples according to the dissimilarity of their
corresponding time sequences to the time sequence.
The use of a time sequence which exists in data in
its split node contributes to comprehensibility of the
classifier, and each time sequence is obtained by exhaustive search. The dissimilarity measure is based
on DTW due to the reasons described in section 2.2.
b1
f1 = (1, 1)
a1
ai
Warping
Path
aI
Adjustment
Window
Figure 3. Example of a warping path (Artwork courtesy of
Eamonn Keogh)
2.3. Problems of Conventional Methods
In (Kadous, 1999), a feature extraction approach is
adopted to handle time series data with a conventional classification algorithm. As we stated in section 1, we believe that the transformation approach
produces less comprehensive classifiers than the direct
approach. (Drücker et al., 2002; Geurts, 2001) generate abstracted time sequences each of which is employed in a classifier. As we will see in section 4.3,
real time sequences which exist in data are favored in
our chronic hepatitis domain. (Rodrı́guez, Alonso, &
Bostrvm, 2000) selects real time sequences in a decision list based on class information, but relies on a
sampling method. As we will see in section 3.1, exhaustive search improves simplicity of a classifier and
shapes of time sequences should be also considered in
the learning process.
Naive methods for classification from time-series data
include a decision-tree learner which replaces a time
sequence with its average value in pre-processing and
a nearest neighbor method with the DTW-based dissimilarity measure. The former neglects the structure
of a time sequence, and might possibly recognize a pair
of largely different time sequences as similar. The latter, as a lazy learner, outputs no classification model
thus its learning results lack of comprehensibility.
We call this split test a standard-example split test.
A standard-example split test σ(e, a, θ) consists of a
standard example e, an attribute a, and a threshold
θ. Let a value of an example e in terms of a timeseries attribute a be e(a), then a standard-example
split test divides a set of examples e1 , e2 , · · · , en to a
set S1 (e, a, θ) of examples each of which ei (a) satisfies
G(e(a), ei (a)) < θ and the rest S2 (e, a, θ). We also call
this split test a θ-guillotine cut.
As the goodness of a split test, we have selected gain
ratio (Quinlan, 1993) since it is frequently used in
decision-tree induction. Since at most n−1 split points
are inspected for an attribute in a θ-guillotine cut and
we consider each example as a candidate of a standard
example, it frequently happens that several split points
exhibit the largest value of gain ratio. We assume
that shapes of time sequences are essential in classification, thus, in such a case, we define that the best
split test exhibits the largest gap between the sets of
time sequences in the child nodes. The gap gap(e, a, θ)
of σ(e, a, θ) is equivalent to G(e (a), e (a)) where e
and e represent the example ei (a) in S1 (e, a, θ) with
the largest G(e(a), ei (a)) and the example ej (a) in
S2 (e, a, θ) with the smallest G(e(a), ej (a)) respectively.
When several split tests exhibit the largest value of
gain ratio, the split test with the largest gap(e, a, θ)
among them is selected.
Below we show the procedure standardExSplit which
obtains the best standard-example split test, where
ω.gr and ω.gap represent the gain ratio and the gap of
a split test ω respectively.
Procedure: standardExSplit
Input: Set of examples e1 , e2 , · · · , en
Return value: Best split test ω
1 ω.gr = 0
2 Foreach(example e)
3
Foreach(time-series attribute a)
4
Sort examples e1 , e2 , · · · , en in the current
node using G(e(a), ei (a)) as a key to 1 , 2 , · · · , n
5
Foreach(θ-guillotine cut ω of 1 , 2 , · · · , n )
6
If ω .gr > ω.gr
7
ω = ω
8
Else If ω .gr == ω.gr And ω .gap > ω.gap
9
ω = ω
10 Return ω
3.2. Cluster-example Split Test
From the standard-example split test, one can easily invent a split test with two standard examples
e , e and an attribute a. A cluster-example split
test σ (e , e , a) divides a set of examples e1 , e2 , · · · , en
into a set U1 (e , e , a) of examples each of which ei (a)
satisfies d(e (a), ei (a)) < d(e (a), ei (a)) and the rest
U2 (e , e , a). The evaluation criterion for the goodness
of a split test is equivalent to that of the standardexample split test without θ.
Below we show the procedure clusterExSplit which obtains the best cluster-example split test.
Procedure: clusterExSplit
Input: Set of examples e1 , e2 , · · · , en
Return value: Best split test ω
1 ω.gr = 0
2 Foreach(pair of examples e , e )
3
Foreach(time-series attribute a)
4
ω = σ (e , e , a)
6
If ω .gr > ω.gr
7
ω = ω
8
Else If ω .gr == ω.gr And ω .gap > ω.gap
9
ω = ω
10 Return ω
4. Experimental Evaluation
4.1. Conditions of Experiments
In order to investigate the effectiveness of the proposed
method, we perform experiments with real-world data
sets. Chronic hepatitis data have been donated from
Chiba University Hospital in Japan, and have been
employed in (Berka, 2002). The Australian sign language data4 and the EEG data belong to the benchmark data sets in the UCI KDD archive (Hettich &
Bay 1999).
For the chronic hepatitis data, we have settled a classification problem of predicting whether a patient is
liver cirrhosis (the degree of fibrosis is F4 or the result
of the corresponding biopsy5 is LC in the data). Since
the patients underwent different numbers of medical
4
The original source is http://www.cse.unsw.edu.au/˜
waleed/tml/data/.
5
We will explain a biopsy in section 4.3.
tests, we have employed time sequences each of which
has more than 9 test values during a period of before
500 days and after 500 days of a biopsy. As consequence, our data set consists of 30 LC-patients and
34 non-LC patients. Since the intervals of medical
tests differ, we have employed liner interpolation between two adjacent values and transformed each time
sequence to a time sequence of 101 values with a 10-day
interval. One of us, who is a physician, suggested to
use in classification 14 attributes (GOT, GPT, ZTT,
TTT, T-BIL, I-BIL, D-BIL, T-CHO, TP, ALB, CHE,
WBC, PLT, HGB) which are important in hepatitis.
We considered that the change of medical test values
might be as important as the values, and have generated 14 novel attributes as follows. Each value e(a, t)
of an attribute a at time t is transformed to e (a, t) by
e (a, t) = e(a, t)−{l(e, a)+s(e, a)}/2, where l(e, a) and
s(e, a) represent the maximum and the minimum value
of a for e respectively. As consequence, another data
set with 28 attributes was used in the experiments.
The Australian sign language data represent a record
of hand position and movement of 95 words which were
uttered several times by five people. In the experiments, we have chosen randomly five words “Norway”,
“spend”, “lose”, “forget”, and “boy” as classes, and
employed 70 examples for each class. Among the 15
attributes, 9 (x, y, z, roll, thumb, fore, index, ring,
little) are employed and the number of values in a
time sequence is 506 . The EEG data set is a record of
brain waves represented in a set of 255-value time sequences obtained from 64 electrodes placed on scalps.
We have agglomerated three situations for each patient and omitted examples with missing values, and
obtained a data set of 192 attributes for 77 alcoholic
people and 43 non-alcoholic people.
In the experiments, we tested two decision-tree learners with the standard-example split test (SE-split) and
the cluster-example split test (CE-spit) presented in
section 3. Throughout the experiments, we employed
the pessimistic pruning method (Mingers, 1989) and
the adjustment window r of DTW was settled to 10 %
of the total length7 .
For comparative purpose, we have chosen methods presented in section 2.3. In order to investigate on the effect of our exhaustive search and use of gaps, we tested
a modified versions of SE-split which select each standard example from randomly chosen 1 or 10 examples without using gaps as Random 1 and Random
6
For each time sequence, 50 points with an equivalent
interval were generated by linear interpolation between two
adjacent points.
7
A value less than 1 is round down.
10 respectively. Intuitively, the former corresponds
to a decision-tree version of (Rodrı́guez, Alonso, &
Bostrvm, 2000). In our implementation of (Geurts,
2001), the maximum number of segments was set to
3 since it outperformed the cases of 5, 7, 11. In our
implementation of (Kadous, 1999), we used average,
maximum, minimum, median, and mode as global features, and Increase, Decrease, and Flat with 3 clusters
and 10-equal-length discretization as its “parametrised
event primitives”. Av-split, and 1-NN represent the
split test for the average values of time sequences,
and the nearest neighbor method with the DTW-based
measure respectively.
It is widely known that the performance of a nearest
neighbor method largely depends on its dissimilarity
measure. As the result of trial and error, the following
dissimilarity measure H(ei , ej ) has been chosen since
it often exhibits the highest accuracy.
H(ei , ej ) =
m
G(ei (ak ), ej (ak ))
k=1
q(ak )
,
where q(ak ) is the maximum value of the dissimilarity
measure for a time-series attribute ak , i.e. q(ak ) ≥
∀i∀j G(e(ai ), e(aj )).
In the standard-example split test, the value of a gap
gap(e, a, θ) largely depends on its attribute a. In
the experiments, we have also tested to substitute
gap(e, a, θ)/q(a) for gap(e, a, θ). We omit the results
in the following sections since this normalization does
not necessarily improve accuracy.
4.2. Results of the Experiments
We show the results of the experiments in Tables 1 and
2 where the evaluation methods are leave-one-out and
20 × 5-fold cross validation respectively. In the Tables, the size represents the average number of nodes
in a decision tree, and the time is equal to {(the time
for obtaining the DTW values of time sequences for
all attributes and all pairs of examples) + (the time
needed for leave-one-out or 20 times 5-fold cross validation)}/(the number of learned classifiers) in order
to compare eager learners and a lazy learner. A PC
with a 3-GHz Pentium IV CPU with 1.5G-byte memory was used in each trial. H1 and H2 represent the
data sets with 14 and 28 attributes respectively both
obtained from the chronic hepatitis data described in
section 4.1.
First, we briefly summarize comparison of our SE-split
with other non-naive methods. From the tables, our
exhaustive search and use of gaps are justified since
SE-split draws with Random 1 and 10 in accuracy but
outperforms them in terms of tree sizes. Our SE-split
draws with (Geurts, 2001) in terms of these criteria
but is much faster8 . We have noticed that (Geurts,
2001) might be vulnerable to outliers but this should
be confirmed by investigation. We attribute the poor
performance of our (Kadous, 1999) to our choice of
parameter values, and consider that we need to tune
them to obtain satisfactory results.
In terms of accuracy, Random 1, (Kadous, 1999), and
Av-split suffer in the EEG data set and the latter two
in the Sign data set. This would suggest effectiveness of exhaustive search, small number of parameters,
and explicit handling of time sequences. 1-NN exhibits
high accuracy in the Sign data set and relatively low
accuracy in the EEG data set. These results might
come from the fact that all attributes are relevant in
the sign data set while many attributes are irrelevant
in the EEG data set.
CE-spit, although it handles the structure of a time
sequence explicitly, almost always exhibits lower accuracy than SE-split. We consider that this is due
to the fact that CE-split rarely produces pure child
nodes9 since it mainly divides a set of examples based
on their shapes. We have observed that SE-split often
produces nearly pure child nodes due to its use of the
θ-guillotine cut.
Experimental results show that DTW is sometimes
time-inefficient. It should be noted that time is less
important than accuracy in our problem. Recent advances such as (Keogh, 2002), however, can significantly speed up DTW.
In terms of the size a decision tree, SE-split, CE-split,
and (Geurts, 2001) constantly exhibit good performance. It should be noted that SE-split and CE-split
show similar tendencies though the latter method often produces slightly larger trees possibly due to the
pure node problem. As we have described in section
2.3, a nearest neighbor method, being a lazy learner,
has deficiency in comprehensiveness of its learned results. This deficiency can be considered as crucial in
application domains such as medicine where interpretation of learned results is highly important.
The difference between Tables 1 and 2 in time mainly
comes from the number of examples in the training set
and the test set. The execution time of 1-NN, being
a lazy learner, is equivalent to the time of the test
8
We could not include the results of the latter for two
data sets since we estimate them several months with our
current implementation. Reuse of intermediate results,
however, would significantly shorten time.
9
A pure child node represents a node with examples
belonging to the same class.
Table 1. Results of experiments with leave-one-out
method
SE-split
Random 1
Random 10
Geurts
Kadous
CE-spit
Av-split
1-NN
H1
79.7
71.9
79.7
75.0
68.8
65.6
73.4
82.8
accuracy (%)
H2 Sign EEG
85.9 86.3
70.0
68.8 85.7
54.2
82.8 86.3
67.5
78.1
68.8 36.9
10.8
73.4 85.4
63.3
70.3 35.7
52.5
84.4 97.8
60.8
H1
0.8
0.2
0.4
26.2
2.0
1.3
0.0
0.2
time (s)
H2
Sign
1.4
63.3
0.5
1.2
0.8
3.4
41.1
4.9
10.6
1.3 1876.4
0.1
0.2
0.4
0.1
EEG
96.8
50.9
59.9
1167.0
1300.5
2.9
47.5
H1
9.0
15.6
10.2
10.2
9.2
9.4
10.9
N/A
size
H2 Sign
7.1
38.7
12.9
69.5
9.8
50.8
9.5
10.8
39.5
7.2
42.8
11.4
47.4
N/A N/A
EEG
16.6
33.7
20.6
40.0
23.4
61.9
N/A
Table 2. Results of experiments with 20 × 5-fold cross validation
H1
71.1
67.3
73.8
68.7
68.1
69.1
73.0
80.9
accuracy (%)
H2 Sign EEG
75.6 85.9
63.8
69.3 81.0
55.5
72.2 84.2
60.5
72.1
67.7 36.2
41.1
69.0 86.2
58.5
70.7 34.9
51.3
81.5 96.6
61.8
H1
0.5
0.1
0.2
7.7
2.1
0.6
0.0
2.2
phase and is roughly proportional to the number of
test examples. As the result, it runs faster with leaveone-out than 20 × 5-fold cross validation, which is the
opposite tendency of a decision-tree learner.
The accuracy of SE-split often degrades substantially
with 20 × 5-fold cross validation compared with leaveone-out. We attribute this to the fact that a “good”
example, which is selected as a standard example if it
is in a training set, belongs to the test set in 20 × 5fold cross validation more frequently than in leave-oneout. In order to justify this assumption, we show the
learning curve10 of each method for EEG data, which
consists of the largest number of examples in the four
data sets, in Figure 4. In order to increase the number of examples, we counted a situation of a patient
as an example. Hence an example is described by 64
attributes, and we have picked up 250 examples from
each class randomly. We omitted several methods due
to their performance in Tables 1 and 2. From the Figure, the accuracy of the decision-tree learner with the
standard-example split test degrades heavily when the
number of examples is small. These experiments show
that our standard-example split test is appropriate for
a data set with a relatively large number of examples.
10
Each accuracy represents an average of 20 trials.
time (s)
H2
Sign
0.8
28.3
0.3
0.8
0.5
2.6
14.5
4.7
10.4
1.5 966.9
0.0
0.2
4.4
9.3
EEG
51.1
48.1
64.5
642.9
530.0
2.5
1021.9
H1
8.3
12.1
9.0
7.9
7.9
7.9
10.1
N/A
size
H2 Sign
7.5
28.3
10.7
59.0
8.4
41.1
8.1
8.7
33.5
7.4
36.0
10.0
41.1
N/A N/A
EEG
13.4
24.6
16.5
26.8
19.9
40.1
N/A
100
standard-example split test
accuracy
method
SE-split
Random 1
Random 10
Geurts
Kadous
CE-spit
Av-split
1-NN
80
1-nearest neighbor
60
40
0.1
average-value split test
0.2
0.3
0.4
0.5
0.6
ratio of training examples
0.7
0.8
0.9
Figure 4. Learning curve of each method for the EEG data
4.3. Knowledge Discovery from the Chronic
Hepatitis Data
Chronic hepatitis represents a disease in which liver
cells become inflamed and harmed by virus infection.
In case the inflammation lasts a long period, the disease comes to an end which is called a liver cirrhosis.
During the process to a liver cirrhosis, the degree of
fibrosis represents an index of the progress, and the
degree of fibrosis consists of five stages ranging from
F0 (no fiber) to F4 (liver cirrhosis). The degree of
fibrosis can be inspected by biopsy which picks liver
tissue by inserting an instrument directly into liver. A
biopsy, however, cannot be frequently performed since
it requires a short-term admission to a hospital and
involves danger such as hemorrhage. Therefore, if a
conventional medical test such as a blood test can pre-
600
Table 3. Experimental results with H0 using leave-one-out
500
400
300
similar
200
CHE(278)
100
-500 -250
0
LC patients = 13
400
300
method
SE-split
Random 1
Random 10
Geurts
Kadous
CE-spit
Av-split
1-NN
dissimilar
250 500
GPT(404)
200
100
similar
0
-500 -250
0
250 500
dissimilar
6
LCpatients = 5
ALB(281)
5
4
similar
3
-500 -250
non-LC patients = 30
(LC patients = 1)
0
250
500
accuracy (%)
88.2
74.5
78.4
84.3
78.4
62.7
82.4
74.5
time (s)
0.5
0.1
0.3
12.5
1.6
0.5
0.0
0.1
size
6.9
8.4
7.4
7.4
5.1
8.5
10.1
N/A
dissimilar
LC patients = 2
Figure 5. Time-series tree learned from H0 (chronic hepatitis data of the first biopsies)
dict the degree of fibrosis, it would be highly beneficial
in medicine since it can replace a biopsy.
The classifier obtained below might realize such a
story. We have prepared another data set, which we
call H0, from the chronic hepatitis data by dealing
with the first biopsies only. H0 consists of 51 examples (21 LC patients, and 30 non-LC patients) each of
which is described with 14 attributes. Since a patient
who underwent a biopsy is typically subject to various
treatments such as interferon, the use of the first biopsies only enables us to analyze a more natural stage of
the disease than in H1 and H2. We show the decision
tree with the standard-example split test learned from
H0 in Figure 5. In the Figure, a number described
subsequent to an attribute in parentheses represents a
patient ID, and a leaf node predicts its majority class.
A horizontal dashed line in a graph represents a border value of two categories (e.g. normal and high) of
the corresponding medical test.
The decision tree and the time sequences employed in
the Figure have attracted interests of physicians, and
were recognized as an important discovery. The timeseries tree investigates potential capacity of a liver by
CHE and predicts patients with low capacity as liver
cirrhosis. Then the tree investigates the degree of inflammation for other patients by GPT, and predicts
patients with heavy inflammation as liver cirrhosis.
For other patients, the tree investigates another sort
of potential capacity of a liver by ALB and predicts
liver cirrhosis based on the capacity. This procedure
highly agrees with routine interpretation of blood tests
by physicians. Our proposed method was highly appreciated by them since it discovered results which
are highly consistent to knowledge of physicians by
only using medical knowledge on relevant attributes.
They consider that we need to verify plausibility of this
classifier in terms of as much information as possible
from various sources then eventually move to biological tests.
During the quest, there was an interesting debate
among the authors. Yamada and Suzuki, as machine
learning researchers, proposed abstracting the time sequences in Figure 5. Yokoi and Takabayashi, who
are physicians, however, insisted to use time sequences
that exist in the data set11 . The physicians are afraid
that such abstracted time sequences are meaningless,
and claim that the use of real time sequences is appropriate from the medical point of view.
We show the results of the learning methods with H0 in
Table 3. Our time-series tree outperforms other methods in accuracy12 , and in tree size except for (Kadous,
1999). Test of significance based on two-tailed tdistribution shows that the differences of tree sizes are
statistically significant. We can safely conclude that
our method outperforms other methods when both accuracy and tree sizes are considered. Moreover, inspection of mis-predicted patients in leave-one-out revealed
that most of them can be considered as exceptions.
This shows that our method is also effective in detecting exceptional patients.
5. Conclusions
In applying a machine learning algorithm to real data,
one often encounters the problems of quantity, qual11
Other physicians supported Yokoi and Takabayashi.
Anyway they are not reluctant to see the results of abstracted patterns too.
12
By a binary test with correspondence, however, we
cannot reject, for example, the hypothesis that SE-split
and Av-split differ in accuracy. We need more examples to
show superiority of our approach in terms of accuracy.
ity, and form. Data are massive (quantity), noisy
(quality), and in various shapes (form). Recently,
the third problem has motivated several researchers
to propose classification from structured data including time-series, and we believe that this tendency will
continue in the near future.
Information technology in medicine has spread its target to multimedia data such as time-series data and
image data from string data and numerical data in
the last decade, and aims to support the whole process of medicine which handles various types of data
(Tanaka, 2001). Our decision-tree induction method
which handles the shape of a time sequence explicitly,
has succeeded in discovering results which were highly
appreciated by physicians. We anticipate that there
is a long way toward an effective classification method
which handles various structures of multimedia data
explicitly, but our method can be regarded as an important step toward this objective.
Acknowledgement
This work was partially supported by the grant-in-aid
for scientific research on priority area “Active Mining” from the Japanese Ministry of Education, Culture, Sports, Science and Technology. We are grateful
to Eamonn Keogh who permitted our use of his figures
in this paper.
References
Almuallim, H., Akiba, Y., & Kaneda, S. (1996). An
Efficient Algorithm for Finding Optimal Gain-ratio
Multiple-split Tests on Hierarchical Attributes in
Decision Tree Learning. Proceedings of the Thirteenth National Conference on Artificial Intelligence
(pp. 703–708). Menlo Park, CA: AAAI Press.
Berka, P. (2002).
ECML/PKDD 2002 Discovery Challenge, Download Data about Hepatitis.
http://lisp.vse.cz/challenge/ecmlpkdd2002/ .
Breiman, L., Friedman, J., Olshen, R., & Stone, C.
(1984). Classification and Regression Trees. Belmont, CA: Chapman & Hall.
Drücker, C. et al. (2002). “As Time Goes by”
- Using Time Series Based Decision Tree Induction to Analyze the Behaviour of Opponent Players.
RoboCup 2001: Robot Soccer World Cup V, LNAI
2377 (pp. 325–330). Berlin: Springer-Verlag.
Geurts, P. (2001). Pattern extraction for time series classification. Principles of Data Mining and
Knowledge Discovery, LNAI 2168 (pp. 115–127).
Berlin: Springer-Verlag.
Hettich, S. and Bay, S. D. (1999). The UCI KDD
Archive. http://kdd.ics.uci.edu. Irvine, CA: University of California, Department of Information and
Computer Science.
Kadous, M. W. (1999). Learning comprehensible descriptions of multivariate time series. Proceedings
of the Sixteenth International Conference on Machine Learning (pp. 454–463). San Francisco: Morgan Kaufmann.
Keogh, E. J. and Pazzani, M. J. (2000). Scaling up
Dynamic Time Warping for Datamining Application. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining (pp. 285–289). New York: ACM.
Keogh, E. J. (2001). “Mining and Indexing Time
Series Data”. Tutorial at the 2001 IEEE International Conference on Data Mining. http://www.cs.
ucr.edu/%7Eeamonn/tutorial on time series.ppt.
Keogh, E. (2002). Exact Indexing of Dynamic Time
Warping. Proceedings of the 28th International
Conference on Very Large Data Bases (pp. 406–
417). St. Louis: Morgan Kaufmann.
Mingers, J. (1989). An Empirical Comparison of Pruning Methods for Decision Tree Induction. Machine
Learning, 4 , 227–243.
Murthy, S. K. (1998). Automatic Construction of Decision Trees from Data: A Multi-disciplinary Survey.
Data Mining and Knowledge Discovery, 2 , 345–389.
Quinlan, J. R. (1993). C4.5: Programs for Machine
Learning, San Mateo, CA: Morgan Kaufmann.
Rodrı́guez, J. J., Alonso, C. J., and Bostrvm, H.
(2000).
Learning First Order Logic Time Series Classifiers, Proceedings of the Work-in-Progress
Track at the Tenth International Conference on Inductive Logic Programming, (pp. 260–275).
Sakoe, H. and Chiba, S. (1978). Dynamic Programming Algorithm Optimization for Spoken Word
Recognition.
IEEE Transaction on Acoustics,
Speech, and Signal Processing, ASSP-26 , 43–49.
Takechi, F. and Suzuki, E. (2002). Finding an Optimal Gain-ratio Subset-split Test for a Set-valued
Attribute in Decision Tree Induction. Proceedings
of the Nineteenth International Conference on Machine Learning (pp. 618–625). San Francisco: Morgan Kaufmann.
Tanaka, H. (2001). Electronic Patient Record and IT
Medical Treatment, Tokyo: MED (in Japanese).
Download