Frequent Pattern Mining on Data Stream

advertisement
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
Project Report
Patient Clustering in Healthcare Monitors
LIS429 Implementation of Information Storage and Retrieval
Spring 2004
Xiao Hu
Graduate School of Library and Information Science
University of Illinois at Urbana-Champaign
Abstract: With the increase of demands for healthcare services, it is desirable if we can
automatically group patients by computers before sending them to physicians [1]. This
project is to develop several clustering mechanisms, and to do pilot experiments on the
data collected from the online questionnaire monitors [2] as well as synthetic data.
The results of the preliminary experiments validate the feasibility of patient clustering,
and at the same time, indicate directions of further work.
1. Introduction
As the demand for higher quality healthcare services increases consistently, information
technology is called to improve the effectiveness and efficiency of healthcare services. A
new model of healthcare monitor based on online questionnaire mechanism is proposed
[1] and developed [2]. The backend of the monitor connects to a comprehensive
healthcare database which records all the answers to the questionnaires. Based on those
records, similarity matching among the population can be conducted and cohorts of
patients with similar features can be identified. Such recognition of cohorts will be very
valuable both for providing references of similar cases to a patient, and for surveying the
public health situation of the population.
Clustering is the process of grouping a set of objects into classes of similar objects [3].
Recognized as an unsupervised learning method in that the class label of each object is
unknown and the data are grouped together simply based on the similarity between them,
clustering is well fitted in the problem of cohort formation. For example, the simplest but
the most important case is to distinguish patients of being “sick” from being “healthy”, so
as to urge sick people to see doctors. Distributing patients into groups of different levels
of sickness can also help assign doctors and nurses appropriately, and thus improve
logistic efficiency. As far as disease diagnosis is concerned, patient clustering can largely
identify groups of people with different disease according to their answers to the
questionnaire. The intuition is the same as that of physician interviews, but this novel
information system will greatly reduce the cost and increase the productivity. Therefore,
1/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
patient clustering together with the online questionnaire infrastructure is a good means to
solve the conflict of huge demand and limited resources in healthcare industry.
Another use of patient clustering is exploratory analysis, which is an important activity in
any processes of patient data. It helps us get a basic feeling about what the data look like,
and get rid of misleading assumptions about the data. Therefore, exploratory analysis is
the first step whenever we are facing new data sets and want to understand basic
characteristics of them.
The remainder of this paper is organized as follows. Section 2 overviews classical
algorithms. Three representative similarity based algorithms we applied in patient
clustering are reviewed in Section 3. Specifically, the K-means, agglomerative and graph
theoretic algorithms. Section 4 describes the experiments and results. Discussions and
conclusions are summarized in Section 5. Finally, Section 6 is our thoughts about further
work.
2. Clustering Algorithms
Clustering is a classical topic in statistical data analysis and machine learning. There are
many research work discussing clustering methods [3,4]. Generally, clustering algorithms
can be categorized in several ways. According to the structure of clusters, there are
hierarchical methods and partitioning methods. The former builds a hierarchical
decomposition upon the data objects, while the latter simply makes a partition for the
data. Most partitioning methods are iterative: they start with a set of initial clusters and
improve them by iterating a reallocation operation that reassigns the objects.
Another distinction between clustering algorithms is whether they perform a soft
clustering or hard clustering, according to the ways of assigning objects to clusters. In a
hard assignment, each object goes to only one cluster. Soft assignments allow degrees of
membership in multiple clusters: one object can be assigned to multiple clusters with
certain membership values. In hierarchical clustering, assignment is usually “hard”. In
partitioning clustering, both types of assignment are common. Actually, most soft
assignment models also assume that an object can belong to only one cluster. The
difference from hard clustering is that there is uncertainty about which cluster is the
correct one. As contrast, there are true multiple assignment models in which an object can
belong to several clusters, which is called disjunctive clustering [6]. In patient clustering,
all of the assignment methods are useful depending on the specific applications.
Clustering algorithms can also be divided into similarity-based and model-based ones.
Similarity-based methods need similarity functions to measure how similar the objects
are. Common algorithms such as agglomerative clustering and K-means clustering are in
this category. The most well known similarity measures are based on distances, such as
Euclidean distance and Manhattan distance. Model based methods assume the data
objects submit to some latent model, and use probabilistic or algebraic approaches to
2/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
estimate the model. Once the model is computed, clustering is quite easy based on the
model. The EM algorithm is a typical example of model-based methods. Similarity-based
methods rely on similarity measures, and thus may not be applicable for some data sets,
for example, nominal data like which kind of diseases. And they are typically hard
clusterings. However, model-based methods can accommodate various definitions of
clusters and allocation of objects based on complex probabilistic models which inherently
support soft clustering. Due to time constraint, we only focus on similarity-based
algorithms this time.
3. Similarity-based Algorithms
Many clustering algorithms are based on similarity measures. In this section, we discuss
three representative similarity-based algorithms.
3.1 K-meansClustering
K-means is a hard clustering algorithm that defines clusters by the centroid of their
members. It needs a set of initial cluster centers in the beginning, and then goes through
several iterations consisting of assigning each object to the cluster whose center is closest,
and re-computing the center of each clusters based the new assignment of the objects.
The iteration process stops when the centers converge. Figure 1 is the skeleton of the
K-means algorithm:


Select k initial centers f1,..., f k
while stopping criterion is not true do
for all clusters c j do
   
 
c j {xi| f l d ( xi , f j ) d ( xi , f l )
end

for all means f j

f j   (c j )
do
end
end
Figure 1: K-means clustering
The time complexity of K-means is O(n) since both steps in the iteration are O(n) and
only a constant number of iterations is computed. In implementation, a problem is how to
break ties in cases where there are several centers with the same distance from an object.
In these cases, we can either assign objects randomly to one of the candidate clusters or
perturb objects slightly so that their new positions don’t cause ties.
3/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
3.2 Agglomerative Clustering
The tree of a hierarchical clustering can be produced either bottom-up, by starting with
the individual objects and grouping the most similar ones, or top-down, by starting with
all the objects and dividing them into groups. The bottom-up algorithm is also called
agglomerative clustering, while the top-down one is called divisive clustering. The ideas
of the two are quite similar expect the former is to merge similar clusters while the latter
is to split dissimilar clusters.
Agglomerative clustering is a greedy algorithm that starts with a separate cluster for each
object. In each step, the two most similar clusters are determined, and merged into a new
cluster. The algorithm stops when a certain stop criterion is met. Usually, it stops when
one large cluster containing all objects are formed. Figure 2 is the pseudo code of the
agglomerative clustering.
Given: a set   {x1,..., xn } of objects
a function of similarity: sim: P(  )  P(  )  
for i:=1 to n do
ci : {xi } end
C := {c1 ,..., cn }
j:=n+1
while |C| >1
# or other stopping criterion
(cn1 , cn 2 ) : arg max ( cu , c v )C C sim (cu , cv )
c j  cn1  cn 2
C: C \ {cn1 , cn 2 }  c j
j: j  2
Figure 1: agglomerative hierarchical clustering
Besides the algorithm itself, there is a trick on how to compute group similarity based on
individual object similarity. There are three similarity functions commonly used as shown
on table 1.
Function
Definition
single link
similarity of the closest pair
compete link
similarity of the farthest pair
average link
average similarity between all pairs
Table 1: similarity functions used in clustering
4/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
In single link clustering, the similarity between two clusters is the similarity of the two
closest objects in the clusters. Clusters based on this function have good local coherence
since the similarity function is locally defined. However, it can result in bad global
quality, since it has no way to take into account the global context. As opposed to locally
coherent clusters as in the case of single-link clustering, complete-link clustering has a
similarity function that focuses on global cluster quality. Also the result can be looked as
“tight” clusters since it comes from the similarity of two most dissimilar members. Both
of the two functions are based on individual objects, and thus are sensitive to outliers.
However, the average link function is immune to such sensitivity since it is based on
group decision. As we see, these functions have their own advantages and are good for
different applications. Nevertheless, in most patient clustering applications, the global
coherence is preferable to local coherence. In terms of computational complexity: single
link clustering is O(n2), but complete link is O(n3). The average link similarity can be
computed efficiently in some cases so that the complexity of the algorithm is only O(n 2).
Therefore, the average link function can be an efficient alternative to complete link
function while avoiding the bad global coherence of single link function.
3.3 Graph Theoretic Algorithm
Graph theoretic algorithm is a technique developed from graph theory [5]. The purpose is
to take the advantage of the simplicity of tree structure, which can facilitate efficient
implementations of much more sophisticated clustering algorithms. It is widely used in
the field of computer vision where the data are all in very high dimension space. The
computation burden is getting heavy when dimensionality goes high, which is the
well-known “curse of dimensionality”. In the case of patient clustering, when number of
questions becomes larger and larger, as what it is in the reality, we will also face the
problem of dimensionality, so we’d like to try it in this study.
There are many variations in the family of graph theoretic algorithms, such as Minimal
Spanning Tree (MST) based method, Cut algorithm, and Normalized Cut/Spectral
methods [5]. In general, the idea of graph theoretic algorithms is the following: firstly, it
constructs a weighted graph upon the points in the high-dimensional space, with each
point being a node, and the similarity/distance value between any two points being the
weight of the edge connecting the two points. Then, it decomposes the graph into
connected components in some way, and calls those components as clusters.
We mainly focus on an MST-based clustering algorithm, as it does not depend on
detailed geometric shape of a cluster, and thus overcomes many of the problems faced by
classical clustering algorithms. For example, the Euclidean distance based K-means
clustering tends to form clusters in a sphere shape regardless the true geometric
characteristics of the data. The algorithm of MST-based graph theoretic clustering is
illustrated in figure 3:
This algorithm is very efficient, due to the fact that it only needs compute similarities /
distances between every pair of points once (when forming the graph). It is very desirable
in high-dimensional space where similarity computation is expensive. However, the
major cost of constructing the minimal spanning tree is not trivial when there are a large
5/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
number of points. Another advantage is it can also be applied to yield hierarchical
clusters during the clustering process.
1. Construct a weighted graph G(V,E):
V is the set of points,
the weight of an edge in E represents the distance between the two vertexes
2. Construct a Minimal Spanning Tree (MST) T from G
3. Identify edges with highest weights (distance) in the MST
4. Remove the identified edges to form connected components (clusters)
Figure 3: MST-based graph theoretic algorithm
4. Experiments and Results
In order to validate the proposed application of patient clustering and healthcare monitors,
we did a set of experiments on algorithms described in the previous section, both on
synthetic data and real data recorded by the healthcare monitor system [2].
4.1 Data Set
1. Data recorded by a healthcare monitor
An online healthcare survey system has been developed in our previous studies [2]. This
system can adaptively select questions according to a patient’s previous answers, and
record all answers from all patients. Although a comprehensive health monitors system
would have probably 100K questions which yield 1000 cohorts, as a preliminary study,
our system starts from a smaller scale of 123 questions about chronic disease and quality
of life (QoL) [1]. The recorded answers form a set of structured vectors with each
dimension as a question. The questions are all multiple choice ones each of which has
five choices, so the answers are categorical data with assigned numerical weight for each
choice. For this study, we simply assign weight 1-5 to the five choices. Considering each
question may have multiple answers over time, and the time interval between two
answers is different across users, we only use the latest answer of each user to simplify
the problem for this time. By this way, we lose the dynamic evolvement along the time,
but to study static situations is our first step.
Due to the adaptive feature of the online survey system, each patient may not have
answered all the questions in the questionnaire, which results in missing values in these
vectors. We fill the missing values with the middle choice which is not distinctive in the
case of two clusters, healthy or sick.
6/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
Because this system has not been applied to real patients, we collected data from students
and instructor in LIS429 course. Each of the students conducted the survey twice yielding
14 vectors to be clustered.
2. Synthetic Data Set
In order to examine the scalability of these algorithms, we generated synthetic data sets
with arbitrary number of vectors, number of features in each vector. The values in each
vector are randomly generated, but are constrained by its cluster. A vector is supposed to
be generated by a cluster which is randomly chosen. For example, if cluster 1 is chosen
for a vector, the value of each feature in this vector can be 1,2, or 3, but cannot be 4, and
5. This procedure limits the number of clusters not to exceed three, because otherwise the
values in each vector would be too identical. This assumption is not true in practice, and
different clusters should be differentiated in a subset of features instead of all features.
However, the real situation of patients’ answers is too complicated to be mimic,
especially for now when we have not collected real patient data. Therefore, we just test
on this simplified situation. If it works, we will create more sophisticated data generators
to do further exploration.
4.2 Experimental Results
The experimental results on data recorded by the healthcare monitor are quite exciting:
the clusters are discovered with good accuracy. Figure 4 illustrates the results of two
different runs of the K-means algorithm:
Run1:
cluster 1 :
cluster 0 :
Run2:
cluster 1 :
cluster 0 :
schatz
schatz
cnsolom1
schatz1
cnsolom1
schatz1
issleb
tdonohu1
issleb
tdonohu1
xiaohu0
issleb1
xiaohu0
issleb1
mcheney1
cnsolomo
tdonohue
mcheney1
tdonohue
sandersb
mcheney
cnsolomo
mcheney
xiaohu
sandersb
sanders1
xiaohu
sanders1
Figure 4: results of K-Means
Due to different initial settings, the K-means algorithm outputs different results. However,
these two results are quite similar except for one account “mcheney1” which was
clustered in different groups in the two runs. For other accounts, the grouping was correct
according to the surveyors. Only two of them pretended to be sick for both of their two
surveys (schatz and sandersb), others all split their two surveys into being healthy and
being sick.
7/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
By examining the database, we found that the answers of the only inconsistent account
“mcheney1” are kind of neutral, indicating a slightly sick patient who belongs to neither
of the groups, while other surveyors pretended to be either healthy or very sick. Take
question 41: “Does it hurt when you exercise?” as an example, “mcheney1” answered
“maybe”, while all the others answered “yes” or “no”. So, the algorithm is able to
differentiate the most important groups: being very sick and being healthy.
The three agglomerative algorithms showed quite different performances. Unsurprisingly,
the single link algorithm suffers from its “chain effect”. However, the average link was
expected to achieve a good result, but it turns out to produce one large cluster and a few
small ones. Some other researchers also reported this phenomenon of average link
algorithm, but the reason still needs to be investigated.
When the expected number of clusters is set as 2, the complete link algorithm output one
small cluster with only one member, and a large one with all the others. But if the number
of clusters is set as 3, it output the following:
cluster 0 :
tdonohu1
cluster 1 :
cluster 2 :
cnsolom1
schatz
issleb
schatz1
xiaohu0
issleb1
tdonohue
mcheney1
mcheney
cnsolomo
sandersb
xiaohu
sanders1
Figure 5: results of agglomerative algorithm with complete link
If combining cluster 0 and cluster 1, the result is completely coincident with the result of
K-means. Interestingly, the account “tdonohu1” keeps being an outlier across all the three
agglomerative algorithms. By looking into the database, we found that this account is the
only one whose answers are either As or Es. (“A” is assigned the smallest weight while
“E” is assigned the largest weight in our system.)
The result of graphic theoretic algorithm is in figure 6. Compared to the result of
K-means. This algorithm mistakenly clustered 3 more patients into “sick” group. It shows
the MST based graphic theoretic algorithm is not very good in this application.
The experiments on synthetic data test the scalability of the algorithms. For K-means, it
takes seconds to accomplish a clustering of 1000 vectors. In the case of 300 vectors, the
accuracy is between 0.71 and 1.00 depending on different initial settings. When the
number of vectors increased to 1000, the accuracy was just slightly reduced, which is
8/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
0.68 to 1.00. This shows the K-means algorithm, although simple, is very effective and
efficient.
cluster 1 :
cluster 0 :
cnsolom1
tdonohu1
xiaohu0
mcheney1
issleb
xiaohu
cnsolomo
issleb1
schatz
schatz1
sandersb
sanders1
Figure 6: results of graph theoretic algorithm with
minimal spanning tree
tdonohue
The agglomerative algorithms are much slower. It took 50 minutes to accomplish a single
mcheney
link run on 1000 vectors. The complete link and average link are known to have higher
order computation complexity (O(n3) and O(n2) respectively). The accuracy is always
1.00 across the algorithms and dataset. Although a reason is the synthetic data are too
clean, the result demonstrates the robust of agglomerative algorithms.
Our results on graph theoretic algorithm are not exciting, in that it runs too slow. We
tested on a dataset of 300 vectors each of which has 400 features. The minimal spanning
tree just got constructed about 250 edges after running for 8 hours. We used Kruskal's
algorithm to construct the minimal spanning tree. It searches cycles in the graph when
adding edges. With the tree getting larger, the searching becomes more time-consuming.
At last, we did an experiment on 100 vectors with 400 features, and the accuracy is 0.62,
which is quite low considering the synthetic data is very clean.
5. Discussion and Conclusion
The above results indicate the K-means algorithm performs best, but it is not always in
this case. K-means quite depends on the initial setting of centers. As of this
implementation where the initial centers are randomly chosen, the results of K-means
may be different across multiple runs. The most remarkable advantage of K-means
algorithm is its conceptual simpleness. K-means has a time complexity which is linear in
the number of documents, so it is preferable if efficiency is a consideration or data sets
are very large.
Among the three agglomerative algorithms, complete link performs best. It validates the
comparison between single link and complete link in section 3. It is unexpected that the
average link didn’t perform well. It seems the average link clustering tends to produce a
large cluster and a few small clusters.
9/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
The result of graph theoretic algorithm doesn’t seem good either. Possible reasons are: 1.
The Kruskal's algorithm used to construct the minimal spanning tree is a greedy
algorithm, and thus cannot achieve global optimal. 2. The criterion of identifying
inconsistent edges might not be the best for this case. The statistics are calculated on the
directly connected neighbors of an edge, statistics on more neighbors or even all the
edges need to be tested in further study. Theoretically, graph theoretic algorithm should
be efficient in high dimensionality cases, because it only needs to compute similarities
once. As a comparison, the agglomerative algorithms will have to update the similarities
after every mergence. However, an efficient way for constructing a MST is needed for
the graph theoretic algorithm.
6. Thoughts on Further work
First of all, we need to do experiment on more data. Both more real patients’ answer and
more variations of synthetic data are needed.
The three methods are all similarity based ones which are sensitive to similarity measures.
Moreover, they all belong to hard clustering methods where an object can only belong to
one cluster. In the context of patient clustering, sometimes soft clustering is necessary. In
addition, soft clustering is also very useful in determining the correlative relationships
between questions and different diseases. Model-based methods can accommodate
various definitions of clusters and allocation of objects based on complex probabilistic
models which inherently support soft clustering. We may try some model based
clustering methods like the EM algorithm.
Another line is to consider the cohorts identification problem as a classification problem,
rather a clustering problem. By training with real data, a proper set of weights on
different features can be learned, and can be applied to classify new patients into
appropriate classes.
References
[1] R. Berlin and B. Schatz, Internet Health Monitors for Outcomes of Chronic Illness,
Medscape General Medicine (MedGenMed), 6 sections (12pp), Sept 2, 1999
[2] Xiao Hu, An Online Health Questionnaire Monitor, project report in LIS450FCC,
GSLIS, UIUC, Dec 2003
[3] A. Jain & M. Murty, Data Clustering: A Review, ACM Computing Surveys, 31(3), pg
264-323, 1999
[4] Everitt, Brian, Cluster Analysis, New York, Wiley,1974
[5] J. Shi & J. Malik, Normalized cuts and image segmentation, In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, pg 731-737, 1997
10/11
LIS429 Project Report
Xiao Hu (xiaohu@uiuc.edu)
[6] Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language
Processing, MIT Press. Cambridge, MA: May 1999
11/11
Download