LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) Project Report Patient Clustering in Healthcare Monitors LIS429 Implementation of Information Storage and Retrieval Spring 2004 Xiao Hu Graduate School of Library and Information Science University of Illinois at Urbana-Champaign Abstract: With the increase of demands for healthcare services, it is desirable if we can automatically group patients by computers before sending them to physicians [1]. This project is to develop several clustering mechanisms, and to do pilot experiments on the data collected from the online questionnaire monitors [2] as well as synthetic data. The results of the preliminary experiments validate the feasibility of patient clustering, and at the same time, indicate directions of further work. 1. Introduction As the demand for higher quality healthcare services increases consistently, information technology is called to improve the effectiveness and efficiency of healthcare services. A new model of healthcare monitor based on online questionnaire mechanism is proposed [1] and developed [2]. The backend of the monitor connects to a comprehensive healthcare database which records all the answers to the questionnaires. Based on those records, similarity matching among the population can be conducted and cohorts of patients with similar features can be identified. Such recognition of cohorts will be very valuable both for providing references of similar cases to a patient, and for surveying the public health situation of the population. Clustering is the process of grouping a set of objects into classes of similar objects [3]. Recognized as an unsupervised learning method in that the class label of each object is unknown and the data are grouped together simply based on the similarity between them, clustering is well fitted in the problem of cohort formation. For example, the simplest but the most important case is to distinguish patients of being “sick” from being “healthy”, so as to urge sick people to see doctors. Distributing patients into groups of different levels of sickness can also help assign doctors and nurses appropriately, and thus improve logistic efficiency. As far as disease diagnosis is concerned, patient clustering can largely identify groups of people with different disease according to their answers to the questionnaire. The intuition is the same as that of physician interviews, but this novel information system will greatly reduce the cost and increase the productivity. Therefore, 1/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) patient clustering together with the online questionnaire infrastructure is a good means to solve the conflict of huge demand and limited resources in healthcare industry. Another use of patient clustering is exploratory analysis, which is an important activity in any processes of patient data. It helps us get a basic feeling about what the data look like, and get rid of misleading assumptions about the data. Therefore, exploratory analysis is the first step whenever we are facing new data sets and want to understand basic characteristics of them. The remainder of this paper is organized as follows. Section 2 overviews classical algorithms. Three representative similarity based algorithms we applied in patient clustering are reviewed in Section 3. Specifically, the K-means, agglomerative and graph theoretic algorithms. Section 4 describes the experiments and results. Discussions and conclusions are summarized in Section 5. Finally, Section 6 is our thoughts about further work. 2. Clustering Algorithms Clustering is a classical topic in statistical data analysis and machine learning. There are many research work discussing clustering methods [3,4]. Generally, clustering algorithms can be categorized in several ways. According to the structure of clusters, there are hierarchical methods and partitioning methods. The former builds a hierarchical decomposition upon the data objects, while the latter simply makes a partition for the data. Most partitioning methods are iterative: they start with a set of initial clusters and improve them by iterating a reallocation operation that reassigns the objects. Another distinction between clustering algorithms is whether they perform a soft clustering or hard clustering, according to the ways of assigning objects to clusters. In a hard assignment, each object goes to only one cluster. Soft assignments allow degrees of membership in multiple clusters: one object can be assigned to multiple clusters with certain membership values. In hierarchical clustering, assignment is usually “hard”. In partitioning clustering, both types of assignment are common. Actually, most soft assignment models also assume that an object can belong to only one cluster. The difference from hard clustering is that there is uncertainty about which cluster is the correct one. As contrast, there are true multiple assignment models in which an object can belong to several clusters, which is called disjunctive clustering [6]. In patient clustering, all of the assignment methods are useful depending on the specific applications. Clustering algorithms can also be divided into similarity-based and model-based ones. Similarity-based methods need similarity functions to measure how similar the objects are. Common algorithms such as agglomerative clustering and K-means clustering are in this category. The most well known similarity measures are based on distances, such as Euclidean distance and Manhattan distance. Model based methods assume the data objects submit to some latent model, and use probabilistic or algebraic approaches to 2/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) estimate the model. Once the model is computed, clustering is quite easy based on the model. The EM algorithm is a typical example of model-based methods. Similarity-based methods rely on similarity measures, and thus may not be applicable for some data sets, for example, nominal data like which kind of diseases. And they are typically hard clusterings. However, model-based methods can accommodate various definitions of clusters and allocation of objects based on complex probabilistic models which inherently support soft clustering. Due to time constraint, we only focus on similarity-based algorithms this time. 3. Similarity-based Algorithms Many clustering algorithms are based on similarity measures. In this section, we discuss three representative similarity-based algorithms. 3.1 K-meansClustering K-means is a hard clustering algorithm that defines clusters by the centroid of their members. It needs a set of initial cluster centers in the beginning, and then goes through several iterations consisting of assigning each object to the cluster whose center is closest, and re-computing the center of each clusters based the new assignment of the objects. The iteration process stops when the centers converge. Figure 1 is the skeleton of the K-means algorithm: Select k initial centers f1,..., f k while stopping criterion is not true do for all clusters c j do c j {xi| f l d ( xi , f j ) d ( xi , f l ) end for all means f j f j (c j ) do end end Figure 1: K-means clustering The time complexity of K-means is O(n) since both steps in the iteration are O(n) and only a constant number of iterations is computed. In implementation, a problem is how to break ties in cases where there are several centers with the same distance from an object. In these cases, we can either assign objects randomly to one of the candidate clusters or perturb objects slightly so that their new positions don’t cause ties. 3/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) 3.2 Agglomerative Clustering The tree of a hierarchical clustering can be produced either bottom-up, by starting with the individual objects and grouping the most similar ones, or top-down, by starting with all the objects and dividing them into groups. The bottom-up algorithm is also called agglomerative clustering, while the top-down one is called divisive clustering. The ideas of the two are quite similar expect the former is to merge similar clusters while the latter is to split dissimilar clusters. Agglomerative clustering is a greedy algorithm that starts with a separate cluster for each object. In each step, the two most similar clusters are determined, and merged into a new cluster. The algorithm stops when a certain stop criterion is met. Usually, it stops when one large cluster containing all objects are formed. Figure 2 is the pseudo code of the agglomerative clustering. Given: a set {x1,..., xn } of objects a function of similarity: sim: P( ) P( ) for i:=1 to n do ci : {xi } end C := {c1 ,..., cn } j:=n+1 while |C| >1 # or other stopping criterion (cn1 , cn 2 ) : arg max ( cu , c v )C C sim (cu , cv ) c j cn1 cn 2 C: C \ {cn1 , cn 2 } c j j: j 2 Figure 1: agglomerative hierarchical clustering Besides the algorithm itself, there is a trick on how to compute group similarity based on individual object similarity. There are three similarity functions commonly used as shown on table 1. Function Definition single link similarity of the closest pair compete link similarity of the farthest pair average link average similarity between all pairs Table 1: similarity functions used in clustering 4/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) In single link clustering, the similarity between two clusters is the similarity of the two closest objects in the clusters. Clusters based on this function have good local coherence since the similarity function is locally defined. However, it can result in bad global quality, since it has no way to take into account the global context. As opposed to locally coherent clusters as in the case of single-link clustering, complete-link clustering has a similarity function that focuses on global cluster quality. Also the result can be looked as “tight” clusters since it comes from the similarity of two most dissimilar members. Both of the two functions are based on individual objects, and thus are sensitive to outliers. However, the average link function is immune to such sensitivity since it is based on group decision. As we see, these functions have their own advantages and are good for different applications. Nevertheless, in most patient clustering applications, the global coherence is preferable to local coherence. In terms of computational complexity: single link clustering is O(n2), but complete link is O(n3). The average link similarity can be computed efficiently in some cases so that the complexity of the algorithm is only O(n 2). Therefore, the average link function can be an efficient alternative to complete link function while avoiding the bad global coherence of single link function. 3.3 Graph Theoretic Algorithm Graph theoretic algorithm is a technique developed from graph theory [5]. The purpose is to take the advantage of the simplicity of tree structure, which can facilitate efficient implementations of much more sophisticated clustering algorithms. It is widely used in the field of computer vision where the data are all in very high dimension space. The computation burden is getting heavy when dimensionality goes high, which is the well-known “curse of dimensionality”. In the case of patient clustering, when number of questions becomes larger and larger, as what it is in the reality, we will also face the problem of dimensionality, so we’d like to try it in this study. There are many variations in the family of graph theoretic algorithms, such as Minimal Spanning Tree (MST) based method, Cut algorithm, and Normalized Cut/Spectral methods [5]. In general, the idea of graph theoretic algorithms is the following: firstly, it constructs a weighted graph upon the points in the high-dimensional space, with each point being a node, and the similarity/distance value between any two points being the weight of the edge connecting the two points. Then, it decomposes the graph into connected components in some way, and calls those components as clusters. We mainly focus on an MST-based clustering algorithm, as it does not depend on detailed geometric shape of a cluster, and thus overcomes many of the problems faced by classical clustering algorithms. For example, the Euclidean distance based K-means clustering tends to form clusters in a sphere shape regardless the true geometric characteristics of the data. The algorithm of MST-based graph theoretic clustering is illustrated in figure 3: This algorithm is very efficient, due to the fact that it only needs compute similarities / distances between every pair of points once (when forming the graph). It is very desirable in high-dimensional space where similarity computation is expensive. However, the major cost of constructing the minimal spanning tree is not trivial when there are a large 5/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) number of points. Another advantage is it can also be applied to yield hierarchical clusters during the clustering process. 1. Construct a weighted graph G(V,E): V is the set of points, the weight of an edge in E represents the distance between the two vertexes 2. Construct a Minimal Spanning Tree (MST) T from G 3. Identify edges with highest weights (distance) in the MST 4. Remove the identified edges to form connected components (clusters) Figure 3: MST-based graph theoretic algorithm 4. Experiments and Results In order to validate the proposed application of patient clustering and healthcare monitors, we did a set of experiments on algorithms described in the previous section, both on synthetic data and real data recorded by the healthcare monitor system [2]. 4.1 Data Set 1. Data recorded by a healthcare monitor An online healthcare survey system has been developed in our previous studies [2]. This system can adaptively select questions according to a patient’s previous answers, and record all answers from all patients. Although a comprehensive health monitors system would have probably 100K questions which yield 1000 cohorts, as a preliminary study, our system starts from a smaller scale of 123 questions about chronic disease and quality of life (QoL) [1]. The recorded answers form a set of structured vectors with each dimension as a question. The questions are all multiple choice ones each of which has five choices, so the answers are categorical data with assigned numerical weight for each choice. For this study, we simply assign weight 1-5 to the five choices. Considering each question may have multiple answers over time, and the time interval between two answers is different across users, we only use the latest answer of each user to simplify the problem for this time. By this way, we lose the dynamic evolvement along the time, but to study static situations is our first step. Due to the adaptive feature of the online survey system, each patient may not have answered all the questions in the questionnaire, which results in missing values in these vectors. We fill the missing values with the middle choice which is not distinctive in the case of two clusters, healthy or sick. 6/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) Because this system has not been applied to real patients, we collected data from students and instructor in LIS429 course. Each of the students conducted the survey twice yielding 14 vectors to be clustered. 2. Synthetic Data Set In order to examine the scalability of these algorithms, we generated synthetic data sets with arbitrary number of vectors, number of features in each vector. The values in each vector are randomly generated, but are constrained by its cluster. A vector is supposed to be generated by a cluster which is randomly chosen. For example, if cluster 1 is chosen for a vector, the value of each feature in this vector can be 1,2, or 3, but cannot be 4, and 5. This procedure limits the number of clusters not to exceed three, because otherwise the values in each vector would be too identical. This assumption is not true in practice, and different clusters should be differentiated in a subset of features instead of all features. However, the real situation of patients’ answers is too complicated to be mimic, especially for now when we have not collected real patient data. Therefore, we just test on this simplified situation. If it works, we will create more sophisticated data generators to do further exploration. 4.2 Experimental Results The experimental results on data recorded by the healthcare monitor are quite exciting: the clusters are discovered with good accuracy. Figure 4 illustrates the results of two different runs of the K-means algorithm: Run1: cluster 1 : cluster 0 : Run2: cluster 1 : cluster 0 : schatz schatz cnsolom1 schatz1 cnsolom1 schatz1 issleb tdonohu1 issleb tdonohu1 xiaohu0 issleb1 xiaohu0 issleb1 mcheney1 cnsolomo tdonohue mcheney1 tdonohue sandersb mcheney cnsolomo mcheney xiaohu sandersb sanders1 xiaohu sanders1 Figure 4: results of K-Means Due to different initial settings, the K-means algorithm outputs different results. However, these two results are quite similar except for one account “mcheney1” which was clustered in different groups in the two runs. For other accounts, the grouping was correct according to the surveyors. Only two of them pretended to be sick for both of their two surveys (schatz and sandersb), others all split their two surveys into being healthy and being sick. 7/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) By examining the database, we found that the answers of the only inconsistent account “mcheney1” are kind of neutral, indicating a slightly sick patient who belongs to neither of the groups, while other surveyors pretended to be either healthy or very sick. Take question 41: “Does it hurt when you exercise?” as an example, “mcheney1” answered “maybe”, while all the others answered “yes” or “no”. So, the algorithm is able to differentiate the most important groups: being very sick and being healthy. The three agglomerative algorithms showed quite different performances. Unsurprisingly, the single link algorithm suffers from its “chain effect”. However, the average link was expected to achieve a good result, but it turns out to produce one large cluster and a few small ones. Some other researchers also reported this phenomenon of average link algorithm, but the reason still needs to be investigated. When the expected number of clusters is set as 2, the complete link algorithm output one small cluster with only one member, and a large one with all the others. But if the number of clusters is set as 3, it output the following: cluster 0 : tdonohu1 cluster 1 : cluster 2 : cnsolom1 schatz issleb schatz1 xiaohu0 issleb1 tdonohue mcheney1 mcheney cnsolomo sandersb xiaohu sanders1 Figure 5: results of agglomerative algorithm with complete link If combining cluster 0 and cluster 1, the result is completely coincident with the result of K-means. Interestingly, the account “tdonohu1” keeps being an outlier across all the three agglomerative algorithms. By looking into the database, we found that this account is the only one whose answers are either As or Es. (“A” is assigned the smallest weight while “E” is assigned the largest weight in our system.) The result of graphic theoretic algorithm is in figure 6. Compared to the result of K-means. This algorithm mistakenly clustered 3 more patients into “sick” group. It shows the MST based graphic theoretic algorithm is not very good in this application. The experiments on synthetic data test the scalability of the algorithms. For K-means, it takes seconds to accomplish a clustering of 1000 vectors. In the case of 300 vectors, the accuracy is between 0.71 and 1.00 depending on different initial settings. When the number of vectors increased to 1000, the accuracy was just slightly reduced, which is 8/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) 0.68 to 1.00. This shows the K-means algorithm, although simple, is very effective and efficient. cluster 1 : cluster 0 : cnsolom1 tdonohu1 xiaohu0 mcheney1 issleb xiaohu cnsolomo issleb1 schatz schatz1 sandersb sanders1 Figure 6: results of graph theoretic algorithm with minimal spanning tree tdonohue The agglomerative algorithms are much slower. It took 50 minutes to accomplish a single mcheney link run on 1000 vectors. The complete link and average link are known to have higher order computation complexity (O(n3) and O(n2) respectively). The accuracy is always 1.00 across the algorithms and dataset. Although a reason is the synthetic data are too clean, the result demonstrates the robust of agglomerative algorithms. Our results on graph theoretic algorithm are not exciting, in that it runs too slow. We tested on a dataset of 300 vectors each of which has 400 features. The minimal spanning tree just got constructed about 250 edges after running for 8 hours. We used Kruskal's algorithm to construct the minimal spanning tree. It searches cycles in the graph when adding edges. With the tree getting larger, the searching becomes more time-consuming. At last, we did an experiment on 100 vectors with 400 features, and the accuracy is 0.62, which is quite low considering the synthetic data is very clean. 5. Discussion and Conclusion The above results indicate the K-means algorithm performs best, but it is not always in this case. K-means quite depends on the initial setting of centers. As of this implementation where the initial centers are randomly chosen, the results of K-means may be different across multiple runs. The most remarkable advantage of K-means algorithm is its conceptual simpleness. K-means has a time complexity which is linear in the number of documents, so it is preferable if efficiency is a consideration or data sets are very large. Among the three agglomerative algorithms, complete link performs best. It validates the comparison between single link and complete link in section 3. It is unexpected that the average link didn’t perform well. It seems the average link clustering tends to produce a large cluster and a few small clusters. 9/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) The result of graph theoretic algorithm doesn’t seem good either. Possible reasons are: 1. The Kruskal's algorithm used to construct the minimal spanning tree is a greedy algorithm, and thus cannot achieve global optimal. 2. The criterion of identifying inconsistent edges might not be the best for this case. The statistics are calculated on the directly connected neighbors of an edge, statistics on more neighbors or even all the edges need to be tested in further study. Theoretically, graph theoretic algorithm should be efficient in high dimensionality cases, because it only needs to compute similarities once. As a comparison, the agglomerative algorithms will have to update the similarities after every mergence. However, an efficient way for constructing a MST is needed for the graph theoretic algorithm. 6. Thoughts on Further work First of all, we need to do experiment on more data. Both more real patients’ answer and more variations of synthetic data are needed. The three methods are all similarity based ones which are sensitive to similarity measures. Moreover, they all belong to hard clustering methods where an object can only belong to one cluster. In the context of patient clustering, sometimes soft clustering is necessary. In addition, soft clustering is also very useful in determining the correlative relationships between questions and different diseases. Model-based methods can accommodate various definitions of clusters and allocation of objects based on complex probabilistic models which inherently support soft clustering. We may try some model based clustering methods like the EM algorithm. Another line is to consider the cohorts identification problem as a classification problem, rather a clustering problem. By training with real data, a proper set of weights on different features can be learned, and can be applied to classify new patients into appropriate classes. References [1] R. Berlin and B. Schatz, Internet Health Monitors for Outcomes of Chronic Illness, Medscape General Medicine (MedGenMed), 6 sections (12pp), Sept 2, 1999 [2] Xiao Hu, An Online Health Questionnaire Monitor, project report in LIS450FCC, GSLIS, UIUC, Dec 2003 [3] A. Jain & M. Murty, Data Clustering: A Review, ACM Computing Surveys, 31(3), pg 264-323, 1999 [4] Everitt, Brian, Cluster Analysis, New York, Wiley,1974 [5] J. Shi & J. Malik, Normalized cuts and image segmentation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pg 731-737, 1997 10/11 LIS429 Project Report Xiao Hu (xiaohu@uiuc.edu) [6] Chris Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press. Cambridge, MA: May 1999 11/11