Interactive Clustering for Labeling Groups in Phones

Interactive Clustering for Labeling Groups in Phones A) B) C) Figure 1: This figure is a visualization of how clusters may change at each interaction of our algorithm. A) Shows the initial unlabeled groups. B) Shows the clusters after we query the user on their labels and recluster. C) Shows how the clusters may change after querying the user on the most unlikely instance in a cluster and reclustering. Abstract Groups of contacts created within mobile phone interfaces enable users to easily interact with a specific subset of people from their social networks. However, manually organizing contacts may become difficult for users as their contact lists grow. We introduce a method that attempts to determine a user’s mobile groups by interactively clustering contacts from their cell phone. We will show that by actively querying the user on likely mislabeled contacts, and then re-clustering the contacts after a correct label is received, we reduce the number of contacts that must be manually labeled. 1. Introduction Organized groups within mobile interfaces give users a convenient mechanism for searching for contacts and contacting multiple people in their phones. Unfortunately, the convenience is of having access to these groups is often overshadowed by the daunting task of manually organizing phone contacts. Additionally, the groups may not be a full representation of the groups users have in their phone. For instance, users may not remember to place contacts from their phone into certain groups. It also may be difficult for users to determine which groups they have in their phone [9]. While methods exist for automatically grouping contacts within a user’s online social network [6,9], the user may be tasked with removing any incorrectly placed contact. Additionally, these groups are not easily translated back to mobile phones. In our work, we attempt to automatically determine the social groups within a user’s mobile phone. We aim to remove any difficulties that the user may face when determining which groups their contacts belong to. We attempt to automatically determine how many groups users have in their phone. By generating initial groups, we remove the need for the user to determine how many groups they have in their phone. Still, we expect that the initial clusters will have errors. However, rather than having the user look through the groups to determine which contacts were incorrectly placed, we actively query them on likely mislabeled contacts. We simplify the task of grouping contacts together by having the user give labels to people that are the most likely to belong to a cluster, and then assigning that group label to the cluster. In the remaining sections of this paper, we first describe the related work of determining social groups for users, interactive clustering, and simplifying tasks for users in general. Then, we describe our approach in detail. After, we introduce our experimental design, and the resulting analysis of our approach. We conclude with a discussion of our methodologies and ideas for future work. 2. Related Work Our algorithm uses methods from active learning and interactive clustering to determine the social groups of the people in a user’s phone. Much work has been done on attempting to automatically determine a user’s social groups. In [6], the authors introduce a method that attempts to find a contact that the user may not remember. The system queries the user on details that will help the algorithm find the contact. This method is not necessarily used for grouping contacts together. Rather, it attempts to group people who may match the person that the user is searching for. In [9], a method is introduced that uses information from emails to determine social topologies in a user’s social network. This method generates groups and displays them in an interface where the user can create new groups or modify existing ones. Errors generated by the algorithm can be manually removed by the user. However, obtaining labels for groups may be expensive, depending on how many errors have occurred. Users may become bored with the task of manually finding misplaced contacts, for example. To alleviate this issue, we use active learning [4] to query the user on any mistakes we think were made by our algorithm. Furthermore, we use the answers to these queries for clustering the data. A somewhat similar approach involving interactive clustering is described in [3]. In this method, users are queried on what the best features are for text documents, as opposed to giving the labels for them. Still, users may need to manually fix errors made by the algorithm. In general, we attempt to simplify the task of manually grouping contacts for a user. In [7], a method for determining tasks that a user wishes to accomplish is introduced. The authors mention that tasks vary for different users, and thus, a user specific model is used. Similarly, we believe that although users may have similar groups, these groups may have different meanings, and so we also design our algorithm to be specific to one user. 3. Approach Our method uses text messages and phone calls that have been collected from cell phones. From this data, we create feature vectors that represent each contact that a user has interacted with. The features for each contact are represented by the time of day a call or text took place (morning, afternoon, night, late night), the duration or length of the messages, and whether the call or text was incoming, outgoing, or missed (for calls). We obtain these features for each day of the week. Once we retrieve this data, we load it into our interface so that it can be used for interactive clustering. Interactive Clustering The first part of our algorithm aims to determine how many groups a user has in their phone. We use the X-Means algorithm [5] to approximate the number of clusters that are in the data. Once we obtain these initial clusters, we then determine what their labels are. We first retrieve the contacts that have the closest Euclidean Distance to each cluster center. Then we query the user on what their relationship is with each of these contacts and assign the given label to the corresponding clusters. We denote the clusters containing labeled contacts as C_true. Initially, C_true will have N clusters containing the contact that minimized the Euclidean distance. It is possible for two initial contacts to have the same relationship, therefore yielding two clusters with the same label. When this is the case, the clusters are joined into one group. After we get the initial group labels, we re-cluster all of the contacts. Hierarchical Clustering with Constraints We must make sure that the instances in C_true are never separated when we re-cluster. Additionally, we must make sure that contacts within different groups in C_true are never clustered together. We thus use an approach, similar to the one described in [1, 10] to make sure that these constraints are always in place. This algorithm uses Agglomerative Hierarchical Clustering with constraints to make sure instances that must-link are always placed together during clustering and instances that cannot-link are never placed together. We generate must-link constraints between any two contacts that have the same relationship with the user. cannot-link constraints are generated between two instances that have different relationships with the user. For example, a user may have a group called “Family” that contains the contacts “Mom” and “Dad,” and a group called “Work” that contains their “Boss.” The contacts “Mom” and “Dad” will have a must-Link constraint because they both have the same relationship, i.e., “Family,” with the user. A cannot-link constraint will be created between “Mom” and “Boss” and “Dad” and “Boss” because they have different relationships. Agglomerative Hierarchical Clustering begins with N Clusters, where N is equal to the number of instances. Each cluster is initialized to a single instance. The algorithm iteratively merges the least dissimilar clusters until only one cluster remains [10]. In our method, we set our initial clusters to every cluster in C_true and all of the remaining unlabeled contacts. Utilizing C_true in the initialization satisfies the must-link constraints because all instances that must-link are placed into a cluster in C_true. We initialize the remaining clusters by placing each unlabeled contact into its own cluster. In the next iteration of the algorithm, we attempt to merge two clusters that are the least dissimilar. None of the contacts in a merging cluster may violate the cannot-link constraint [10]. The authors in [hierarchical] note that it is possible to reach a point when none of the clusters can be merged because there is an instance in each cluster that cannot-link with another instance from another cluster. We attempt to reduce this issue by first finding the least dissimilar cluster as normal. Then, we find all of the instances that are causing the cannot-link constraint to be violated. We remove these instances and place them into a new cluster. Once these instances are removed, we are able to merge the two initial clusters because they no longer violate any constraints. For example, suppose we have the following cannot-link constraints: a cannot-link with b a cannot-link with c a cannot-link with d and the current iteration in hierarchical clustering has the following clusters Cluster 1: a, e, f, g; Cluster 2: b, c, d We will be unable to join these clusters because an instance in Cluster 1, a, cannot link with any of the instances in Cluster 2. Therefore, we remove a from Cluster 1 and place it in a new cluster, Cluster 3, and merge clusters 1 and 2: Cluster 1_2: c, d, e, f, g; Cluster 3: a We continue to merge the clusters until we reach k clusters, where k is equal to the number of unique labels that we have received from the user. Then we return C_all, which are the clusters containing all of the instances. Active Learning It is likely that some contacts will be incorrectly grouped together during the clustering phase. Instead of having the user find these errors on their own, we use active learning [4] to query them about the contacts that we are the most uncertain about. In active learning, there is often a large pool of data that is unlabeled, and a small number of instances that are labeled. Membership queries request a label for one of the unlabeled instances. However, one must determine which instance to query on. We do not want to query about contacts that are in the correct cluster, as that would be a waste of resources and we are attempting to minimize the number of queries that are asked. Therefore, we query on the contact that we suspect will reduce the error of the clusters the most if it were relabeled. We believe these contacts are those that are located between clusters, as they are the instances that are the most likely to be mislabeled. The Silhouette Coefficient [2] is a metric used for determining how well an instance is clustered, i.e., whether it was clustered well or lies between two clusters. The silhouette of an instance i, which has been placed in cluster A from C_all, is defined as: s(i) = b(i) – a(i)/max{a(i), b(i)}, where a(i) is the average dissimilarity of i to all of the instances in a cluster A. The term b(i) is defined as the min d(i, C) where d(i, C) is the average dissimilarity of all i to all of the instances in cluster C, and C is not equal to A. The values of s(i) are between -1 and 1 inclusive. The intuition is that if i is mislabeled, then a(i) will be somewhat large. If a(i) is larger than b(i), then the silhouette will be negative, meaning that it is possible that C could be a good fit for i. If a(i) is less than b(i), then the silhouette will be positive, meaning that A is probably a good fit for i. Values near 1 represent a well clustered instance and values near -1 represent a poorly clustered instance. When s(i) is close to 0, the silhouette is considered to be somewhat neutral. Therefore, when A only has one instance, the contact is assigned a silhouette of 0 because there are no other instances in A to compare i it to. Now that we have introduced this metric, we can formalize our method for determining which instances to query on: We first find the instance that minimizes s(i) for each cluster. Then, we pick the instance that has the smallest silhouette out of all to query on. Once we obtain this contact, we determine their label: Contact_label = Answer(“What is your relationship with contact i?”) This query gives us insight into which group this contact belongs to and whether or not there is a new cluster that we haven't considered. When we ask the user what their relationship with this contact i is, if the named relationship, Contact_label, does not match A's label, then the contact was mislabeled. We can easily determine which cluster the contact does belong to by simply assigning it to the cluster whose label matches Contact_label. Furthermore, if there are no clusters with this name, then we know that there is a new cluster. If this is this case, we add a new cluster, c_new, to C_True, which contains instance i, and has a label of Contact_label. Then, we recluster with k = # unique relationships. We repeat the entire process of clustering and querying the user about contacts until we have labeled a preset number of the user's contacts. Interface Design To ensure that users do not forget the relationships, i.e., groups, that they have created, we propagate a list consisting of each of the labels that they have already given. Once users create all of their groups, we allow them to change a group name and add and remove contacts to the groups through our interface. 4. Experimental Design Study 1: We tested our algorithm with text message and call data received from the phones of 10 participants. We used Funf [11] to collect the data from the participants for 2 weeks. We asked the participants to give us the ground truth for all of the contacts that were in their data logs. The true labels were placed into an artificial Oracle. We ran simulated experiments for each of the participants, where the Oracle gave the true responses for each of the queries. We chose to use an oracle for the experiments so that we could automatically determine the best parameters for our algorithms. Study 2: We also had a separate study with 8 participants from Samsung in Korea. We were interested in seeing if the features that we chose for our algorithm would work well across different cultures. Data was collected from these participants over a period of one month. Some of the features used in these experiments are different (we remove the outgoing text messages feature and add a rejected call feature), although we use the same simulated experiments. We should note that to fully observe how effective our algorithm was, it was important to be able to use the best features. We had 133 features total. A subset of them could likely represent each user’s set of contacts well. However, we would face the curse of dimensionality if we used all of these features in our algorithm. Therefore, we used supervised feature selection to determine the subset of features that our algorithm performed best with. In particular, we used a Ranker Algorithm that chose the top n features that gave the most information gain. We used the features that the algorithm gave for n = 1, …, 133, and determined which subset gave the best performance for each user. However, this step was only used for preprocessing the data. Our algorithm only uses the labels that it acquires after querying the user about a contact. In our experiments, we calculated the error, or number of incorrectly grouped instances, after each query. This calculation is not straightforward. For instance, suppose the user gave us the ground truth of groups A and B: TrueCluster A: Contact 1, Contact 2, Contact 3 TrueCluster B: Contact 4, Contact 5 Let’s say that the current iteration of the algorithm has the following groups: A: Contact 1, Contact 4, Contact 5 ?: Contact 2, Contact 3 We have received the label A for a cluster. However, the unknown cluster “?” is a better representation of A because it has more Contacts from A in it. Furthermore, A actually has all of the contacts from B, so they should not be penalized since they were grouped together. Therefore, we define the error over all of the clusters we have to be: 1 – (ΣC maxTrueCluster score (C, TrueCluster))/N, where the score is the number of instances in C that belong to a cluster in True Clusters. So in our example, Cluster ‘A’ would get a score of 2 because it has 2 contacts from the TrueCluster B, as opposed to 1 contact from A. Cluster ‘?’ would receive a score of 2 because it has 2 contacts from TrueCluster A, as opposed to 0 contacts from TrueCluster B. Thus, the error will be 1 – (2 + 2)/5 = .2, so 20% of the contacts are incorrectly grouped. This makes sense, because Contact 1 should have been grouped with Contacts 1 and 2. 5. Results Figure 2: Results from Both Studies This figure shows the average error that occurred after a number of queries were asked in the Study 1 and Study 2. The average only takes into account the participants who were asked each number of queries. For example, if a participant was only queried about 4 contacts, then they are not used in the average for 10 queries. We also show the results from querying about a random instance at each iteration. Figure 3: Results from Study 1 Figure 4: Results from Study 2 Figure 3 shows how much the error was reduced by after 0, 10, 20, 30, and 40 queries for each total number of contacts in the participants from study 1’s phones. Figure 4 shows the same results for study 2, except there is another plot for 50 queries. 6. Discussion In Figure 2, we see that the average error decreases as we ask more queries. It is also clear that on average, our algorithm performs better than one that would randomly ask the user to label each instance one at a time. This shows that our algorithm is more effective than manually labeling instances. We also see that the results from Study 1 and Study 2 are similar, so we have shown that our algorithm works well in at least two different cultures. Figure 3 and Figure 4 show that the error of the clusters is reduced as we increase the number of questions that are asked. Furthermore, we see that it only takes about 10 queries to reduce the error of the clusters to around .5, regardless of how many contacts there are. 7. Conclusion and Future Work We have shown that our algorithm is able to effectively reduce the number of contacts that a user needs to manually label in their phone. Our interface allows users to save their groups into a format that can be placed into their phones. In the future, we would like to see if the groups that the user created can be used for predicting new contacts from in their phones. We are also interested in seeing if similar groups that have been created by different users can be combined and used for predicting the label of unlabeled instances for any given user. For instance, if two users, 1 and 2, both have a family group and a friend group, then we could combine these groups, and use them for predicting whether, say, user 3’s contacts belong to one of these groups. We also note that our approach currently only allows one group to be created for each contact. However, it is likely that some users exist in multiple groups (for example, roommate and friend). To allow contacts to be placed into multiple groups, we would need to use a clustering algorithm that allows instances to be placed in more than one cluster. Finally, because our interface was created to be used for creating groups in mobile phones, a reasonable next step would be to place our interface into phones. References [1]: Wagstaff, Kiri et al. “Constrained K-means Clustering with background knowledge” [2]: Rousseuw, Peter. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.” [3] Bekkerman, Ron et al. “Interactive Clustering of Text Collections According to User-Specified Criterion.” [4] Settles, Burr, “Active Learning Survey.” [5]: Pelleg, Dan et al;. “X-means: Extending k-means with efficient estimation of the number of clusters.” [6]: Zhou, Michelle et al. “Finding someone in my social directory whom I do not fully remember or barely know.” [7] Isbell, Charles et al. “From devices to tasks: automatic task prediction for personalized appliance control.” [8] Hurst, Amy et al. “Automatically Identifying Targets Users Interact with During Real World Tasks.” [9] Maclean, Diana et al. “Groups without Tears: Mining Social Topologies from Email.” [10]: Davidson, Ian et al. “Agglomerative Hierarchical Clustering with Constraints.” [11]: http://funf.org/

Interactive Clustering for Labeling Groups in Phones

Related documents

Products

Support

Interactive Clustering for Labeling Groups in Phones

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib