Interactive Clustering for Labeling Groups in Phones

advertisement
Interactive Clustering for Labeling Groups in Phones
A)
B)
C)
Figure 1: This figure is a visualization of how clusters may change at each interaction of our algorithm. A) Shows the initial unlabeled groups. B)
Shows the clusters after we query the user on their labels and recluster. C) Shows how the clusters may change after querying the user on the
most unlikely instance in a cluster and reclustering.
Abstract
Groups of contacts created within mobile phone interfaces enable users to easily interact with a specific subset of
people from their social networks. However, manually organizing contacts may become difficult for users as their
contact lists grow. We introduce a method that attempts to determine a user’s mobile groups by interactively
clustering contacts from their cell phone. We will show that by actively querying the user on likely mislabeled
contacts, and then re-clustering the contacts after a correct label is received, we reduce the number of contacts that
must be manually labeled.
1. Introduction
Organized groups within mobile interfaces give users a convenient mechanism for searching for contacts and
contacting multiple people in their phones. Unfortunately, the convenience is of having access to these groups is
often overshadowed by the daunting task of manually organizing phone contacts. Additionally, the groups may not
be a full representation of the groups users have in their phone. For instance, users may not remember to place
contacts from their phone into certain groups. It also may be difficult for users to determine which groups they have
in their phone [9].
While methods exist for automatically grouping contacts within a user’s online social network [6,9], the user may be
tasked with removing any incorrectly placed contact. Additionally, these groups are not easily translated back to
mobile phones.
In our work, we attempt to automatically determine the social groups within a user’s mobile phone. We aim to
remove any difficulties that the user may face when determining which groups their contacts belong to. We attempt
to automatically determine how many groups users have in their phone. By generating initial groups, we remove the
need for the user to determine how many groups they have in their phone. Still, we expect that the initial clusters
will have errors. However, rather than having the user look through the groups to determine which contacts were
incorrectly placed, we actively query them on likely mislabeled contacts. We simplify the task of grouping contacts
together by having the user give labels to people that are the most likely to belong to a cluster, and then assigning
that group label to the cluster.
In the remaining sections of this paper, we first describe the related work of determining social groups for users,
interactive clustering, and simplifying tasks for users in general. Then, we describe our approach in detail. After, we
introduce our experimental design, and the resulting analysis of our approach. We conclude with a discussion of our
methodologies and ideas for future work.
2. Related Work
Our algorithm uses methods from active learning and interactive clustering to determine the social groups of the
people in a user’s phone.
Much work has been done on attempting to automatically determine a user’s social groups.
In [6], the authors introduce a method that attempts to find a contact that the user may not remember. The system
queries the user on details that will help the algorithm find the contact. This method is not necessarily used for
grouping contacts together. Rather, it attempts to group people who may match the person that the user is searching
for. In [9], a method is introduced that uses information from emails to determine social topologies in a user’s social
network. This method generates groups and displays them in an interface where the user can create new groups or
modify existing ones. Errors generated by the algorithm can be manually removed by the user. However, obtaining
labels for groups may be expensive, depending on how many errors have occurred. Users may become bored with
the task of manually finding misplaced contacts, for example. To alleviate this issue, we use active learning [4] to
query the user on any mistakes we think were made by our algorithm. Furthermore, we use the answers to these
queries for clustering the data. A somewhat similar approach involving interactive clustering is described in [3]. In
this method, users are queried on what the best features are for text documents, as opposed to giving the labels for
them. Still, users may need to manually fix errors made by the algorithm.
In general, we attempt to simplify the task of manually grouping contacts for a user. In [7], a method for determining
tasks that a user wishes to accomplish is introduced. The authors mention that tasks vary for different users, and
thus, a user specific model is used. Similarly, we believe that although users may have similar groups, these groups
may have different meanings, and so we also design our algorithm to be specific to one user.
3. Approach
Our method uses text messages and phone calls that have been collected from cell phones. From this data, we create
feature vectors that represent each contact that a user has interacted with. The features for each contact are
represented by the time of day a call or text took place (morning, afternoon, night, late night), the duration or length
of the messages, and whether the call or text was incoming, outgoing, or missed (for calls). We obtain these features
for each day of the week.
Once we retrieve this data, we load it into our interface so that it can be used for interactive clustering.
Interactive Clustering
The first part of our algorithm aims to determine how many groups a user has in their phone. We use the X-Means
algorithm [5] to approximate the number of clusters that are in the data.
Once we obtain these initial clusters, we then determine what their labels are. We first retrieve the contacts that have
the closest Euclidean Distance to each cluster center. Then we query the user on what their relationship is with each
of these contacts and assign the given label to the corresponding clusters.
We denote the clusters containing labeled contacts as C_true. Initially, C_true will have N clusters containing the
contact that minimized the Euclidean distance. It is possible for two initial contacts to have the same relationship,
therefore yielding two clusters with the same label. When this is the case, the clusters are joined into one group.
After we get the initial group labels, we re-cluster all of the contacts.
Hierarchical Clustering with Constraints
We must make sure that the instances in C_true are never separated when we re-cluster. Additionally, we must make
sure that contacts within different groups in C_true are never clustered together. We thus use an approach, similar to
the one described in [1, 10] to make sure that these constraints are always in place. This algorithm uses
Agglomerative Hierarchical Clustering with constraints to make sure instances that must-link are always placed
together during clustering and instances that cannot-link are never placed together. We generate must-link
constraints between any two contacts that have the same relationship with the user. cannot-link constraints are
generated between two instances that have different relationships with the user. For example, a user may have a
group called “Family” that contains the contacts “Mom” and “Dad,” and a group called “Work” that contains their
“Boss.” The contacts “Mom” and “Dad” will have a must-Link constraint because they both have the same
relationship, i.e., “Family,” with the user. A cannot-link constraint will be created between “Mom” and “Boss” and
“Dad” and “Boss” because they have different relationships.
Agglomerative Hierarchical Clustering begins with N Clusters, where N is equal to the number of instances. Each
cluster is initialized to a single instance. The algorithm iteratively merges the least dissimilar clusters until only one
cluster remains [10].
In our method, we set our initial clusters to every cluster in C_true and all of the remaining unlabeled contacts.
Utilizing C_true in the initialization satisfies the must-link constraints because all instances that must-link are placed
into a cluster in C_true. We initialize the remaining clusters by placing each unlabeled contact into its own cluster.
In the next iteration of the algorithm, we attempt to merge two clusters that are the least dissimilar. None of the
contacts in a merging cluster may violate the cannot-link constraint [10]. The authors in [hierarchical] note that it is
possible to reach a point when none of the clusters can be merged because there is an instance in each cluster that
cannot-link with another instance from another cluster. We attempt to reduce this issue by first finding the least
dissimilar cluster as normal. Then, we find all of the instances that are causing the cannot-link constraint to be
violated. We remove these instances and place them into a new cluster. Once these instances are removed, we are
able to merge the two initial clusters because they no longer violate any constraints.
For example, suppose we have the following cannot-link constraints:
a cannot-link with b
a cannot-link with c
a cannot-link with d
and the current iteration in hierarchical clustering has the following clusters
Cluster 1: a, e, f, g; Cluster 2: b, c, d
We will be unable to join these clusters because an instance in Cluster 1, a, cannot link with any of the instances in
Cluster 2. Therefore, we remove a from Cluster 1 and place it in a new cluster, Cluster 3, and merge clusters 1 and
2:
Cluster 1_2: c, d, e, f, g; Cluster 3: a
We continue to merge the clusters until we reach k clusters, where k is equal to the number of unique labels that we
have received from the user. Then we return C_all, which are the clusters containing all of the instances.
Active Learning
It is likely that some contacts will be incorrectly grouped together during the clustering phase. Instead of having the
user find these errors on their own, we use active learning [4] to query them about the contacts that we are the most
uncertain about. In active learning, there is often a large pool of data that is unlabeled, and a small number of
instances that are labeled. Membership queries request a label for one of the unlabeled instances. However, one must
determine which instance to query on. We do not want to query about contacts that are in the correct cluster, as that
would be a waste of resources and we are attempting to minimize the number of queries that are asked. Therefore,
we query on the contact that we suspect will reduce the error of the clusters the most if it were relabeled. We believe
these contacts are those that are located between clusters, as they are the instances that are the most likely to be
mislabeled.
The Silhouette Coefficient [2] is a metric used for determining how well an instance is clustered, i.e., whether it was
clustered well or lies between two clusters. The silhouette of an instance i, which has been placed in cluster A from
C_all, is defined as:
s(i) = b(i) – a(i)/max{a(i), b(i)},
where a(i) is the average dissimilarity of i to all of the instances in a cluster A. The term b(i) is defined as the min
d(i, C) where d(i, C) is the average dissimilarity of all i to all of the instances in cluster C, and C is not equal to A.
The values of s(i) are between -1 and 1 inclusive. The intuition is that if i is mislabeled, then a(i) will be somewhat
large. If a(i) is larger than b(i), then the silhouette will be negative, meaning that it is possible that C could be a good
fit for i. If a(i) is less than b(i), then the silhouette will be positive, meaning that A is probably a good fit for i.
Values near 1 represent a well clustered instance and values near -1 represent a poorly clustered instance. When s(i)
is close to 0, the silhouette is considered to be somewhat neutral. Therefore, when A only has one instance, the
contact is assigned a silhouette of 0 because there are no other instances in A to compare i it to. Now that we have
introduced this metric, we can formalize our method for determining which instances to query on:
We first find the instance that minimizes s(i) for each cluster. Then, we pick the instance that has the smallest
silhouette out of all to query on. Once we obtain this contact, we determine their label:
Contact_label = Answer(“What is your relationship with contact i?”)
This query gives us insight into which group this contact belongs to and whether or not there is a new cluster that we
haven't considered. When we ask the user what their relationship with this contact i is, if the named relationship,
Contact_label, does not match A's label, then the contact was mislabeled. We can easily determine which cluster the
contact does belong to by simply assigning it to the cluster whose label matches Contact_label. Furthermore, if there
are no clusters with this name, then we know that there is a new cluster. If this is this case, we add a new cluster,
c_new, to C_True, which contains instance i, and has a label of Contact_label.
Then, we recluster with k = # unique relationships. We repeat the entire process of clustering and querying the user
about contacts until we have labeled a preset number of the user's contacts.
Interface Design
To ensure that users do not forget the relationships, i.e., groups, that they have created, we propagate a list consisting
of each of the labels that they have already given.
Once users create all of their groups, we allow them to change a group name and add and remove contacts to the
groups through our interface.
4. Experimental Design
Study 1: We tested our algorithm with text message and call data received from the phones of 10 participants. We
used Funf [11] to collect the data from the participants for 2 weeks. We asked the participants to give us the ground
truth for all of the contacts that were in their data logs. The true labels were placed into an artificial Oracle. We ran
simulated experiments for each of the participants, where the Oracle gave the true responses for each of the queries.
We chose to use an oracle for the experiments so that we could automatically determine the best parameters for our
algorithms.
Study 2: We also had a separate study with 8 participants from Samsung in Korea. We were interested in seeing if
the features that we chose for our algorithm would work well across different cultures. Data was collected from
these participants over a period of one month. Some of the features used in these experiments are different (we
remove the outgoing text messages feature and add a rejected call feature), although we use the same simulated
experiments.
We should note that to fully observe how effective our algorithm was, it was important to be able to use the best
features. We had 133 features total. A subset of them could likely represent each user’s set of contacts well.
However, we would face the curse of dimensionality if we used all of these features in our algorithm. Therefore, we
used supervised feature selection to determine the subset of features that our algorithm performed best with. In
particular, we used a Ranker Algorithm that chose the top n features that gave the most information gain. We used
the features that the algorithm gave for n = 1, …, 133, and determined which subset gave the best performance for
each user. However, this step was only used for preprocessing the data. Our algorithm only uses the labels that it
acquires after querying the user about a contact.
In our experiments, we calculated the error, or number of incorrectly grouped instances, after each query. This
calculation is not straightforward. For instance, suppose the user gave us the ground truth of groups A and B:
TrueCluster A: Contact 1, Contact 2, Contact 3
TrueCluster B: Contact 4, Contact 5
Let’s say that the current iteration of the algorithm has the following groups:
A: Contact 1, Contact 4, Contact 5
?: Contact 2, Contact 3
We have received the label A for a cluster. However, the unknown cluster “?” is a better representation of A because
it has more Contacts from A in it. Furthermore, A actually has all of the contacts from B, so they should not be
penalized since they were grouped together.
Therefore, we define the error over all of the clusters we have to be:
1 – (ΣC maxTrueCluster score (C, TrueCluster))/N,
where the score is the number of instances in C that belong to a cluster in True Clusters. So in our example, Cluster
‘A’ would get a score of 2 because it has 2 contacts from the TrueCluster B, as opposed to 1 contact from A. Cluster
‘?’ would receive a score of 2 because it has 2 contacts from TrueCluster A, as opposed to 0 contacts from
TrueCluster B. Thus, the error will be 1 – (2 + 2)/5 = .2, so 20% of the contacts are incorrectly grouped. This makes
sense, because Contact 1 should have been grouped with Contacts 1 and 2.
5. Results
Figure 2: Results from Both Studies
This figure shows the average error that occurred after a number of queries were asked in the Study 1 and Study 2.
The average only takes into account the participants who were asked each number of queries. For example, if a
participant was only queried about 4 contacts, then they are not used in the average for 10 queries. We also show the
results from querying about a random instance at each iteration.
Figure 3: Results from Study 1
Figure 4: Results from Study 2
Figure 3 shows how much the error was reduced by after 0, 10, 20, 30, and 40 queries for each total number of
contacts in the participants from study 1’s phones. Figure 4 shows the same results for study 2, except there is
another plot for 50 queries.
6. Discussion
In Figure 2, we see that the average error decreases as we ask more queries. It is also clear that on average, our
algorithm performs better than one that would randomly ask the user to label each instance one at a time. This shows
that our algorithm is more effective than manually labeling instances. We also see that the results from Study 1 and
Study 2 are similar, so we have shown that our algorithm works well in at least two different cultures.
Figure 3 and Figure 4 show that the error of the clusters is reduced as we increase the number of questions that are
asked. Furthermore, we see that it only takes about 10 queries to reduce the error of the clusters to around .5,
regardless of how many contacts there are.
7. Conclusion and Future Work
We have shown that our algorithm is able to effectively reduce the number of contacts that a user needs to manually
label in their phone. Our interface allows users to save their groups into a format that can be placed into their
phones.
In the future, we would like to see if the groups that the user created can be used for predicting new contacts from in
their phones. We are also interested in seeing if similar groups that have been created by different users can be
combined and used for predicting the label of unlabeled instances for any given user. For instance, if two users, 1
and 2, both have a family group and a friend group, then we could combine these groups, and use them for
predicting whether, say, user 3’s contacts belong to one of these groups.
We also note that our approach currently only allows one group to be created for each contact. However, it is likely
that some users exist in multiple groups (for example, roommate and friend). To allow contacts to be placed into
multiple groups, we would need to use a clustering algorithm that allows instances to be placed in more than one
cluster.
Finally, because our interface was created to be used for creating groups in mobile phones, a reasonable next step
would be to place our interface into phones.
References
[1]: Wagstaff, Kiri et al. “Constrained K-means Clustering with background knowledge”
[2]: Rousseuw, Peter. “Silhouettes: a graphical aid to the interpretation and validation of cluster analysis.”
[3] Bekkerman, Ron et al. “Interactive Clustering of Text Collections According to User-Specified Criterion.”
[4] Settles, Burr, “Active Learning Survey.”
[5]: Pelleg, Dan et al;. “X-means: Extending k-means with efficient estimation of the number of clusters.”
[6]: Zhou, Michelle et al. “Finding someone in my social directory whom I do not fully remember or barely know.”
[7] Isbell, Charles et al. “From devices to tasks: automatic task prediction for personalized appliance control.”
[8] Hurst, Amy et al. “Automatically Identifying Targets Users Interact with During Real World Tasks.”
[9] Maclean, Diana et al. “Groups without Tears: Mining Social Topologies from Email.”
[10]: Davidson, Ian et al. “Agglomerative Hierarchical Clustering with Constraints.”
[11]: http://funf.org/
Download