Collaborative and Structural Recommendation of Friends using Weblog-based Social Network Analysis William H. Hsu, Andrew L. King, Martin S. R. Paradesi, Tejaswi Pydimarri, Tim Weninger Department of Computing and Information Sciences, Kansas State University 234 Nichols Hall, Manhattan, KS 66506 {bhsu | aking | pmsr | tejaswi | weninger}@ksu.edu http://www.kddresearch.org Abstract social networks such as citation and collaboration networks. In this paper, we address the problem of link recommendation in weblogs and similar social networks. First, we present an approach based on collaborative recommendation using the link structure of a social network and content-based recommendation using mutual declared interests. Next, we describe the application of this approach to a small representative subset of a large real-world social network: the user/community network of the blog service LiveJournal. We then discuss the ground features available in LiveJournal’s public user information pages and describe some graph algorithms for analysis of the social network. These are used to identify candidates, provide ground truth for recommendations, and construct features for learning the concept of a recommended link. Finally, we compare the performance of this machine learning approach to that of the rudimentary recommender system provided by LiveJournal. We investigate the problem of link recommendation in such weblog-based social networks and describe an annotated graph-based representation for such networks. Our approach uses graph feature analysis to recommend links (u, v) given structural features of individual vertices and joint features of the start and end points of a candidate link, such as distance between them. We present a hybrid system that combines analysis of link structure with analysis of content, such as shared interests. This framework supports more in-depth analyses of structure combined with content, for normalization, feature construction, learning of constraints, clustering of a user’s friends and communities. Such capabilities in turn support more sophisticated recommendations such as the security level of new and existing friendships. Keywords: social networks, collaborative recommendation, LiveJournal, machine learning, data mining, graph algorithms 1 INTRODUCTION This paper presents a recommender system for links in a social network. Such links have different meanings depending on the start and end points: between users of a weblog service denote friendship or trust; between users and communities, they denote subscribership and requested privileges such as posting access; between communities and users they denote accepted members and moderator privileges. Specific recommendation targets for weblogs include links (new friends and subscriberships), security levels (link strength), requests for reciprocal links (trust, membership and moderatorship applications), and definition of security levels (filters). There are analogous applicatons of this functionality in In this paper, we describe how this hybrid approach was used to develop LJMiner, a recommender system for the popular weblog service LiveJournal. LJMiner differentiates friends from non-friends in a connected group of users with greater accuracy than the recommender system actually used by LiveJournal – namely, ranking users and communities by decreasing count of mutual interests. This task is similar to the friend recommendation task given candidates within a specified radius, so the result is a strong positive indication that LJMiner can generate better recommendations than interest-based or simple graph-based recommendation in a fielded application. 2 BACKGROUND 2.1 Social Networks in Weblogs Social network services such as Denga’s LiveJournal, Google’s Orkut, Friendster, and The Facebook allow users to list interests and link to friends, sometimes annotating these links by designating trust levels or qualitative ratings for selected friends. Among the most popular of these is the weblog service LiveJournal, a highly customizable and flexible personal publishing tool used by several million users. In this work, we focus on LiveJournal and derivative services such as GreatestJournal, DeadJournal, and JournalFen based on the same open-source server code. At the time of this writing, there are over 8.5 million LiveJournal accounts, of which over 2.5 million are active; these are either user accounts, associated with one or a small number of individuals, or communities (each a forum for multiple users similar to MSN Communities or Yahoo! Groups). The embeddability, syndication, OpenID integration, and metadata features of LiveJournal make it a rich source of structured data about these users and communities, and the interrelationships among them. Friendship in LiveJournal is an asymmetric relation between two accounts and which can be represented as an edge in a directed graph. Either the start vertex u or the end vertex v may denote either a user account or a community account, though community-to-community links are not used. Table 1 lists the categories of links and specific link types. Community-to-user links are of three independent types: “member”, “posting access”, and “maintainer” (post and membership moderation). Of these relations, only membership is requested by users or invited by maintainers; the rest are privileges granted by maintainers. Table 1. Types of links in the blog service LiveJournal. Start User User End User Community Community User Community Community Link Denotes Trust or friendship Readership or subscribership Membership, posting access, maintainer Obsolete Thus, a reciprocal link between a user and a community means that the user subscribes to the community and is an accepted member of the community. Subscriptions are listed in the “Friends: Communities” section of the user’s page and in a list titled “Watched By” in the community’s page. Links from user u to v are listed in the “Friends” list of u and in an optionally displayed “Friends Of” list of v. This list can be partitioned into reciprocal and nonreciprocal sublists for a user u: Mutual Friends: { v | (v, u) ∈E ∧ (u, v) ∈ E } Also Friend Of: { v | (v, u) ∈ E ∧ (u, v) ∉ E } The social network for the LiveJournal user base consists of many connected components. There are a few source vertices corresponding to users that link to friends but have no reciprocated friendships. Many of these are aggregator accounts created for reading RSS or other users’ blog entries. Additionally, there are sink vertices corresponding to users or communities watched by others, but who have named no friends. Some of these are channels for announcement or dissemination of creative work. 2.2 Collaborative, Structural, and Content-Based Link Recommendation We now discuss the link recommendation problem, the available data, and some previous approaches. One social function of many weblog services is to introduce people to new friends and communities and to provide content aggregators and communication media among people who know each other. The basis for these introductions is often the list of interests reported by a user or community maintainer. LiveJournal collects all of the abovementioned information on the social network structure, along with user interests, self-reported personal information, and descriptive statistics about posting history in a user information page for each account. We seek to mine this data in order to provide improved link recommendations. Our hypothesis is that recommendations based only on shared interests can be greatly improved using information about the graph structure. For instance, local structural features such as whether a link already exists from the candidate friend to the recommender system user, how many mutual friends of the user and candidate there are, and the degree of user and candidate all provide some supporting evidence for a link recommendation. Additionally, search-based graph analysis can yield information about the shortest alternate path in friendships from the user to the candidate, and vice versa. The long-term goal of this research is to explore ways in which contextual information can be combined with graph structure or descriptive graph features to obtain an enriched model for making weblog-based link recommendations. Examples of this information include user interests, preferences and constraints (e.g., desired ranges or limits for number of friends). Mechanisms for combining structural and contextual information include filtering candidate sets by graph proximity, counting number of mutual friends sharing certain interests, normalizing weights of shared interests based on dynamic itemset frequency within a certain graph radius. Initially, we consider a predominantly collaborative and structural approach to recommendation: we hypothesize that users are likely to prefer links similar to extant ones and therefore generate candidates in this paper from within a specified radius in the social network. This is a form of collaboration in that the paths are formed by other users’ choices of friends. Statistics such as the indegree of a vertex, denoting length of the users “Frends Of” list, are similarly collaborative in nature. We also use counts of mutual interests and mutual friends (structural recommendation). In the next section, we discuss the acquisition of data and experiment design for this recommendation problem. 2.3 Methodologies for link mining Getoor and Diehl [GD05] recently surveyed techniques for link mining, focusing on statistical relational learning approaches and emphasizing graphical models representations of link structure. Ketkar et al. [KHC05] compare data mining techniques over graph-based representations of links to first-order and relational representations and learning techniques that are based upon inductive logic programming (ILP). Sarkar and Moore [SM05] extend the analysis of social netwoks into the temporal dimension by modeling change in link structure across discrete time steps, using latent space models and multidimensional scaling. One of the challenges in collecting time series data from LiveJournal is the slow rate of data acquisition, just as spatial annotation data (such as that found in LJ maps and the “plot your friends on a map meme) is relatively incomplete. 2.4 Other applications using graph mining Popescul and Ungar [PU03] learn a kind of entiryrelational model from data in order to predict links. Hill [Hi03] and Bhattacharya and Getoor [BG04] similarly use statistical relational learning from data in order to resolve identity uncertainty, particularly coreferences and other redundancies (also called deduplication). Resig et al. [RDHT04] use a large (200000-user) crawl of LiveJournal to annotate a social network of instant messaging users, and explore the approach of predicting online times as a function of friends graph degree. There have been numerous recent applications of social network mining based on the text and headers of e-mail. One notable research project by McCallum et al. [MCW05] uses the Enron e-mail corpus and infers roles and topic categories based on link analysis A primary goal of this work is to extend the graph mining approach beyond link prediction and recommendation towards link explanation and annotation. It may be much more useful to explain why a group of friends in a blog service created accounts en masse or added one another as friends than to recommend relationship sets that are already extant or structured according to a preexistent social group. For example, high school classmates often create accounts and encourage their peers to join the same service. In a few cases, this is encouraged or facilitated by a teacher, for a class project. Solving the problem of link prediction is not particularly useful in this case, because the user decisions have already been made or strongly constrained; however, it may be very useful to link other classmates not working on the same project to the same relationship set (perhaps they were encouraged to join the blog service by students who continued to use it after the class project). Large groups such as web comic subscriberships, community co-members, etc. are also somewhat identifiable, and relating members of a blog service to one another through relationship sets is a typical entityrelational data modeling operation that can be made more robust and efficient through graph feature extraction. 3 EXPERIMENT DESIGN 3.1 LJCrawler To acquire the graph structure and attributes describe in the previous section, we developed an HTTP-based spider called LJCrawler to harvest user information from LiveJournal This multithreaded program collects an average of 5 records per second, traversing the social network depth-first and archiving the results in a master index file. Because LiveJournal’s functionality for looking up users by user number is only available to administrators, we decided to compile a list of seeds for a disjoint-set representation of the disconnected social network. For purposes of this experiment, however, starting from just one seed (the first author’s LiveJournal ID) and restricting the crawl to one connected oomponent was sufficient. Using LJCrawler, we compiled an adjacency list and the following ground features for each user: • • • • Account type (user, community) Paid status (free, paid, permanent) Dates of creation and last update Interest list 3.2 Feature Analyzers We define a single example to be a candidate edge (u, v) in the underlying directed graph of the social network, along with a set of descriptive features calculated from the annotated graph recorded by LJCrawler: Graph features: 1. 2. 3. 4. 5. 6. 7. Indegree of u: popularity of the user Indegree of v: popularity of the candidate Outdegree of u: number of other friends besides the candidate; saturation of friends list Outdegree of v: number of existing friends of the candidate besides the user; correlates loosely with likelihood of a reciprocal link Number of mutual friends w such that u → w ∧ w→v “Forward deleted distance“: minimum alternative distance from u to v in the graph without the edge (u, v) Backward distance from v to u in the graph The degree attributes can be enumerated in time linear in the number of users, as can the mutual friends count for each pair of users. Deleted distance requires one iteration over w for shortest path algorithm (re-relaxation). In a graph (V, E), backward distance requires Θ(|V|3) using the brute-force dynamic programming implementation produced for this experiment, Θ(|E| lg |E|) using a simple heap, and Θ(|V| lg |V| + |E|) using Fibonacci heaps. [CLRS02] Interest-based features: 8 9 10 11 Number of mutual interests between u and v Number of interests listed by u Number of interests listed by v Ratio of the number of mutual interests to the number listed by u 12 Ratio of the number of mutual interests to the number listed by v Using a straightforward string pair enumeration and comparison algorithm, the mutual interest counts are stored in matrix of |V|2 elements, each requiring constant time to check (given a maximum of 150 interests). 2. 3. From a query distribution modeled on the frequency of vertices at a given graph distance Exhaustively within a specified radius The first technique is likely to be unscalable, as the number of candidates is |V|2. The second requires having a representatively large sample of the full LiveJournal social network, in order to fit the distribution parameters accurately. The third was the most straightforward to implement. Two calls to the all pairs shortest path algorithm provided cost matrix, and one pass at each radius up to a maximum of 10 yielded the data shown in Table 2. To simplify the initial experiments, we defined the classification problem to be classification of d(u, v) as 1 or 2. Other features: Additional planned features for continuing experiments include dates (update frequencies when taken differentially), user options such as maximum friends count, and content descriptors of LiveJournal entries and comments (average post length, word frequency, etc.). This task is actually useful for social network recommender systems because discrimination of a direct friend from a “friend of a friend” (FOAF) is functionally similar to recommending FOAFs to link to directly. There are more detailed classification targets, such as placement, promotion, and demotion of linked friends within strata of trust (setting, increasing, and decreasing the security level), but choosing a user’s friends to begin with is the more fundamental decision. 3.3 Graph Search Algorithms for Computing Features Table 2. Number of candidate edges for the 941-node LiveJournal graph. Computing the minimum forward and backward distances can be done more efficiently by using breadth-first search. A straightforward implementation that we are currently using performs Θ(|V|) BFS passes to accumulate the edge set of the entire LiveJournal friends graph, which contains multiple connected components. This requires Θ(|V| (|V| + |E|)) time. Currently, a Java implementation of this algorithm requires about 70 seconds to process a 3000node graph. Since |E| is about 20 times |V| on average (that is, the average outdegree or “Friend Of” cardinality is 20), an algorithm that is more efficient in practice is possible. We note that the amortized cost of running BFS to precompute all-pairs shortest paths (APSP) with the actual edge deleted (which is necessary to avoid knowing the prediction target in link predicton) is Θ(|E| (|V| + |E|)). This is prohibitively large even for our “mid-sized” subgraphs of 10-50K nodes; when |V| is about 9 million, |E| is about 200 million, enumerating APSP is completely infeasible. However, we do not typically consider all of E, so the bottleneck is typically the first step plus a constant number of calls to BFS, requiring running time in Θ(k (|V| + |E|)). 3.4 Generating Candidates We considered several alternative ways to generate candidate edges (u, v): 1. Uniform at random Distance d 1 2 3 4 5 6 7 8 9 10 4 Frequency (= d) 5934 45042 69013 101256 87683 51040 29981 13230 4808 1022 Cumulative (≤ d) 5934 50976 119989 221245 308928 359968 389949 403179 407987 409009 RESULTS 4.1 Small Experiment Using the 941-node annotated graph summarized in Table 2, we generated 50976 candidate edges. Note that all forward distances are greater than 1: when u and v are actually connected, we erase (u, v) and find the length of the shortest alternative path. The complete listing of all twelve features is given in Section 3. The numerical types of all of the network features – both the ones describing the graph and those measuring and interests and ratios – makes data set amenable to logistic regression. We defined the concept IsFriendOf and trained three types of inducers with: 1. 2. 3. 4. 5. all attributes all graph attributes excluding the forward and backward distances the backward distances alone the backward and forward distances alone interest-related attributes alone. Table 3. Percent accuracy for predicting all classes using the 941-node graph. Inducer J48 OneR Logistic All 98.2 95.8 91.6 NoDist 94.8 92.0 90.9 BkDist 95.8 95.8 88.3 Dist 97.6 95.8 88.9 Interest 88.5 88.5 88.4 Table 4. Percent accuracy for predicting edges (d = 1, IsFriendOf = TRUE) using the 941-node graph. Inducer J48 OneR Logistic All 89.5 67.7 38.3 NoDist 65.7 41.1 33.3 BkDist 67.7 67.7 0 Dist 83.0 67.7 4.5 Interest 5.4 4.5 4.5 Tables 3 and 4 show the results for three inducers: the J48 decision tree inducer, the 1R inducer, and the Logistic regression inducer. All accuracy measures were collected over 10-fold cross-validated runs. The J48 output wth all features achieves a significant boost over the next highest (distance only). 4.2 Data Acquisition and Larger Experiments The crawler has been improved with several servicespecific optimizations for fetching user info pages. Presently these do not use LiveJournal’s BML feed of user data, which is incomplete for our purposes (that is, not all ground attributes in our initial relations are provided). At press time, this crawler processes about 20000 user records per hour and thus would require over a week to crawl LiveJournal. The current bottleneck is the Θ(|V| (|V| + |E|)) step described in Section 3.3. This is the dominant term, because the constant k denoting the number of candidate edges is usually much smaller than n, e.g., 100-1000, so that Θ(k (|V| + |E|)) is not only in Θ (|V| + |E|), but actually just a few hundred times the cost of a single BFS. 4.3 Interpretation Using mutual interests alone, even with normalization based on the number of interests in u and v, results in very poor prediction accuracy using all inducers with which we experimented. Intermediate results are achieved using mutual friends count and degree (NoDist: 65.7% on predicting edges) and using forward deleted distance and backward distance (Dist: 67.7%). Using all 12 computed graph and annotation features resulted in the highest prediction accuracy (All: 89.5%). We note that LiveJournal once used a variant of normalized mutual interests to produce a list of potential friends, arranged in decreasing order of match quality. Although this was not the same type of recommender system as LJMiner supports, it shows that the state of the art user matching systems have a lot of room for improvement. The results in Table 4 indicate that features produced by LJMiner, used with a good inducer, can generate collaborative and structural recommendations. 5 CONTINUING WORK Scaling up: Our current research focuses on scaling up to tens of thousands and eventually millions of users. Crawling over 9 million records is at least technically feasible, but scaling up the graph analyzers is a challenge that may best be met with heuristic search. Learning relational models: A promising area of research is the recovery of relational graphical models, including class-level (membership and reference slot) uncertainty. [GFKT02] LJMiner has yielded a ready source of semistructured data for both structure learning and distribution learning. Another potentially useful approach is to organize users and communities into clusters using this relational model. We have developed schemas for blog posts (entries, threads, comments) and for users and dynamic groups of users. This is related to previous preliminary work on relational data mining for personalization of web portals, especially computational grid portals. [HBJ03]. Much of the relational metadata in the bioinformatics domain comes from description languages for workflows and workflow components [Hs04]. The next step in our experimental plan is to use schemas such as our detailed ones for blog sevice users and bioinformatics information and computational grid users [Hs05] to learn a richer predictive model. Finally, modeling relational data as it persists or changes across time is an important challenge. 6 ACKNOWLEDGEMENTS We thank Todd Easton and Kirsten Hildrum for helpful discussions and Jason Li for assistance with implementations. 7 REFERENCES [BG04] I. Bhattacharya & L. Getoor. Deduplication and group detection using links. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Workshop on Link Analysis and Group Detection (LinkKDD2004), Seattle, WA, USA, August 22-25, 2004. [CLRS02] T. H. Cormen, C. E. Leiserson, R. L. Rivest, & C. Stein. Introduction to Algorithms, Second Edition. Cambridge, MA: MIT Press, 2002. [GD05] L. Getoor & C. P. Diehl. Link mining: a survey. SIGKDD Explorations, Special Issue on Link Mining, 7(2):3-12. [GFKT02] L. Getoor, N. Friedman, D. Koller, & B. Taskar. Learning Probabilistic Models of Link Structure. Journal of Machine Learning Research, 2002. [HBJ03] W. H. Hsu, P. Boddhireddy, & R. Joehanes. Using probabilistic relational models for collaborative filtering. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Statistical Learning of Relational Models (SRL), Acapulco, MEXICO, August, 2003. [Hi03] S. Hill. Social network relational vectors for anonymous identity matching In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Statistical Learning of Relational Models (SRL), Acapulco, MEXICO, August, 2003. [Hs04] W. H. Hsu. Relational graphical models of computational workflows for data mining. In Proceedings of the International Conference on Semantics of a Networked World: Semantics for Grid Databases (ICSNW-2004), p. 309-310, Paris, FRANCE, June, 2004. [Hs05] W. H. Hsu. Relational graphical models for collaborative filtering and recommendation of computational workflow components. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Multi-Agent Information Retrieval and Recommender Systems, Edinburgh, UK, July 31, 2005. [KHC05] N. S. Ketkar, L. B. Holder, & D. J. Cook. Comparison of graph-based and logic-based multirelational data mining. SIGKDD Explorations, Special Issue on Link Mining, 7(2):64-71. [MCW05] A. McCallum, A. Corrada-Emmanuel, & X. Wang. Topic and role discovery in social networks. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Edinburgh, UK, August, 2005. [MH04] M. Mukherjee & L. B. Holder. Graph-based data mining on social networks. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Workshop on Link Analysis and Group Detection (LinkKDD2004), Seattle, WA, USA, August 22-25, 2004. [PU03] A. Popescul & L. H. Ungar. Statistical relational learning for link prediction. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) Workshop on Statistical Learning of Relational Models (SRL), Acapulco, MEXICO, August, 2003. [RDHT04] J. Resig, S. Dawara, C. M. Homan, & A. Teredesai. Extracting social networks from instant messaging populations. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) Workshop on Link Analysis and Group Detection (LinkKDD2004), Seattle, WA, USA, August 22-25, 2004. [SM05] P. Sarkar & A. Moore. Dynamic social network analysis using latent space models. SIGKDD Explorations, Special Issue on Link Mining, 7(2):31-40.