Semantics for Big Data AAAI Technical Report FS-13-04 Large Scale Social Network Analysis Using Semantic Web Technologies Dominic DiFranzo, Qingpeng Zhang, Kristine Gloria, and James Hendler Tetherless World Constellation, Rensselaer Polytechnic Institute difrad@rpi.edu Abstract We focused on characteristics and behavioral patterns associated with “trust networks”. Traditional analyses of these networks concentrate on the trust formed between people who connect under a certain context or to achieve a certain goal (Golbeck 2008). In studying trust-based relationships, we can better understand the multi-level complexities of trust and how it relates to social phenomena like mentoring, expertise, clandestine behaviors etc (Ahmad 2012, 2010). We, like many other researchers, also recognize that online trust uniquely exploits the qualitative effects of our social contexts and environments. Thus, to explore trust in online communities, also presents an ideal litmus test for the need of mixed methodology and tools. The TWC’s response dissects trust further by leveraging YarcData’s uRiKA technology to perform analysis on large-scale non-partitionable networks like our EQ2 data. We situate our own assumptions of trust networks by citing the results of two separate works: “The Formation of TaskOriented Groups” (Huang, et.al., 2010) and “Trust Among Rogues? A Hypergraph Approach for Comparing Clandestine Trust Networks in MMOGs” (Ahmad, et.al., 2010). The first study examines team formation and collaboration among users in EQ2 as they accomplish combat activities while the second looks at housing permissions granted between players. In EQ2, “housing permissions” refer to the ability of a players have to grant other players permission to enter their in-game houses, move objects around in them, or even remove objects from the house (ibid). The owner of the house has the ability to assign access level to other users in the game. These levels are: Trustee (all rights that an owner has), Friend (can access the house and move objects in the house), Visitor (can only enter the house, can’t move things), None (can not enter the house). In fall 2012, YarcData issued the Graph Analytics Challenge, which called for the best submission of unpartitionable, big data graph problems. The team at the Tetherless World Constellation (TWC) at Rensselaer Polytechnic Institute (RPI) took up this challenge and submitted an aggressive proposal to unpack the complexities of human online social behavior. In particular, TWC’s contribution explored notions of trust networks in EverQuest II (EQ2), a Massively Multiplayer Online Role Playing Game. The following presents: what social concepts were explored, how we pursued its discovery, and why the YarcData system best suited our needs. We conclude with a brief discussion of the lessons learned and future projects. Introduction The marriage between computer science and social science continues to yield fruitful, unique results significantly influencing traditional social discourse. Particularly, the Web has provided a concentrated space for social scientists to examine human interaction at various levels and scenarios. These studies point to human behaviors that are present both online and offline making a strong case for these findings as indicative of general human behavior (Turkle 1995, 2005, 2011; boyd 2008, 2009; Rheingold 2002). However, we dispute that despite these achievements there remains significant limitations to the studies. Specifically, we argue that most existing works focus on the topological features of social networks formed by interactions. Due to the lack of data and the multidimensional nature of these interactions, the more detailed social behaviors have largely been ignored. We suggest that this lack of data leads to a gap in meaningful research from perspectives of social and organizational sciences. Copyright © 2013, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 20 Ahmad highlights that the housing permission network “is a ready proxy for the level of trust amongst characters” (ibid, p. 2). Both works are significant in their claims that behaviors exhibited offline materialize online as well. For example, in accomplishing team-oriented tasks either online or offline, individuals who lack the ability or skill set to accomplish a task are more likely to seek group help. While these findings are important, we infer that much more can be determined. EverQuest II allows players to create multiple characters, and each character can only play and interact in one server (a server in this case acts as a parallel version of the game, each with their own specialties). The four servers in our records are Nagafen (Player vs Player server. This means players are allowed to fight and kill each other), Guk (Player vs Environment. This means players can’t fight each other, they only fight computer controlled characters), Antonia Bayle (Also a Player vs Environment, but encourages users to role play as their character and act like them in game), and The Bazaar (a Player vs Environment that allows users to train and sell in game commodities). The differences in these servers let us see how these different rules effect the interaction and trust in the game. We emphasize that the two works mentioned above serve primarily as a frame of reference in evaluating the value of our data collected. Our results build upon these findings. Moreover, as described later in the submission, we leverage linked data technologies to strip away limitations such as size and schema. For unlike previous studies, this study examined cross server, multi-world interactions over various months at a time. The total size of our dataset is the largest of our knowledge featuring over 35 billion triples (4.5 terabytes of data). The detail results include (but are not limited to) the following: activity type, housing permissions held and granted, frequency of play, etc. Instead of just reviewing the types of tasks completed by a sampling of EQ2 players, we asked: Has this player developed a relationship with anyone else in his task grou and if so, for how long? Does trust flow between networks and how does this materialize? the amount of data that we used and queried. For this study, we looked only at the housing network, the mentor network, and one month (September 2006) of the experience network. This in total was 1,270,497,287 triples (about 220 GB of data). In our main analysis, we looked at the trust levels in the housing network, and explored the connections between the housing network and the mentorship network. We wanted to explore the sequence and causality between the trust in housing network and the mentorship relationship. In order to answer this question, we queried the data for all mentors that also owned a house, and had set a house access level to their mentee. This returned back a list of all mentors and mentees that also have a housing relationship, along with attributes for each player. We found that the majority of the mentors and mentees have set the highest house access level (Trustee), which indicates the strong trust between them. Next, we wanted to see if there is any difference in the trust level and the time they established the mentorship relationship? In other words, does the trust level tend to be higher if it was built after they establishment of the mentorship relationship? Whether the trust was built before or after their mentorship relationship? Figure 1: Proportion of trust under three conditions. We constructed the proportion of trust levels under three conditions as seen in Figure 1: (a) housing trust first, (b) mentorship first, and (c) the same time. We could find that more than half of the mentor-mentee pairs set the highest trust (Trustee) in their housing relationship under conditions (a) and (b), while when the two relationships were built the same time, nearly 40% of the mentormentee pairs set to the second highest trust (Friend). We hypothesize that under condition (a), when housing trust was built first, they should have had other connections (other than mentorship relationship) already; under condition (b), when mentor-mentee pairs established the housing trust, they had been working together as mentor and mentee, a strong relationship in virtual world, thus they were more likely to build a strong trust in housing Method As we outlined, our project highlights the power that large-scale social network analysis can have, and how it can be done using the new tools and technologies created by YarcData at scale. We used data collected from EverQuest II (EQ2), a Massively Multiplayer Online Role Playing Game. Our data set includes over 35 billion triples representing over 2 million players and over a billion recorded interactions within ten months of play. Due to limitations of the YarcData servers, we had to scale down 21 relationship as well. Under condition (c), they tended to have no previous relationships before, which decreased the average level of trust they established. More research needs to be done to validate our hypothesis and will be some of our future work. We further studied the distribution of the time intervals between the establishment of housing trust and mentorship 600 300 500 Count 200 150 Housing trust first 10 400 300 1 1 1 10 100 Time interval (# of days) 1000 1 10 100 Time interval (# of days) 1000 100 50 0 -300 -200 -100 0 100 Time interval (# of days) 200 0 -300 300 -200 -100 0 100 Time interval (# of days) 200 300 C D 350 450 400 250 300 Count 250 200 D 1000 1000 100 300 350 Count Mentorship first 100 10 200 100 C Housing first Mentorship first Housing trust first Mentorship first 10 10 200 150 1 1 1 150 100 Housing trust first Count Count 250 Mentorship first 100 Count B 700 350 B 1000 1000 Count 400 A Count A in time complexity with respect to the number of nodes and edges in a graph. In addition to this, graphs are difficult to cut and send to different processes running in parallel without a heavy amount of communication between these processes, which increase the complexity of a given task. Moreover, what parallel computing gains in power and 100 100 10 100 Time interval (# of days) 1000 1 10 100 Time interval (# of days) 1000 50 50 0 -300 -200 -100 0 100 Time interval (# of days) 200 300 0 -300 -200 -100 0 100 Time interval (# of days) 200 Figure 3: Distribution of the absolute time interval between building housing trust and mentorship under two conditions (housing trust established first and mentorship established first) in four servers: A. Guk; B. Antonia Bayle; C. The Bazaar; D. Nagafen. 300 Figure 2: Distribution of the time interval between building housing trust and mentorship. A. Guk; B. Antonia Bayle; C. The Bazaar; D. Nagafen. execution, it lacks in preserving context. Our decision to use linked data and RDF technology is to demonstrate the advantage of retaining such information. By standardizing and connecting the data, we can unlock potential relational qualities at an impressively large scale while keeping the context of the data intact. This is particularly important when studying human activity and task completion as neither is accomplished context-free, relationship-free or independent from the social situation (Feld 1981). relationship, as shown in Figure 2. We found that the distributions on both sides followed similar patterns. Most of the time intervals were very small (less than ten days). The distribution roughly followed a power-law distribution when the time interval is less than 100 days. Then the probability dropped very fast after 100 days, following a power-law with a much steeper slope (Figure 3). However, there are still a small number of housing trust or mentorship relationships established half a year later than the other one. In all four servers, the distributions under both conditions are very similar to each other, without distinct difference observed. Therefore we could not identify a clear causality between the establishment of housing trust and mentorship relationship. Instead, both relationships have been making impact upon each other at a macroscopic level. From a microscopic view, in some pairs, housing trust was built first, and in some pairs, vice versa. It is with enthusiasm that we accepted the opportunity to use the YarcData uRIKA technology in order to accomplish our study. The system’s powerful graph analytics hardware platform, large shared memory and massive multithreading allows us to query large amounts of our data without writing or creating special purpose tools. By just knowing SPARQL, we can instantly start to explore and analyze our data. Without this, we would waste large amounts of time testing, writing, and using special purpose programs which would have to recreated for each and ever query we had. Many of the queries we wished to ask would not be possible on distributed memory architecture, as there is no real way to partition the data. Why Semantics and YarcData? From a computational standpoint, it is very difficult to create algorithms and software that can scale for extremely large graphs. Many basic graph algorithms are exponential 22 Conclusion and Further Discussion References While we had great success using the YarcData uRIKA system, we still only got to use a small fraction of the data we had access to. It was impossible to do any real queries on the experience network (the largest network in the data we have) as was our original plan. Even on the subset of data we chose, many queries we tried or wanted to use were impossible to run, as the machine ran out of memory, or simply crashed. We wished to understand more about the actually users behind the characters (like where the lived, what sex they were, etc), but any queries that included this demographic network failed to run. We also had more queries that would find the average money, experience, etc of characters we studied, but these queries as well failed to run. Even with the subset of our data we used, we ran against the limits of the system. Likewise, some of the social network analysis we wanted to run was impossible to write or do in the SPARQL query language. Things like finding the shortest path between two nodes, or finding the centrality of a network. We feel that having extensions to the YarcData uRIKA system that goes beyond or extends the SPARQL query language would be greatly used and needed. In future we would like to continue this work over more data, and further test the hypotheses’ we created in this study. As said before, this dataset has only been explored in small pieces, and we believe there is much more to be found in it. Ahmad, M. A., Poole, M. S., & Srivastava, J. (2010, August). Network Exchange in Trust Networks. In Social Computing (SocialCom), 2010 IEEE Second International Conference on (pp. 341-346). IEEE. Ahmad, M. A., Keegan, B., Williams, D., Srivastava, J., & Contractor, N. (2011, May). Trust amongst rogues? a hypergraph approach for comparing clandestine trust networks in mmogs. In Proceedings of Fifth International AAAI Conference on Weblogs and Social Media (ICWSM 2011) (pp. 17-21). Ahmad, M. A. (2012). Computational Trust in Multiplayer Online Games(Doctoral dissertation, University of Minnesota, 2012. Major: Computer science.). Boyd, D. (2008). How can qualitative internet researchers define the boundaries of their projects: A response to Christine Hine. Internet inquiry: Conversations about method, 26-32. Boyd, D., Marwick, A., Aftab, P., & Koeltl, M. (2009). The conundrum of visibility: Youth safety and the Internet. Emirbayer, M., & Goodwin, J. (1994). Network analysis, culture, and the problem of agency. American journal of sociology, 14111454. Feld, S. L. (1982). Social structural determinants of similarity among associates. American Sociological Review, 797-801. Golbeck, J. 2008. Weaving a web of trust. Science, 321(5896), 1640-1641. Acknowledgements Huang, Y., Zhu, M., Wang, J., Pathak, N., Shen, C., Keegan, B., ... & Contractor, N. (2009, August). The formation of taskoriented groups: Exploring combat activities in online games. In Computational Science and Engineering, 2009. CSE'09. International Conference on (Vol. 4, pp. 122-127). IEEE. We would like to thank the researchers at Northwestern University’s Science of Networks in Communities (SONIC) research group and the other members of the Virtual Worlds Observatory. The Virtual Worlds Observatory is funded in part by the National Science Foundation (Grant No. CNS-1010904, OCI-0904356, & IIS-0841583) and the Army Research Lab (W911NF-0902-0053). In particular we would like to thank Dora Cai at the University of Illinois at Urbana-Champaign for her help with the EQ2 datasets. The work in this paper was also supported in part by the DARPA SMISC program and in part by the Army Research Laboratory's Network Science Collaborative Technology Alliance. The opinions in this paper do not necessarily reflect the views of these sponsors. Ratan, R. A., Chung, J. E., Shen, C., Williams, D., & Poole, M. S. (2010). Schmoozing and smiting: Trust, social institutions, and communication patterns in an MMOG. Journal of Computerā Mediated Communication, 16(1), 93-114. Rheingold, H. (2002). Smart mobs: The new social revolution. Perseus Publishing. Turkle, S. (2011). Life on the Screen. Simon & Schuster. 23