A Framework for Computing the Privacy Scores of Users in Online Social Networks KUN LIU Yahoo! Labs and EVIMARIA TERZI Boston University A large body of work has been devoted to address corporate-scale privacy concerns related to social networks. Most of this work focuses on how to share social networks owned by organizations without revealing the identities or the sensitive relationships of the users involved. Not much attention has been given to the privacy risk of users posed by their daily information-sharing activities. In this article, we approach the privacy issues raised in online social networks from the individual users’ viewpoint: we propose a framework to compute the privacy score of a user. This score indicates the user’s potential risk caused by his or her participation in the network. Our definition of privacy score satisfies the following intuitive properties: the more sensitive information a user discloses, the higher his or her privacy risk. Also, the more visible the disclosed information becomes in the network, the higher the privacy risk. We develop mathematical models to estimate both sensitivity and visibility of the information. We apply our methods to synthetic and real-world data and demonstrate their efficacy and practical utility. Categories and Subject Descriptors: H.2.8 [Database Management]: Database Applications— Data mining General Terms: Algorithms, Experimentation, Theory Additional Key Words and Phrases: Social networks, item-response theory, expectation maximization, maximum-likelihood estimation, information propagation ACM Reference Format: Liu, K. and Terzi, E. 2010. A framework for computing the privacy score of users in online social networks. ACM Trans. Knowl. Discov. Data 5, 1, Article 6 (December 2010), 30 pages. DOI: 10.1145/1870096.1870102. http://doi.acm.org/10.1145/1870096.1870102. A shorter version of this article appeared in the Proceedings of the 2009 International Conference on Data Mining (ICDM). This work was done while K. Liu was with the IBM Almaden Research Center. Authors’ addresses: K. Liu, Yahoo! Labs, 4401 Great America Parkway, Santa Clara, CA 95054; email: kyn@yahoo-inc.com; E. Terzi (contact author), Computer Science Department, Boston University, 111 Cummington Street, Boston, MA 02215; email: evimaria@cs.bu.edu. Permission to make digital or hard copies part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from the Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org. c 2010 ACM 1556-4681/2010/12-ART6 $10.00 DOI: 10.1145/1870096.1870102. http://doi.acm.org/10.1145/1870096.1870102. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6 6: 2 · K. Liu and E. Terzi 1. INTRODUCTION In recent years, online social networking has moved from niche phenomenon to mass adoption. As we are writing this article, the two largest social-networking Web sites in the U.S., Facebook and MySpace, each already has more than 110 million monthly active users [Owyang 2008]. The goal of users upon entering a social network is to contact or be contacted by others, meet new friends or dates, find new jobs, receive or provide recommendations, and much more. Liu and Maes [2005] estimated that well over a million self-descriptive personal profiles are available across different Web-based social networks in the United States. According to Leonard [2004], already in 2004, seven million people had accounts on Friendster, two million users were registered with MySpace, while a whopping sixteen million users were registered on Tickle for a chance to take a personality test. Facebook, for example, has spread to millions of users [Gross and Acquisti 2005], that span of various educational institutions, high-school, undergraduate, and graduate students, and faculty members, staff, and alumni—not to mention the vast media attention it has received. As the number of users of these sites and the number of sites themselves explode, securing individuals’ privacy to avoid threats such as identity theft and digital stalking becomes an increasingly important issue. Unfortunately, even experienced users who are aware of their privacy risks are sometimes willing to compromise their privacy in order to improve their digital presence in the virtual world. That is, they prefer being popular and “cool” to being conservative with respect to their privacy settings. They know that loss of control over their personal information poses a long-term threat, but they cannot assess the overall and long-term risk accurately enough to compare it to the shortterm gain. Even worse, setting the privacy controls in online services is often a complicated and time-consuming task that many users feel confused about and usually skip. Past research on privacy and social networks (e.g., Backstrom et al. [2007]; Hay et al. [2008]; Liu and Terzi [2008]; Ying and Wu [2008]; Zhou and Pei [2008]) has mainly focused on corporate-scale privacy concerns, that is, how to share a social network has owned by an organization without revealing the identities of or sensitive relationships among the registered users. Not much attention has been given to the privacy risk for individual users posed by their information-sharing activities. In this article, we address the privacy issue from the user’s perspective: we propose a framework that estimates a privacy score for each user. This score measures the user’s potential privacy risk due to his or her online informationsharing behavior. With this score, we can achieve the following. —Privacy risk monitoring. The score serves as an indicator of the user’s potential privacy risk. The system can estimate the sensitivity of each piece of information the user has shared, and send alert to the user if the sensitivity of some information is beyond the predefined threshold. —Privacy setting recommendation. The user can compare his or her privacy score with the rest of the population to know where he or she stands. In the ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 3 case where the overall privacy score of a user’s social graph is lower than that of the user himself or herself, the system can recommend stronger privacy settings based on information from the user’s social neighbors. —Social study. As a byproduct, the system can estimate the inherent attitude of each individual. This psychometric measure can help sociologists study the online behavior of users. The overall objective of our work is to enhance public awareness of privacy, and to reduce the complexity of managing information sharing in social networks. From the technical point of view, our definition of privacy score satisfies the following intuitive properties: The score increases with (i) the sensitivity of the information being revealed and (ii) the visibility of the revealed information within the network. We develop mathematical models to estimate both the sensitivity and visibility of the information, and we show how to combine these two factors in the calculation of the privacy score. —Contribution. To the best of our knowledge, we are the first to provide an intuitive and mathematically sound methodology for computing users’ privacy scores in online social networks. The two principles stated above are rather general, and many models would be able to satisfy them. In addition, the specific model we propose in this article exhibits two extra advantages: (i) it is container independent, meaning that scores calculated for users belonging to different social networks (e.g., Facebook, LinkedIn, and MySpace) are comparable, and (ii) it fits the real data. Finally, we give algorithms for the computation of privacy scores that scale well and indicative experimental evidence of the efficacy of our framework. Our models draw inspiration from the Item Response Theory (IRT) [Baker and Kim 2004] and Information Propagation (IP) models [Kempe et al. 2003]. —Overview of our framework. For a social-network user, j, we compute the privacy score as a combination of the partial privacy scores of each one of his or her profile items, for example, the user’s real name, email, hometown, mobile-phone number, relationship to status, sexual orientation, IM screen name, etc. The contribution of each profile item to the total privacy score depends on the sensitivity of the item and the visibility it gets due to j’s privacy settings and j’s position in the network. Here, we assume that all N users specify their privacy settings for the same n profile items. These settings are stored in an n × N response matrix R. The profile setting of user j for item i, R (i, j), is an integer value that determines how willing j is to disclose information about i. The higher the value, the more willing j is to disclose information about item i. In general, large values in R imply higher visibility. On the other hand, small values in the privacy settings of an item are an indication of high sensitivity; it is the highly sensitive items that most people try to protect. Therefore, the privacy settings of users for their profile items stored in the response matrix R have lots of valuable information about users’ privacy behavior. Our first approach uses exactly this information ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 4 · K. Liu and E. Terzi to compute the privacy score of users. We do so by employing notions from Item Response Theory (IRT) [Baker and Kim 2004]. The position of every user in the social network also affects his or her privacy score. The visibility setting of the profile items is enhanced (or silenced) depending on the user’s role in the network. For example, the privacy risk of a completely isolated individual is much lower than the privacy risk of a popular individual, even if both have the same privacy settings in their profiles. In our extended version of privacy-score computation, we take into account the social-network structure and use models and algorithms from information-propagation and viral marketing studies [Kempe et al. 2003]. —Remarks. In this article, we do not consider how to conduct inference attacks to derive hidden information about a user based on his or her publicly disclosed data. We deem this inference problem as important, albeit orthogonal, to our work. Some profile items such as hobbies are composite since they may contain many different kinds of sensitive information. We decompose these kinds of items into primitive ones. Again, determining the granularity of the profile items is considered an orthogonal issue to the problem we study here. Although the privacy scores computed by a single method are all comparable (i.e., they are on the same scale), the scale across different methods varies. Adjusting the scales of measures that have totally different ranges can sometimes be tricky, and crude normalization can lead to misinterpretations of the results. In this article, we do not adjust the scales of different scores. Instead, we emphasize the properties of these scores and the ranking of users with respect to their privacy scores. —Organization of the material. After the presentation of the related work in Section 2 and the description of the notational conventions in Section 3, we present our definitions of privacy score and present algorithms for computing it. Our initial privacy score definition ignores the structure of the social network; its computation is only based on the privacy settings users associate with their profile items. The models and algorithmic solutions associated with this definition of privacy score are given in Sections 4, 5, 6, and 7. We extend our definition of privacy score to take into account the social-network structure in Section 8. Experimental comparison of our methods is given in Section 9. We conclude the article in Section 10. 2. RELATED WORK To the best of our knowledge, we are the first to present a framework that formally quantifies the privacy score of online social-network users. None of the previous work on privacy-preserving social-network analysis [Backstrom et al. 2007; Hay et al. 2008; Liu and Terzi 2008; Ying and Wu 2008; Zhou and Pei 2008] has addressed privacy concerns from this perspective. Past work has mostly considered anonymization methods and threats to users’ privacy once an anonymized social network is released. What we consider as the most relevant work is that on scoring systems for measuring popularity, ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 5 creditworthiness, trustworthiness, and identity verification. We briefly describe these scores here. —QDOS score. Garlik, a UK-based company, launched a system called QDOS1 for measuring people’s digital presence. The QDOS score is determined by four factors: (1) popularity, that is, who and how many people know you; (2) impact, that is, the extent to which people are influenced by what you say; (3) activity, that is, what you do online; and (4) individuality, that is, how easily you can be located online. Although QDOS can be potentially used to measure one’s privacy risk, the primary purpose of this system as of today is the opposite; it encourages people to enhance their digital presence. More importantly, QDOS uses a different mathematical model based on spectral analysis of the input social network. Our model, on the other hand, exploits item response theory and information-propagation models. —Credit score. A credit score is used to estimate the likelihood that a person will default on a loan. The most famous one, the FICO score,2 was originally developed by Fair Isaac Corporation in 1956. Nowadays, this ubiquitous three-digit number is used to evaluate the creditworthiness of a person. The credit score is different from our privacy score, not only because it serves different purposes but also because the input data used for estimating the two scores as well as the estimation methods themselves are different. —Trust score. A trust score is a measure of how much one a member of a group is trusted by the others. There is a large body of applications and research on this topic, see for example, eBay’s sellers and buyers rating system, trust management for the Semantic Web [Richardson et al. 2003], etc. Trust scores could be used by social-network users to determine who can view their personal information. However, our system is used to quantify the privacy risk after the information has been shared. Ahmad [2006] described a method for managing the release of private information. When a request is received, the information provider calculates a score to serve as a confidence level for authentication and authorization associated with the request. The provider releases the information only when the score is above a predefined threshold. —Identity score. An identity score3 is used for tagging and verifying the legitimacy of a person’s public identity. It was originally developed by financialservice firms to measure the fraud risk of new customers. Our privacy score is different from an identity score since it serves a different purpose. 1 http://www.qdos.com 2 http://www.myfico.com/ 3 http://en.wikipedia.org/wiki/Identity score ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 6 · K. Liu and E. Terzi 3. PRELIMINARIES We assume there exists a social network G that consists of N nodes, every node j ∈ {1, . . . , N} being associated with a user of the network. Users are connected through links that correspond to the edges of G. In principle, the links are unweighted and undirected. However, for generality, we assume that G is directed and we have converted undirected networks into directed ones by adding two directed edges ( j → j ) and ( j → j ) for every input undirected edge ( j, j ). Every user has a profile consisting of n profile items. For each profile item, users set a privacy level that determines their willingness to disclose information associated with this item. The privacy levels picked by all N users for the n profile items are stored in an n × N response matrix R. The rows of R correspond to profile items and the columns correspond to users. We use R (i, j) to refer to the entry in the ith row and jth column of R; R (i, j) refers to the privacy setting of user j for item i. If the entries of the response matrix R are restricted to take values in {0, 1}, we say that R is a dichotomous response matrix. If entries in R take any nonnegative integer values in {0, 1, . . . , }, we say that matrix R is a polytomous response matrix. In a dichotomous response matrix R, R (i, j) = 1 means that user j has made the information associated with profile item i publicly available. If user j has kept information related to item i private, then R (i, j) = 0. The interpretation of values appearing in polytomous response matrices is similar: R (i, j) = 0 means that user j keeps profile item i private; R (i, j) = 1 means that j discloses information regarding item i only to his or her immediate friends. In general, R (i, j) = k (with k ∈ {0, 1, . . . , }) means that j discloses information related to item i to users that are at most k links away in G. In general, R (i, j) ≥ R i , j means that j has more conservative privacy settings for item i than item i. The ith row of R, denoted by Ri, represents the settings of all users for profile item i. Similarly, the jth column of R, denoted by R j, represents the profile settings of user j. In most of the social media sites (e.g., Facebook, Flickr, LinkedIn, etc.), there is a response matrix where the user is asked to determine his or her choice of privacy levels for different information items. In some cases, the privacy levels are dichotomous or polytomous. It can be the case that some of the users do not set levels in all their profile items, for some of them leave the default settings. Therefore, the response matrix is always complete for every user. Recent research [Fang and LeFevre 2010] has also aimed at helping users set their personalized levels for all items, based on their selected levels at a subset of those items. We often consider users’ settings for different profile items as random variables described by a probability distribution. In such cases, the observed response matrix R is just a sample of responses that follow this probability distribution. For dichotomous response matrices, we use Pij to denote the probability that user j selects R (i, j) = 1. That is, Pij = Prob R (i, j) = 1 . In the probability that user j sets the polytomous case, we use Pijk to denote R (i, j) = k. That is, Pijk = Prob R (i, j) = k . ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 7 In order to allow the readers to build intuition, we start by defining the privacy score for dichotomous response matrices. Once this intuition is built, we extend our definitions to polytomous settings. 4. PRIVACY SCORE IN DICHOTOMOUS SETTINGS The privacy score of a user is an indicator of his or her potential privacy risk; the higher the privacy score of a user, the higher the threat to his or her privacy. Naturally, the privacy risk of a user depends on the privacy level he or she picks for the his or her profile items. The basic premises of our definition of privacy score are the following. —The more sensitive information a user reveals, the higher his or her privacy score. —The more people know some piece of information about a user, the higher his or her privacy score. The following two examples illustrate these two premises. Example 1. Assume user j and two profile items, i = {mobile-phone number} and i = {employer}. R (i, j) = 1 is a much more risky setting for j than R i , j = 1; even if a large group of people knows j’s employer, this cannot be as intrusive a scenario as the one where the same set of people knows js mobile-phone number. Example 2. Assume again user j and let i = {mobile-phone number} be a single profile item. Naturally, setting R (i, j) = 1 is a more risky behavior than setting R (i, j) = 0; making j’s mobile phone publicly available increases j’s privacy risk. In order to capture the essence of the preceding examples, we define the privacy score of user j to be a monotonically increasing function of two parameters: the sensitivity of the profile items and the visibility these items get. 4.1 Sensitivity of a Profile Item Examples 1 and 2 illustrate that the sensitivity of an item depends on the item itself. Therefore, we define the sensitivity of an item as follows. Definition 1. The sensitivity of item i ∈ {1, . . . , n} is denoted by βi and depends on the nature of the item i. Some profile items are, by nature, more sensitive than others. In Example 1, the {mobile-phone number} is considered more sensitive than {employer} for the same privacy level. 4.2 Visibility of a Profile Item The visibility of a profile item i due to j captures how known j’s value for i becomes in the network; the more it spreads, the higher the item’s visibility. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 8 · K. Liu and E. Terzi Naturally, visibility, denoted by V (i, j), depends on the value R (i, j), as well as on the particular user j and his or her position in the social network G. The simplest possible definition of visibility is V (i, j) = I(R(i, j)=1) , where Icondition is an indicator variable that becomes 1 when “condition” is true. We call this the observed visibility for item i and user j. In general, one can assume that R is a sample from a probability distribution over all possible response matrices. Then the true visibility, or simply the visibility, is computed based on this assumption. 2. If Pij = Prob R (i, j) = 1 , then the visibility is V (i, j) = Pij × 1 + Definition 1 − Pij × 0 = Pij. Probability Pij depends both on the item i and the user j. 4.3 Privacy Score of a User The privacy score of individual j due to item i, denoted by P R (i, j), can be any combination of sensitivity and visibility. That is, P R (i, j) = βi V (i, j). Operator is used to represent any arbitrary combination function that respects the fact that P R (i, j) is monotonically increasing with both sensitivity and visibility. For simplicity, throughout our discussion we use the product operator to combine sensitivity and visibility values. In order to evaluate the overall privacy score of user j, denoted by P R ( j), we can combine the privacy score of j due to different items. Again, any combination function can be employed to combine the per-item privacy scores. For simplicity, we use a summation operator here. That is, we compute the privacy score of individual j as follows: P R ( j) = n i=1 P R (i, j) = n βi × V (i, j) . (1) i=1 In the above, the privacy score can be computed using either the observed visibility or the true visibility. For the rest of the discussion, we use the true visibility, and we refer to it as visibility. This is because we believe that the specific privacy settings of a user are just an instance of his or her possible settings descibed by the probability distribution P R (i, j). In the next sections we show how to compute the privacy score of a user in a social network based on the privacy settings on his or her profile items. 5. IRT-BASED COMPUTATION OF PRIVACY SCORE: DICHOTOMOUS CASE In this section we show how to compute the privacy score of users using concepts from Item Response Theory (IRT). We start the section by introducing some basic concepts from IRT. We then show how these concepts are applicable in our setting. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 9 Fig. 1. Item characteristic curves (ICC). y axis: Pij = Pi θ j for different β values (Figure 1(a)) and α values (Figure 1(b)); x axis: ability level θ j. 5.1 Introduction to IRT IRT has its origins in psychometrics where it is used to analyze data from questionnaires and tests. The goal there is to measure the abilities of the examinees, the difficulty of the questions, and the probability of an examinee correctly answering a given question. In this article, we consider the two-parameter IRT model. In this model, every question qi is characterized by a pair of parameters ξi = (αi, βi). Parameter βi, βi ∈ (−∞, ∞), represents the difficulty of qi. Parameter αi, αi ∈ (−∞, ∞), quantifies the discrimination power of qi. The intuitive meaning of these two parameters will become clear shortly. Every examinee j is characterized by his or her ability level θ j, θ j ∈ (−∞, ∞). The basic random variable of the model is the response of examinee j to a particular question qi. If this response is marked either “correct” or “wrong” (dichotomous response), then the probability that j answers qi correctly is given by Pij = 1 1+e −αi (θ j−βi) . (2) Thus, Pij is a function of parameters θ j and ξi = (αi, βi). For a given question qi with parameters ξi = (αi, βi), the plot of the above equation as a function of θj4 is called the item characteristic curve (ICC). The ICCs obtained for different values of parameters ξi = (αi, βi) are given in Figures 1(a) and 1(b). These illustrations make the intuitive meaning of parameters αi and βi easier to explain. Figure 1(a) shows the ICCs obtained for two questions q1 and q2 with parameters ξ1 = (α1 , β1 ) and ξ2 = (α2 , β2 ) such that α1 = α2 and β1 < β2 . Parameter βi, the item difficulty, is defined as the point on the ability scale at which Pij = 0.5. We can observe that IRT places βi and θ j on the same scale (see the x axis of Figure 1(a)) so that they can compared. If an examinee’s ability is higher than can represent Pij by Pi θ j to indicate the dependency on θ j. But in general, we use Pij and Pi θ j interchangeably. 4 We ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 10 · K. Liu and E. Terzi the difficulty of the question, then he or she has a better chance to get the answer right, and vice versa. This also indicates a very important feature of IRT called group invariance, that is, the item’s difficulty is a property of the item itself, not of the people that responded to the item. We will elaborate on this in the experiments section. Figure 1(b) shows the ICCs obtained for two questions q1 and q2 with paraαi, meters ξ1 = (α1 , β1 ) and ξ2 = (α2 , β2 ) such that α1 > α2 and β1 = β2 . Parameter the item discrimination, is proportional to the slope of Pij = Pi θ j at the point where Pij = 0.5; the steeper the slope, the higher the discriminatory power of a question, meaning that this question can well differentiate among examinees whose abilities are below and above the difficulty of this question. In our IRT-based computation of the privacy score, we estimate the proba bility Prob R (i, j) = 1 using Equation (2). However, we do not have examinees and questions; rather we have users and profile items. Thus, each examinee is mapped to a user, and each question is mapped to a profile item. The ability of an examinee corresponds to the attitude of a user: for user j, his or her attitude θ j quantifies how concerned j is about his or her privacy; low values of θ j indicate a conservative/introvert user, while high values of θ j indicate a careless/extrovert user. We use the difficulty parameter βi to quantify the sensitivity of profile item i. In general, parameter βi can take any value in (−∞, ∞). In order to maintain the monotonicity of the privacy score with respect to items’ sensitivity, we need to guarantee that βi ≥ 0 for all i ∈ {1, . . . , n}. This can be easily handled by shifting all items’ sensitivity values by a big constant value. In the preceding mapping, parameter αi is ignored. Naturally, the need of the two-parameter model (qi is characterized by αi, βi) is questioned. One could argue that for our purposes it is enough to use the one-parameter model (qi is only characterized by βi), which is also known as the Rasch model.5 In the Rasch model, each item is described by parameter βi and αi = 1 for all i ∈ {1, . . . , n}. However, as shown in Birnbaum [1968] (and discussed in Baker and Kim [2004], Chapter 5), the Rasch model is unable to distinguish users that disclose the same number of profile items but with different sensitivities. We believe that a finer-grained analysis of users attitude is necessary and this is the reason we pick the two-parameter model. For computing the privacy score, we need to compute the sensitivity β i for all items i ∈ {1, . . . , n} and the probabilities Pij = Prob R (i, j) = 1 , using Equation (2). For the latter computation, we need to know all the parameters ξi = (αi, βi) for 1 ≤ i ≤ n and θ j for 1 ≤ j ≤ N. In the next three sections, we show how we can estimate these parameters using as input the response matrix R and employing maximum-likelihood estimation (MLE) techniques. All these techniques exploit the following three independence assumptions inherent in IRT models: (i) independence between items; (ii) independence between users; and (iii) independence between users and items. The independence assumptions are necessary for devising a relatively simple and intuitive model. Also, they help in the design of efficient algorithms for 5 http://en.wikipedia.org/wiki/Rasch model ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 11 computing the privacy scores. Modeling dependencies between users or items would significantly increase the computational complexity of our methods and it would make them incapable of handling large datasets in real scenarios. Further, our experiments in Section 9 show that parameters learned based on these assumptions fit the real-world data very well. We refer to the privacy score computed using these methods as the Pr IRT score. 5.2 IRT-Based Computation of Sensitivity In this section, we show how to compute the sensitivity βi of a particular item i.6 Since items are independent, the computation of parameters ξi = (αi, βi) is done separately for every item; thus all methods are highly parallelizable. In Section 5.2.1 we first show how to compute ξi assuming that the attitudes of the N individuals θ = (θ1 , . . . , θ N ) are given as part of the input. The algorithm for the computation of items’ parameters when attitudes are not known is discussed in Section 5.2.2. 5.2.1 Item-Parameters Estimation. The maximum-likelihood estimation of ξi = (αi, βi) sets as our goal to find ξi such that the likelihood function N R(i, j) Pij 1−R(i, j) 1 − Pij j=1 is maximized. Recall that Pij is evaluated as in Equation (2) and depends on αi, βi, and θ j. The above likelihood function assumes a different attitude per user. In practice, online social-network users form a grouping that partitions the K set of Fg = users {1, . . . , N} into K nonoverlapping groups {F1 , . . . , FK } such that g=1 {1, . . . , N}. All users within each partition have a similar “attitude.” Let θg be the attitude of group Fg (all members of Fg share the same attitude θg) and fg = Fg. Also, for eachitem i let rig be the number of people in Fg that set R (i, j) = 1, that is, rig = j | j ∈ Fg and R (i, j) = 1 . Given such a grouping, the likelihood function can be written as K f −r fg rig 1 − Pi θg g ig . Pi θg rig g=1 After ignoring the constants, the corresponding log-likelihood function is L = K rg log Pi θg + fg − rig log 1 − Pi θg . (3) g=1 The partitioning of the users into groups is made for two reasons: (1) the partitioning reduces the computational complexity of the algorithm. We only have to consider the groups of users rather than the users. (2) Partitioning of the users into groups that have the same “attitude” parameters allows us to 6 The value of αi, for the same item, is obtained as a byproduct of this computation. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 12 · K. Liu and E. Terzi get better estimates of the parameters of the groups as well as of the sensitivity values of the items, since we have more observations per parameter value. Our goal is now to find item parameters ξi = (αi, βi) to maximize the loglikelihood function given in Equation (3). For this we use the Newton-Raphson method [Ypma 1995]. The Newton-Raphson method is a numerical algorithm that, given partial derivatives L1 = ∂L ∂L and L 2 = ∂αi ∂βi and L 11 = ∂2L , ∂αi2 L 22 = ∂2L , ∂βi2 L 12 = L 21 = ∂2L , ∂αiβi At iteration (t+ 1), the estimates of estimates parameters ξi = (αi, βi) iteratively. αi are computed from the corresponding the parameters αi, βi denoted by βi t+1 estimates at iteration t, as follows: −1 α i α i L 11 L 12 L 11 = − × . (4) i L 21 L 22 t L 21 t β βi t t+1 5.2.1.1 Discussion. Algorithm 1 shows the overall process for computing ξi = (αi, βi) for all items i ∈ {1, . . . , n}. The overall process starts with the partitioning of the set of N users into K groups based on users’ attitude. The partitioning is done using the PartitionUsers routine. This routine implements a one-dimensional clustering, and can be done optimally using dynamic program ming in O N2 K time. The result of this procedure is a grouping of users into K groups {F1 , . . . , FK }, with group attitudes θg, 1 ≤ g ≤ K. Given this grouping, the values of fg and rig for 1 ≤ i ≤ n and 1 ≤ g ≤ K are computed. These computations take time O (nN). Given these values, the Newton-Raphson estimation is performed for each one of the n items. This takes O (nIK ) time in total, where I is the number of iterations for the estimation of one item. Therefore, the total running time of the item-parameters estimation is O N2 K + nN + nIK . Note that the N2 K complexity can be reduced to linear using heuristics-based clustering, though the optimality is not guaranteed. Moreover, since items are independent of each other, the Newton-Raphson estimation of each item can be done in parallel, which makes the computation much more efficient than the theoretical complexity would suggest. 5.2.2 The EM Algorithm for Item-Parameter Estimation. In the previous section, we showed how to estimate item parameters ξi = (αi, βi) assuming that user attitudes θ = (θ1 , . . . , θ N ) were part of the input. In this section, we show how we can do the same computation without knowing θ . In this case, the input only consists of the response matrix R. If ξ = (ξ1 , . . . , ξn ) is the vector of parameters for all items, then our goal is to estimate ξ given response matrix R, with θ being the hidden and unobserved variables. We perform this task using the Expectation-Maximization (EM) procedure. The EM procedure is an iterative method that consists of two steps: expectation and maximization. We describe these two steps below. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 13 Algorithm 1. Item-parameter estimation of ξi = (αi, βi) for all items i ∈ {1, . . . , n}. Input: Response matrix R, users attitudes θ = (θ1 , . . . , θ N ) and the number K of users’ attitude groups. Output: Item parameters α = (α1 , . . . , αn) and β = (β1 , . . . , βn). K 1: Fg , θg g=1 ← PartitionUsers(θ , K) 2: for g = 1 to K do 3: fg ← Fg 4: for i = 1 to n do 5: rig ← j | j ∈ Fg and R (i, j) = 1 6: for i = 1 to n do K 7: (αi, βi) ←NR Item Estimation Ri, fg , rig , θg g=1 5.2.2.1 Expectation Step. In this step, we calculate the expected grouping of users using the previously ξ . In other words, for 1 ≤ i ≤ n and estimated 1 ≤ g ≤ K, we compute E fg and E rig as follows: N E fg = f g = P θg | R j, ξ (5) j=1 and N E rig = rig = P θg | R j, ξ × R (i, j) . (6) j=1 The computation relies on the posterior probability distribution of a user’s j attitude P θg | R , ξ . Assume for now that we know how to compute these probabilities. It is easy to observe that the membership of a user in a group is probabilistic. That is, every individual belongs to every group with some probability; the sum of these membership probabilities is equal to 1. 5.2.2.2 Maximization Step. Knowing the values of f g and rig for all groups and all items allows us to compute a new estimate of ξ by invoking the Newton-Raphson item-parameters estimation procedure (NR Item Estimation) described in Section 5.2.1. The pseudocode for the EM algorithm is given in Algorithm 2. Every iteration of the algorithm consists of an Expectation and a Maximization step. 5.2.2.3 The Posterior Probability of Attitudes. By the definition of probability, this posterior probability is P R j | θ j, ξ g θ j . (7) P θ j | R j, ξ = P R j | θ j, ξ g θ j dθ j Function g θ j is the probability density function of attitudes in the population of users. It is used to model our prior knowledge about user attitudes and its called the prior distribution of users attitude. Following standard conventions [Mislevy and Bock 1986], we assume that the prior distribution g(.) is ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. · 6: 14 K. Liu and E. Terzi Algorithm 2. The EM algorithm for estimating item parameters ξi = (αi, βi) for all items i ∈ {1, . . . , n}. Input: Response matrix R and the number K of user groups. Users in the same group have the same attitude. Output: Item parameters α = (α1 , . . . , αn), β = (β1 , . . . , βn). 1: for i = 1 to n do 2: αi ←initial values 3: βi ←initial values 4: ξi ← (αi, βi) 5: ξ ← (ξ1 , . . . , ξn) 6: repeat // Expectation step 7: for g = 1 to K do 8: Sample θg on the ability scale 9: Compute f g using Equation (6) 10: for i = 1 to n do 11: Compute rig using Equation (6). // Maximization step 12: for i = 1 to n do K 13: (αi, βi) ←NR Item Estimation Ri, f g , rig , θg ξi ← (αi, βi) 15: until convergence g=1 14: Gaussian and is the same for all users. Our results indicate that this prior fits the data well. The term P R j | θ j, ξ in the numerator is the likelihood of the vecThis tor of observations R j given items’ parameters and user j’s attitude. j term can be computed using the standard likelihood function P R | θ j, ξ = 1−R(i, j) n R(i, j) . 1 − Pij i=1 Pij The evaluation of the posterior probability of every attitude θ j requires the evaluation of an integral. We bypass this problem as follows: since we assume the existence of K groups, we only need to sample K points X 1 , . . . , X K on the ability scale. Each of these points serves as the common attitude of a user group. For each t ∈ {1, . . . , K}, we compute g ( X t), the density of the attitude function at attitude value X t. Then, we let A ( X t) be the area of the rectangle defined by the points (X t − 0.5, 0), (X t + 0.5, 0), (X t − 0.5, g ( X t)), and (X t + 0.5, K g ( X t)). Then the A ( X t) values are normalized such that t=1 A ( X t) = 1. In that way, we can obtain the posterior probabilities of X t: P R j | X t, ξ A ( X t) . (8) P X t | R j, ξ = K j t=1 P R | X t, ξ A ( X t) 5.2.2.4 Discussion. The estimation of the privacy score using the IRT model requires as input the number of groups of users K. In our implementation, we follow standard conventions [Mislevy and Bock 1986] and set K = 10. However, we have found that other values of K fit the data as well. The estimation of the ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 15 “correct” number of groups is an interesting model-selection problem for IRT models, which is not the focus of this work. The running time of the EM algorithm is O ( I R (TEXP + TMAX )), where I R is the number of iterations of the repeat statement, and TEXP and TMAX the running times of the Expectation and the Maximization steps, respectively. Lines 9 and 11 require O( Nn) time each. Therefore, the total time of the expectation step is TEXP = O K Nn2 . From the preceding discussion in Section 5.2.1 we know that TMAX = O (nIK ), where I is the number of iterations of Equation (4). Again, Steps 12, 13, and 14 can be done in parallel due to the independence assumption of items. 5.3 IRT-Based Computation of Visibility The computation of visibility requires the evaluation of Pij = Prob R (i, j) = 1 , given in Equation (2). Apparently if vectors θ = (θ1 , . . . , θ N ), α = (α1 , . . . , αn), and β = (β1 , . . . , βn) are known, then computing Pij, for every i and j, is trivial. Here, we describe the NR Attitude Estimation algorithm, which is a Newton-Raphson procedure for computing the attitudes of individuals, given the item parameters α = (α1 , . . . , αn) and β = (β1 , . . . , βn). These item parameters could be given as input or they can be computed using the EM algorithm (Algorithm 2.). For each individual j, the NR Attitude Estimation computes 1−R(i, j) R(i, j) 1 − Pij , or the corresponding θ j that maximizes likelihood ni=1 Pij log-likelihood L = n R (i, j) log Pij + (1 − R (i, j)) log 1 − Pij . i=1 Since α and β are part of the input, the only variable to maximize over is θ j. The θ j, is obtained iteratively using again the Newtonestimate of θ j, denoted by Raphson method. More specifically, the estimate θ j at iteration (t + 1), θ j t+1 , is computed using the estimate at iteration t, θ j t, as follows: −1 ∂L ∂2L θj t − θ j t+1 = . ∂θ j t ∂θ 2j t 5.3.1 Discussion. For I iterations of the Newton-Raphson method, the running time for estimating a single user’s attitude θ j is O (nI). Due to the independence of users, each user’s attitude is estimated separately; thus the estimation for N users requires O ( NnI) time. Once again, this computation can be parallelized due to the independence assumption of users. 5.4 Putting It All Together The sensitivity βi computed in Section 5.2 and the visibility Pij computed in Section 5.3 can be applied to Equation (1) to compute the privacy score of a user. The advantages of the IRT framework can be summarized as follows: (1) the quantities IRT computes, that is, sensitivity, attitude, and visibility, have ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 16 · K. Liu and E. Terzi Fig. 2. True item characteristic curves (ICCs) and estimated ICCs using three different groups of users. y axis: Pij = Pi θ j ; x axis: ability level θ j. an intuitive interpretation. For example, the sensitivity of information can be used to send early alerts to users when the sensitivities of their shared profile items are out of the comfortable region. (2) Due to the independence assumptions, many of the computations can be parallelized, which makes the computation very efficient in practice. (3) As our experiments will demonstrate later, the probabilistic model defined by IRT in Equation (2) can be viewed as a generative model, and it fits the real response data very well in terms of χ 2 goodness-of-fit test. (4) Most importantly, the estimates obtained from the IRT framework satisfy the group invariance property. We will further discuss this property in the experimental section. At an intuitive level, this property means that the sensitivity of the same profile item estimated from different social networks is close to the same true value, and consequently, the privacy scores computed across different social networks are comparable. The following example illustrates this group invariance property of IRT. Example 3. In Figure 2 the dotted line is the true ICC for item i with (known) parameters ξi = (αi, βi). The data plotted by the markers in this figure consists of 30 groups; each marker depicts the proportion of people in a group that disclose item i. Since ICC spans the whole spectrum of attitude values, in order to estimate ξi from the data we need to consider all 30 groups. Instead of that, we estimate three pairs of item parameters ξiL , ξiM , and ξiH using three nonoverlapping clusters of users with low, medium, and high attitudes, respectively; each cluster consists of 10 groups. We thus obtain three different ICCs: C L , C M , and C H . In the figure we only draw (with solid lines) the fragments of these curves projected on the users that were actually used for estimating the corresponding item parameters. The group invariance property says that, as long as the three estimations consider responses to the same item, the fragments of C L , C M , and C H should belong to the same (in this case the true) ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 17 curve. Therefore, the estimated parameters ξiL , ξiM , and ξiH should all be very close to the true value ξi. This is exactly what happens in Figure 2. 6. POLYTOMOUS SETTINGS In this section, we show how the definitions and methods described before can be extended to handle polytomous response matrices. Recall that, in polytomous matrices, every entry R (i, j) = k with k ∈ {0, 1, . . . , }. The smaller the value of R (i, j), the more conservative the privacy setting of user j with respect to profile item i. The definitions of sensitivity and visibility of items in the polytomous case be generalized as follows. Definition 3. The sensitivity of item i ∈ {1, . . . , n} with respect to privacy level k ∈ {0, . . . , }, is denoted by βik . Function βik is monotonically increasing with respect to k; the larger the privacy level k picked for item i, the higher its sensitivity. Similarly, the visibility of an item becomes a function of its privacy level. Definition 4. The visibility of item i that belongs to user j at level k is denoted by V (i, j, k). The observed visibility is computed as V (i, j, k) = I(R(i, j)=k) × k. Thetrue visibility is computed as V (i, j, k) = Pijk × k, where Pijk = Prob R (i, j) = k . Given Definitions 3 and 4, we compute the privacy score of user j using the following generalization of Equation (1): P R ( j) = n βik × V (i, j, k) . (9) i=1 k=0 Again, in order to keep our framework more general, in the following sections, we will discuss true rather than observed visibility for the polytomous case. 6.1 IRT-Based Privacy Score: Polytomous Case Computing the privacy score in this case boils down to a transformation of the polytomous response matrix R into ( + 1) dichotomous response matrices R∗0 , R∗1 , . . . , R∗ . Each matrix R∗k , k ∈ {0, 1, . . . , }, is constructed so that R∗k (i, j) = 1 if R (i, j) ≥ k, and R∗k (i, j) = 0 otherwise. Let P∗ijk be the probability of setting R∗k (i, j) = 1, that is, P∗ijk = Prob R∗k (i, j) = 1 = Prob R (i, j) ≥ k . When k = 0, matrix R∗ik has all its entries equal to 1, we have that P∗ijk = 1 for all users. When k ∈ {1, . . . , }, P∗ijk is given as in Equation (2). That is, P∗ijk = 1 1+e ∗ −αik (θ j−βik∗ ) . (10) By construction, for every k , k ∈ {1, . . . , } and k < k we have that matrix R∗k contains only a subset of the 1-entries appearing in matrix R∗k . Therefore, P∗ijk ≥ P∗ijk , and ICC curves (P∗ijk ) of the same profile item i at different privacy ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 18 · K. Liu and E. Terzi Fig. 3. (a): y axis: probability P∗ijk = Prob R (i, j) ≥ k for k ∈ {0, 1, 2, 3}; x axis: attitude θ j of user j. (b): y axis: probability Pijk = Prob R (i, j) = k for k ∈ {0, 1, 2, 3}; x axis: attitude θ j of user j. levels k ∈ {1, . . . , } do not cross, as shown in Figure 3(a). This observation results in the following corollary. C OROLLARY 1. For items i and privacy levels k ∈ {1, . . . , }, we have that ∗ ∗ < · · · < βik < · · · < βi∗ . Moreover, since curves P∗ijk do not cross, we also have βi1 ∗ ∗ ∗ ∗ = · · · = αi∗ = αi∗ . For k = 0, P∗ijk = 1, αi0 and βi0 are not defined. that αi1 = · · · = αik The computation of the privacy score case, however, re in the polytomous quires computing βik and Pijk = Prob R (i, j) = k (see Definition 4 and Equa∗ tion (9)). These parameters are different from βik and P∗ijk since the latter are ∗ defined on dichotomous matrices. Now the question is: if we can estimate βik ∗ and Pijk , how to transform them to βik and Pijk ? Fortunately, since by definition P∗ijk is the cumulative probability P∗ijk = k =k Pijk , we have that ∗ Pijk − P∗ij(k+1) , when k ∈ {0, . . . , − 1}; (11) Pijk = P∗ijk , when k = . Figure 3(b) shows the ICCs for Pijk , which are obtained by the above equation. Also, by Baker and Kim [2004], we also have the following proposition for βik . P ROPOSITION 1 B AKER AND K IM [2004]. For k ∈ {1, . . . , − 1} it holds that β ∗ +β ∗ ∗ βik = ik 2i(k+1) . Also, βi0 = βi1 and βi = βi∗ . From Proposition 1 and Corollary 1 we have the following. C OROLLARY 2. For k ∈ {0, . . . , }, it holds that βi0 < βi1 < . . . < βi . Corollary 2 verifies our intuition that the sensitivity of an item is a monotonically increasing function of the privacy level k. 6.1.1 IRT-Based Sensitivity for Polytomous Settings. The sensitivity of item i with respect to privacy level k, βik , is the sensitivity parameter of the ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 19 ∗ Pijk curve. We compute it by first computing the sensitivity parameters βik and ∗ βi(k+1) . Then we use Proposition 1 to compute βik . ∗ The goal here is to compute the sensitivity parameters βi1 , . . ., βi∗ for each item i. As in Section 5, we consider two cases: one where the users’ attitudes θ are given as part of the input along with the response matrix R, and the case where the input consists only of R. We devote the rest of this section to discussing the algorithm for the first case. The second case can be solved using the same EM principles described in Section 5.2.2. Given attitude vector θ = (θ1 , . . . , θ N ) as the input, one could argue that for every k = 1, . . . , one could use Algorithm 1. with input R∗k to compute the ∗ ∗ item parameters αik , βik for each level k and item i. Such a solution would give the wrong results for the following reasons: first, for each value of k, a ∗ would be found. Second, different value for the discrimination parameter αik ∗ the dependency of the Pijk functions would not be taken into consideration. These problems can be eliminated by simultaneously computing all ( + 1) ∗ unknown parameters αi∗ and βik for 1 ≤ k ≤ . Again assume that the set of N individuals can be partitioned into K groups, such that all the individuals in the gth group have the same attitude θg. Also let Pik θg be the probability that an individual j in group g sets R (i, j) = k. Finally, denote by fg the total number of users in the gth group and by rgk the number of people in gth group that set R (i, j) = k. Given this grouping, the likelihood of the data in the polytomous case can be written as K g=1 r f j! Pik θg gk. rg1 !rg2 ! . . . rg ! k=1 After ignoring the constants, the corresponding log-likelihood function is L = K rgk log Pik θg . (12) g=1 k=1 To evaluate Equation (12), we use Equations (11) and (10). This substitution transforms L to a function where the only unknowns are the ( + 1) parameters ∗ , . . ., βi∗ ). The computation of these parameters is done using again (αi∗ , βi1 an iterative Newton-Raphson procedure. The algorithm is similar to the one described in Section 5.2.1. The difference here is that there are more unknown parameters with respect to which we need to compute the partial derivatives of log-likelihood L given in Equation (12). Details can also be found in Baker and Kim [2004], Chapter 8. 6.1.2 IRT-Based Visibility for Polytomous Settings. Computing the visibility values in the polytomous case requires the computation of the attitudes θ for ∗ , . . ., βi∗ , this can be done indeall individuals. Given the item parameters αi∗ , βi1 pendently for each user, using a procedure similar to NR Attitude Estimation (see Section 5.3). The only difference here is that the likelihood function used for the computation is the one given in Equation (12). ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 20 · K. Liu and E. Terzi 6.1.3 Putting It All Together. The IRT-based computations of sensitivity and visibility for polytomous response matrices give a privacy score for every user. This score is computed by applying the IRT-based sensitivity and visibility values to Equation (9). As in the dichotomous IRT computations, we refer to the score thus obtained as the Pr IRT score. The distinction between polytomous and dichotomous IRT scores becomes clear from the context. 7. NAIVE PRIVACY-SCORE COMPUTATION In this section we describe a simple way of computing the privacy score of a user. We call this approach Naive and it serves as a baseline methodology for computing privacy scores. We also demonstrate some of its disadvantages. 7.1 Naive Computation of Sensitivity Intuitively, the higher the sensitivity of an item i, the less number of people who are willing to disclose it. So, if |Ri| denotes the number of users who set R (i, j) = 1, then the sensitivity βi for dichotomous matrices can be computed as the proportion of users that are reluctant to disclose item i. That is, βi = N − |Ri| . N (13) The higher the value of βi, the more sensitive the item i. For the polytomous case, the above equation generalizes as follows: ∗ βik = N− N j=1 I(R(i, j)≥k) N . (14) In order to be symmetric with the IRT-based computations for the polytomous settings, we compute the sensitivity value associated with level k, βik , by ∗ ∗ combining the βik and the βi(k+1) values as in Proposition 1. Note that, in this way, we guarantee that, for k ∈ {0, . . . , }, βi0 < βi1 < . . . < βi as required by Definition 3. 7.2 Naive Computation of Visibility The computation of visibility in the dichotomous case requires an estimate of the probability Pij = Prob R (i, j) = 1 . Assuming independence between items and individuals, we can compute Pij to be the product of the probability of an 1 in row Ri times the probability of an 1 in column R j. That is, if R j is the number of items for which j sets R (i, j) = 1, we have |Ri| R j × . (15) Pij = N n Probability Pij is higher for less sensitive items and for users that have the tendency/attitude to disclose lots of their profile items. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 21 The visibility in thepolytomous case requires the computation of probability Pijk = Prob R (i, j) = k . By assuming independence between items and users, this probability can be computed as follows: N Pijk = j=1 I(R(i, j)=k) N n × i=1 I(R(i, j)=k) n . (16) The Naive computation of privacy score requires applying Equations (13) and (15) to Equation (1). For the polytomous case, we use Equation (9) to combine the βik and the Pijk values computed as described above. We refer to the privacy score computed in this way as the Pr Naive score. 7.3 Discussion The Naive computation can be done efficiently in O ( Nn) time. But the disadvantage is that the sensitivity values obtained are significantly biased by the user population contained in R. If the users happen to be quite conservative and they rarely share anything, then the estimated sensitivity values can be very high; otherwise the values can be very low if the users are very extrovert. Therefore, the Naive approach does not exhibit the nice group invariance property. Moreover, as we will show in the experimental section, the probability model defined by Equations (15) and (16), though simple and intuitive, fails to fit the real-world response matrices R (in terms of χ 2 goodness-of-fit). 8. NETWORK-BASED PRIVACY SCORE So far, we have defined the visibility of an item so that it only depends on the privacy setting picked by a user. In fact, the visibility of a profile item also depends on the position of a user within the social network. That is, if a popular user makes one of the items in his profile accessible by everyone, then this item becomes much more visible or her compared to the corresponding item of a more isolated user that is also publicly available. Formally, so far we assumed thatfor user j the visibility of item i at level k is simply the product Prob R (i, j) = k × k, with k ∈ {0, . . . , }. We can generalize this definition to V (i, j, k) = Prob R (i, j) = k × f j (k), (17) where f j () is a monotonically increasing function of k with f j (0) = 0. Note that the form of f j () depends on user j. For example, given a social network G, f j () can depend on the position of j in G. Equation (17) assumes that function f j (k) takes the same value for all items i. A more general setting would be one where this function is different not only for every user j, but also for every item i. For simplicity of exposition, we assume the former scenario. Given G, we evaluate f j (k) by exploiting notions from informationpropagation models used in social-network analysis [Kempe et al. 2003]. In ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 22 · K. Liu and E. Terzi this setting, f j (k) should be interpreted as the fraction of nodes in the network that know the value of item i for user j, given that R (i, j) = k. For R (i, j) = 0, f j (0) = 0; naturally a piece of information that is not released cannot be spread in the network. For k = 1, information about item i propagates from j to j’s friends in G, and from them to other users of G. Let P be a propagation model that determines how information propagates from one node to its neighbors in G. Also let P ( j, G ) be the fraction of nodes in G that know a piece of information about j once j releases it.7 We define f j (1) to be P ( j, G ). In order to compute f j (k) for k ≥ 2, we extend the original graph G to G k by adding directed links from j to all the nodes in G that are within distance k from j. We then perform propagation of information from j to the graph G k and let f j(k) = P j, G k . In our experiments, we set the propagation model P to be the Independent Cascade (IC) model (see Kempe et al. [2003] for a more thorough description of the model). In this model, propagation proceeds in discrete steps. When node v gets a piece of information for the first time at time t, it is given a single chance to pass the information to each one of its currently oblivious immediate neighbors. Node v succeeds in passing the information to node w with probability pv,w . If v succeeds, w gets to know the piece of information at time t + 1. Independently of whether v succeeds, it cannot make any further attempts to pass the information to w in subsequent rounds. For our experiments, we assumed that pv,w is the same for all neighboring nodes. Alternatively, one can use the information about the attitude of users and the IRT model to determine these probabilities. From the implementation point of view, one can compute f j (k) = P j, G k by sampling every edge (v → w) of graph G k with probability pv,w . Implementation details on how to compute f j(k) for IC can be found in Kempe et al. [2003]. Having computed f j (k) using the IC model, we can then compute visibility V (i, j, k) using Equation (17). Combining this visibility with the appropriate sensitivity values, we the privacy score of users using Equa can estimate tion (9). When Prob R (i, j) = k and βik are computed using the Naive model described in Section 7, then we refer to the obtained score as the Pr Naive IC privacy score. When the computation of Prob R (i, j) = k and βik is done using the IRT model described in Section 6.1, we refer to the obtained score as the Pr IRT IC. Note that our model is not restricted to the information-propagation models described above. In fact, any other of the information-propagation models described in Kempe et al. [2003] could be used to compute the visibility of a node as well. 9. EXPERIMENTS The purpose of the experimental section is to illustrate the properties of the different methods for computing users’ privacy scores and pinpoint their 7P ( j, G) can either refer to the actual or the expected fraction depending on whether the propagation model P is deterministic or probabilistic, respectively. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 23 advantages and disadvantages. From the data analysis point-of-view, our experiments with real data show interesting facts about users’ behavior. 9.1 Datasets We start by giving a brief description of the synthetic and real-world datasets we used for our experiments. —Dichotomous synthetic dataset. This dataset consists of a dichotomous n × N response matrix R S, where the rows correspond to items and the columns correspond to users. The response matrix R S was generated as follows: for each item i, of a total of n = 30 items, we picked parameters αi and βi uniformly at random from intervals (0, 2) and [6, 14], respectively. We assumed that the items were sorted based on their βi values, that is, β1 < β2 < . . . < βn. Next, K = 30 different attitude values were picked uniformly at random from the real interval [6, 14]. Each such attitude value θg was associated with a group of 200 users (all 200 users in a group had attitude θg). Let the groups be sorted so that θ1 < θ2 < · · · < θ K . For i, we set Rs (i, j) = 1 with probability every group Fg,user j ∈ Fg, and item Prob R (i, j) = 1 = 1/ 1 + e−αi(θg −βi) . —Survey dataset. This dataset consists of the data we collected by conducting an online survey. The goal of the survey was to collect users’ informationsharing preferences. Given a list of profile items that span a large spectrum of one’s personal life (e.g., name, gender, birthday, political views, interests, address, phone number, degree, job, etc.), the users were asked to specify the extent to which they wanted to share each item with others. The privacy levels a user could allocate to items were {0, 1, 2, 3, 4}; 0 means that a user wanted to share this item with no one, 1 with some immediate friends, 2 with all immediate friends, 3 with all immediate friends and friends of friends, and 4 with everyone. This setting simulates most of the privacy-setting options used in real online social networks. Along with users’ privacy settings, we also collected information about their locations, educational backgrounds, ages, etc. The survey spans 49 profile items. We have received 153 complete responses from 18 countries/political regions. Among the participants, 53.3% are male and 46.7% are female, 75.4% are in the age range of 23 to 39, 91.6% hold a college degree or higher, and 76.0% spend 4 h or more everyday surfing online. From the Survey dataset we constructed a polytomous response matrix R (with = 4). This matrix contains the privacy levels picked by the 153 respondents for each one of the 49 items. We also constructed four dichotomous matrices R∗k with k = {1, 2, 3, 4} as follows: R∗k (i, j) = 1 if R (i, j) ≥ k, and 0 otherwise. We conducted the survey on SurveyMonkey8 for 3 months in order to obtain the users’ answers to the questions. However, due to privacy concerns and IBM’s policy, we are not currently allowed to make the dataset publicly. 8 http://www.surveymonkey.com/ ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 24 · K. Liu and E. Terzi 9.2 Experiments with Dichotomous Synthetic Data The goal of the experiments described in this section was to demonstrate the group invariance property of the IRT model. For the experiments, we used the Dichotomous Synthetic dataset. We conducted the experiments as follows: first, we clustered the 6000 users into three groups FL = ∪g=1...10 Fg, FM = ∪g=11...20 Fg, and FH = ∪g=21...30 Fg. That is, the first cluster consists of users in the 10 lowest-attitude groups F1 , . . . , F10 , the second consists of all users in the 10 medium-attitude groups, and the third consists of all users in the 10 highest-attitude groups. Given users’ attitudes assigned in the data generation, we estimated item parameters ξiL = αiL , βiL , ξiM = αiM , βiM , and ξiH = αiH , βiH for every item i. The estimation was done using Algorithm 1 with an input response matrix that only contained the columns of R S associated with the users in FL , FM, and FH respectively. We also used Algorithm 1 to compute estimates ξiall = αiall , βiall using the whole response matrix R S. Figure 4(a) shows the estimated sensitivity values of the items. Since the data was generated using the IRT model, the true parameters ξi = (αi, βi) for each item were also known (and plotted). The x axis of the figure shows the different items sorted in increasing order of their true βi values. It can be seen that for the majority of the items the estimated sensitivity values βiL , βiM , βiH , and βiall are all very close to the true βi value. This indicates one of the interesting features of IRT that item parameters are not dependent upon the attitude level of the users responding to the item. Thus, the item parameters show what is known as group invariance. The validity of this property was demonstrated in Frank Baker’s book [Baker and Kim 2004] and in an online tutorial.9 At an intuitive level, since the same item was administrated to all groups, each of the three parameter estimation processes was dealing with a segment of the same underlying item characteristic curve (see Figure 1). Consequently, the item parameters yielded by the three estimations should be identical. It should be noted that, even though the item parameters are group invariant, this does not mean that in practice values of the same item parameter estimated from different groups of users will always be exactly the same. The obtained values will be subject to variation due to group size and the goodnessof-fit of the ICC curve to the data. Nevertheless, the estimated values should be in “the same ballpark”. This explains why in Figure 4(a) there are some items for which the estimated parameters deviate from the true one more. We repeated the same experiment for the Naive model. That is, for each item we estimated sensitivities βiL , βiM , βiH , and βiall using the Naive approach (Section 7). Figure 4(b) shows the obtained estimates. The plot demonstrates that the Naive computation of sensitivity does not have the group-invariance property. For most of the items, sensitivity βiL obtained from users with low-attitude levels (i.e., conservative, introvert) were much higher than the βiall estimates since these users rarely shared anything, whereas βiH obtained 9 http://echo.edres.org:8080/irt/baker/chapter3.pdf ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 25 Fig. 4. Testing the group-invariance property of item parameter estimation using the IRT (Figure 4(a)) and Naive (Figure 4(b)) models. from users with high-attitude levels (i.e., careless, extrovert) were much lower than βiall . Note that since sensitivities estimated by the Naive and IRT models are not on the same scale, one should consider the relative error instead of absolute error when comparing the results in Figure 4(a) and 4(b). 9.3 Experiments with the Survey Data The goal of the experiments in this section was to show (1) that IRT is a good model for the real-world data, whereas Naive is not; and (2) that IRT provides us an interesting estimation of the sensitivity of information being shared in online social networks. 9.3.1 Testing χ 2 Goodness-of-Fit. We start by illustrating that the IRT model fits the real-world data very well, whereas the Naive model does not. For that we use the χ 2 goodness-of-fit test, a commonly used test for accepting or rejecting the null hypothesis that a data sample comes from a specific distribution. Our input data consisted of dichotomous matrix R∗k (k ∈ {1, 2, 3, 4}) constructed from the Survey data. First we tested whether the IRT model is a good model for data in R∗k . We tested this hypothesis as follows: first we used the EM algorithm (Algorithm 2) to estimate both items’ parameters and users’ attitudes. Then, we used a one-dimensional dynamic-programming algorithm to group the users based on their estimated attitudes. The mean attitude of a group Fg serves as the group attitude θg. Also let the size of Fg be fg. Next, for each item i and group g we computed χ2 = K ( fg p̃ig − fg pig )2 ( fgq̃ig − fgqig )2 + . fg pig fgqig g=1 In this equation, fg is the number of users in group Fg ; pig (respectively, p̃ig ) is the expected (respectively observed) proportion of users in Fg that set R∗k (i, j) = 1. Finally, qig = 1 − pig (and q̃ig = 1 − p̃ig ). For the IRT model ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 26 · K. Liu and E. Terzi Table I. R∗2 Data—χ 2 -Goodness-Of-Fit Tests: the Number of Rejected Hypotheses (Out of a Total of 49) With Respect to the Number of Groups K R∗1 R∗2 R∗3 R∗4 IRT K K K K K =6 =8 = 10 = 12 = 14 4 4 5 5 5 3 3 5 3 3 6 4 7 5 3 Naive 11 8 8 7 7 49 49 49 49 49 pig = Pi θg and it is computed using Equation (2) for group attitude θg and item parameters estimated by EM. For IRT, the test statistic followed, approximately, a χ 2 distribution with (K − 2) degrees of freedom since there were two estimated parameters. For testing whether the responses in R∗k can be described by the Naive model, we followed a similar procedure. First, we computeed, for each user, the proportion of items that the user set equal to 1 in R∗k . This value served as the user’s “pseudoattitude.” Then we constructed K groups of users F1 , . . . , FK , using a one-dimensional dynamic-programming algorithm based on these attitude values. Given this grouping, the χ 2 statistic was computed again. The only difference here was that ∗ j |Rki | 1 |R∗k | , (18) pig = × N fg j∈F n g where |R∗ki | denotes the number of users who shared item i in R∗k , and |R∗k | denotes the number of items being shared by a user j in R∗k . For Naive, the test statistic approximately followed a χ 2 -distribution with (K − 1) degrees of freedom. Table I shows the number of items for which the null hypothesis that their responses followed the IRT or Naive model was rejected. We show results for all dichotomous matrices R∗1 , R∗2 , R∗3 , and R∗4 and K = {6, 8, 10, 12, 14}. In all cases, the null hypothesis that items followed the Naive model were rejected for all 49 items. On the other hand, the null hypothesis that items followed the IRT model was rejected for only a small number of items in all configurations. This indicates that the IRT model better fits the real data. All results reported here are for confidence level .95. j 9.3.2 Sensitivity of Profile Items. In Figure 5 we visualize, using a tag cloud, the sensitivity of the profile items used in our survey. The evaluation of sensitivity values was done using the EM algorithm (Algorithm 2.) with input the dichotomous response matrix R∗2 . The larger the fonts used to represent a profile item in the tag cloud, the higher its estimated sensitivity value. It is easily observed that Mother’s Maiden Name was the most sensitive item, while Gender, which locates just right above the letter “h” of “Mother” has the lowest sensitivity, too small to be visually identified. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 27 Fig. 5. Sensitivity of the profile items computed using IRT model with input the dichotomous matrix R∗2 . Larger fonts mean higher sensitivity. 9.4 Comparison of Privacy Scores The goal of this experiment was compare the privacy scores obtained using different scoring schemes. Since scores obtained using different methods were not on the same scale, we compared them using the Pearson correlation coefficient. We showed that IRT model produced more robust privacy scores than the Naive approach. For this experiment we used the Survey dataset. Using as inputs the polytomous response matrix R and methods from Section 6, we obtained privacy scores Pr Naive and Pr IRT. Also, using as inputs the dichotomous matrix R∗2 and the Naive and IRT methods, we obtained scores Pr Naive∗ and Pr IRT∗ , respectively. We also computed privacy scores by taking into account information about the structure of the the users’ social networks. We did so using the methodology described in Section 8. Unfortunately, the Survey data consists of responses of individuals to a set of survey questions and we are not aware of the underlying social-network structure. However, since we wanted to compare the privacy scores obtained using all the proposed scoring schemes, we constructed an artificial social network G among the respondents of our survey. The network G was constructed as follows: first we constructed five clusters of users, based on respondents’ geographic location. These five clusters corresponded to users in North America—West coast, North America—East coast, Europe, Asia, and Australia. Each one of these clusters consisted of 71, 30, 29, 12, and 11 users, respectively. We added connections between respondents in the same clusters so as to generate a powerlaw graph between them. For this we used the graph-generation model described in Barabási and Albert [1999]. Finally, we connected the powerlaw subgraphs that correspond to each cluster by adding random links between nodes in different subgraphs with probability p = 0.01. Using graph G, response matrix R (respectively R∗2 ), and the methods from Section 8, we computed privacy scores Pr Naive IC and Pr IRT IC (respectively Pr Naive∗ IC and Pr IRT∗ IC). ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 28 · K. Liu and E. Terzi Fig. 6. Survey data: comparison of privacy scores using correlation coefficient; darker colors correspond to higher values of the correlation coefficient. Fig. 7. Survey data: average privacy scores (Pr IRT) (a) and average users’ attitudes (b) per geographic region. Figure 6 shows the values of the Pearson correlation between the privacy scores obtained using the eight aforementioned scores—darker colors correspond to higher correlation coefficients. Note that the 4×4 submatrix in the top left, which contains the correlation coefficients between the privacy scores computed using IRT, has consistently high-correlation values. Thus, the IRT model produced more robust privacy scores. On the other hand, the Naive model was not as consistent. For example, Pr Naive IC scores seem to be significantly different from all the rest. 9.4.1 Geographic Distribution of Privacy Scores. Here we present some interesting findings we got by further analyzing the Survey dataset. We computed the the privacy scores of the 153 respondents using the polytomous IRT-based computations (Section 6.1). After evaluating the privacy scores of individuals using as input the whole response matrix R, we grouped the respondents based on their geographic locations. Figure 7(a) shows the average values of the users’ Pr IRT scores per location. The results indicate that people from North America and Europe ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. Computing Privacy Scores of Users in Online Social Networks · 6: 29 had higher privacy scores (high risk) than people from Asia and Australia. Figure 7(b) shows the average users’ attitudes per geographic region. The privacy scores and the attitude values are highly correlated. This experimental finding indicates that people from North America and Europe are more comfortable in revealing personal information on the social networks in which they participate. This can be either a result of inherent attitude or social pressure. Since online social networking is more widespread in these regions, one can assume that people in North America and Europe succumb to the social pressure to reveal things about themselves online in order to appear “cool” and become popular. 10. CONCLUSIONS We have presented models and algorithms for computing the privacy scores of users in online social networks. Our methods take into account the privacy settings of users with respect to their profile items as well as their positions in the social network. Our framework uses notions from item response theory and information-propagation models. We described the mathematical underpinnings of our methods and presented a set of experiments on synthetic and real data that highlight the properties of our models and the current trends in users’ behavior. We believe that our framework tackles the issue of privacy in online social networking from a new user-centered perspective and can prove useful in growing users’ awareness. REFERENCES A HMAD, O. 2006. Privacy management method and apparatus. Patent application U.S. 2006/0047605. B ACKSTROM , L., D WORK , C., AND K LEINBERG, J. M. 2007. Wherefore art thou R3579X? Anonymized social networks, hidden patterns, and structural steganography. In Proceedings of the 16th International Conference on the World Wide Web (WWW). 181–190. B AKER , F. B. AND K IM , S.-H. 2004. Item Response Theory: Parameter Estimation Techniques. Marcel Dekker, New York, NY. B ARAB ÁSI , A.-L. AND A LBERT, R. 1999. Emergence of scaling in random networks. Science 286, 5439, 509–512. B IRNBAUM , A. 1968. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores, F. Lord and M. Novicle, Eds. Addison-Wesley, Reading, MA, 397–479. FANG, L. AND L E F EVRE , K. 2010. Privacy wizards for social media sites. In Proceedings of the International Conference on the World Wide Web (WWW). G ROSS, R. AND A CQUISTI , A. 2005. Information revelation and privacy in online social networks. In Proceedings of the ACM Workshop on Privacy in the Electronic Society. 71–80. H AY, M., M IKLAU, G., J ENSEN, D., T OWSLEY, D., AND W EIS, P. 2008. Resisting structural reidentification in anonymized social networks. Proc. VLDB Endow. 1, 1, 102–114. K EMPE , D., K LEINBERG, J. M., AND T ARDOS, É. 2003. Maximizing the spread of influence through a social network. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD). 137–146. L EONARD, A. 2004. You are who you know. http://dir.salon.com/tech/feature/2004/06/15/social software one/index.html. L IU, H. AND M AES, P. 2005. Interestmap: Harvesting social network profiles for recommendations. In Proceedings of the Beyond Personalization Workshop. ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010. 6: 30 · K. Liu and E. Terzi L IU, K. AND T ERZI , E. 2008. Towards identity anonymization on graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 93–106. M ISLEVY, R. AND B OCK , R. 1986. PC-BILOG: Item Analysis and Test Scoring with Binary Logistic Models. Scientific Software, Mooreville, IN. O WYANG, J. 2008. Social network stats: Facebook, myspace, reunion. http://www.web-strategist.com/blog/2008/01/09/. R ICHARDSON, M., A GRAWAL , R., AND D OMINGOS, P. 2003. Trust management for the Semantic Web. In Proceedings of the International Semantic Web Conference. 351–368. Y ING, X. AND W U, X. 2008. Randomizing social networks: A spectrum preserving approach. In Proceedings of the SIAM International Conference on Data Mining (SDM). 739–750. Y PMA , T. J. 1995. Historical development of the Newton-Raphson method. SIAM Rev. 37, 4, 531–551. Z HOU, B. AND P EI , J. 2008. Preserving privacy in social networks against neighborhood attacks. In Proceedings of the 24th International Conference on Data Engineering (ICDE). 506–515. Received October 2009; revised March 2010; accepted April 2010 ACM Transactions on Knowledge Discovery from Data, Vol. 5, No. 1, Article 6, Pub. date: December 2010.