IJRISE_Paper_pran

advertisement
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
FINDING ‘K’ INFLUENCE SEED USERS IN MULTIPLE ONLINE
SOICAL NETOWRK USING CLUSTERING
Praneetha G N, Post Graduate Student, Dept of CS&E, SCE Bangalore, India.
Mrs Ragini Krishna, Assistant. Prof , Dept of CS&E , SCE Bangalore, India.
Mrs Kavitha S, Assistant. Prof , Dept of CS&E , SCE Bangalore, India.
Dr. Prashanth C M, Prof & HOD, Dept of CS&E, SCE Bangalore, India.
ABSTRACT
The advent of Online Social Network (OSN) has been one of the most exciting events in this decade. Many
popular OSN such as Facebook, Twitter, Linkedln and Flickr have become popular. People are becoming more
interested in online social network and they depend on the social network for many purposes such as finding
opinions of other people about any product, movie etc. Influence propagation has become an important
mechanism for viral marketing. This further motivates the researchers to carry out extensive studies on various
aspects of the influence propagation problem. Influence Maximization is the problem of finding a small set of
influential users in the online social network so that their influence in the social network is maximized. There are
many diffusion models like Linear Threshold Model and Independent Cascade Model that are used to find the
maximum influential user in online social network. And many modification and upgradation are made to these
diffusion models based on scalability and efficiency issues. But, clustering technique has not been used to find
the influential user. In this paper, we present a clustering approach in order to find the minimum seed users for
Influence Maximization. Fuzzy k means algorithm is used to find out the influential users in multiple online
social networks. This approach reduces processing time when compared to diffusion models. Diffusion model
processing time increase exponentially as network size increases. Fuzzzy k means algorithm processing time
increases linearly as network size increases.
Keywords: Influence Maximization, Multiple Online Social Network, Fuzzy k means algorithm..
----------------------------------------------------------------------------------------------------------------------------1. INTRODUCTION
The advent of Online Social Network (OSN) has been one of the most exciting events in this decade.
Many popular OSN such as Facebook, Twitter, Linkedln and Flickr have become increasingly popular. These
networks are extremely rich in content and linkage data which can be analyzed. The linkage data is essentially the
graph structure of social network and the communication between nodes, whereas the content data contains the text,
images and other multimedia data in social network. The richness of this network provides opportunities for data
analysis in context of Online Social Network.
There are several factors due to which the OSN has gained importance by researchers[1]. Some of the
factors are availability of social data that are vast, distributed, noisy and dynamic. There are some research issues
with respect to mining the social network sites using data mining techniques. One of the issues is Influence
Propagation.
In many markets, customers are strongly influenced by the opinions of their friends. Influence propagation
has become an important mechanism for viral marketing. This further motivates the researchers to carry out
extensive studies on various aspects of the influence propagation problem. Influence Maximization problem is a
problem of finding a small set of nodes that maximizes the spread of influence.
Influence Maximization problem was first studied by Domingo’s and Richardson[2] and proposed first
algorithm for influence propagation. Then, Kempe et al.[3] gave two fundamental propagation models, named
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Independent Cascade (IC) Model and Linear Threshold (LT) Model. Many other researchers extended this basic
propagation models in terms of scalability and efficiency. But most of the works focussed on a single online social
network whereas users now often are found in more than one social network. Dung T. Nguyen et. Al [10] proposed
an algorithm to handle this problem. We introduce a new approach to solve Influence Maximization problem. Fuzzy
k means algorithm is used to find the minimum seed users in multiplex online social network so that the influence is
maximized.
2. RELATED WORK
2.1 Diffusion Models
Then, Kempe et al. [3] studied influence propagation by focusing on two fundamental propagation models,
named Independent Cascade (IC) Model and Linear Threshold (LT) Model. Here, a social network is modelled as a
graph with nodes representing individuals and edges representing connections or relationship between two
individuals. Influences are propagated in the network according to the model, such as the independent cascade (IC)
model. Kempe et al. proved that the optimization problem is NP-hard, and present a greedy approximation
algorithm guaranteeing that the influence spread is within the optimal influence spread.
In LT model, a node v is influenced by each neighbour w according to a weight bv, w such that
∑
b v,w ≤ 1
……. Eq.1
w active neighbor of v
Step 1: Each node chooses a threshold from the interval [0,1].
Step 2: Initially A0 is a set of nodes that are active.
Step 3: All nodes that were active in step 2 remains active, and these active nodes tries to activate its neighbour
nodes. This is according to the equation give below
∑
b v,w ≥ θv
………..Eq.2
w active neighbor of v
In IC Model, we again start with an initial set of active nodes A0, and the process unfolds in discrete steps
according to the following randomized rule. When node v first becomes active in step t, it is given a single chance to
activate each currently inactive neighbour w. If v succeeds, then w will become active in step t+1; but whether or
not v succeeds, it cannot make any further attempts to activate w in subsequent rounds. Again, the process runs until
no more activation is possible.
The main disadvantage of the algorithm is that it was less efficient.
2.2 Heuristic Algorithm
Chen et al. proposed a new propagation model similar to the greedy algorithm[4,5] but with a better
efficient result. Scalability problem of IC Model is addressed by proposing a new heuristic algorithm. The main idea
of heuristic scheme is to use local arborescence structures of each node to approximate the influence propagation.
Maximum influence paths (MIP) are computed between every pair of nodes in the network via a Dijkstra shortestpath algorithm, and ignore MIPs with probability smaller than an influence threshold. The union of the MIPs
starting or ending at each node into the arborescence structures, which represent the local influence regions of each
node, is constructed. Influence propagated through this local arborescence is only considered, and this model is
referred as the maximum influence arborescence (MIA) model.
Chen et al. proposed the first scalable heuristic algorithm for influence maximization in the LT model which
was referred as LDAG algorithm [6]. They constructed a local DAG for every node in the network and restricted the
influence for the node to be within the local DAG structure. To construct local DAG, they proposed a fast greedy
algorithm where the nodes were added one by one to local DAG such that the influence of these individual nodes is
greater than θ.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
2.3 Multiple Online Social Networks
Existing works focused only on single online social network while users nowadays are found in several
OSNs. Thus, it is important to study the influence in multiple social networks. Dung T. Nguyen et.al [10] proposed
a method to handle this problem. Multiple OSN network are combined into one network while preserving the
influential properties of the original network. After coupling the networks, LT model was being used to find the
influential users. A new metric named Influence Relay was introduced which was used to study the flow of
influence between the social networks.
2.4 Interest matching users
Most of the related works focused only on the topology of the network but ignored the factors such as users
interest in the information that is being propagated. Yilin Shen et.al [9] proposed a new method Total Seed Nodes
Learning (TSNL) to capture these interest-matching users whose interest match with each other. First, a new
network was constructed from k multiple networks using network coupling. Then, interest matching users were
found out using semi-supervising learning and then minimum influential users were found out based on idea called
Iterative Semi-Supervised Learning.
In this paper, we try to solve the Influence Maximization problem from different perspective i.e, using
clustering technique. Fuzzy k-means algorithm is used to find the minimum seed users for influence maximization
in multiple online social networks. LT and IC Model processing time increases exponentially as social network size
increases. But, Fuzzy k-means algorithm reduces the processing time. Processing time increases linearly as network
size increases.
3. METHODOLOGY
1.
2.
3.
The solution to the Influence Maximization using clustering includes four modules:
OSN Combiner
Data Aggregator
Fuzzy k-means Clusterer
1. OSN Combiner:
The aim of combining the social network is to reinforce the effect of crossing users. Crossing users refers
to those users who appear in more than one network. If a link appears in more than one network, then the weight of
the link in coupled network will be sum of all its weights on these networks.
2. Data Aggregator
After combining multiple online social networks into one, for each node in the network, following things
are calculated: Total outgoing edge, total outgoing weight, total incoming edge, total incoming weight and last
active time. Time Score for each user in the network is calculated to know how much s/he is active in the network.
Time Score can be calculated using the formula:
Time Score= c*[(LA-RMT) / (Now-RMT)]
Where, LA refers to Last Active time of the user, RMT refers to Recent Message Timestamp and NOW refers to
the current time. The value of Time Score will be between 0 and 1, 0 means the user is not active for long time and 1
means the user is very active. Since the value is relatively small compared to other variables like number of edges,
weights etc of the user, this value can be multiplied by constant ‘c’ to all users.
3. Fuzzy k-means clusterer
In k-means clustering, data element belongs to only one cluster. In fuzzy k-means clustering, data elements
can belong to more than one cluster. Each data element will be associated with a set of membership matrix which
indicates the strength of the association between data element and a particular cluster. Fuzzy k means algorithm
works as following:
Step 1: Choose a number of clusters
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
Step 2: Assign randomly to each point coefficients for being the cluster.
Step 3: Repeat the algorithm until the co-efficient change between two iteration does not exceed sensitivity
threshold ε.
Step 4: Compute the centroid of cluster.
Step 5: Compute the membership matrix
Initially, number of overlapping users will be number of clusters. Here, co-efficient will be time score that
is being calculated. Considering overlapping users as cluster head, the association
between the users and overlapping user is calculated. Membership matrix gives the association between users and
overlapping users.
Then, membership values with respect to each cluster are calculated by summing up the membership value
of each user in that cluster. Then top k clusters are selected having high membership value. This represents the top k
overlapping user who has high influence in the network. Figure 2 shows the system architecture.
OSN #1
OSN #2
Relationship
Data
OSN #N
………………
Relationship
Data
Relationship
Data
OSN Combiner
Combined
OSN
Relationship
Data
Common User
List
Data
Aggregator
Data
Seed
Fuzzy K-Means
Clustering
Iterations
(Usually 1)
Degree of Association Matrix:
Nodes
Cluster A
Cluster B
n
a
0.3
0.2
b
0.5
0.05
1
0.1
0.3
…
…
Total
5.3
7.9
Required no.of Seed
Users
Cluster D
0.2
0.04
0.6
0.1
0.3
0.1
0.2
0.21
0.0
4.6
3.1
1.8
Get Top n Clusters having
High Degree of Associativity
IJRISE| www.ijrise.org|editor@ijrise.org
…
Cluster C
Cluster
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
4.RESULTS
Snapshot 1: User to Cluster Associativity Matrix
The above snapshot represents the User to Cluster Associativity Matrix. Here, the number of seed users to be found
from OSN is entered as 10. Hence 10 clusters are formed. With respect to each cluster, the user to cluster
associativity is found out as shown in the above figure.
Snapshot 2: Finding Seed User
The above snapshot represents finding seed users. With respect to each cluster, the overlapping user who is
almost near to the cluster center is considered as seed user for that particular cluster. Seed User ID for each cluster
and its User Association with respect to that cluster is shown. Cluster Average Association for each cluster is also
calculated to rank the seed Users based on this parameter.
5. CONCLUSION
The study of Influence Maximization is being studied from years and solutions like Independent Cascade
(IC) model and Linear Threshold (LT) Model was proposed. But this model neglected the effect of active users in
the network and processing time of these model increased exponentially as network size increases. Hence, we
proposed a new approach towards Influence Maximization in multiple online social networks i.e, using clustering
technique. Fuzzy k means clustering is being used to find the influential seed user from multiple online social
networks. Processing time increases linearly as network size increases. Hence, processing time is reduced when
compared to the diffusion models like Independent Cascade (IC) model and Linear Threshold (LT) model. A new
metric called ‘Timescore’ is introduced which specifies the how much user is active in the network.
IJRISE| www.ijrise.org|editor@ijrise.org
International Journal of Research In Science & Engineering
Volume: 1 Issue: 1
e-ISSN: 2394-8299
p-ISSN: 2394-8280
ACKNOWLEDGEMENT
I am thankful to “Mrs. Ragini Krishna” for their valuable advice and support extended to us without which i
could not have been able to complete the paper. I express deep thanks to Dr. Prashanth C M, Head of Department
(CS&E) for warm hospitality and affection towards me. I thank the anonymous referees for their reviews that
significantly improved the presentation of this paper. Words cannot express our gratitude for all those people who
helped directly or indirectly in my endeavor. I take this opportunity to express my sincere thanks to all staff
members of CS&E department of SCE for the valuable suggestion.
REFERENCES
[1]. Kempe, J. M. Kleinberg, and ´E. Tardos, “Maximizing the spread of influence through a social network,”
in Proc. Of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’03), 2003,
pp.137-146.
[2]. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. S. Glance, “Cost-effective outbreak
detection in networks,” in Proc. of the 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining (KDD’07), 2007, pp. 420–429
[3]. W. Chen, Y. Wang, and S. Yang, “Efficient influence maximization in social networks”, in Proc. of the 15th
ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’09), 2009.
[4]. W. Chen, Y. Yuan, and L. Zhang, “Scalable influence maximization in social networks under the linear
threshold model”, in Proc. of the 10th IEEE Int. Conf. on Data Mining (ICDM’10), 2010..
[5]. Y.Shen, T.N. Dinh, H. Zhang, and M. T. Thai. “Interest-matching information propagation in multiple
online social networks in CIKM, 2012.
[6]. Dung T. Nguyen, Huiyuan Zhang, Soham Das, My T. Thai, “Least cost influence in multiplex social
networks: Model Representation and Analysis, 2012 IEEE 13 th International Conference on Data Mining.
IJRISE| www.ijrise.org|editor@ijrise.org
Download