Design and Implementation of Distributed Facebook Crawler Based on Interaction Simulation

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014
Design and Implementation of Distributed Facebook Crawler
Based on Interaction Simulation
B.S Satpute, Raj Ambani, RohitRai, JatinMalviya, Vikash Raj
D.Y Patil Inst. Of Engg.& Tech.
Pune University
Abstract- The increasing use of Online Social
Networks (OSNs) has attracted such wide
attention of people that OSNs have become a hot
topic. There are already many systems previously
designed as Facebook crawler. But these systems
have many limitations on their working. These
systems were able to extract only friend upto 400.
This drawback has been overcome in thecurrently
designed system. On the basis of interaction
simulation, we have designed and implemented a
Facebook crawler, which can obtain the complete
friend list of a Facebook user and overcome the
drawback in the previously designed Facebook
Crawlers. Two popular approaches that can
produce approximately uniform samples are the
Metropolis-Hasting random walk (MHRW) and a
re-weighted random walk (RWRW). Both have
their own pros and cons depending on their use.
In comparison, if we use Breadth-First-Search
(BFS) or an unadjusted Random Walk (RW) leads
to substantially biased results.We make an
analysis and visualization of the crawled dataset,
which implies that the awareness of privacy
protection of Facebook users has been greatly
improved.
Keywords:
Facebook,
Social
Networking,
Clustering, Crawler, Analysis, Privacy, URL.
I. INTRODUCTION
Today online social networking sites (OSN) has
become an important part of life. People are using
these sites to connect with their friends and families.
So this is obvious that people mentally and
emotionally feel that they are very close to each other
in spite they are sitting far away from each other.
Now a person can get registered on the same site as
their friends and can upload his videos or photos
which can be seen by their friend and family
members. Not only that they can also share their
emotion related to those pictures.People use these
facilities to come closer even when they are sitting in
different countries or regions.
The power of online networks lies in the fact that
they are organized around their users, thus making it
possible to utilize user interconnectedness in order to
ISSN: 2231-5381
reach large audiences at a relatively low cost. The
social relations that exist between people can be
established and maintained through social networking
sites such as MySpace and Facebook [2]. Among all
OSN Facebook is the most famous OSN. Facebook
reported having 1.2 billion monthly active users on
January 2014, globally. This has aroused a great deal
of interest in social media, and especially in
Facebook as a marketing tool: in 2011 76 percent of
companies reported that they planned to strengthen
their presence on Facebook [1].
In January 2013, the countries with the most
Facebook users were:
United States with 168.8 million members
Brazil with 64.6 million members
India with 62.6 million members
Indonesia with 51.4 million members
Mexico with 40.2 million members
Facebook is having most complex privacy policy and
also it is updating its updating frequently. Many
researches have been made to check through its
privacy policy and users motivations but as it
changes frequently, those research no longer suits for
current timeline model of Facebook. Thus we need
regular research for studying this famous OSN. So
we have chosen this as our object for study.
Previously for study of Facebook no distributed
crawler has been used. They implemented crawler in
very simple way. No graph generation algorithm was
used to show the mutuality between friends. And also
they didn't make any observation of level of
mutuality between two friends.
Our paper will make following contribution to the
study of OSN:
First, we will design a crawler which is no longer
limited to maximum of 400 friends.
Second, we are using distributed crawler for
extracting data with a combination of interaction
simulation to extract more data in less time.
Third, we will collect data from Facebook and will
create a database for our analysis. We will make
following analysis from the database:- We will check user's privacy setting to check
whether what % of user is aware of their privacy.
http://www.ijettjournal.org
Page 85
International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014
- We will crawler the complete friend list of users
and will check mutuality level of friends and will
create a graph for that.
We are also discussing following points in next
sections
Literature Survey.
Proposed methodology.
Result.
Future scope & conclusion.
II. LITERATURE SURVEY
Bonneau et al. has researched on Facebook crawler
he also took a survey and used many methods for
abstracting personal data from Facebook using fake
account, phishing, infected application and query
language. Where asin terms of data collection
aspects, the work by Gjoka et al. [3] on social
networking (in particular on Facebook) is closely
related to our work.
As we all know the security and access
control mechanisms is improved continuously in
Facebook, due to that getting personal data of any
other account in Facebook has been more difficult
than ever. Korolova et al has examined attacks on
privacy on social networks where all personal
information about the friends of a user is not made
accessible.
User dataset mining in [6] was done through public
listing in [4]. However data procurement of users has
become very tough due to continuously enhanced
access control mechanisms and privacy policies of
Facebook. The method of public listings in [4] has
failed in present scenario. Although crawlers and API
can be used to still obtain user data from OSNs sites.
But API proves to be unable to obtain the friend list
of users’ friends by unauthorized developers for
Facebook, due to its strict access restrictions to use
API. The Facebook Query Language in [4] also failed
as Facebook Query Language is one of the Facebook
API. Although the crawlers can be used to obtain
user data for scientific research despite strict access
and privacy policies. This paper is aimed to study
privacy issues and network structure of OSNs and
using crawlers as the main approach to obtain user
data from Facebook.
Further studies were conducted by [5, 3, 6] on
crawlers for OSNs. Two new unbiased crawl
strategies: Metropolis- Hasting random walk
(MHRW) and a re-weighted random walk (RWRW)
were proposed by [3] and social network parallel
crawler was modeled by [6]. Although the work were
innovative they lacked the detailed implementation
and technical challenges of the crawler. However [5]
described as well as drew comparisons on the
detailed implementation of a social network crawler
using two crawling strategies: the breadth-first search
ISSN: 2231-5381
and uniform sampling to run the crawler on
Facebook. However, the crawler in [5] could only
retrieve at most 400 friends of a user as Facebook
limited at most 400 friends per friend webpage which
was a major drawback. Nowadays, Facebook has
strengthened access control policy and modified its
privacy policy as a step to curb access by crawlers,
limiting each friend webpage to return at most 60
friends. In this paper, based on interactive simulation,
we have designed and implemented the crawler
algorithm, which can obtain all the friend list of a
Facebook user.
III. PROPOSED METHODOLOGY
OSNs can be represented as graphs where nodes
represent users and edges represent connections.
Facebook,in particular, is characterized by a simple
friendship schema, so as it is possible to represent its
structure through an unweighted,undirected graph.In
order to understand if it is possible to study OSNs
without having access to their complete dataset (e.g.
Facebook),we adopted two different methodologies
to gather partial data (i.e. a sub-graph of the
Facebook complete graph).Our purpose is to acquire
comparable samples of the Facebook graph when
applying different sampling methodologies.Once
collected, informationare analyzed using tools and
techniques provided by the SNA. We alsoinvestigate
the scalability of this problem, evaluating several
properties of different graphs and sub-graphs,
collected with different sampling approaches. Then,
we compare measures calculated applying SNA
techniques with statistical information provided by
Facebook itself in order to validate the reliability of
collected data. Moreover, highlighting similarities
and differences between results obtained through this
study and similar ones, conducted in past years, is
helpful to understand the evolution of the
socialnetwork over the time. Finally, this
methodology starts from the assumption that, to gain
an insightinto an OSN, we take a snapshot of (part of)
its structure. Process of collection of data require
time,during whichitsstructure could slightly change,
but, assuming the small amount of time, with respect
to the whole life of the social network, required for
gathering information, this evolution can be ignored.
Even though, during the data cleaning step we take
care of possible information discrepancies.
We design and implement a multi-threaded
crawler using Java based on the interaction
simulation. The crawler can obtain the complete
friend list of users. The crawler overcomes the
drawback that it can get only 400friends of users. The
detailed algorithm is shown in Algorithm 1. The
crawler uses the strategy of breadth-first search and
works as follows: First, the crawler logs in
automatically with a real Facebook user credential,
http://www.ijettjournal.org
Page 86
International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014
and then visits the seed profile and its friend
webpage. Second, friend lists from the webpage are
extracted through regular expression and are inserted
in the queue to be visited. Finally, the crawler visits
the profile in the queue to be visited in turn by FIFO
(First In and First Out), and cycle the process. In
Facebook due to the access control mechanism, each
friend webpage returns at most 60 friends. Therefore,
in order to visit the complete friend list, the crawler
must simulate the normal interaction behavior of a
user with Facebook. Through analyzing the traffic
between the browser and Facebook captured by
HttpFox, a plugin for Firefox, we find that the
interaction of the Ajax code embedded in the user's
friend page. By simulating the interactive behavior of
the Ajaxcode with Facebook, we can get the
complete list of friends, which is done through
multiple extractions and get a list of at most 60
friends each time. The experiment result in section
4.2 proves the algorithm correctness. In this paper,
the crawler focuses on obtaining the social graph, i.e.,
the friend links. User privacy is protected by
collecting only the user ID and by not collecting user
name, photos, other personal information or
interaction information among users.
1. Open the Facebook webpage;
2. Enter real user login and password ;
3. Consider this account as a seed profile,Visit the profile
and its friend webpage;
4. Step 1:
5. get the friendNo; {Extratct the number of friends of a
user from its friends web page;}
6. If friendNo % 60 != 0 NoofExtract = (friendNo friendNo %
60 + 60) / 60;
7. For (i = 2; i <NoofExtract; i++)
8. Create a URI; {directing to a friend webpage
including friend between 60*(i-1) and 60*i
}
9. Insert the URI into the queue to be visited;
10. Go to step 3;
11. Step 2:
12. Visit the URI; {directing to a friend webpage
including maximum of 60 friends in queue which are to be
visited}
13. Extract the friend list through regular
expression;
14. Insert the profile in the queue to be visited;
15. Step 3:
16. Visit the next item in the queue to be visited;
17. If the item is a profile
18. Visit the profile and his friend webpage;
19. Go to step 1 until the crawl ends;
20. If the item is a URI {directing to a friend list
webpage including at most 60 friends}
21. Go to step 2 until the crawl ends;
ISSN: 2231-5381
clustering algorithm
Step 0: Start
Step 1: Get crawled tree
Step 2: Identify the all parent node
Step 3: create cluster bucket for parent node
Step 4: add all the parent node into each cluster
bucket vector
Step 5: then add all the children node of the parent
node into its Respective vector
Step 6: Stop
IV. RESULT
A. This paper used a notebook with Intel i5
CPU, 6G RAM, and broadband Internet access at a
rate of 1Mb/s to validate the algorithm, and had been
running the crawler from March 10, 2014 to March
31, 2014. A dataset of a total of 1000 users was
logged into. We encrypted the user ID with MD5
function in order to protect the user privacy after data
collection.
B. we make visualization on the dataset
crawled from Facebook after initial data analysis.
First, we use the open source software, NodeXL, to
make visualization on a subset of the collected
dataset, including a visualization of 1000 users,
10000 users, 2000 users and 3000 users subgraph.
Formally, a social network is an undirected graph G
= <V, E>, where V is the set of nodes (users) and E is
the set of edges (links). General metrics on OSN
structure include degree, clustering coefficient,
diameter etc. In this paper, we use the functions of
SNAP (Stanford Network Analysis Platform) for the
data analysis.
According to the Facebook statistics, each
user in Facebook has more than 120 friends on an
average. However, the result in Table 1 is only 2.59.
Moreover, the average clustering coefficient in Table
1 is 0.04, but the result in [8] is 0.16. Thus, we can
conclude that the dataset crawled from Facebook is
biased. The main reason is that breadth-first search
used in this paper is a biased crawling strategy [18],
which always skews towards the users with larger
degree.
No. of distinct visited user
No. of distinct dis. Neighbors
No. of distinct edges
Avg. diameter
544,456
4,144,526
5,244,548
4.65
Crawling period
10-03-2014 to
31-03-2014
TABLE 1: THE RESULT OF DATA ANALYSIS
http://www.ijettjournal.org
Page 87
International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014
C. in the Privacy Setting we found that only
500 users' friend list could be seen by public in the
total 1000 users crawled, which shows that 36.5% of
the Facebook users changed the default privacy
setting and set the user's friend list invisible to
strangers.,The statistics has increased by up to 20.5%
in comparison to 26.6% and 16%, the results
respectively, thus indicating that the privacy
protection amongst the Facebook users has been
greatly improved.
Advertisement companies can use these data in
different manner. They can advertise on Facebook
according to choice of a user. Based on which a
production company can change their product
specification and can add new features.
Crawled data can also be used for mood detection by
Government. Thus the can prevent bad activities
going to happen in society.
D. Summary In the 1000 users whose friend
lists are visible to the public, we find that 700 users
have more than 50 friends, accounting for 8.1%; 172
users have more than 130 friends, accounting for
5.0%. Two facts have been proven: First, the
complete friend list can be obtained by our crawler;
second, most of the user's friend list is not crawled
completely and is still in the queue to be visited, due
to the biased crawl strategy and the crawl time
limitation.
1: Social Media Examiner, “Social Media Marketing
Report 2011”, 2011. Retrieved November 8th 2011, from
http://www.socialmediaexaminer.com/SocialMedia
MarketingReport2011.pdf
REFERENCES:
2. J. J. S. Huang, S. J. H. Yang, Y. M. Huang, and I. Y. T. Hsiao,
Social learning networks: build mobile learning networks based on
collaborative services, Educational Technology & Society,
vol.3(3), Jul. 2010, pp. 78–92, doi: 10.1.1.174.1095
3. M. Gjoka, M. Kurant, C. Butts, and A. Markopoulou.
Walking in facebook: a case study of unbiased sampling of OSNs.
In Proceedings of the 29thconference on Information
communications, pages 2498{2506. IEEE Press, 2010.
4. Joseph Bonneau, Jonathan Anderson, George Danezis, Prying
Data out of a Social Network, Proceedings of the 2009
International Conference on Advances in Social Network Analysis
and
Mining,
p.249-254,
July
20-22,
2009
[doi>
10.1109/ASONAM .2009.45]
V. CONCLUSION
In conclusion, we implement a crawler for Facebook
based on interaction simulation and distributed
crawler, which can extract friends list of Facebook at
increased rate as previous crawler. We also overcome
the drawback in [4] that crawler cannot extract
friends list of more than 400 friends of a user. We
extracted data from Facebook and stored on cloud.
We hosted our crawler on google app engine so that
it can be invoked from any place in the world. To
gather data. We implemented distributed crawler
which can accept more than one seed URL and can
extract data using multi-crawling and gather data in
single data base. We analyzed gather data to check
awareness of users and found that people are more
conscious today. We also gather data about frequent
users and generated graph for crawled users.
5. Catanese, S., Meo, PD, Ferrara, E., Fiumara, G., and Provetti, A.
Crawling Facebook for Social Network Analysis Purposes. WISM
2011.
6. D. Chau, S. Pandit, S. Wang, and C. Faloutsos. Parallel crawling
for online social networks.In Proceedings of the 16th international
conference on World Wide Web, pages 1283 {1284.ACM, 2007.
7. http://wikipedia.org
8. https://google.com
VI. FUTURE SCOPE
Facebook crawler can be used for collecting user data
from Facebook. Analysis of these information can
provide us various types of information.
We can find information about one’s family
members. Crawling through a user profile can
provide us his area of interest and choices.
An organization can analyze data for their own
purpose. They can collect user’s qualification
information, whether he is employed or unemployed.
Based on which they can make a job request to user.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 88
Download