International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014 Design and Implementation of Distributed Facebook Crawler Based on Interaction Simulation B.S Satpute, Raj Ambani, RohitRai, JatinMalviya, Vikash Raj D.Y Patil Inst. Of Engg.& Tech. Pune University Abstract- The increasing use of Online Social Networks (OSNs) has attracted such wide attention of people that OSNs have become a hot topic. There are already many systems previously designed as Facebook crawler. But these systems have many limitations on their working. These systems were able to extract only friend upto 400. This drawback has been overcome in thecurrently designed system. On the basis of interaction simulation, we have designed and implemented a Facebook crawler, which can obtain the complete friend list of a Facebook user and overcome the drawback in the previously designed Facebook Crawlers. Two popular approaches that can produce approximately uniform samples are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have their own pros and cons depending on their use. In comparison, if we use Breadth-First-Search (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results.We make an analysis and visualization of the crawled dataset, which implies that the awareness of privacy protection of Facebook users has been greatly improved. Keywords: Facebook, Social Networking, Clustering, Crawler, Analysis, Privacy, URL. I. INTRODUCTION Today online social networking sites (OSN) has become an important part of life. People are using these sites to connect with their friends and families. So this is obvious that people mentally and emotionally feel that they are very close to each other in spite they are sitting far away from each other. Now a person can get registered on the same site as their friends and can upload his videos or photos which can be seen by their friend and family members. Not only that they can also share their emotion related to those pictures.People use these facilities to come closer even when they are sitting in different countries or regions. The power of online networks lies in the fact that they are organized around their users, thus making it possible to utilize user interconnectedness in order to ISSN: 2231-5381 reach large audiences at a relatively low cost. The social relations that exist between people can be established and maintained through social networking sites such as MySpace and Facebook [2]. Among all OSN Facebook is the most famous OSN. Facebook reported having 1.2 billion monthly active users on January 2014, globally. This has aroused a great deal of interest in social media, and especially in Facebook as a marketing tool: in 2011 76 percent of companies reported that they planned to strengthen their presence on Facebook [1]. In January 2013, the countries with the most Facebook users were: United States with 168.8 million members Brazil with 64.6 million members India with 62.6 million members Indonesia with 51.4 million members Mexico with 40.2 million members Facebook is having most complex privacy policy and also it is updating its updating frequently. Many researches have been made to check through its privacy policy and users motivations but as it changes frequently, those research no longer suits for current timeline model of Facebook. Thus we need regular research for studying this famous OSN. So we have chosen this as our object for study. Previously for study of Facebook no distributed crawler has been used. They implemented crawler in very simple way. No graph generation algorithm was used to show the mutuality between friends. And also they didn't make any observation of level of mutuality between two friends. Our paper will make following contribution to the study of OSN: First, we will design a crawler which is no longer limited to maximum of 400 friends. Second, we are using distributed crawler for extracting data with a combination of interaction simulation to extract more data in less time. Third, we will collect data from Facebook and will create a database for our analysis. We will make following analysis from the database:- We will check user's privacy setting to check whether what % of user is aware of their privacy. http://www.ijettjournal.org Page 85 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014 - We will crawler the complete friend list of users and will check mutuality level of friends and will create a graph for that. We are also discussing following points in next sections Literature Survey. Proposed methodology. Result. Future scope & conclusion. II. LITERATURE SURVEY Bonneau et al. has researched on Facebook crawler he also took a survey and used many methods for abstracting personal data from Facebook using fake account, phishing, infected application and query language. Where asin terms of data collection aspects, the work by Gjoka et al. [3] on social networking (in particular on Facebook) is closely related to our work. As we all know the security and access control mechanisms is improved continuously in Facebook, due to that getting personal data of any other account in Facebook has been more difficult than ever. Korolova et al has examined attacks on privacy on social networks where all personal information about the friends of a user is not made accessible. User dataset mining in [6] was done through public listing in [4]. However data procurement of users has become very tough due to continuously enhanced access control mechanisms and privacy policies of Facebook. The method of public listings in [4] has failed in present scenario. Although crawlers and API can be used to still obtain user data from OSNs sites. But API proves to be unable to obtain the friend list of users’ friends by unauthorized developers for Facebook, due to its strict access restrictions to use API. The Facebook Query Language in [4] also failed as Facebook Query Language is one of the Facebook API. Although the crawlers can be used to obtain user data for scientific research despite strict access and privacy policies. This paper is aimed to study privacy issues and network structure of OSNs and using crawlers as the main approach to obtain user data from Facebook. Further studies were conducted by [5, 3, 6] on crawlers for OSNs. Two new unbiased crawl strategies: Metropolis- Hasting random walk (MHRW) and a re-weighted random walk (RWRW) were proposed by [3] and social network parallel crawler was modeled by [6]. Although the work were innovative they lacked the detailed implementation and technical challenges of the crawler. However [5] described as well as drew comparisons on the detailed implementation of a social network crawler using two crawling strategies: the breadth-first search ISSN: 2231-5381 and uniform sampling to run the crawler on Facebook. However, the crawler in [5] could only retrieve at most 400 friends of a user as Facebook limited at most 400 friends per friend webpage which was a major drawback. Nowadays, Facebook has strengthened access control policy and modified its privacy policy as a step to curb access by crawlers, limiting each friend webpage to return at most 60 friends. In this paper, based on interactive simulation, we have designed and implemented the crawler algorithm, which can obtain all the friend list of a Facebook user. III. PROPOSED METHODOLOGY OSNs can be represented as graphs where nodes represent users and edges represent connections. Facebook,in particular, is characterized by a simple friendship schema, so as it is possible to represent its structure through an unweighted,undirected graph.In order to understand if it is possible to study OSNs without having access to their complete dataset (e.g. Facebook),we adopted two different methodologies to gather partial data (i.e. a sub-graph of the Facebook complete graph).Our purpose is to acquire comparable samples of the Facebook graph when applying different sampling methodologies.Once collected, informationare analyzed using tools and techniques provided by the SNA. We alsoinvestigate the scalability of this problem, evaluating several properties of different graphs and sub-graphs, collected with different sampling approaches. Then, we compare measures calculated applying SNA techniques with statistical information provided by Facebook itself in order to validate the reliability of collected data. Moreover, highlighting similarities and differences between results obtained through this study and similar ones, conducted in past years, is helpful to understand the evolution of the socialnetwork over the time. Finally, this methodology starts from the assumption that, to gain an insightinto an OSN, we take a snapshot of (part of) its structure. Process of collection of data require time,during whichitsstructure could slightly change, but, assuming the small amount of time, with respect to the whole life of the social network, required for gathering information, this evolution can be ignored. Even though, during the data cleaning step we take care of possible information discrepancies. We design and implement a multi-threaded crawler using Java based on the interaction simulation. The crawler can obtain the complete friend list of users. The crawler overcomes the drawback that it can get only 400friends of users. The detailed algorithm is shown in Algorithm 1. The crawler uses the strategy of breadth-first search and works as follows: First, the crawler logs in automatically with a real Facebook user credential, http://www.ijettjournal.org Page 86 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014 and then visits the seed profile and its friend webpage. Second, friend lists from the webpage are extracted through regular expression and are inserted in the queue to be visited. Finally, the crawler visits the profile in the queue to be visited in turn by FIFO (First In and First Out), and cycle the process. In Facebook due to the access control mechanism, each friend webpage returns at most 60 friends. Therefore, in order to visit the complete friend list, the crawler must simulate the normal interaction behavior of a user with Facebook. Through analyzing the traffic between the browser and Facebook captured by HttpFox, a plugin for Firefox, we find that the interaction of the Ajax code embedded in the user's friend page. By simulating the interactive behavior of the Ajaxcode with Facebook, we can get the complete list of friends, which is done through multiple extractions and get a list of at most 60 friends each time. The experiment result in section 4.2 proves the algorithm correctness. In this paper, the crawler focuses on obtaining the social graph, i.e., the friend links. User privacy is protected by collecting only the user ID and by not collecting user name, photos, other personal information or interaction information among users. 1. Open the Facebook webpage; 2. Enter real user login and password ; 3. Consider this account as a seed profile,Visit the profile and its friend webpage; 4. Step 1: 5. get the friendNo; {Extratct the number of friends of a user from its friends web page;} 6. If friendNo % 60 != 0 NoofExtract = (friendNo friendNo % 60 + 60) / 60; 7. For (i = 2; i <NoofExtract; i++) 8. Create a URI; {directing to a friend webpage including friend between 60*(i-1) and 60*i } 9. Insert the URI into the queue to be visited; 10. Go to step 3; 11. Step 2: 12. Visit the URI; {directing to a friend webpage including maximum of 60 friends in queue which are to be visited} 13. Extract the friend list through regular expression; 14. Insert the profile in the queue to be visited; 15. Step 3: 16. Visit the next item in the queue to be visited; 17. If the item is a profile 18. Visit the profile and his friend webpage; 19. Go to step 1 until the crawl ends; 20. If the item is a URI {directing to a friend list webpage including at most 60 friends} 21. Go to step 2 until the crawl ends; ISSN: 2231-5381 clustering algorithm Step 0: Start Step 1: Get crawled tree Step 2: Identify the all parent node Step 3: create cluster bucket for parent node Step 4: add all the parent node into each cluster bucket vector Step 5: then add all the children node of the parent node into its Respective vector Step 6: Stop IV. RESULT A. This paper used a notebook with Intel i5 CPU, 6G RAM, and broadband Internet access at a rate of 1Mb/s to validate the algorithm, and had been running the crawler from March 10, 2014 to March 31, 2014. A dataset of a total of 1000 users was logged into. We encrypted the user ID with MD5 function in order to protect the user privacy after data collection. B. we make visualization on the dataset crawled from Facebook after initial data analysis. First, we use the open source software, NodeXL, to make visualization on a subset of the collected dataset, including a visualization of 1000 users, 10000 users, 2000 users and 3000 users subgraph. Formally, a social network is an undirected graph G = <V, E>, where V is the set of nodes (users) and E is the set of edges (links). General metrics on OSN structure include degree, clustering coefficient, diameter etc. In this paper, we use the functions of SNAP (Stanford Network Analysis Platform) for the data analysis. According to the Facebook statistics, each user in Facebook has more than 120 friends on an average. However, the result in Table 1 is only 2.59. Moreover, the average clustering coefficient in Table 1 is 0.04, but the result in [8] is 0.16. Thus, we can conclude that the dataset crawled from Facebook is biased. The main reason is that breadth-first search used in this paper is a biased crawling strategy [18], which always skews towards the users with larger degree. No. of distinct visited user No. of distinct dis. Neighbors No. of distinct edges Avg. diameter 544,456 4,144,526 5,244,548 4.65 Crawling period 10-03-2014 to 31-03-2014 TABLE 1: THE RESULT OF DATA ANALYSIS http://www.ijettjournal.org Page 87 International Journal of Engineering Trends and Technology (IJETT) – Volume 10 Number 2 - Apr 2014 C. in the Privacy Setting we found that only 500 users' friend list could be seen by public in the total 1000 users crawled, which shows that 36.5% of the Facebook users changed the default privacy setting and set the user's friend list invisible to strangers.,The statistics has increased by up to 20.5% in comparison to 26.6% and 16%, the results respectively, thus indicating that the privacy protection amongst the Facebook users has been greatly improved. Advertisement companies can use these data in different manner. They can advertise on Facebook according to choice of a user. Based on which a production company can change their product specification and can add new features. Crawled data can also be used for mood detection by Government. Thus the can prevent bad activities going to happen in society. D. Summary In the 1000 users whose friend lists are visible to the public, we find that 700 users have more than 50 friends, accounting for 8.1%; 172 users have more than 130 friends, accounting for 5.0%. Two facts have been proven: First, the complete friend list can be obtained by our crawler; second, most of the user's friend list is not crawled completely and is still in the queue to be visited, due to the biased crawl strategy and the crawl time limitation. 1: Social Media Examiner, “Social Media Marketing Report 2011”, 2011. Retrieved November 8th 2011, from http://www.socialmediaexaminer.com/SocialMedia MarketingReport2011.pdf REFERENCES: 2. J. J. S. Huang, S. J. H. Yang, Y. M. Huang, and I. Y. T. Hsiao, Social learning networks: build mobile learning networks based on collaborative services, Educational Technology & Society, vol.3(3), Jul. 2010, pp. 78–92, doi: 10.1.1.174.1095 3. M. Gjoka, M. Kurant, C. Butts, and A. Markopoulou. Walking in facebook: a case study of unbiased sampling of OSNs. In Proceedings of the 29thconference on Information communications, pages 2498{2506. IEEE Press, 2010. 4. Joseph Bonneau, Jonathan Anderson, George Danezis, Prying Data out of a Social Network, Proceedings of the 2009 International Conference on Advances in Social Network Analysis and Mining, p.249-254, July 20-22, 2009 [doi> 10.1109/ASONAM .2009.45] V. CONCLUSION In conclusion, we implement a crawler for Facebook based on interaction simulation and distributed crawler, which can extract friends list of Facebook at increased rate as previous crawler. We also overcome the drawback in [4] that crawler cannot extract friends list of more than 400 friends of a user. We extracted data from Facebook and stored on cloud. We hosted our crawler on google app engine so that it can be invoked from any place in the world. To gather data. We implemented distributed crawler which can accept more than one seed URL and can extract data using multi-crawling and gather data in single data base. We analyzed gather data to check awareness of users and found that people are more conscious today. We also gather data about frequent users and generated graph for crawled users. 5. Catanese, S., Meo, PD, Ferrara, E., Fiumara, G., and Provetti, A. Crawling Facebook for Social Network Analysis Purposes. WISM 2011. 6. D. Chau, S. Pandit, S. Wang, and C. Faloutsos. Parallel crawling for online social networks.In Proceedings of the 16th international conference on World Wide Web, pages 1283 {1284.ACM, 2007. 7. http://wikipedia.org 8. https://google.com VI. FUTURE SCOPE Facebook crawler can be used for collecting user data from Facebook. Analysis of these information can provide us various types of information. We can find information about one’s family members. Crawling through a user profile can provide us his area of interest and choices. An organization can analyze data for their own purpose. They can collect user’s qualification information, whether he is employed or unemployed. Based on which they can make a job request to user. ISSN: 2231-5381 http://www.ijettjournal.org Page 88