International Journal of Engineering Trends and Technology (IJETT) – Volume 23 Number 7- May 2015 A Predictor System for Social Network with Privacy Protection Irine Mary Babu 1 , Laya Devadas 2 1 M tech Scholar, 1, 2 2 College of Engineering, Munnar, Kerala, India Abstract— Online social networks have become an essential component of the online actions on the network and one of the most impressing media. Online Social networks (OS Ns), such as YouTube, Facebook, twitter etc are increasingly directed by countless people. These networks grant users to expose their own information to others. Users can convey, interact and mingle with others. These networks offered periodic data sharing and inter-user communications immediately available. Privacy is one of the major burdens when disclosing or sharing social network data for social science research and business analysis. Recently, researchers have expanded privacy models to put an end to node identification over architecture information. However, even when these privacy models are prescribed, an attacker may still be capable to figure out one’s private information if a group of nodes largely shares the same sensitive attributes. In this paper, it finds the relation of friend in social network and share the videos according to their interest, friend’s range and friend’s prediction. And also these networks are modeled as a graph in which the node indicates the user and the link between the nodes indicates the relationship between those users. Keywords— Data mining, predictor system, privacy preserving, social network, text categorization, video sharing. I. INTRODUCTION Data mining is the process of automatically detecting appropriate knowledge in large data repositories. Data mining techniques are expanded to scour large databases in order to find innovative and appropriate arrangements that might otherwise remain anonymous. Online social networks are growing rapidly day by day and it shows the way people mingle with each other. Online social networks such as YouTube, Facebook, twitter etc. have developed into one of the most prominent activities on the web. As part of the retailing procedure, most of the business holders actively use these social networks. Users can convey, connect, and mingle with each other. Apart from these each and every user has their own profile, in that they can reveal their own information if they need. On the specific side, these sets of information or data gives tremendous analysis opportunities to researchers and on the unfavorable side, this information gives a hazard to users’ privacy. Privacy is one of the most important concern when revealing one’s identity or information. This information should be protected. Users are more concentrated on securing their data. Due to day to day update and rapid growth in social networks, obtaining information like community growth, user ISSN: 2231-5381 Assistant Professor behavior, shared videos etc by researchers are growing now a days, at the same time these networks should not reveal the private information. So the challenge is to acquire methods to publish the information on the social network, with privacy protection. The social network can be modeled as a graph in which the nodes indicate the user, labels indicates the information which the user provided and the link indicates the relationship between the users. In the beginning, some researchers have proposed many techniques which prevent both information leakage and attacks by adversaries on these networks. These methods mostly concentrated in revealing the identity. A collection of anonymization techniques and privacy models have been developed like k-anonymity [10], l-diversity [14] and tcloseness [17]. Graph structures are also published with the corresponding social relationships when publishing the social network. As a result, it may be exploited as a new means to compromise privacy. A structure attacks mean attacks which utilize the graph structure data, for example the degree and the sub graph of a node, to distinguish a node. To obviate structure attacks, a published graph should satisfy k anonymity [3] [9]. Current methodologies for forefending graph, privacy can be relegated into two categories: [6] clustering [3],[9] and edge altering [1], [10], [9]. Clustering is to consolidate a sub graph to one super hub, which is inadmissible for delicate marked graphs, subsequent to when a gathering of hubs is converted into one super hub, the node label cognations have been lost. Edgealtering strategies, keep the nodes in the pristine graph unaltered and only integrate/expunge/swap edges. Edgealtering may to a great extent demolish the properties of a graph. The edge-altering system once in a while may change the distance properties significantly by connecting two far away nodes together or erasing the scaffold connection between the two groups. The privacy ruptures in social network data can be assembled into three categories:[3] 1) Identity disclosure: the personality of the person who is connected to the node is uncovered. 2) Link disclosure: delicate connections between two people are revealed. 3) Content disclosure: the security of the information connected with every hub is broken, e.g., the email message sent and/or got from the people in an email correspondence graph. http://www.ijettjournal.org Page 366 International Journal of Engineering Trends and Technology (IJETT) – Volume 23 Number 7- May 2015 We accept that an immaculate protection insurance framework ought to consider these issues. Notwithstanding, ensuring against each of the above ruptures may require different procedures. For instance, for content disclosure, standard security safeguarding data mining techniques [4] such as data perturbation and k-anonymization can offer assistance. For link disclosure, the different strategies examined by the link-mining community [5,7] can be valuable. Here in this paper, we are concentrating on video sharing activities. Nowadays, users are more interested in video sharing. The paper deals with privacy in video sharing by comparing the interests of users. The comparison is made by text categorization technique. So that the video will be shared according to the user’s interest. In text categorization, a content or word might in part coordinate numerous classifications. We have to locate the best coordinating class for the content or word. The Term (word) Frequency/Inverse Document Frequency (TF-IDF) methodology is ordinarily used to measure every word in the content record as indicated by how novel it is. The rest of the paper is organized as follows. Section II reviews previous works in the area. Section III describes the text categorization. Section IV describes the representation. . Then the problem is defined in Section V and propose solutions in Section VI, Algorithms are described in section VII, Experiments and result analysis are described in Section VIII and the conclusion in Section XI. II. RELATED WORK A number of recent studies have been proposed by the researchers to ensure the private data in the social network. A large portion of these works on graph structure and labels on that graph and privacy on that label. L. Sweeney proposed a model k-anonymity for protecting privacy [10]. If the data holder wants to share a version of the data with researches and that contains some private data, so that the data holder cannot release the private data with scientific guarantees. The solution provided here is a formal protection model called kanonymity. If we apply this model the resulting data looks anonymous, so that the individuals who are the subjects of the data cannot be identified. But the k-anonymity can still be vulnerable to attacks. A. Machanavajjhala, J. Gehrke, D. Kifer and M. Venkitasubramaniam proposed l-diversity: privacy beyond k anonymity [14] shows that k anonymity does not guarantee privacy against attackers. The authors proposed a novel and powerful privacy definition called l- diversity and shows that l- diversity is practical and can be implemented efficiently. They show the attacks on k anonymity that leak information due to lack of diversity in the sensitive attributes. To overcome this l- diversity principle was used. After applying l- diversity principle, all the tuples have the same value. Different adversaries can have different background inferences l-diversity simultaneously protects against all of them. X. Ying and X. Wu [8] proposed a spectrum, preserving approach of randomizing social networks. The authors ISSN: 2231-5381 investigated the consequence of various properties of networks due to randomization. They studied how randomly deleting, and swapping edges change graph properties and proposed an eigenvalues oriented random graph change algorithm. All the edge editing- based models prefer to produce a published graph with as fewer edge change. L. Zou, L. Chen and M. T Ozsu Ozsu proposed kautomorphism: a general framework for privacy preserving network publication [12]. Due to increasing social network applications, privacy concerns in social networks have become increasing important; since social networks usually contain personal information. Simply removing all identifiable personal information (such as names and social security number) before releasing the data is insufficient. It is easy for an attacker to identify the target by performing different structural queries. The authors proposed k-automorphism to protect against multiple structural attacks and developed an algorithm called KM that ensures k-automorphism. The authors also discussed an extension of KM to handle ―dynamic‖ releases of the data and proved that the algorithm performs well in terms of the protection it provides. J. Cheng, A. W-C Fu, and J. Liu proposed k-isomorphism: privacy preserving network publication against structural attacks [2]. J. Cheng, A.W.c. Fu, and J. Liu (2010) identified a new problem of enforcing k-security for protecting sensitive information concerning the nodes and links in a published network dataset. Their investigation leads to the invention of k-isomorphism where, the selection of anonymization algorithm depends on the adversary knowledge and the targets of protection. The authors addressed the information of protection against structural attack if the target is only NodeInfo. The authors say that NodeInfo and LinkInfo are two basic sources of sensitive information on network datasets, and they call for special efforts for their security. L. Liu, J. Wang, J. Liu and J. Zang proposed privacy preserving in social networks against sensitive edge disclosure [11] treated weights on the edges as sensitive labels and proposed a method to preserve shortest paths between most pairs of nodes in the graph. L. Liu, J. Wang, J. Liu, and J. Zhang considered preserving weights data privacy of certain edges, while trying to preserve close shortest path lengths and exactly the same shortest paths of certain pairs of nodes. Also the authors developed two privacy preserving strategies for this application. The first strategy is based on a Gaussian randomization multiplication, and the second one is a greedy perturbation algorithm which is based on the graph theory. M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis proposed Resisting structural re-identification in anonymized social networks [13] categorized the entities connected by relations such as friendship, communication, or a shared activity. They quantified the privacy risks associated with three different classes of attacks on the privacy of individuals on networks, based on the adversarial knowledge. They proved that network structure and size is the main root of the risks of these attacks. They also proposed a novel approach to http://www.ijettjournal.org Page 367 International Journal of Engineering Trends and Technology (IJETT) – Volume 23 Number 7- May 2015 anonymizing network data that models, aggregate network structure and then allow samples to be drawn from that model, which guarantees anonymity for network entities while preserving the ability to estimate a wide variety of network measures with negligible bias. from the corpus to contain the preparation set, performs the adapting on the preparation set, and after that creates the model. 5) Testing and Evaluation: This stride performs the characterization on the testing set. III. TEXT CATEGORIZATION With the fast development of online data, how to process, there huge amounts of content has effectively turned into a hot exploration subject. Text categorization is one of the key errands among them. The objective of content order is the grouping of archives into an altered number of predefined classes. The working definition utilized all throughout this paper expects that every document of the client/user is allocated to precisely one class. To put it all the more formally, there is a set of classes C and a set of training document (interest) I, there is a target idea T: I-> C which maps the text or the document to the classes of interest. T (I) is known for the training documents/interest for the preparation set. Through supervised learning data contained in the preparation samples can be utilized to locate a model. A fundamental issue in content classification is the manner by which to enhance the characterization exactness. The goal is to locate a model which augments precision. IV. PROBLEM DEFINITION When we consider social network, privacy is the most important thing we have to concentrate. The production of social organization information involves a protection danger for their clients. Delicate data about clients of the social systems ought to be ensured. Almost all of the privacy algorithm more concentrated on graph structure and providing privacy for the labels which will be either sensitive or non sensitive information about the client/user. So authors proposed techniques like k- anonymity, l- diversity and many more. Video sharing in social networking is some more different. In this paper, we focus on video sharing on the basis of text categorization. Text categorization is the process of grouping documents into different categories or classes. With the amount of online information growing rapidly, the need for reliable automatic categorization has increased. This text categorization task show that TF-IDF algorithm not only enables a better theoretical understanding, but also performs better in practice without being conceptually or computationally more complex. A. Text Categorization Steps: Generally, text categorization often includes 5 main steps: document pre-processing, document reduction, stemming, model training and testing and evaluation [15] 1) Document Preprocessing: In this step, it removes the html tags, rare words, stopping words, and may need to do some stemming 2) Document Reduction: Since in records, there're countless words, on the off chance that we pick every one of them as highlights, then it'll be infeasible to do the arrangement, as the PC can't process such measure of information. So we have to choose those most significant and delegate highlights for grouping. Stop words are a piece of common dialect that does not have such a great amount of importance in a recovery framework. The reason that stopwords ought to be expelled from content is that they make the content look heavier and less imperative for experts. Evacuating stop words lessens the dimensionality of term space 3) Stemming: Stemming methods are utilized to figure out the root/stem of a word. Stemming proselyte’s words to their stems which consolidate a lot of dialect ward etymological information. Behind stemming, the speculation is that words with the same stem or word root for the most part depict same or generally close ideas in content thus words can be conflated by utilizing stems. 4) Model Training: This is the most critical piece of content classification. It incorporates picking a few records ISSN: 2231-5381 A. Objective To build up another strategy to give privacy and security of social network information in conveying environment by looking at client interest. It helps publishers to publish a unified data together to ensure privacy. We can publish the non sensitive data for everyone in the network Low overhead. V. PROPOSED TECHNIQUE Here we propose a method to provide privacy for video information. Fig 1 shows the proposed system architecture. The architecture inputs a url to the search engine. The search engine retrieves the webpage, in their preprocessing is performed on the content of the web page. Then, the preprocessed web page is categorized and compares it with the user interest. According to the user interest the data to be shared into the user profile. For that, we create one network. The network consists of nodes which indicate the user/client, label indicates the data belongs to the user, and the link indicates the relationship between the users. http://www.ijettjournal.org Page 368 International Journal of Engineering Trends and Technology (IJETT) – Volume 23 Number 7- May 2015 A. Preprocessing The fundamental target of preprocessing is to acquire the key highlights or key terms from putting away messages, and to improve the significance between a word and repeat and the pertinence in the middle of a word and classification. Preprocessing step is significant in deciding the nature of the following stage, that is, the order stage. It is essential to choose the critical magic words that convey the importance and dispose of the words that don't add to recognizing the records. The preprocessing period of the study changes over the first printed information in information mining prepared structure. At last it will evacuate all the html labels, uncommon words, stop words and so forth. B. Categorizing In this technique, the resulted word after the preprocessing technique, will categorize. Almost same meaning word will come into the same category. So after this, the categorized word will compare with the category given in the user profile. This category will automatically share in to another user profile, if they have the same category interest. VI. ALGORITHMIC STRATEGY A. TF-IDF classifier To share videos according to the user's interest, here we use an algorithm known as TF-IDF algorithm. TF–IDF can be viably used to stop-words filtering in diverse subject fields, including substance summary and portrayal or arrangement TF-IDF weight is a weight often used in information retrieval and text mining. The three main design choices are, 1) The word weighting method. 2) The text length normalization is done using the number of words. 3) The similarity measure is the inner product. Algorithm Procedure Enter the url Get the meta content Remove the stop words Perform stemming After stemming, categorize the video Check whether friends with same interest of video present If it is true, Share the video to corresponding users Else Video won’t be shared. The algorithm first checks the URL and get the meta content of the url and delete all the stop words and the resulted url consists only root words. Then it categorizes the video. Then find the friends with same interest/category by getting the video collected by the users. After this, the url will automatically share into the user profile and at the same time this data will share, those who have the same interest. Fig.1 System architecture ISSN: 2231-5381 VII. EXPERIMENTAL RESULTS Here in this paper, we have developed one network with the user interface. Their user can add their interest to their profile. So that, according to this interest they can share their data. The algorithm checks the URL and erases all the stop words and the resulted url comprises just root words. At that point it analyses the classification of the url address with the client interest class. After this, the url will consequently impart into the client profile and in the meantime, this information will impart, the individuals who have the same interest. http://www.ijettjournal.org Page 369 International Journal of Engineering Trends and Technology (IJETT) – Volume 23 Number 7- May 2015 VIII. CONCLUSION Nowadays, the use of social networking increases day by day. It contains some valuable information as well as some worthless information. Users on these networks are more concentrated on securing this information. In this paper, we design a consummate security protection framework. This framework allows users to set their own interest to their profile and it provides privacy according to their interest. The algorithm compares the keywords from the url address and the user's interest. So by doing this, videos shall be shared only to interested people. [7] [8] [9] [10] [11] [12] REFERENCES [1] [2] [3] [4] [5] [6] Mingxuan Yuan, Lei Chen, Philip S. Yu, Ting Yu, "Protecting Sensitive Labels in Social Network Data Anonymization", IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 3, March [13] 2013 [14] J. Cheng, A.W.-c. Fu, and J. Liu, ―K-Isomorphism: Privacy Preserving Network Publication against Structural Attacks,‖ Proc. Int’lConf. Management of Data, pp. 459-470, 2010. K. Liu and E. Terzi, ―Towards identity anonymization on graphs,‖ in Proc. of the 2008 ACM SIGMOD Intl. Conf. on Management of Data, New York, USA: ACM, pp. 93–106, 2008. Aggarwal, C. C., and Yu, P. S. Privacy-Preserving Data Mining: Models and Algorithms, vol. 34 of Advances in Database Systems. Springer, 2008. Getoor, L., and Diehl, C. P. Link mining: a survey. ACM SIGKDD Explorations Newsletter 7, 2(2005), 3-12. Mr.Gaurav .and P.R. Mr.Gururaj.T, ―Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks‖ ISSN: 2231-5381 [15] [16] [17] International Journal of Computer Science and Information Technologies, Vol. 5 (4) , 2014, 5995-6000. Zheleva, E., and Getoor, L. Preserving the privacy of sensitive relationships in graph data. In Proceedings of the International Workshop on Privacy, Security, and Trust in KDD (PinKDD'07) (San Jose,CA, August 2007). X. Ying and X.Wu. Randomizing social networks: a spectrum perserving approach. In SDM, 2008 B.Zhou and J.pei,‖Preserving Privacy in Social Networks Against Neighborhood Attacks,‖Proc.IEEE 24th Int’l Conf.Data Eng.(ICDE’08),pp.506-515,2008 L.Sweeney,‖K-Anonymity: A Model for Protecting Privacy‖, Int’l J.uncertain. Fuzziness Knowledge-Based Systems, 2002 L. Liu, J.Wang, J. Liu, and J. Zhang.‖ Privacy preserving in social networks against sensitive edge disclosure‖. In SIAM International Conference on Data Mining, 2009. L. Zou, L. Chen, , and M. T. • Ozsu. K-automorphism: a general framework for privacy-preserving network publication. PVLDB, 2(1), 2009. M. Hay, G. Miklau, D. Jensen, D. Towsley, and P. Weis. Resisting structural re-identi_cation in anonymized social networks. PVLDB, 1(1), 2008. A.Machanavajjhala et.al.,‖l-diversity:Privacy beyond K anonymity, ACM Trans Knowledge Discovery Data(2007) Mingyong Liu1+ and Jiangang Ya ―An improvement of TFIDF weighting in text categorization‖ 2012 International Conference on Computer Technology and Science (ICCTS 2012) IPCSIT vol. 47 (2012) © (2012) IACSIT Press, Singapore DOI: 10.7763/IPCSIT.2012.V47.9 Suvidha, Ravikishan ― A study on the Architecture for Text Categorization and Summarization‖ International journal of computer trends and technology- vol 3 Issue 4- 2012 N.Li and T.Li,‖T-closeness: Privacy beyond K-Anonymity and LDiversity‖ IEEE 23rd Int’l Conf.Data Eng, 2010 http://www.ijettjournal.org Page 370