International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 An Implementation of Predictive Hotset Identification and Analysis in Social Networks Miss. Gayatri Kabra#1 , Prof. Mangesh Wanjari*2 #1 M.Tech Scholar, #2 Assistant Professor,. Department of Computer Science, Shri Ramdeobaba College Of Engg. & Management, Nagpur, India #1,#2 Abstract— Social networks platforms where people build connections, share interests, activities, links, different resources, tweets, and reviews about the things etc. These are web based services which provide means for interaction over internet through instant messaging and sharing of resources. As so many different services are collectively provided by single platform there exist different kind of data such as structured, unstructured and semi-structured which is scattered over social networks. Sometimes it is very difficult for user to get the exact information through huge amount of dataset on social networks. To manage this huge amount of data cost effectively in terms of storage and network we need to analyse it. For this we need to build hotset of data which would contain precisely pre-processed data of social networks avoiding repetitive, meaningless data so that it would constitute only useful data. There are so many kind of social networking sites such as job sites, matrimonial sites, product review sites, blogs, social connecting sites etc. some of these sites are mend to be informatory which provides user with review about product by different users. One of the problem which a common user will face about product reviews on these sites is to get required information through reviews and to get the effectiveness of reviews by statistical analysis by users. The contribution of this paper is to provide an approach for searching mobile product information in the form of reviews based on user’s query and also get trustworthiness of reviews quality based on users replies which are available on social network calculated on some statistical basis. Keywords— Social network, hotset, trustworthiness, reviews, stopwords. statistical analysis, I. INTRODUCTION Social networks consist of many social actors which plays different roles in utilities of social networks. Data on which these actors are working can be structured, unstructured, semistructured, images, videos, quick links, uploads, reviews, tweets etc. To work with these variety of structures over a single platform is quite difficult. For this reason working on similar kind of data is quite easier. Sometimes it is possible to consider more than one kind of data together such as reviews and quick links. Product review data is very useful amongst all kinds of social networking data because it includes the comments about any product by the product users. Processing this data and get in the form of hotset is advantageous to analysis of the product by product user. Furthermore trustworthiness of product calculated on the basis of generated hotset can be a measure of review quality. For the purpose of ISSN: 2231-5381 analysis data collection is a major task. More the data one have, purpose of analysing huge amount of data in terms of big data analytics will be achieved. Data can be taken directly by means of internet or can be collected manually. It can have different types. In our project for the purpose of convenience we have considered only one type of data i.e. product reviews data. After data collection storing this large amount of data in appropriate format is necessary so as to enable easy access for analysis purpose. After collecting data, first thing which should be done is pre-processing of data which includes getting data cleaned by removing unnecessary things from data such as extra symbols, stop words. Pre-processing is mandatory while working with reviews because if data is not processed somehow it would lead to errors while further processing. Pre-processing is not done on whole data at one time. Because system data gets updated dynamically so it would be a headache to pre-process whole data at every updation. Pre-processing is performed on the set of data by which one want to generate hotset. One more purpose due to which pre-processing is performed on user entered query is to map it with exact dataset which has to be further analysed. After pre-processing getting data on which one has to create hotset by means of index mapping based on user’s requirement. Mapping index with dataset makes it easier to fetch required data from dataset. Mapping of index also focus on user’s social connection. i.e reviews in a dataset are from which user, it can be further useful for social-based analysis. After fetching this data, predictive-social based method has to be applied over data to predict hot sets of user’s query. Predictive social method takes into account the word’s frequency count amongst common pattern of dataset. Datasets with minimum common pattern are discarded and rest of the things are taken account on the basis of social connection. Social connections of user are taken into account for trust factor where reviews from connected people are given priority. Rest of the reviews which are remained are kept for analysis based on social connection of users. Social connection are useful to predict the factor named as “trustworthiness” which can be given by considering different social connection groups. This factor has impact on predicted review quality by which one may conclude that what is the percent to trust the predicted review. http://www.ijettjournal.org Page 164 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 II RELATED WORK Predictive content management for web has two kinds of strategies based on content over web i. e. static and dynamic. Static content management strategies considers historical data which has been coming over system for fixed time period. Whereas dynamic strategy has dynamically changing data exploding on web over time. Static method for content management takes into account static data abstract in form of web .log files or click through history. Some of the strategies for static content management are: Predicting sequence of events on a prefetched webpages is one of the static predictive strategy defined in [3]. In this event sequences are been formed on basis of Marcov LxS(length time support) algorithm which allows to form episode of event sequence which are occurred during particular time. I.e. time is a deciding factor. Historical data from click through history, log files can also be used for predictions [7], [9]. In both of these, static historical data through web log files is been processed for predicting the events or sequence of events occurred in some predefined time interval. Web log data clearly defines web access sequence by user. Moving towards structured data space in social networks work from [6] can be considered for prediction over category accesses. Forming categories of accesses in some predefined time interval makes prediction easier because in certain time phase accesses from particular users comes into account which has tendency to be of similar kind. Moving towards the dynamic content management strategies, there exist some adaptive algorithms for Hotsets prediction which combines predictive and social based estimation through rank merging techniques [1]. Rank merging techniques also defers in dynamic and static rank merging. Both of these strategies are clearly defined in [2]. For the purpose of rank merging social connection information extraction can be done by different techniques [5]. Huge content management strategies basically relying on big data analytics techniques. There are strategies for managing this huge amount of dynamically updating web content discussed in [13]. Web content having different form, features and huge content so it would be difficult to extract it. Extraction terminologies for social networks are given in [12]. In our approach of analyzing product reviews, product reviews are filtered on the basis of word frequency count. The measure of word frequency count is beneficial in case of review data because data has similar structure constituting common words. The base of this procedure is so simple to discard the reviews having minimum matching word frequency count. Matching word frequency count is criteria came from the concept of tokenization. Where each separate word after removing stopword from reviews is considered as ‘token’ to match frequency of each word coming into picture. After considering separate word frequency if they are in an order then again the probability count to keep the review will be increased. As one get cleaned and analyzed data on the ISSN: 2231-5381 basis of word frequency count, this can be called as predictive analytics. Extracting social connection information about user’s contacts and using them for deciding hotsets by using rank merging technique can be given by adaptive algorithm [1]. Social connection information can be also used to determine trustworthiness on the basis of weight given to reviews from different groups i.e reviews from friends, friends of friends and non-friends. Clearly weight will be more for reviews from friends. Trustworthiness has an impact of weight given to different group of reviews hence weight should be carefully chosen. III. PROPOSED APPROACH The goal big data predictive analytics in social networks is analyze data according to its usage identify predictive hotset and to lower the cost of data in terms of storage, network usage and computational processing. Many times on social networking product review site it’s not possible to get exact information in the form of product reviews. This can be done by generating hotsets of these review dataset based on user entered query. Again the analysis of reviews in generated hotset based on some statistical criteria is useful in giving review quality in terms of trustworthiness to the user. Not only predictive but also social connection based technique is used for generating hotset. To do this our approach consist of following phases: 1. Data Collection and Preprocessing 2. Mapping of users query to required review set and Hot set generation 3. Analysis of generated hotset in terms of trustworthiness. All the phases are discussed in detail below. A. Data Collection and Pre-processing Data Collection is an important aspect of our project as it is based on Big data analytics. Big data as the name indicates constitutes large amount of data (may be in terabytes and petabytes) exploding over an internet. So while performing analysis of this huge amount of data it is necessary to have workload on system which would response as if it’s a Big data. While gathering data for our project we found difficulty in getting similar kind of bulk data. To resolve this issue one can collect data at their own by means of hosting any forum kind of thing on webserver. On this webserver different people uploads reviews, short comments, links and so on which results in automatic data generation in bulk amount. One more advantage of this forum hosting is one can get users social connection information which will be further useful. Social connections are nothing but population connected to each other via social network. This data that is collected by means of forum or any other means will be unstructured, and also consist of misspelled words, irrelevant words, extraneous comments, dots etc. To http://www.ijettjournal.org Page 165 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 generate hotsets without error it is necessary to remove all these irrelevant things. Consider an example: The question about product reviews can be put by the user as follows: hiii, is there any body who can explain me that whatsapp is working on this new nokia 207 or not? This question constitute “hii” which is of no use while predicting Hosets. For this reason first thing on data that is to be done is removal of stopwords. To do this dictionary containing all the stopwords which are to be removed by data is created. For this type of review questions there are reviews in form of relevant answers from different users can be given as: nikky123:whats app wasn't pre istall but i find solution. solution is in the link below: http://discussions.nokia.com/t...td-p/2144653 You need a pc to do it. It worked for me. You are welcome…. Sohan11:http://discussions.nokia.com/whattsapp_sol/2345 112 Again one can observe that, answer from username “nikky123” has multiple dots like … which are inappropriate for hotsets creation. These are needed to be avoided or removed which is accomplished by preprocessing. If one works on raw data without removing irrelevant data, it would be difficult to form proper predictive hotset because of erroneous dataset. Hence, preprocessing becomes a mandatory task in hotset identification. B. Mapping of users query to required review set and Hot set generation Next thing to work on after preprocessing of data is mapping of users query to review set. User might put query in the search box as: “ how to solve Heating problem in Grand 2” “Over heating in grand 2” “what to do if my galaxy grand 2 gets heating problem” As one can clearly notice that all the three above mentioned queries will have same answer set so that the generated hot sets for all the above queries will be same. To search for the reviews of entered query and to generate predictive hotsets first thing which has to be done is “Mapping”. Mapping: To map the entered query one has to remove all the stopwords in the entered query. Because stopwords do not have any role in mapping process. Eg. In the above 1. ‘How’, ‘to’, ‘in’ are the three stopwords. After removing stopwords portion remained is set of keywords“solve Heating problem Grand 2”. This set of keywords is compared with each question in our dataset. And question where the maximum words matches required mapping is done. Somehow it is possible that sometimes user ISSN: 2231-5381 enters grand2 instead of Grand 2, then it would be difficult to map to the appropriate question. To resolve this converting entire string to lowercase and then matching is an option. Hot set Generation: after mapping the entered query with question one can automatically fetch all of its reviews from the storage. While performing Hot set generation user’s social connection are also given preference. So if required reviews are given by any of connected user then it will automatically be prefered and it will be displayed above on the basis of trust factor by users social connection. In the remaining review set one can observe that there is particular pattern of reviews including common words which are underlined. Eg. Mapped review set for above 1, 2 and 3 is- heating is common problem for just 10-15 day after that its not a big issue - heating is common in all phones but in this phone heating issue is very very because my brother using the galaxy s4 in that also having heating issue in sony also suffer from heating issue so if you don’t see another phone then don’t talk here - all phone have pros and cons if u will be see such type common problem then u cnt buy any smart phone in world. I recommend to u buy nokia 112 or 1100 coz u can’t see smart phone have no exhast fan for coolant So we can observe the common pattern based on frequency of words. So review having maximum matching pattern based on frequency of words are extracted and kept as a set called as Hotsets. For generated hotset, social connection also plays important role in analysis. C. Analysis of generated hotsets Hotsets which are generated in previous step are clearly on the basis of prediction. How one user can know about the trust factor based on which the quality of by analysis can be determined. To do this, one may need to know about the trustworthiness of reviews contained in Hotsets. Best way to do this is using user’s social connection to determine trustworthiness. Trustworthiness can be derived by considering different weights for different sets of reviews. One has to notice that in determining trustworthiness different weights can be chosen. One has to choose weights carefully to avoid wrong trust factor. It can be given by following formula. Where, Trustworthiness in %= (No of reviews from particular group X Weight of particular group reviews) X100 (Total No of reviews X Maximum Weight) - No of reviews from particular group is a count of reviews from Ist social connection i.e. friends of users and count from second connection i.e. friends of friends and also non friends. http://www.ijettjournal.org Page 166 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 - - Weight of particular group reviews consist weight given to the particular group which differs accordingly. Total no of reviews is a total count of reviews in a hotset. Maximum weight is a max weight given to any particular group. IV. PERFORMANCE EVALUATION: In this section, we evaluate our created hotsets against the user entered query. While entering query, will varies in no of words entered. Depending on these words results of manual and system based execution will vary. We carry out different sets of experiments to demonstrate the effectiveness of the predictive hotset against trustworthiness by means of formulae. Results indicate that the model is able to respond to different input size differently and is varied from 40% to 70%. We have also studied the performance of the end result by varying the offset introduced by the error in their recommended formula. Our system consist provision for user entered query where user can enter query in the form of keywords. We have tested our system on the basis of 2 criteria i.e. manual and system based. On the basis of no of keywords entered in the system results for generating hotset varies. Hence the results from entered query differently on increasing number of review set based on both the above mentioned criteria can be shown by different graphs as follows: Graph For Query of 1 Word 180 As the review size increases manual detection correctness will automatically reduce because in 50 reviews there is always a chance that reviews with unique index can be found, but as size of dataset increases probability of getting success over unique index will be reduced. This phenomenon can be clearly shown in above graph. Here one has to consider one more fact that for the query of more than one word, there is a chance that the query consist of stopwords. So for the case of 2 words query entering stopword in search box would result in similar results as that of by the query of 1 word. Graph for the query of 2 words can be shown below. In this graph, we have not considered the case where out of 2 entered words one or both will be stopwords. One can clearly observe from the graph that, manual and system based search results from query varies as that of in 1 word query. Graph For The Query of 2 Words 200 180 160 140 120 100 80 60 40 20 0 50 160 100 150 200 No of Reviews 140 120 Manual System Based 100 80 Fig. 2 Results for analysis of query with two words 60 40 20 0 50 100 150 200 NO of Reviews Manual System Based Fig 1 Results for analysis of query with one word For the query where one word is entered for hotset generation, gives correct result for first 50 reviews manually. ISSN: 2231-5381 Experimental results for the query consisting 3 words can be shown in graph below. Here one can observe that even if increase in dataset, will not affect the manual results much, also system based query execution has similar variations.One more thing one can notice that review set upto 150 have similar differences in system based and manual execution results. But as dataset increases from 150 onwards, there will be more difference in accuracy. In case of 4 word query, manual results will provide maximum successful search probability. i.e. one can conclude that query with 4 or more words is enough to get success probability in case of manual execution. Similarly, in case of system based execution for small no of review set probability of getting success rate is more as compared to large review set. Overall performance for query with 4 words is better than the above three. http://www.ijettjournal.org Page 167 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 Accuracy in %= (No of System based reviews in scenario) / (No of Manual reviews in scenario) * 100 Graph For the Query of 3 Words 180 160 Graph for accuracy can be given as: 140 120 100 80 60 40 20 0 50 100 150 200 No of Reviews Manual System Based Fig 3 Results for analysis of query with three words Fig. 5Accuracy of system based results IV Conclusion and Future Work Fig. 4 Results for analysis of query with four words Manual results for query execution are correct. So to calculate accuracy of system based results we can take manual results correctness as a base factor depending on which accuracy of system based results can be calculated. Accuracy of system based results can be calculated as: ISSN: 2231-5381 Application of social networks in case of product review site is for knowing the actual user’s opinion. Problem with these kind of site is that, one may not get what’s desired and cannot predict the trustworthiness of reviews. For this purpose in our project, we have provided with the facility to search required information in review dataset. This review dataset consist huge data, hence to know which the useful reviews are, we have generated predictive hotsets on the basis of user entered query. To form this predictive hotsets we have used word frequency because review dataset consist of similar kind of data. After discarding data having minimum match in word frequency count, social preferences of data are also considered. As social connections are also considered for hotset generation, this can be called as Predictive- Social method for generation of hotsets. One more application of social connection is to decide trustworthiness. Trustworthiness is a factor which can be determined by means of weights given to different groups of data. These weights can be varied depending on groups of user posting reviews. Trustworthiness http://www.ijettjournal.org Page 168 International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014 also illustrates trust factor of reviews i.e. it is conclusion measure for analysis of review quality. BIOGRAPHY There can be different statistical analysis measures to determine the quality of reviews in generated hotset. These measure can be based on review rating.one can consider different rating given to review by the user. And statistical analysis can determine the range by which quality can be predicted. REFERENCES [1] Claudia Canali, Michele Colajanni, Riccardo Lancellotti, University of Modena and Reggio Emilia. “Adaptive Algorithms For efficien Content Management in Social Network Services” [2] Claudia Canali, Michele Colajanni, Riccardo Lancellotti, Department of Information Engineering, University of Modena and Reggio Emilia “Hot Set Identification For Social Network Applications” [3] Yonatan Aumann, Oren Etzioni, Ronen Feldman, Mike Perkowitz, Tomer Shmiel. “Predicting Event Sequences: Data Mining for Prefetching Web-pages” [4] K. Lerman and L. Jones. “Social Browsing on Flickr”, In Proc. of ICWSM Conference, March 2007. [5] Aron Culotta, Ron Bekkerman, Andrew McCallum, University of Massachusetts – Amherst “Extracting social networks and contact information from email and the Web” [6] “Predicting Category Accesses for a User in a Structured Information Space” By Mao Chen, Andrea S, LaPaugh, Jaswinder Pal Singh, Department of Computer Science Princeton University Princeton. [7] Benjamin Piwowarski, Hugo Zaragoza, Yahoo! Research, Barcelona, Spain. “Predictive User Click Models Based on Clickthrough History” [8] R. Zhang, Y. Chang, Z. Zheng, D. Metzler, and J.-y. Nie. “Search result re-ranking by feedback control adjustment for time-sensitive query”. In Proc. of Human Language Technologies Conference (HLT’09), June. 2009. [9] “WebPUM: A Web-based recommendation system to predict user future movements.” Mehrdad Jalali a,b,*, Norwati Mustapha b, Md. Nasir Sulaiman b, Ali Mamat b a Department of Software Engineering, Faculty of Engineering, Islamic Azad University of Mashhad Mashhad, Iran b Department of Computer Science, Faculty of Computer Science and Information Technology, Universiti Putra Malaysia, Selangor, Malaysia a r. [10] “Prediction Algorithms for User Actions” Melanie Hartmann and Daniel Schreiber Telecooperation Group, Darmstadt University of Technology D-64289 Darmstadt, Germany {melanie,schreiber}@tk.informatik.tudarmstadt.de [11] “Predicting the Future With Social Media” Sitaram Asur Social Computing Lab HP Labs Palo Alto, California, sitaram.asur@hp.com Bernardo A. Huberman Social Computing Lab HP Labs Palo Alto, California, bernardo.huberman@hp.com [12] Flink: Semantic Web Technology for the Extraction and Analysis of Social Networks” Peter Mika Department of Computer Science Vrije Universiteit Amsterdam (VUA) De Boelelaan 1081, 1081HV Amsterdam, Netherlands. [13] Future Trends Of Content management systems (CMS) for e-Learning: A Tool Based Database Oriented Approach Dr. JSR Subrahmanyam, BE., Ph.D., Executive Director – Technical, Quantum Softech Limited, Hyderabad, India jsrs@hd1.vsnl.net.in, ed_tech@quantumsoftech.com ISSN: 2231-5381 http://www.ijettjournal.org Gayatri Kabra has received his B.E. degree in Information Technology from DKTES Textile and Engineering Institute Ichalkaranji Shivaji University in 2011. She is pursuing Mtech in Computer Science and Engineering from Shri Ramdeobaba College of Engineering and Management (Autonomous), Nagpur. Her research interests include big data predictive analytics on social networks. Mangesh Wanjari has received his B.E. in Computer Technology from Nagpur Univeristy in 2002. He has received his Master of Technology (MTech) in Computer Science and Engineering from VNIT, Nagpur in 2009. After having some industrial experience he has joined the teaching field. He is an associate professor in Ramdeobaba College of engineering and Management, Nagpur. His research interests include Database Technologies, Query Optimization and Semantic Analysis. Page 169