An Implementation of Predictive Hotset Identification and Analysis in Social Networks

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
An Implementation of Predictive Hotset
Identification and Analysis in Social Networks
Miss. Gayatri Kabra#1 , Prof. Mangesh Wanjari*2
#1
M.Tech Scholar, #2 Assistant Professor,.
Department of Computer Science,
Shri Ramdeobaba College Of Engg. & Management, Nagpur, India
#1,#2
Abstract— Social networks platforms where people build
connections, share interests, activities, links, different resources,
tweets, and reviews about the things etc. These are web based
services which provide means for interaction over internet
through instant messaging and sharing of resources. As so many
different services are collectively provided by single platform
there exist different kind of data such as structured,
unstructured and semi-structured which is scattered over social
networks. Sometimes it is very difficult for user to get the exact
information through huge amount of dataset on social networks.
To manage this huge amount of data cost effectively in terms of
storage and network we need to analyse it. For this we need to
build hotset of data which would contain precisely pre-processed
data of social networks avoiding repetitive, meaningless data so
that it would constitute only useful data. There are so many kind
of social networking sites such as job sites, matrimonial sites,
product review sites, blogs, social connecting sites etc. some of
these sites are mend to be informatory which provides user with
review about product by different users. One of the problem
which a common user will face about product reviews on these
sites is to get required information through reviews and to get
the effectiveness of reviews by statistical analysis by users. The
contribution of this paper is to provide an approach for
searching mobile product information in the form of reviews
based on user’s query and also get trustworthiness of reviews
quality based on users replies which are available on social
network calculated on some statistical basis.
Keywords— Social network, hotset,
trustworthiness, reviews, stopwords.
statistical
analysis,
I. INTRODUCTION
Social networks consist of many social actors which plays
different roles in utilities of social networks. Data on which
these actors are working can be structured, unstructured, semistructured, images, videos, quick links, uploads, reviews,
tweets etc. To work with these variety of structures over a
single platform is quite difficult. For this reason working on
similar kind of data is quite easier. Sometimes it is possible to
consider more than one kind of data together such as reviews
and quick links. Product review data is very useful amongst
all kinds of social networking data because it includes the
comments about any product by the product users. Processing
this data and get in the form of hotset is advantageous to
analysis of the product by product user. Furthermore
trustworthiness of product calculated on the basis of generated
hotset can be a measure of review quality. For the purpose of
ISSN: 2231-5381
analysis data collection is a major task. More the data one
have, purpose of analysing huge amount of data in terms of
big data analytics will be achieved. Data can be taken directly
by means of internet or can be collected manually. It can have
different types. In our project for the purpose of convenience
we have considered only one type of data i.e. product reviews
data. After data collection storing this large amount of data in
appropriate format is necessary so as to enable easy access for
analysis purpose. After collecting data, first thing which
should be done is pre-processing of data which includes
getting data cleaned by removing unnecessary things from
data such as extra symbols, stop words. Pre-processing is
mandatory while working with reviews because if data is not
processed somehow it would lead to errors while further
processing. Pre-processing is not done on whole data at one
time. Because system data gets updated dynamically so it
would be a headache to pre-process whole data at every
updation. Pre-processing is performed on the set of data by
which one want to generate hotset. One more purpose due to
which pre-processing is performed on user entered query is to
map it with exact dataset which has to be further analysed.
After pre-processing getting data on which one has to create
hotset by means of index mapping based on user’s
requirement. Mapping index with dataset makes it easier to
fetch required data from dataset. Mapping of index also focus
on user’s social connection. i.e reviews in a dataset are from
which user, it can be further useful for social-based analysis.
After fetching this data, predictive-social based method has to
be applied over data to predict hot sets of user’s query.
Predictive social method takes into account the word’s
frequency count amongst common pattern of dataset.
Datasets with minimum common pattern are discarded and
rest of the things are taken account on the basis of social
connection. Social connections of user are taken into account
for trust factor where reviews from connected people are
given priority. Rest of the reviews which are remained are
kept for analysis based on social connection of users. Social
connection are useful to predict the factor named as
“trustworthiness” which can be given by considering
different social connection groups. This factor has impact on
predicted review quality by which one may conclude that
what is the percent to trust the predicted review.
http://www.ijettjournal.org
Page 164
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
II RELATED WORK
Predictive content management for web has two kinds of
strategies based on content over web i. e. static and dynamic.
Static content management strategies considers historical data
which has been coming over system for fixed time period.
Whereas dynamic strategy has dynamically changing data
exploding on web over time. Static method for content
management takes into account static data abstract in form of
web .log files or click through history. Some of the strategies
for static content management are: Predicting sequence of
events on a prefetched webpages is one of the static predictive
strategy defined in [3]. In this event sequences are been
formed on basis of Marcov LxS(length time support)
algorithm which allows to form episode of event sequence
which are occurred during particular time. I.e. time is a
deciding factor.
Historical data from click through history, log files can also
be used for predictions [7], [9]. In both of these, static
historical data through web log files is been processed for
predicting the events or sequence of events occurred in some
predefined time interval. Web log data clearly defines web
access sequence by user. Moving towards structured data
space in social networks work from [6] can be considered for
prediction over category accesses. Forming categories of
accesses in some predefined time interval makes prediction
easier because in certain time phase accesses from particular
users comes into account which has tendency to be of similar
kind.
Moving towards the dynamic content management
strategies, there exist some adaptive algorithms for Hotsets
prediction which combines predictive and social based
estimation through rank merging techniques [1]. Rank
merging techniques also defers in dynamic and static rank
merging. Both of these strategies are clearly defined in [2].
For the purpose of rank merging social connection
information extraction can be done by different techniques [5].
Huge content management strategies basically relying on big
data analytics techniques. There are strategies for managing
this huge amount of dynamically updating web content
discussed in [13]. Web content having different form, features
and huge content so it would be difficult to extract it.
Extraction terminologies for social networks are given in [12].
In our approach of analyzing product reviews, product
reviews are filtered on the basis of word frequency count. The
measure of word frequency count is beneficial in case of
review data because data has similar structure constituting
common words. The base of this procedure is so simple to
discard the reviews having minimum matching word
frequency count. Matching word frequency count is criteria
came from the concept of tokenization. Where each separate
word after removing stopword from reviews is considered as
‘token’ to match frequency of each word coming into picture.
After considering separate word frequency if they are in an
order then again the probability count to keep the review will
be increased. As one get cleaned and analyzed data on the
ISSN: 2231-5381
basis of word frequency count, this can be called as predictive
analytics. Extracting social connection information about
user’s contacts and using them for deciding hotsets by using
rank merging technique can be given by adaptive algorithm
[1]. Social connection information can be also used to
determine trustworthiness on the basis of weight given to
reviews from different groups i.e reviews from friends, friends
of friends and non-friends. Clearly weight will be more for
reviews from friends. Trustworthiness has an impact of weight
given to different group of reviews hence weight should be
carefully chosen.
III. PROPOSED APPROACH
The goal big data predictive analytics in social networks is
analyze data according to its usage identify predictive hotset
and to lower the cost of data in terms of storage, network
usage and computational processing. Many times on social
networking product review site it’s not possible to get exact
information in the form of product reviews. This can be done
by generating hotsets of these review dataset based on user
entered query. Again the analysis of reviews in generated
hotset based on some statistical criteria is useful in giving
review quality in terms of trustworthiness to the user. Not
only predictive but also social connection based technique is
used for generating hotset.
To do this our approach consist of following phases:
1.
Data Collection and Preprocessing
2.
Mapping of users query to required review
set and Hot set generation
3.
Analysis of generated hotset in terms of
trustworthiness.
All the phases are discussed in detail below.
A. Data Collection and Pre-processing
Data Collection is an important aspect of our project as it is
based on Big data analytics. Big data as the name indicates
constitutes large amount of data (may be in terabytes and
petabytes) exploding over an internet. So while performing
analysis of this huge amount of data it is necessary to have
workload on system which would response as if it’s a Big
data. While gathering data for our project we found difficulty
in getting similar kind of bulk data. To resolve this issue one
can collect data at their own by means of hosting any forum
kind of thing on webserver. On this webserver different
people uploads reviews, short comments, links and so on
which results in automatic data generation in bulk amount.
One more advantage of this forum hosting is one can get users
social connection information which will be further useful.
Social connections are nothing but population connected to
each other via social network.
This data that is collected by means of forum or any other
means will be unstructured, and also consist of misspelled
words, irrelevant words, extraneous comments, dots etc. To
http://www.ijettjournal.org
Page 165
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
generate hotsets without error it is necessary to remove all
these irrelevant things.
Consider an example: The question about product reviews
can be put by the user as follows:
hiii, is there any body who can explain me that whatsapp is
working on this new nokia 207 or not?
This question constitute “hii” which is of no use while
predicting Hosets. For this reason first thing on data that is to
be done is removal of stopwords. To do this dictionary
containing all the stopwords which are to be removed by data
is created. For this type of review questions there are reviews
in form of relevant answers from different users can be given
as:
nikky123:whats app wasn't pre istall but i find
solution. solution
is
in
the
link
below:
http://discussions.nokia.com/t...td-p/2144653
You need a pc to do it. It worked for me.
You are welcome….
Sohan11:http://discussions.nokia.com/whattsapp_sol/2345
112
Again one can observe that, answer from username
“nikky123” has multiple dots like … which are inappropriate
for hotsets creation. These are needed to be avoided or
removed which is accomplished by preprocessing. If one
works on raw data without removing irrelevant data, it would
be difficult to form proper predictive hotset because of
erroneous dataset. Hence, preprocessing becomes a mandatory
task in hotset identification.
B. Mapping of users query to required review set and Hot set
generation
Next thing to work on after preprocessing of data is
mapping of users query to review set. User might put query in
the search box as:
“ how to solve Heating problem in Grand 2”
“Over heating in grand 2”
“what to do if my galaxy grand 2 gets
heating problem”
As one can clearly notice that all the three above mentioned
queries will have same answer set so that the generated hot
sets for all the above queries will be same. To search for the
reviews of entered query and to generate predictive hotsets
first thing which has to be done is “Mapping”.
Mapping: To map the entered query one has to remove all
the stopwords in the entered query. Because stopwords do not
have any role in mapping process.
Eg. In the above 1. ‘How’, ‘to’, ‘in’ are the three
stopwords. After removing stopwords portion remained is set
of keywords“solve Heating problem Grand 2”. This set of
keywords is compared with each question in our dataset. And
question where the maximum words matches required
mapping is done. Somehow it is possible that sometimes user
ISSN: 2231-5381
enters grand2 instead of Grand 2, then it would be difficult to
map to the appropriate question. To resolve this converting
entire string to lowercase and then matching is an option.
Hot set Generation: after mapping the entered query with
question one can automatically fetch all of its reviews from
the storage. While performing Hot set generation user’s social
connection are also given preference. So if required reviews
are given by any of connected user then it will automatically
be prefered and it will be displayed above on the basis of trust
factor by users social connection.
In the remaining review set one can observe that there is
particular pattern of reviews including common words which
are underlined.
Eg. Mapped review set for above 1, 2 and 3 is- heating is common problem for just 10-15 day after that
its not a big issue
- heating is common in all phones but in this phone heating
issue is very very because my brother using the galaxy s4 in
that also having heating issue in sony also suffer from heating
issue so if you don’t see another phone then don’t talk here
- all phone have pros and cons if u will be see such type
common problem then u cnt buy any smart phone in world. I
recommend to u buy nokia 112 or 1100 coz u can’t see smart
phone have no exhast fan for coolant
So we can observe the common pattern based on frequency
of words. So review having maximum matching pattern based
on frequency of words are extracted and kept as a set called as
Hotsets. For generated hotset, social connection also plays
important role in analysis.
C. Analysis of generated hotsets
Hotsets which are generated in previous step are clearly on
the basis of prediction. How one user can know about the trust
factor based on which the quality of by analysis can be
determined. To do this, one may need to know about the
trustworthiness of reviews contained in Hotsets. Best way to
do this is using user’s social connection to determine
trustworthiness. Trustworthiness can be derived by
considering different weights for different sets of reviews.
One has to notice that in determining trustworthiness different
weights can be chosen. One has to choose weights carefully to
avoid wrong trust factor. It can be given by following formula.
Where,
Trustworthiness in %= (No of reviews from particular group X
Weight of particular group reviews)
X100
(Total No of reviews X Maximum Weight)
-
No of reviews from particular group is a count of
reviews from Ist social connection i.e. friends of
users and count from second connection i.e. friends
of friends and also non friends.
http://www.ijettjournal.org
Page 166
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
-
-
Weight of particular group reviews consist weight
given to the particular group which differs
accordingly.
Total no of reviews is a total count of reviews in a
hotset.
Maximum weight is a max weight given to any
particular group.
IV. PERFORMANCE EVALUATION:
In this section, we evaluate our created hotsets against the
user entered query. While entering query, will varies in no of
words entered. Depending on these words results of manual
and system based execution will vary. We carry out different
sets of experiments to demonstrate the effectiveness of the
predictive hotset against trustworthiness by means of formulae.
Results indicate that the model is able to respond to different
input size differently and is varied from 40% to 70%. We have
also studied the performance of the end result by varying the
offset introduced by the error in their recommended formula.
Our system consist provision for user entered query where
user can enter query in the form of keywords. We have tested
our system on the basis of 2 criteria i.e. manual and system
based. On the basis of no of keywords entered in the system
results for generating hotset varies. Hence the results from
entered query differently on increasing number of review set
based on both the above mentioned criteria can be shown by
different graphs as follows:
Graph For Query of 1 Word
180
As the review size increases manual detection correctness will
automatically reduce because in 50 reviews there is always a
chance that reviews with unique index can be found, but as
size of dataset increases probability of getting success over
unique index will be reduced. This phenomenon can be clearly
shown in above graph. Here one has to consider one more fact
that for the query of more than one word, there is a chance
that the query consist of stopwords. So for the case of 2 words
query entering stopword in search box would result in similar
results as that of by the query of 1 word.
Graph for the query of 2 words can be shown below. In this
graph, we have not considered the case where out of 2 entered
words one or both will be stopwords. One can clearly observe
from the graph that, manual and system based search results
from query varies as that of in 1 word query.
Graph For The Query of 2
Words
200
180
160
140
120
100
80
60
40
20
0
50
160
100
150
200
No of Reviews
140
120
Manual
System Based
100
80
Fig. 2 Results for analysis of query with two words
60
40
20
0
50
100
150
200
NO of Reviews
Manual
System Based
Fig 1 Results for analysis of query with one word
For the query where one word is entered for hotset
generation, gives correct result for first 50 reviews manually.
ISSN: 2231-5381
Experimental results for the query consisting 3 words can
be shown in graph below. Here one can observe that even if
increase in dataset, will not affect the manual results much,
also system based query execution has similar variations.One
more thing one can notice that review set upto 150 have
similar differences in system based and manual execution
results. But as dataset increases from 150 onwards, there will
be more difference in accuracy.
In case of 4 word query, manual results will provide
maximum successful search probability. i.e. one can conclude
that query with 4 or more words is enough to get success
probability in case of manual execution. Similarly, in case of
system based execution for small no of review set probability
of getting success rate is more as compared to large review set.
Overall performance for query with 4 words is better than the
above three.
http://www.ijettjournal.org
Page 167
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
Accuracy in %= (No of System based reviews in
scenario) / (No of Manual reviews in scenario) * 100
Graph For the Query of 3
Words
180
160
Graph for accuracy can be given as:
140
120
100
80
60
40
20
0
50
100
150
200
No of Reviews
Manual
System Based
Fig 3 Results for analysis of query with three words
Fig. 5Accuracy of system based results
IV Conclusion and Future Work
Fig. 4 Results for analysis of query with four words
Manual results for query execution are correct. So to
calculate accuracy of system based results we can take manual
results correctness as a base factor depending on which
accuracy of system based results can be calculated. Accuracy
of system based results can be calculated as:
ISSN: 2231-5381
Application of social networks in case of product review
site is for knowing the actual user’s opinion. Problem with
these kind of site is that, one may not get what’s desired and
cannot predict the trustworthiness of reviews. For this purpose
in our project, we have provided with the facility to search
required information in review dataset. This review dataset
consist huge data, hence to know which the useful reviews are,
we have generated predictive hotsets on the basis of user
entered query. To form this predictive hotsets we have used
word frequency because review dataset consist of similar kind
of data. After discarding data having minimum match in word
frequency count, social preferences of data are also considered.
As social connections are also considered for hotset
generation, this can be called as Predictive- Social method for
generation of hotsets. One more application of social
connection is to decide trustworthiness. Trustworthiness is a
factor which can be determined by means of weights given to
different groups of data. These weights can be varied
depending on groups of user posting reviews. Trustworthiness
http://www.ijettjournal.org
Page 168
International Journal of Engineering Trends and Technology (IJETT) – Volume 12 Number 4 - Jun 2014
also illustrates trust factor of reviews i.e. it is conclusion
measure for analysis of review quality.
BIOGRAPHY
There can be different statistical analysis measures to
determine the quality of reviews in generated hotset. These
measure can be based on review rating.one can consider
different rating given to review by the user. And statistical
analysis can determine the range by which quality can be
predicted.
REFERENCES
[1] Claudia Canali, Michele Colajanni, Riccardo Lancellotti, University of
Modena and Reggio Emilia. “Adaptive Algorithms For efficien Content
Management in Social Network Services”
[2] Claudia Canali, Michele Colajanni, Riccardo Lancellotti, Department of
Information Engineering, University of Modena and Reggio Emilia
“Hot Set Identification For Social Network Applications”
[3] Yonatan Aumann, Oren Etzioni, Ronen Feldman, Mike Perkowitz, Tomer
Shmiel. “Predicting Event Sequences: Data Mining for Prefetching
Web-pages”
[4] K. Lerman and L. Jones. “Social Browsing on Flickr”, In Proc. of ICWSM
Conference, March 2007.
[5] Aron Culotta, Ron Bekkerman, Andrew McCallum, University of
Massachusetts – Amherst “Extracting social networks and contact
information from email and the Web”
[6] “Predicting Category Accesses for a User in a Structured Information
Space” By Mao Chen, Andrea S, LaPaugh, Jaswinder Pal Singh,
Department of Computer Science Princeton University Princeton.
[7] Benjamin Piwowarski, Hugo Zaragoza, Yahoo! Research, Barcelona,
Spain. “Predictive User Click Models Based on Clickthrough History”
[8] R. Zhang, Y. Chang, Z. Zheng, D. Metzler, and J.-y. Nie. “Search result
re-ranking by feedback control adjustment for time-sensitive query”.
In Proc. of Human Language Technologies Conference (HLT’09),
June. 2009.
[9] “WebPUM: A Web-based recommendation system to predict user future
movements.” Mehrdad Jalali a,b,*, Norwati Mustapha b, Md. Nasir
Sulaiman b, Ali Mamat b a Department of Software Engineering,
Faculty of Engineering, Islamic Azad University of Mashhad Mashhad,
Iran b Department of Computer Science, Faculty of Computer Science
and Information Technology, Universiti Putra Malaysia, Selangor,
Malaysia a r.
[10] “Prediction Algorithms for User Actions” Melanie Hartmann and Daniel
Schreiber Telecooperation Group, Darmstadt University of Technology
D-64289 Darmstadt, Germany {melanie,schreiber}@tk.informatik.tudarmstadt.de
[11] “Predicting the Future With Social Media” Sitaram Asur Social
Computing Lab HP Labs Palo Alto, California, sitaram.asur@hp.com
Bernardo A. Huberman Social Computing Lab HP Labs Palo Alto,
California, bernardo.huberman@hp.com
[12] Flink: Semantic Web Technology for the Extraction and Analysis of
Social Networks” Peter Mika Department of Computer Science Vrije
Universiteit Amsterdam (VUA) De Boelelaan 1081, 1081HV
Amsterdam, Netherlands.
[13] Future Trends Of Content management systems (CMS) for e-Learning: A
Tool Based Database Oriented Approach Dr. JSR Subrahmanyam, BE.,
Ph.D., Executive Director – Technical, Quantum Softech Limited,
Hyderabad, India jsrs@hd1.vsnl.net.in, ed_tech@quantumsoftech.com
ISSN: 2231-5381
http://www.ijettjournal.org
Gayatri Kabra has received his B.E. degree in
Information Technology from DKTES Textile and
Engineering Institute Ichalkaranji Shivaji University in
2011. She is pursuing Mtech in Computer Science and
Engineering from Shri Ramdeobaba College of
Engineering and Management (Autonomous), Nagpur.
Her research interests include big data predictive
analytics on social networks.
Mangesh Wanjari has received his B.E. in Computer
Technology from Nagpur Univeristy in 2002. He has
received his
Master of Technology (MTech) in
Computer Science and Engineering from VNIT, Nagpur
in 2009. After having some industrial experience he has
joined the teaching field. He is an associate professor in
Ramdeobaba College of engineering and Management,
Nagpur. His research interests include Database
Technologies, Query Optimization and Semantic
Analysis.
Page 169
Download