Finding High-Quality Content in Social Media ,

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015
Finding High-Quality Content in Social Media
Ms. Aparna Todwal1, Prof. M. Wanjari2
1
2
M.Tech Scholar, CSE Department,SRCOEM, Nagpur, India1
Associate Professor, CSE Department, SRCOEM, Nagpur, India 2
Abstract— The quality of user-generated content differs
drastically from excellent to abuse. As the availability of such
content grows, the task of identifying high-quality content in sites
based on user contributions in social media sites which becomes
drastically important. Social media in general serves a rich variety
of information sources: in addition to the content itself, there is a
broad range of non-content information available, such as
connectivity between items and explicit quality ratings from teamworker of the community. This paper investigates methods for
exploiting such community feedback to automatically identify high
quality content. As a test case, the concentration is on Yahoo!
Answers system, a large community question/answering portal. In
particular, for the community question/answering domain, this
paper shows that our system is able to separate high-quality items
from the rest.
Keywords— Social Media; Community Question Answering;
User Interaction.
I. INTRODUCTION
From the early 2000s, user-generated contents has
become increasingly popular on www; so majority of users
participate in content Creation, rather than just consumption.
Popular social media domain contains blogs, social
bookmarking sites, photo and video sharing communities and
social networking platform such as MySpace and Facebook,
which offers a combination of all these and relationship
among users. Community-driven question/answering portals
are the specific form of user-generated content that is gaining
a large audience in recent years. These systems provide an
alternative way. For gaining information on the web. An
important difference between user-generated content and
traditional content that is particularly significant for
knowledge-based media such as question/answering portals is
the difference in the quality of the content. The main
challenge posed by content in social media sites is the fact that
the distribution of quality has high variance: from very highquality items to not concerning quality items, sometimes bad
content. This makes the tasks of filtering and ranking in such
systems more complex than in other domains.
1. What are the elements of social media that can be used to
facilitate automated discovery of high-quality con-tent? In
addition to the content itself, there is a heavy range of noncontent information present, from links between items to
explicit and implicit quality rating from members of the
community. What is the utility of each source of information
to the task of estimating quality?
2. How are these different factors co- related? Is content
alone sufficient for identifying high-quality items?
3. Can community feedback approximate judgments of
specialists?
A. Yahoo! Answers
Yahoo! Answers1 is a question/answering system where
people ask and answer questions on any topic. A user can vote
for answers of other users, mark interesting questions and
answers. Thus, overall, each user has a three role: asker,
answerer and evaluator.
The central element of the Yahoo! Answers system are
basically questions. Each question has its lifecycle. It starts in
an “open" state where it receives the answers. Then at a point
the question is considered closed and can receive no answers.
At this point, a “best answer" is selected either by the asker or
through a voting procedure from other users; once a better
answer is selected, the question is solved. Yahoo! Answers is
a very popular service as a result, it hosts a very large amount
of questions and answers in a wide variety of areas, making it
a particularly useful domain for evaluating content quality in
social media.
II. BASIC ELEMENTS OF QUESTION/ANSWERING
PORTAL
In this paper we address the task of identifying highquality content in community-driven question/answering sites.
As a test case, we focus on Yahoo! Answers, a large portal
that is particularly rich in the amount and types of content and
social interaction available in it. We focus on the following
research questions:
Fig:-1 Entity relationship diagram of answers of question.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 321
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015
(Adopted from [1] )
Now, in such a system users mark stars to the question and
asks the question. Again user can give answer to the question,
can vote for the answers and also can evaluate the answer.
The relationships between questions, users asking and
answering questions, and answers can be captured by a
tripartite graph given in Figure 2, edge represents an explicit
relationship between various node types.
Fig:- Interrelation of users-question-answers modeled as tri-partite graph.
(Adopted from [1])
As given in figure 2, a user is not allowed to answer his/her
own questions, there are no triangles in the graph, so in fact all
cycles in the graph have their length at least 6.[1]
III. RELATED WORK
Link analysis in social media. In particular, link-based
ranking algorithms that were triumphant in estimating the
quality of web pages have been applied in this context. Linkbased methods have been shown to be triumphant for several
tasks in social media [2].Two of the most prominent linkbased ranking algorithms are Page Rank[4] and HITS [3]
Consider a graph G = (V;E) with vertex set V with
respect to the users of a question/answer system and contains
a directed edge e = (u; v) є E from a user u є V to a user v є V
if user u has answered to at least one question of user v.
Expertise Rank [8] respect to Page Rank over the transposed
graph Gʹ= (V;Eʹ), that is, a score is cultivated from the
person receiving the answer to the person giving the answer.
The recursion implies that if person u was able to give an
answer to person v, and person v was able to provide an
answer to person w, then u should receive some extra points
given that he/she was able to provide an answer to a person
with a certain degree of expertise.
Propagating reputation. Guha et al. [5] study the problem of
propagating trust and distrust among Epinions users, who may
assign positive (trust) and negative (distrust) ratings to each
other. The authors study ways of integrating trust and distrust
and observe that, while appraising trust as a transitive property
makes sense, distrust cannot be considered transitive.
ISSN: 2231-5381
Ziegler and Lausen [7] has also studied the models for
propagation of trust. They present a anotomy of trust metrics
and discuss ways of incorporating information about distrust
into the rating scores.
Question/answering portals and forums. The particular
framework of question/answering communities we focus on in
this paper has been the object of some study in recent years.
According to Su et al. [6], the quality of answers in
question/answering portals is good on average, but the quality
of specific answers varies considerably. In particular, in a
study of the answers to a set of questions in Yahoo! Answers,
the authors found that the fragment of correct answers to
specific questions asked by the authors of the study, varied
from 17% to 45%. The fragment of questions in their sample
with atleast one good answer was much higher, varying from
65% to 90%, meaning that a method for finding high-quality
answers can have a remarkable effect in the user's satisfaction
with the system.
Expert Finding. Zhang et al. [8] examine data from an online
forum, seeking to identify users with high expertise. They
research the user answers graph in which there is association
between users u and v if u answers a question by v, pertaining
both Expertise Rank and HITS to identify users with high
expertise. Their results gives high interrelationship between
link-based metrics and the answer quality. The authors also
develop synthetic models that record some of the
characteristics of the interactions among users in their dataset.
The HITS algorithm is run on the user-answer graph.
Jurczyk and Agichtein [9] presents an application of the HITS
algorithm [3] to a question/answering portal. The results finds
that HITS is a promising approach, as the obtained authority
score is better interrelated with the number of votes that the
items receive, than simply counting the number of answers the
answerer has given in the past. Dom et al. [10] studied the
interpretation of several link-based algorithms to rank people
by expertise on a network of e-mail exchanges.
Text analysis for content quality. Most work on evaluating the
quality of text has been in the field of Automated Essay
Grading (AES), where writings of students are sorted by
machines on several aspects, including compositionality, style,
precision, and soundness. AES systems are typically shown to
correlate very well with human judgments.
Although simplistic and disputable, these methods are
widely-used and provide a rough estimation of the difficulty
of text.
Implicit feedback for ranking. Implicit feedback from millions
of web users has been shown to be very important source of
result quality and ranking information. Authors had
incorporate the results on click interpretation on web search
results from these studies, as a source of quality information
in social media. In particular, clicks on results and methods
for interpreting the clicks have been studied in references [11].
http://www.ijettjournal.org
Page 322
International Journal of Engineering Trends and Technology (IJETT) – Volume 21 Number 6 – March 2015
IV. PROPOSED APPROACH
Using the literature survey of Question/Answering
portals in Social Media we can say can that the finding high
quality contents in such a systems becomes efficient by
sorting answers with respect to that of the question or the
standard answer. Sorting of the answers can be based on
scoring each answers and sorting them on the basis of scores.
Scoring mechanism can be accomplished with the help of
word net. Word net is the lexical data base, which contains
synonyms for the English words. After scoring answers
positioning them in descending order scores. Now in the first
approach the question will get compared with that of the
answers and for each answer for a question the score will get
generated and positioning of answers will be done.
In the second approach, the users rating will come into an
account and the scoring will done on the basis of users rating.
And positioning of answers will be done on the basis of scores
come out of user rating.
The third approach is to compare each answers of a
question to that of the standard answer. The standard answer
will get generated by Wikipedia and each answer will get its
score and positioning of answers will done.
The main perspective of this approach are:
i.
Efficient way of getting answers on the basis of
scores.
ii.
Useful for the users those who are seeking answers
on the question answering portals.
[3] J. M. Kleinberg. Authoritative sources in a hyperlinked environment.
Journal of the ACM, 46(5):604{632, 1999.
[4] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation
ranking: bringing order to the Web. Technical report, Stanford Digital Library
Technologies Project, 1998.
[5] R. Guha, R. Kumar, P. Raghavan, and A. Tomkins. Propagation of trust
and distrust. In WWW '04: Proceedings of the 13th international conference
on World Wide Web, pages 403{412, New York, NY, USA, 2004. ACM
Press
[6] Q. Su, D. Pavlov, J.-H. Chow, and W. C. Baker. Internet-scale collection
of human-reviewed data. In WWW '07: Proceedings of the 16th international
conference on World Wide Web, pages 231{240, New York, NY, USA, 2007.
ACM Press.
[7] C.-N. Ziegler and G. Lausen. Propagation models for trust and distrust in
social networks. Information Systems Frontiers, 7(4-5):337{358, December
2005.
[8] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online
communities: structure and algorithms. In WWW '07: Proceedings of the 16th
international conference on World Wide Web, pages 221{230, New York,
NY, USA, 2007. ACM Press.
[9] P. Jurczyk and E. Agichtein. HITS on question answer portals: an
exploration of link analysis for author ranking. In SIGIR (posters). ACM,
2007.
[10] B. Dom, I. Eiron, A. Cozzi, and Y. Zhang. Graph-based ranking
algorithms for e-mail expertise analysis. In Proceedings of Workshop on Data
Mining and Knowledge Discovery, pages 42{48, San Diego, CA, USA, 2003.
ACM Press.
[11] T. Joachims, L. A. Granka, B. Pan, H. Hembrooke, and G. Gay.
Accurately interpreting click through data as implicit feedback. In SIGIR,
pages 154{161, 2005.
[12] K. Ali and M. Scarr. Robust methodologies for modeling web click
distributions. In WWW, pages 511{520, 2007.
[13] C. Anderson. The Long Tail: Why the Future of Business Is Selling Less
of More. Hyperion, July 2006.
[14] Y. Attali and J. Burstein. Automated essay scoring with e-rater v.2.
Journal of Technology, Learning, and Assessment, 4(3), February 2006.
[15] A. L. Berger, V. J. D. Pietra, and S. A. D. Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics,
22(1):39{71, 1996.
V. CONCLUSION
Community Question Answering Portal is a popular
information seeking paradigm. Variance of contents in such a
sites is drastically high. So the task of quality estimation in
such a sites or portals is important for the users those who
require relevant answer to their question. So the proposed
approach could be the efficient way for finding high quality
contents in social media by calculating scores for each
answers and positioning them accordingly. User will get ease
of getting answers by having highly ranked answer on top and
relevancy in answer at certain level.
REFERENCES
[1] Carlos Castillo, Debora Donato et al., “FINDING HIGH QUALITY
CONTENTS IN SOCIAL MEDIA”Yahoo! Research Barcelona, Spain,
WSDM'08, February 11.12, 2008.
[2] J. P. Scott. Social Network Analysis: A Handbook.SAGE Publications,
January 2000.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 323
Download