BLOG 2.0

advertisement
CS 8803
Advanced Internet Application Development
BLOG 2.0
Project Report
INDEX
1.Keywords ............................................................................................................................................. 4
2.Terminology......................................................................................................................................... 4
3. Motivation........................................................................................................................................... 5
4. Objectives ........................................................................................................................................... 5
5. Related Work: ..................................................................................................................................... 6
6. System Scope ..................................................................................................................................... 7
7. Implementation Modules ................................................................................................................... 10
8. System Architecture: ......................................................................................................................... 11
9. Implementation Details...................................................................................................................... 12
10. Plan of Action.................................................................................................................................. 18
10.1 Resources ...................................................................................................................................... 18
10.2 Schedule ........................................................................................................................................ 18
11. Testing ............................................................................................................................................ 19
12. Bibliography.................................................................................................................................... 19
Blog 2.0
Enhancing the Blog Experience
Experience
Hardik Chheda
Ashwin Paranjpe
Ketan Kalgaonkar
Kalgaonkar
hardik.chheda@gatech.edu
ashwin.p@gatech.edu
ketan@gatech.edu
Georgia Institute of
Georgia Institute of
Georgia Institute of
Technology
Technology
Technology
1.
Keywords
Blog 2.0, Web 2.0, RSS 2.0, RSS Feeds, Enhanced Blogs, Blogger Gadgets, Blogger
API, Location API,
2.
Terminology
Blog
A Web site, usually maintained by an individual with regular entries of
commentary, descriptions of events, or other material such as graphics or video
Blog Owner
The individual who has created a personal web-page using a hosted blogging
service
Blog Users
These are individuals who read the published contents on Blogs which might
belong to other Users who might be geographically located in any corner of the
world
RSS
RSS is an acronym for Really Simple Syndication. RSS is a family of Web feed
formats used to publish frequently updated works—such as blog entries, news
headlines, audio, and video—in a standardized format
3. Motivation
A recent study by Ball State University concludes that blogs have, in fact, done
very little to increase the quality of dialogue with the public. So what can we
expect next? What could be the future of our very own, favorite Blogs? There is
probably a need for a more decisive quest for better contribution. Not through
software filters and algorithms but through human, professional, judgment. Our
projects aims to answer some of these questions.
____
“Blogs have done very little to enhance the quality of dialogue with the public.”
____
We propose a few set of features and implement a few enhancements to the
existing Blog hosting services to offer more personalized services and create an
enhanced user experience. Web 2.0 describes the changing trends in the use of
World Wide Web technology and web design that aim to enhance creativity,
communications, secure information sharing, collaboration and functionality of
the web. Blogs are an important part of the Web 2.0 revolution.
Recently, there has been an exponential increase in the number of Blog users
and this trend is expected to continue in the near future. A popular term
“Blogosphere” has been coined to imply the virtual network of blogs connecting
people and facilitating information dissemination for social interaction. Users feel
the urge to express themselves through Blogs and there is a constant effort to
personalize the blogs as much as possible.
The Blog 2.0 Project aims to provide features and enhancements in this direction
so as to make the Blogging experience more personalized. This will help the
readers of the Blog to gather more information about the Blog owner and will help
in a better social interaction.
4. Objectives
1. To gather requirements about the enhancements and additional features that
existing Blog Users would expect from a Blog 2.0
2. To find whether these requirements have already been implemented in order to
filter them out; or design a way to enhance them
3. To enable collection of relevant posts across Blogs and render them together
4. Automatic Tagging
5. Sorting based on degree of relevance
6. Sorting based on User Ratings
7. To enable the user to continuously publish his location information on the Blog
8. To enable the user to show his current status information
9. To enable the readers of the Blog to post Voice comments
10. To add Analytics information to the Blog to better track the popularity of the posts
and enable dynamic re-organization of posts based on ranking information
5. Related Work:
To the best of our knowledge, following efforts are directed on improving the
existing Blogs and Blog-hosting landscape:
1. Blogonize - Includes a hot page to display popular blog entries. This work is
aligned to our efforts of introducing ranking in the Blog posts. However, we could
not find a full-fledged working implementation of the same to compare the
approaches; and the web-site is still in Beta.
2. CallinSearch - CallinSearch has two components: a database that associates
URLs with Click-to-Call links, RSS feeds and video/audio streaming content and
location data: and plugins that allow searchers to link to those files. In some way,
the Blog 2.0 project will extend the ideas used in CallinSearch by using RSS
feeds to share real-time location updates.
3. ChinSwing - Chinswing.com is a new voice-based message board that combines
features of podcasting, text message boards, and live voice chat. However, we
do not propose to evolve our Blog 2.0 model to a complete Voice-based Blog.
The reason behing this is that Voice-Posts cannot be easily indexed by search
engines. Unlike chinSwing; the Blog 2.0 would add just Voice-based comments
to existing textual posts. Having Textual posts helps the search engines index
the content. Since the voice comments beneath the Post would be related to the
content of the Post; we expect them to be indexed indirectly to the textual
content.
4. Findory - Personalized news and weblog reader. Findory learns from the articles
you read and recommends other interesting articles
5. LoudBlog - Loudblog is a sleek and easy-to-use Content Management System
(CMS) for publishing media content on the web. It automatically generates a
skinnable website and an RSS-Feed for Podcasting.
6. System Scope
•
To enable
enable the detection of relevant content across multiple Blogs
With millions of Blogs being posted daily, there is a need of a better solution to find
relevant content across multiple Blogs. This will help the user refer to relevant
content posted by other users in the Blogosphere. This can lead to better
awareness of the latest happenings in your social network and would also help
the user gather information about how his/her other friends feel about a particular
thing or a common topic. This will help him better articulate the thoughts and
create references to other Posts conveniently.
•
Automatic Tagging
Current systems like “Blogger.com” make the user select Labels for his post which
sound most relevant to him. The assumption that the user will correctly tag his
post with the most relevant keywords may not be true all the time. The user might
also have some malicious intent and might tag his post to multiple popular tags
just to increase the revenue of his Blog. Moreover, it may be regarded as inconvenient by the user to manually tag each and every post. It is practically
observed that most of the users do not care about tagging their posts and
therefore, such a very valuable feature of the Blog is hardly ever used.
•
Sorting Posts Based on Relevance
After doing a survey of the current blogging systems, we observed that hardly any
system provides the functionality of finding relevance between posts. Out of the
very few Blog 2.0 systems that provide this feature, almost none of them sort the
Posts based on the Degree of relevance. We think that when our system scales
up to provide service to thousands of Bloggers, any post will have links to
hundreds of relevant posts; which are found by our system. It is evident that
showing hundreds of relevant posts can be made more useful if the system is
able to show the posts in a descending degree of the value of relevance.
•
Sorting Posts based on User Ratings
This is another feature which is extremely important when the system scales up to
serve thousands of Bloggers. User Ratings form an important part of the
feedback mechanism for blogs. They can help track the popularity of the Blog
and might help in tuning the content of the Blog for better results. We think that if
a particular reader visits a Blog and is interested in knowing more about the
owner of the Blog; it will be most useful if the most popular posts of the user are
displayed.
•
To enable the user to continuously publish his location information on the Blog
Today, there is a plethora of available plug-ins to show the location of the user of
a website or the Blog reader. Such location information can be gathered from IP
address, GPS location tracking, Wi-Fi Triangulation techniques and so on.
However, most of these efforts have been targeted toward displaying the location
of the user who is currently viewing the web-page. On the contrary, we plan to
offer such a location based service to show the information about the
whereabouts of the Blog owner. This will enable the readers to know where they
can meet the user at that moment and would thus help in better networking or
interaction with the User.
•
To enable the user to show his current status information
Similar to the location information, there is scope for sharing several other
aspects about the status of the Blog owner. Currently instant messengers have a
feature wherein a user can flaunt his status such as available, busy or some
custom message which would be visible to other users. Taking it a step forward,
we would like to enable the blog owner to post his status on the Blog such as his
current work-related update or latest fascination or even the track that he is
listening to in his music player. This would help the blog reader to know if it’s a
good time to contact the blog owner.
•
To enable the readers of the Blog to post Voice comments
During requirement analysis phase of this project, we spoke with several Blog
users and took their feedback about several aspects about the usability of the
Blog. It was observed that many Blog readers do not post comments on the
Posts. It has been experienced that users spend some time on reading an
interesting Post; however, they tend to avoid writing comments in such a small
amount of time. This restricts the views expressed on the Blog to those submitted
by the Owner. This might reduce the credibility of the views expressed in the
Post. Having more number of comments by spatially and culturally disparate set
of users helps the readers to become aware of the views which align or
contradict the ones expressed by the Blog Owner.
•
To add Analytics information to the Blog to better track the popularity of the posts
Analytics has been the focus of most of the Web related Research.
Measurements about the number of visitors and the geographical information
about the visitors can help the owner of a particular website to analyze such data
and have targeted content on the web site to attract more users. Such Analytical
information can be extremely useful in many ways to re-organize the Blog Posts
or network the most popular Posts.
•
Group Blogs with Location Information
There are several Group-Blogs in existence today which include a group of
people collaborating on specific topics and writing Blog Posts to express their
ideas. Group Blogs are an epitome of a distributed information sharing system
which can be used extensively for sharing technical information or helping other
people in various ways by sharing expertise of various people. Group Blogs are
convenient and are expected to out-perform various technical and non-technical
mailing lists.
Imagine how a Group-Blog can replace the COC-MS mailing list used at Georgia
Tech. Instead of sending numerous e-mails daily to share information about
events or seminars on campus; a simple Group Blog can be maintained. Multiple
people can post to this Blog, allowing hundreds of students to view them.
Convenient sections can be formed within the Blog so that any particular student
can look-up for updates only in the type of posts he is interested in.
Infact, using RSS-feed based mechanism; a student can subscribe to a particular
type of Post-Update he is interested in and can visit the Blog web-site only when
desired. This is a tremendous leap over conventional mailing list based systems
which result into hundreds of e-mails being sent daily and which also sound a bit
intrusive.
There are several Group Blogs already functioning on the World Wide Web.
However, we propose an enhancement to such Group Blogs by adding Location
information. Consider a technical Blog wherein, a group of experienced software
developers share their knowledge about a particular topic for e.g. Kernel
Development, Security, Web 2.0, Networking and so on. Generally we can see a
list of collaborators on the Group Blog website which includes their contact
information.
However, it would be really helpful if we could come to know that a particular
collaborator is currently in our City to deliver a lecture at a Conference. We
propose to include the current/real-time location information of each and every
Blog collaborator of the Group Blog on the Blog Website.
•
Ranking Blogs for reading quality content
It is frustrating to read a lot of content only to realize that the opinion mentioned
in the blog is biased or does not cover the entire information about the topic. For
example, if a user wants to know the review for a particular vacation destination
and the author rambles on 4 to 5 pages talking about insignificant details about
his trip, it is clear that the content is of no use to the user and he has lost
precious time reading the blog.
There should be a mechanism where the user can filter/rank the blogs and
should be able to easily identify quality blog that will give him more information
with respect to the time invested in reading.
We propose a ranking scheme that will sort the blogs according to user supplied
ranking to a particular blog.
7. Implementation Modules
• Detection of relevant content across multiple Blogs
• Automatic tagging
• Sorting posts based on Degree of Relevance
• Sorting posts based on User Ratings
8. System Architecture:
Architecture:
inter--blog and intra
intra--blog
I) System Architecture for finding Relevant Content and inter
Ranking
II) System Architecture for Location Based Services
9. Implementation Details
Code Snippets of keyword extraction engine:
function keyword_extract($text){
echo $text;
$text = str_replace(",","", $text);
$text = str_replace(".","", $text);
$text = str_replace(";","", $text);
$text = strtolower($text);
$punc =". , : ; ' ? ! ( ) \" \\";
$punc = explode(" ",$punc);
foreach($punc as $value){
$text = str_replace($value, " ", $text);
}
$commonWords
="about,that's,this,that,than,then,them,there,their,they,it's,with,which,were,where,whose,when,what,her's,
he's,have";
$commonWords = strtolower($commonWords);
$words = explode(" ", $text);
$commonWords = explode(",", $commonWords);
//Flag set
$not_empty_flag = false;
foreach ($words as $value) {
$common = false;
if (strlen($value) > 3){
foreach($commonWords as $commonWord){
If ($commonWord == $value){
$common = true;
}
else{
}
}
if($common != true){
$not_empty_flag = true;
$keywords[] = $value;
}
else{
}
}
else{
}
}
if ($not_empty_flag == true)
{
$keywords = array_count_values($keywords);
arsort($keywords);
$count_1=0;
foreach ($keywords as $key => $value) {
if ($value > 3){
echo "<p><strong>" . ucfirst($key) . "</strong> is used <strong>" . $value
. "</strong> times. | <a href=\"http://www.google.com/search?q=$key\" target=\"_blank\">Search
Google</a></p>";
if ($count_1<=4)
{
$arr[$count_1] = $key;
$count_1 = $count_1 + 1;
}
}
else{
}
}
}
return $arr;
}
The above piece of code extracts keyword from the data written in a particular post in
the blog. It auto extracts keywords from the post and does auto tagging. This engine will
give the top 5 keywords in the post and tag them to the post.
Below is the code which is the core functionality of the project. This code will help to find
similar posts across different blogs. Based on the tags that match with the given post,
our engine would search the database for similar posts and display matching posts
above the threshold. Threshold is a parameter which can be set to be eligible for similar
posts.
$resultB = mysql_query("select * from blog_data where post_id != $blog_id_param");
if (!$resultB)
{
die('B :Could not connect: ' . mysql_error());
}
$rowA = mysql_fetch_array($resultA);
$base_cnt =0;
echo "<br><br>Keywords for ". $rowA[1] . " : " . $rowA[2] . " " . $rowA[3] . " " . $rowA[4] . " " . $rowA[5] . "
" . $rowA[6] . "<br><br>";
while($rowB = mysql_fetch_array($resultB))
{
$countA = 2;
$match_val = 0;
//echo "<b>Matching Keywords are:</b>";
while($countA < 7)
{
if ($rowA[$countA] != "")
{
//echo "Displaying CountA" . $rowA[$countA];
if ($rowA[$countA] == $rowB[2] or
$rowA[$countA] == $rowB[3] or
$rowA[$countA] == $rowB[4] or
$rowA[$countA] == $rowB[5] or
$rowA[$countA] == $rowB[6]
{
$match_val = $match_val + 1;
echo $rowA[$countA] . " ";
//echo "<p>" . "I am comparing this Blog with Blog # " . $base_cnt . "Match Val=" . $match_val . "
keyword =" . $rowA[$countA] . "</p>";
}
if ($match_val > 2)
{
echo "<br> <a href=http://localhost/view2.php?title=" . $rowB[0] .">" . $rowB[1] . "</a><br><br>";
//echo "<p>Post Matches and later display this post here<p>";
break;
}
}
$countA = $countA + 1;
}
$base_cnt = $base_cnt + 1;
}
mysql_close($con);
The table below stores the metadata about the posts present in the various blogs. This
table can be extended to implement tf-idf in order to improve the efficiency of the
search.
FIELD
TYPE
NULL
POST_ID
INT
NO
TITLE
VARCHAR(50)
NO
KEY1
VARCHAR(20)
YES
KEY2
VARCHAR(20)
YES
KEY3
VARCHAR(20)
YES
KEY4
VARCHAR(20)
YES
KEY5
VARCHAR(20)
YES
DATA
TEXT
NO
BLOG_NAME
VARCHAR(1)
NO
RATING
INT
NO
The tf–idf weight (term frequency–inverse document frequency) is a weight often used in
information retrieval and text mining. This weight is a statistical measure used to
evaluate how important a word is to a document in a collection or corpus. The
importance increases proportionally to the number of times a word appears in the
document but is offset by the frequency of the word in the corpus. Variations of the tf–idf
weighting scheme are often used by search engines as a central tool in scoring and
ranking a document's relevance given a user query.
One of the simplest ranking functions is computed by summing the tf-idf for each query
term; many more sophisticated ranking functions are variants of this simple model.
Suppose we have a set of English text documents and wish to determine which
document is most relevant to the query "the brown cow." A simple way to start out is by
eliminating documents that do not contain all three words "the," "brown," and "cow," but
this still leaves many documents. To further distinguish them, we might count the
number of times each term occurs in each document and sum them all together; the
number of times a term occurs in a document is called its term frequency. However,
because the term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more, without giving enough weight to the more
meaningful terms "brown" and "cow". Also the term "the" is not a good keyword to
distinguish relevant and non-relevant documents and terms like "brown" and "cow" that
occur rarely are good keywords to distinguish relevant documents from the non-relevant
documents. Hence an inverse document frequency factor is incorporated which
diminishes the weight of terms that occur very frequently in the collection and increases
the weight of terms that occur rarely.
The term count in the given document is simply the number of times a given term
appears in that document. This count is usually normalized to prevent a bias towards
longer documents (which may have a higher term count regardless of the actual
importance of that term in the document) to give a measure of the importance of the
term ti within the particular document dj. Thus we have the term frequency, defined as
follows.
where ni, s the number of occurrences of the considered term in document dj, and the
denominator is the sum of number of occurrences of all terms in document dj.
The inverse document frequency is a measure of the general importance of the term
(obtained by dividing the number of all documents by the number of documents
containing the term, and then taking the log of that quotient).
with
|D|: total number of documents in the corpus
: number of documents where the term- ti appears (that is
). If the term is
not in the corpus, this will lead to a division-by-zero. It is therefore common to
use
Then
A high weight in tf–idf is reached by a high term frequency(in the given document) and a
low document frequency of the term in the whole collection of documents; the weights
hence tend to filter out common terms.
e.g Consider a document containing 100 words wherein the word cow appears 3 times.
Following the previously defined formulas, the term frequency (TF) for cow is then 0.03
(3 / 100). Now, assume we have 10 million documents and cow appears in one
thousand of these. Then, the inverse document frequency is calculated as ln(10 000
000 / 1 000) = 9.21. The TF-IDF score is the product of these quantities: 0.03 * 9.21 =
0.28.
System Screenshots
Figure 1: Display List if Blogs.
Figure 2: Display Similar Blog Link
Figure 3: Display Similar Blog Link
Figure 4: Display List of Similar Blogs
10. Plan of Action
10.1 Resources
Software Requirements:
•Apache Server - Available at apache.org
•Eclipse IDE - not mandatory but preferred for developing android applications.
Available at eclipse.org
Hardware Requirements:
A PC which will act as a server with minimum configuration as follows:
•Processor Speed: 1.8 Ghz Dual Core
•Memory: 1 GB RAM
Our laptop satisfies the above requirements. Hence we are going to use one of the
laptops as a server.
10.2 Schedule
W
ee
k
Work Proposed
1
Project Proposal
2
Read project related document.
3
Create prototype of the design.
4
Implementation of selected features
5
Implementation of selected features
6
Implementation of selected features
7
Testing
8
Report writing
9
Demo
11. Testing
The testing needs to be performed at the various levels of application:
•
Test the Gadget component of the Blog Owner for proper retrieval of status and
location information
•
Test the RSS writer to verify whether the location RSS-Feed has been generated
properly
•
Test for any network related issues while the transfer of information happens
between the Gadget and the RSS Reader on the Blog Web Site
•
Test whether the information retrieved from the RSS reader is displayed
flawlessly on the Blog API Gadget
•
Test whether the Browser Gadgets at the Readers-end are able to properly
display the status, music track, location etc. information in the appropriate fields
12. Bibliography
•
Wikipedia – explanation of the term: “Web 2.0” http://en.wikipedia.org/wiki/Web_2.0 [A Web URL reference]
•
Wikipedia – explanation of the term: “RSS” http://en.wikipedia.org/wiki/RSS_(file_format) [A Web URL reference]
•
Wikipedia – explanation of the term: “Blog” - http://en.wikipedia.org/wiki/Blog [A
Web URL reference]
Download