CS 8803 Advanced Internet Application Development BLOG 2.0 Project Report INDEX 1.Keywords.................................................................................................................................................... 4 2.Terminology ............................................................................................................................................... 4 3. Motivation................................................................................................................................................. 5 4. Objectives.................................................................................................................................................. 5 5. Related Work: ........................................................................................................................................... 6 6. System Scope ........................................................................................................................................... 7 7. Implementation Modules ....................................................................................................................... 10 8. System Architecture:............................................................................................................................... 11 9. Implementation Details .......................................................................................................................... 12 10. Plan of Action ........................................................................................................................................ 18 10.1 Resources ............................................................................................................................................ 18 10.2 Schedule.............................................................................................................................................. 18 11. Testing ................................................................................................................................................... 19 12. Bibliography .......................................................................................................................................... 19 Blog 2.0 Enhancing the Blog Experience Hardik Chheda Ashwin Paranjpe Ketan Kalgaonkar hardik.chheda@gatech.edu ashwin.p@gatech.edu ketan@gatech.edu Georgia Institute of Georgia Institute of Georgia Institute of Technology Technology Technology 1. Keywords Blog 2.0, Web 2.0, RSS 2.0, RSS Feeds, Enhanced Blogs, Blogger Gadgets, Blogger API, Location API, 2. Terminology Blog A Web site, usually maintained by an individual with regular entries of commentary, descriptions of events, or other material such as graphics or video Blog Owner The individual who has created a personal web-page using a hosted blogging service Blog Users These are individuals who read the published contents on Blogs which might belong to other Users who might be geographically located in any corner of the world RSS RSS is an acronym for Really Simple Syndication. RSS is a family of Web feed formats used to publish frequently updated works—such as blog entries, news headlines, audio, and video—in a standardized format 3. Motivation A recent study by Ball State University concludes that blogs have, in fact, done very little to increase the quality of dialogue with the public. So what can we expect next? What could be the future of our very own, favorite Blogs? There is probably a need for a more decisive quest for better contribution. Not through software filters and algorithms but through human, professional, judgment. Our projects aims to answer some of these questions. ____ “Blogs have done very little to enhance the quality of dialogue with the public.” ____ We propose a few set of features and implement a few enhancements to the existing Blog hosting services to offer more personalized services and create an enhanced user experience. Web 2.0 describes the changing trends in the use of World Wide Web technology and web design that aim to enhance creativity, communications, secure information sharing, collaboration and functionality of the web. Blogs are an important part of the Web 2.0 revolution. Recently, there has been an exponential increase in the number of Blog users and this trend is expected to continue in the near future. A popular term “Blogosphere” has been coined to imply the virtual network of blogs connecting people and facilitating information dissemination for social interaction. Users feel the urge to express themselves through Blogs and there is a constant effort to personalize the blogs as much as possible. The Blog 2.0 Project aims to provide features and enhancements in this direction so as to make the Blogging experience more personalized. This will help the readers of the Blog to gather more information about the Blog owner and will help in a better social interaction. 4. Objectives 1. To gather requirements about the enhancements and additional features that existing Blog Users would expect from a Blog 2.0 2. To find whether these requirements have already been implemented in order to filter them out; or design a way to enhance them 3. To enable collection of relevant posts across Blogs and render them together 4. Automatic Tagging 5. Sorting based on degree of relevance 6. Sorting based on User Ratings 7. To enable the user to continuously publish his location information on the Blog 8. To enable the user to show his current status information 9. To enable the readers of the Blog to post Voice comments 10. To add Analytics information to the Blog to better track the popularity of the posts and enable dynamic re-organization of posts based on ranking information 5. Related Work: To the best of our knowledge, following efforts are directed on improving the existing Blogs and Blog-hosting landscape: 1. Blogonize - Includes a hot page to display popular blog entries. This work is aligned to our efforts of introducing ranking in the Blog posts. However, we could not find a full-fledged working implementation of the same to compare the approaches; and the web-site is still in Beta. 2. CallinSearch - CallinSearch has two components: a database that associates URLs with Click-to-Call links, RSS feeds and video/audio streaming content and location data: and plugins that allow searchers to link to those files. In some way, the Blog 2.0 project will extend the ideas used in CallinSearch by using RSS feeds to share real-time location updates. 3. ChinSwing - Chinswing.com is a new voice-based message board that combines features of podcasting, text message boards, and live voice chat. However, we do not propose to evolve our Blog 2.0 model to a complete Voice-based Blog. The reason behing this is that Voice-Posts cannot be easily indexed by search engines. Unlike chinSwing; the Blog 2.0 would add just Voice-based comments to existing textual posts. Having Textual posts helps the search engines index the content. Since the voice comments beneath the Post would be related to the content of the Post; we expect them to be indexed indirectly to the textual content. 4. Findory - Personalized news and weblog reader. Findory learns from the articles you read and recommends other interesting articles 5. LoudBlog - Loudblog is a sleek and easy-to-use Content Management System (CMS) for publishing media content on the web. It automatically generates a skinnable website and an RSS-Feed for Podcasting. 6. System Scope To enable the detection of relevant content across multiple Blogs With millions of Blogs being posted daily, there is a need of a better solution to find relevant content across multiple Blogs. This will help the user refer to relevant content posted by other users in the Blogosphere. This can lead to better awareness of the latest happenings in your social network and would also help the user gather information about how his/her other friends feel about a particular thing or a common topic. This will help him better articulate the thoughts and create references to other Posts conveniently. Automatic Tagging Current systems like “Blogger.com” make the user select Labels for his post which sound most relevant to him. The assumption that the user will correctly tag his post with the most relevant keywords may not be true all the time. The user might also have some malicious intent and might tag his post to multiple popular tags just to increase the revenue of his Blog. Moreover, it may be regarded as inconvenient by the user to manually tag each and every post. It is practically observed that most of the users do not care about tagging their posts and therefore, such a very valuable feature of the Blog is hardly ever used. Sorting Posts Based on Relevance After doing a survey of the current blogging systems, we observed that hardly any system provides the functionality of finding relevance between posts. Out of the very few Blog 2.0 systems that provide this feature, almost none of them sort the Posts based on the Degree of relevance. We think that when our system scales up to provide service to thousands of Bloggers, any post will have links to hundreds of relevant posts; which are found by our system. It is evident that showing hundreds of relevant posts can be made more useful if the system is able to show the posts in a descending degree of the value of relevance. Sorting Posts based on User Ratings This is another feature which is extremely important when the system scales up to serve thousands of Bloggers. User Ratings form an important part of the feedback mechanism for blogs. They can help track the popularity of the Blog and might help in tuning the content of the Blog for better results. We think that if a particular reader visits a Blog and is interested in knowing more about the owner of the Blog; it will be most useful if the most popular posts of the user are displayed. To enable the user to continuously publish his location information on the Blog Today, there is a plethora of available plug-ins to show the location of the user of a website or the Blog reader. Such location information can be gathered from IP address, GPS location tracking, Wi-Fi Triangulation techniques and so on. However, most of these efforts have been targeted toward displaying the location of the user who is currently viewing the web-page. On the contrary, we plan to offer such a location based service to show the information about the whereabouts of the Blog owner. This will enable the readers to know where they can meet the user at that moment and would thus help in better networking or interaction with the User. To enable the user to show his current status information Similar to the location information, there is scope for sharing several other aspects about the status of the Blog owner. Currently instant messengers have a feature wherein a user can flaunt his status such as available, busy or some custom message which would be visible to other users. Taking it a step forward, we would like to enable the blog owner to post his status on the Blog such as his current work-related update or latest fascination or even the track that he is listening to in his music player. This would help the blog reader to know if it’s a good time to contact the blog owner. To enable the readers of the Blog to post Voice comments During requirement analysis phase of this project, we spoke with several Blog users and took their feedback about several aspects about the usability of the Blog. It was observed that many Blog readers do not post comments on the Posts. It has been experienced that users spend some time on reading an interesting Post; however, they tend to avoid writing comments in such a small amount of time. This restricts the views expressed on the Blog to those submitted by the Owner. This might reduce the credibility of the views expressed in the Post. Having more number of comments by spatially and culturally disparate set of users helps the readers to become aware of the views which align or contradict the ones expressed by the Blog Owner. To add Analytics information to the Blog to better track the popularity of the posts Analytics has been the focus of most of the Web related Research. Measurements about the number of visitors and the geographical information about the visitors can help the owner of a particular website to analyze such data and have targeted content on the web site to attract more users. Such Analytical information can be extremely useful in many ways to re-organize the Blog Posts or network the most popular Posts. Group Blogs with Location Information There are several Group-Blogs in existence today which include a group of people collaborating on specific topics and writing Blog Posts to express their ideas. Group Blogs are an epitome of a distributed information sharing system which can be used extensively for sharing technical information or helping other people in various ways by sharing expertise of various people. Group Blogs are convenient and are expected to out-perform various technical and non-technical mailing lists. Imagine how a Group-Blog can replace the COC-MS mailing list used at Georgia Tech. Instead of sending numerous e-mails daily to share information about events or seminars on campus; a simple Group Blog can be maintained. Multiple people can post to this Blog, allowing hundreds of students to view them. Convenient sections can be formed within the Blog so that any particular student can look-up for updates only in the type of posts he is interested in. Infact, using RSS-feed based mechanism; a student can subscribe to a particular type of Post-Update he is interested in and can visit the Blog web-site only when desired. This is a tremendous leap over conventional mailing list based systems which result into hundreds of e-mails being sent daily and which also sound a bit intrusive. There are several Group Blogs already functioning on the World Wide Web. However, we propose an enhancement to such Group Blogs by adding Location information. Consider a technical Blog wherein, a group of experienced software developers share their knowledge about a particular topic for e.g. Kernel Development, Security, Web 2.0, Networking and so on. Generally we can see a list of collaborators on the Group Blog website which includes their contact information. However, it would be really helpful if we could come to know that a particular collaborator is currently in our City to deliver a lecture at a Conference. We propose to include the current/real-time location information of each and every Blog collaborator of the Group Blog on the Blog Website. Ranking Blogs for reading quality content It is frustrating to read a lot of content only to realize that the opinion mentioned in the blog is biased or does not cover the entire information about the topic. For example, if a user wants to know the review for a particular vacation destination and the author rambles on 4 to 5 pages talking about insignificant details about his trip, it is clear that the content is of no use to the user and he has lost precious time reading the blog. There should be a mechanism where the user can filter/rank the blogs and should be able to easily identify quality blog that will give him more information with respect to the time invested in reading. We propose a ranking scheme that will sort the blogs according to user supplied ranking to a particular blog. 7. Implementation Modules Detection of relevant content across multiple Blogs Automatic tagging Sorting posts based on Degree of Relevance Sorting posts based on User Ratings 8. System Architecture: I) System Architecture for finding Relevant Content and inter-blog and intra-blog Ranking II) System Architecture for Location Based Services 9. Implementation Details Code Snippets of keyword extraction engine: function keyword_extract($text){ echo $text; $text = str_replace(",","", $text); $text = str_replace(".","", $text); $text = str_replace(";","", $text); $text = strtolower($text); $punc =". , : ; ' ? ! ( ) \" \\"; $punc = explode(" ",$punc); foreach($punc as $value){ $text = str_replace($value, " ", $text); } $commonWords ="about,that's,this,that,than,then,them,there,their,they,it's,with,which,were,where,whose,when,what,her's, he's,have"; $commonWords = strtolower($commonWords); $words = explode(" ", $text); $commonWords = explode(",", $commonWords); //Flag set $not_empty_flag = false; foreach ($words as $value) { $common = false; if (strlen($value) > 3){ foreach($commonWords as $commonWord){ If ($commonWord == $value){ $common = true; } else{ } } if($common != true){ $not_empty_flag = true; $keywords[] = $value; } else{ } } else{ } } if ($not_empty_flag == true) { $keywords = array_count_values($keywords); arsort($keywords); $count_1=0; foreach ($keywords as $key => $value) { if ($value > 3){ echo "<p><strong>" . ucfirst($key) . "</strong> is used <strong>" . $value . "</strong> times. | <a href=\"http://www.google.com/search?q=$key\" target=\"_blank\">Search Google</a></p>"; if ($count_1<=4) { $arr[$count_1] = $key; $count_1 = $count_1 + 1; } } else{ } } } return $arr; } The above piece of code extracts keyword from the data written in a particular post in the blog. It auto extracts keywords from the post and does auto tagging. This engine will give the top 5 keywords in the post and tag them to the post. Below is the code which is the core functionality of the project. This code will help to find similar posts across different blogs. Based on the tags that match with the given post, our engine would search the database for similar posts and display matching posts above the threshold. Threshold is a parameter which can be set to be eligible for similar posts. $resultB = mysql_query("select * from blog_data where post_id != $blog_id_param"); if (!$resultB) { die('B :Could not connect: ' . mysql_error()); } $rowA = mysql_fetch_array($resultA); $base_cnt =0; echo "<br><br>Keywords for ". $rowA[1] . " : " . $rowA[2] . " " . $rowA[3] . " " . $rowA[4] . " " . $rowA[5] . " " . $rowA[6] . "<br><br>"; while($rowB = mysql_fetch_array($resultB)) { $countA = 2; $match_val = 0; //echo "<b>Matching Keywords are:</b>"; while($countA < 7) { if ($rowA[$countA] != "") { //echo "Displaying CountA" . $rowA[$countA]; if ($rowA[$countA] == $rowB[2] or $rowA[$countA] == $rowB[3] or $rowA[$countA] == $rowB[4] or $rowA[$countA] == $rowB[5] or $rowA[$countA] == $rowB[6] { $match_val = $match_val + 1; echo $rowA[$countA] . " "; //echo "<p>" . "I am comparing this Blog with Blog # " . $base_cnt . "Match Val=" . $match_val . " keyword =" . $rowA[$countA] . "</p>"; } if ($match_val > 2) { echo "<br> <a href=http://localhost/view2.php?title=" . $rowB[0] .">" . $rowB[1] . "</a><br><br>"; //echo "<p>Post Matches and later display this post here<p>"; break; } } $countA = $countA + 1; } $base_cnt = $base_cnt + 1; } mysql_close($con); The table below stores the metadata about the posts present in the various blogs. This table can be extended to implement tf-idf in order to improve the efficiency of the search. FIELD TYPE NULL POST_ID INT NO TITLE VARCHAR(50) NO KEY1 VARCHAR(20) YES KEY2 VARCHAR(20) YES KEY3 VARCHAR(20) YES KEY4 VARCHAR(20) YES KEY5 VARCHAR(20) YES DATA TEXT NO BLOG_NAME VARCHAR(1) NO RATING INT NO The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus . The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. One of the simplest ranking functions is computed by summing the tf-idf for each query term; many more sophisticated ranking functions are variants of this simple model. Suppose we have a set of English text documents and wish to determine which document is most relevant to the query "the brown cow." A simple way to start out is by eliminating documents that do not contain all three words "the," "brown," and "cow," but this still leaves many documents. To further distinguish them, we might count the number of times each term occurs in each document and sum them all together; the number of times a term occurs in a document is called its term frequency. However, because the term "the" is so common, this will tend to incorrectly emphasize documents which happen to use the word "the" more, without giving enough weight to the more meaningful terms "brown" and "cow". Also the term "the" is not a good keyword to distinguish relevant and non-relevant documents and terms like "brown" and "cow" that occur rarely are good keywords to distinguish relevant documents from the non-relevant documents. Hence an inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the collection and increases the weight of terms that occur rarely. The term count in the given document is simply the number of times a given term appears in that document. This count is usually normalized to prevent a bias towards longer documents (which may have a higher term count regardless of the actual importance of that term in the document) to give a measure of the importance of the term ti within the particular document dj. Thus we have the term frequency, defined as follows. where ni, s the number of occurrences of the considered term in document dj, and the denominator is the sum of number of occurrences of all terms in document dj. The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the log of that quotient). with |D|: total number of documents in the corpus : number of documents where the term- ti appears (that is ). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use Then A high weight in tf–idf is reached by a high term frequency(in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. e.g Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then 0.03 (3 / 100). Now, assume we have 10 million documents and cow appears in one thousand of these. Then, the inverse document frequency is calculated as ln(10 000 000 / 1 000) = 9.21. The TF-IDF score is the product of these quantities: 0.03 * 9.21 = 0.28. System Screenshots Figure 1: Display List if Blogs. Figure 2: Display Similar Blog Link Figure 3: Display Similar Blog Link Figure 4: Display List of Similar Blogs 10. Plan of Action 10.1 Resources Software Requirements: •Apache Server - Available at apache.org •Eclipse IDE - not mandatory but preferred for developing android applications. Available at eclipse.org Hardware Requirements: A PC which will act as a server with minimum configuration as follows: •Processor Speed: 1.8 Ghz Dual Core •Memory: 1 GB RAM Our laptop satisfies the above requirements. Hence we are going to use one of the laptops as a server. 10.2 Schedule W ee k Work Proposed 1 Project Proposal 2 Read project related document. 3 Create prototype of the design. 4 Implementation of selected features 5 Implementation of selected features 6 Implementation of selected features 7 Testing 8 Report writing 9 Demo 11. Testing The testing needs to be performed at the various levels of application: Test the Gadget component of the Blog Owner for proper retrieval of status and location information Test the RSS writer to verify whether the location RSS-Feed has been generated properly Test for any network related issues while the transfer of information happens between the Gadget and the RSS Reader on the Blog Web Site Test whether the information retrieved from the RSS reader is displayed flawlessly on the Blog API Gadget Test whether the Browser Gadgets at the Readers-end are able to properly display the status, music track, location etc. information in the appropriate fields 12. Bibliography Wikipedia – explanation of the term: “Web 2.0” http://en.wikipedia.org/wiki/Web_2.0 [A Web URL reference] Wikipedia – explanation of the term: “RSS” http://en.wikipedia.org/wiki/RSS_(file_format) [A Web URL reference] Wikipedia – explanation of the term: “Blog” - http://en.wikipedia.org/wiki/Blog [A Web URL reference]