This is the published version: Saha, Subrata, Sajjanhar, Atul, Gao, Shang, Dew, Robert and Zhao, Ying 2010, Delivering categorized news items using RSS feeds and web services, in CIT 2010 : 10th IEEE International Conference on Computer and Information Technology Proceedings, IEEE Computer Society, Los Alamitos, Calif., pp. 698-702. Available from Deakin Research Online: ©2010 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. Copyright: 2010,IEEE 2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010) Delivering Categorized News Items Using RSS Feeds and Web Services Subrata Saha, Atul Sajjanhar, Shang Gao, Robert Dew Ying Zhao School of Information Science and Technology Beijing University of Chemical Technology Beijing, 100029, P.R.China School of Information Technology Deakin University Burwood, VIC 3125, Australia {ssaha, atuls, shang, rad} Web services to manage the RSS feeds after they are processed and filtered by data-mining techniques. In this paper, the classical text categorization technique is adopted to demonstrate the feasibility of this method. Text classification, also known as text categorization, is one of the popular and widely used machine learning techniques for categorizing and filtering data. For instance, spam filtering, text document filtering, indexing text document, picture classification, and classifying survey document are some applications of text classification. Information overload problem with RSS can be overcome by filtering and grouping data into predefined categories. Text categorization has the potential of categorizing feeds. A significant amount of work has been done in the area of text document categorization and text indexing using machine learning. In this paper, we extend the text categorization approach and apply it to RSS news feeds processing. Based on the above discussion, the whole picture is: a Web service processes RSS news feeds using text categorization techniques, and then delivers categorized news items to a client application. The client application requests the Web service to provide specific news items based on predefined categories, and to display retrieved items on a user-friendly interface. The rest of the paper is organized as follows: Section 2 describes the proposed method which is further divided into two sub-sections: categorization of news items in RSS feeds and development of a Web service delivering categorized news items; Section 3 describes the experiments and results; Future work and conclusions are discussed in Section 4. Abstract— In the past decade the massive growth of the Internet brought huge changes in the way humans live their daily life; however, the biggest concern with rapid growth of digital information is how to efficiently manage and filter unwanted data. In this paper, we propose a method for managing RSS feeds from various news websites. A Web service was developed to provide filtered news items extracted from RSS feeds and these were categorized based on classical text categorization algorithms. A client application consuming this Web service retrieves and displays such filtered information. A prototype was implemented using Rapidminer 4.3 as a data mining tool and SVM as a classification algorithm. Experimental results suggest that the proposed method is effective and saves a significant amount of user processing time. Keywords-text classification, text categorization, web services, Support Vector Machines (SVM), Really Simple Syndication (RSS) I. INTRODUCTION With the rapid development of the web for distribution of information and business function over the Internet, Really Simple Syndication (RSS) and Web services are increasing in popularity. RSS uses XML to deliver updated content on the web. According to a survey, 52% of RSS users use RSS to get updated news from local and international news providers and 23% use it for blogging [1]. RSS provides a way to dynamically change content, but its users still suffer from information overload because of the huge volume of online updates. A good solution is to filter RSS feeds based on some rules before these are retrieved by the users. Web services provide the similar functionality as approaches such as Object Management Group's (OMG), Common Object Request Broker Architecture (CORBA), Microsoft's Distributed Component Object Model (DCOM) or Sun Microsystems's Java/Remote Method Invocation (RMI). But they support interoperable interaction over a network using HTTP and XML in conjunction with other Web-related standards. Web services are Internet Application Programming Interfaces (API) that can be accessed over a network, such as the Internet, and executed on a remote system hosting the requested services. [2] Since Web services support interoperable interaction and can be accessed via HTTP, it might be a wise choice of using 978-0-7695-4108-2/10 $26.00 © 2010 IEEE DOI 10.1109/CIT.2010.136 II. PROPOSED METHOD The proposed method has two distinct stages: first, classification of RSS feeds; second, delivery of classified news items to user. At the classification stage, RSS feeds are collected from various websites (e.g. BBC, Yahoo, CNN, News and 20 Newsgroup). Collected data is separated into two different data sets, one for training and another for testing. By using classification tools and algorithms, a classification model is created which is then trained and used to classify collected feeds into predefined categories. Detail is addressed in section A. 698 N f ( w i ) = Number of words appeared in training At the second stage, a Web service delivers categorized news service to its user. The user reads classified news items from the Web services repository upon his/her request. The Web service and the client application are described in Sections B and C, respectively. d document So word i in training document d can be denoted by W ( w i , d ) = TF ( w i , d ) * IDF ( w i ) A. Classification of Feeds There are different types of classification methods and algorithms available for text classification. Normally for categorization, TFIDF (term frequency–inverse document frequency) and SVM (Support Vector Machines) algorithm are used to classify feeds with some data preprocessing techniques. SVM algorithm is selected for classification because of it high generalization ability and performance for different range of applications [3]. It is also considered as one of the most efficient classification methods which provide a comprehensive comparison for text classification in supervise machine learning method [4]. SVM method is effective for particular high dimensional data such as text due to its strong theoretical foundation and good generalization performance. Though SVM suffers from quadratic optimizer (QP) problem due to bigger training, this problem can simply be solved by reducing the support vector. Data preprocessing is an important part of data mining. Before knowledge being discovered from a data set, it is important to process and filter data into such a way that the data mining tool can read the data for testing and training purposes. To match learning text classification, it is necessary to extract features from given dataset and produce knowledge model which can be used for text classification. In addition to that, features reduction is also important and necessary when dealing with a large volume of data, as it improves the accuracy of classification, processing time and performance by reducing data size. In this paper, RapidMiner, a freely and commercially available data mining and text mining software, is used. It can be simply integrated with third-party programming applications. String Tokenization, English Stop-words Filter and Token Length Filter methods are used for data preprocessing and TFIDF is adopted to calculate vector frequency in RSS feeds. TFIDF (term frequency–inverse document frequency) algorithm was introduced by Salton and Buckley in 1988 [5]. In this algorithm, each document is represented as d. j is a class in the vector space. TF (Term Frequency (wi, d)) refers to the number of times the word j appeared in document along with d, while IDF (inverse document frequency (wi)) is used to measure general frequency of term in the document by dividing number of documents and number of terms in the document by taking their logarithm. IDF can be calculated in the following way [6]. ⎛ ⎞ ⎟ IDF ( w i ) = log ⎜⎜ N ⎟ ( ) N w f i ⎝ f ⎠ N f = Number of training documents ⎛ Nf ⎞ ⎟ = N ( wid ) * log⎜ ⎜ N (w ) ⎟ f i ⎝ ⎠ N ( wid ) = Number of words i in document d. For SVM binary classification, S data set is given as S= {xi, yi}ni=1 where xi denoted as xi Є RN and yi is as Є(-1, +1). SVM yields to find optimal hyper-planes and also solves quadratic programming problem. (1) y i = sin g [w T ϕ ( x i ) + b ] Optimal hyper-plane min w ,b J (w ) = 1 Subject :y k [w T ε n 2 wT w + c∑ k (2) k =1 ϕ (xk) + b] k =1 ≥ 1 −ε k Primal problem ζ is a slake variable to deal with miss classification. ζk >0, k =1…..n, c>0 is equal to quadric problem and it is a dual problem with language multiplier αk ≥0, max α J (α ) = − 1 n 2∑ k =1 n y k y i K ( x k , x j )α k α j + ∑ k =1 αk (3) n Subject : ∑ α k y k = 0 ,0 ≤α k ≤ c k =1 K(xi, xl) is known as Mercel Kernel, which must satisfy the Mercel preconditions. The resulting solution is, ⎡ ⎤ (4) y ( x ) = sign ⎢ ∑ α k y k ( x k , x j ) + b ⎥ ⎣ k ∈V ⎦ Many solution to equation (3) (αk = 0) is 0, so a solution to vector is spare, and the 0’s and sum are taken over a non 0 αk. The xi is correspondent to αi , which is called SV (Support Vector). In the following equation (5), v is the index set of SV and the optimal hyper-planes is, ∑α k y k K (xk ,x j ) + b (5) k ∈V Resulting classification is ⎤ ⎡ y ( x) = sign ⎢∑ α k y k K ( x k , x j ) + b ⎥ ⎦ ⎣k∈V (6) In equation (6), b is determined by Kuhn-Tucker conditions, 699 ∂L = 0, w = ∂w n ∑α k however they should be transferred into vector before features can be selected. TFIDF algorithm is used to create vectors and 95 support vectors are created from given training datasets. Also a weight table (TABLE II) is created which contains words and weight associated with them. Once the vector is created, the next step is to save the word vectors for later classification (testing) use. In addition to that, Rapidminer also creates a model file which holds all the specified parameters. yk ϕ (xk ) k =1 n ∂L = 0, w = ∑ α k y k = 0 ∂b k =1 ∂L = 0 , c −α k − v k = 0 (α k − c ≥ 0 ) ∂ε n α k {y k [w T ϕ ( x k ) + b ] − 1 + ε k } = 0 TABLE III. In (2), size of w is fixed and it is not dependent on the number of data points. In (3), solution vector α increases with the number of data point p. In the high dimensional space, it is always a good idea to solve dual problem and with the large data set it serves as an advantage to solve primal problems [6]. To create proposed classification model for feeds, data is gathered from various news websites and 20 Newsgroup for testing and training. Training dataset contains 50 articles for business news and 50 articles for sports news. In addition, 1200 files are collected from news group dataset, BBC, News and CNN websites for testing (TABLE I). TABLE I. DATA SET TABLE FOR TRAINING AND TESTING SUPPORT VECTOR TABLE Label (abs) Alpha Attr. 1 Sports 0.42527807 0 0.04508112 Sports 0.68267482 0 0 Sports 0.58932045 0 0 Sports 0.55646904 0 0 Sports 0.29432025 0 0 Sports 0.52408548 0 0 Sports 0.18381303 0 0 Attr. 2 Business 1.04166666 0 0 Dataset Training Testing Business 1.04166666 0 0 Business 50 200 (CNN, BBC) Business 1.04166666 0 0 Sports 50 1000 (20 newsgroups) Business 0.88637830 0 0 Total 100 1200 Business 0.49816087 0 0 Business 0.23411624 0 0.02232141 Business 0.22118449 0 0 Business 0.28132026 0 0 Business 0.47983781 0 0 After creating a data set for training and testing, classification is done in the following way: first step, preprocess some data in order to remove stop words and tags from training datasets. TABLE II. SOME SELECTED ATTRIBUTES AND WEIGHTS Attribute Weight Team 0.008143169 takes 0.008143169 primarily 0.008143169 fired 0.008143169 fantastic 0.008143169 appeared 0.008143169 opposite 0.008143169 forcing 0.008143169 believes 0.008642927 In the testing phase, a dataset is supplied which is preprocessed using the same techniques as training. Then the word vectors and the model are supplied. Rapidminer calculates word vectors created from testing dataset and the word vector supplied from training. After comparing, a prediction is made based on the confidence that which category the given RSS feed item belongs to (TABLE III). B. Web service A Web service is developed using Java and Apache Axis (an open source Web services container allows user to develop Java Web services and publish them ) on Apache Tomcat (an application server runs on top of Apache server to host Java web applications and JSP web applications). The interaction between client application and Web service is as follows: 1. On the selection of a button (operation 1, Figure 1), the client makes a connection to server application and sends the desired request to server. The dataset is filtered by using Stop word Filter, Token Length Filter operator and also by pruning data. Pruning is important in text mining because it improves the efficiency of classifier by removing less frequent words from documents. Now the data is ready for feature selection, 700 2. Once the server receives the request, first, it initializes the global variables and then a document builder variable to parse the retrieeved XML-based RSS feed documents. Then the sserver reads the XML RSS feeds from the repositoory (operation 2, Figure 1) as well as the <item>, <title>, <link> and <description> channels by ignoringg other channels as they are not considered as imporrtant as the above four tags at this stage, and parsingg all channel tags makes the process complicated and affects the application performance. 3. The server creates a virtual XML L file, and saves it at a specific location. 4. The filtered XML (<title> >, <link> and <description>) data is read from thhe directory to a buffer memory and reproduced ass a single string (operation 4, Figure 1). C. Client Application The client application (Figuree 2) provides user an interface to access the content retrieved r from the Web service. Buttons are provided allow wing user to select news items from predefined categoriess, such as news about business, sports and/or entertainmen nt. Figure 2. Client Application displays new ws retrieved from Web service. The client application is developed using C# .Net. It dlines in its list view displays all available news head component. III. RESULTS AND DISCUSSION A training model is created from m training data sets after applying the classification algorithm ms to them. The model is then used for prediction. For featu ure selection, some data pre-processing techniques and TFIIDF are used to convert data into vectors. TFIDF is selected d due to the consideration of its higher accuracy of predictio on, compared to Binary Occurrence, Term Frequency and Term Occurrence. The oved further by creating a accuracy of prediction can be impro more efficient training model with w more training data feeding. TABLE IV portrays the different d accuracies when using different classification algorith hms. Figure 1. Interaction between Client and W Web service. 5. The string is sent to the cliennt application as shown the Figure 1 (operation 5). 6. The client writes the XML ffile to the local memory Figure 1 (operation 6). 7. The user accesses the retrievved information locally whenever he/she wants wiithout sending a request to the server every timee. The retrieval process is speeded up. 8. The XML file is read from locaal memory and a corresponding row is created in thhe list view tray interface with <title> and <link>. U User clicks on of the links to view the full article. TABLE IV. 701 VECTOR R ACCURACY Training Testing TFIDF 96.00 Binary Occurrence 93.00 Term Frequency 93.00 Term Occurrence 84.00 favorite feeds locally. Furthermore, client application can also be equipped with a special search request for feeds, allowing a specific set of feeds to be retrieved (E.g. request by title). Feed management and categorization is a problem with current RSS technologies. In this paper, a novel approach is presented for delivering news items from RSS feeds, based on the existing text categorization and Web service techniques. News feeds are collected from various news websites and stored in folders (for training and testing purposes) for categorization, using Rapidminer 4.3 as a data mining tool and SVM as a classification algorithm with some data preprocessing techniques. A Web service and a client application are developed to deliver and display categorized feeds. The Web service also filters feeds by stripping unnecessary tags to improve categorizing performance. The client application provides a user friendly interface to display retrieved feeds. The proposed Web-service based RSS feed categorizing approach manages and delivers feeds in an efficient way and overcomes feeds categorization problem. The prototype enables the user to get a specific type of news without subscribing to news websites and/or being flooded by unnecessary information, which saves time and effort. Further enhancement on the categorizing algorithms and testing data sets is necessary to improve the prototype. Prediction is done by calculating the average confidence of a file in respect to categories, such as business and sports. If the confidence is higher for business then the file is labeled as business news, otherwise, as sports. TABLE V lists some random prediction results generated from the classifier for a specific training file. The proposed Web services-based RSS feed categorizing architecture is very efficient as users do not have to subscribe to any websites for feeds. The only thing they needs to do is to subscribe to the Web service for categorizing news. The Web service processes the client requests and sends the result back to clients which saves users a significant amount of the time by avoiding superfluous feeds. The client application downloads the feed in the local machine, making the whole process quicker as there is no need for users to keep sending requests to the Web service for the feeds. TABLE V. SOME RANDOM PREDICTION RESULTS ID Prediction Label Confidence Business Confidence Sports 996 Sports 0.34765210 0.65234789 997 Sports 0.36646943 0.63353056 998 Sports 0.41855054 0.58144945 999 Sports 0.39448818 0.60551181 1001 Business 0.73105696 0.26894303 1002 Business 0.73107281 0.26892718 1003 Business 0.73108733 0.26891266 1004 Business 0.73303367 0.26696632 [1] [2] [3] IV. FUTURE WORK AND CONCLUSION [4] Currently this prototype only delivers sports and business news, however, it can be improved by further categorizing news items into sub-categories (e.g. cricket, football, formula one, golf, motorsports and horse racing). In addition, in the future it is possible to develop an automated feed categorization system and integrate that system with the Web services. Client application can also be improved by adding bookmarking system which allows users to save their [5] [6] 702 J. Grossnickle, T. Board, B. Pickens and M. Bellmont, RSS Crossing into the Mainstream, Yahoo White Paper, October 2005 [Available Online] Web services, Access on 30th July, [Available Online] J. Cervantes, X. Li and W. Yu, 2008, Support Vector Classification for Large Data Sets by Reducing Training Data with Change of Classes, IEEE International Conference on Systems, Man and Cybernetics, 2008. Y. Yang and X. Liu, An re-examination of text categorization, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, pp. 42-49, 1999. Q. R. Zhang and L. Zhang, S. B. Dong and J. H. Tan, Document Indexing in Text Categorization, International Conference on Machine Learning and Cybernetics, 2005. K. Chen and C. Zong, A New Weighting Algorithm For Linear Classifier, International Conference on Natural Language Processing and Knowledge Engineering, 2003.