Delivering Categorized News Items Using RSS Feeds and

advertisement
This is the published version:
Saha, Subrata, Sajjanhar, Atul, Gao, Shang, Dew, Robert and Zhao, Ying 2010, Delivering
categorized news items using RSS feeds and web services, in CIT 2010 : 10th IEEE
International Conference on Computer and Information Technology Proceedings, IEEE
Computer Society, Los Alamitos, Calif., pp. 698-702.
Available from Deakin Research Online:
http://hdl.handle.net/10536/DRO/DU:30031989
©2010 IEEE. Personal use of this material is permitted.
However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new
collective works for resale or redistribution to servers or lists, or
to reuse any copyrighted component of this work in other works
must be obtained from the IEEE.
Copyright: 2010,IEEE
2010 10th IEEE International Conference on Computer and Information Technology (CIT 2010)
Delivering Categorized News Items Using RSS Feeds and Web Services
Subrata Saha, Atul Sajjanhar, Shang Gao, Robert
Dew
Ying Zhao
School of Information Science and Technology
Beijing University of Chemical Technology
Beijing, 100029, P.R.China
zhaoy@mail.buct.edu.cn
School of Information Technology
Deakin University
Burwood, VIC 3125, Australia
{ssaha, atuls, shang, rad}@deakin.edu.au
Web services to manage the RSS feeds after they are
processed and filtered by data-mining techniques. In this
paper, the classical text categorization technique is adopted
to demonstrate the feasibility of this method.
Text classification, also known as text categorization, is
one of the popular and widely used machine learning
techniques for categorizing and filtering data. For instance,
spam filtering, text document filtering, indexing text
document, picture classification, and classifying survey
document are some applications of text classification.
Information overload problem with RSS can be
overcome by filtering and grouping data into predefined
categories. Text categorization has the potential of
categorizing feeds. A significant amount of work has been
done in the area of text document categorization and text
indexing using machine learning. In this paper, we extend
the text categorization approach and apply it to RSS news
feeds processing.
Based on the above discussion, the whole picture is: a
Web service processes RSS news feeds using text
categorization techniques, and then delivers categorized
news items to a client application. The client application
requests the Web service to provide specific news items
based on predefined categories, and to display retrieved
items on a user-friendly interface.
The rest of the paper is organized as follows: Section 2
describes the proposed method which is further divided into
two sub-sections: categorization of news items in RSS feeds
and development of a Web service delivering categorized
news items; Section 3 describes the experiments and results;
Future work and conclusions are discussed in Section 4.
Abstract— In the past decade the massive growth of the
Internet brought huge changes in the way humans live their
daily life; however, the biggest concern with rapid growth of
digital information is how to efficiently manage and filter
unwanted data. In this paper, we propose a method for
managing RSS feeds from various news websites. A Web
service was developed to provide filtered news items extracted
from RSS feeds and these were categorized based on classical
text categorization algorithms. A client application consuming
this Web service retrieves and displays such filtered
information. A prototype was implemented using Rapidminer
4.3 as a data mining tool and SVM as a classification
algorithm. Experimental results suggest that the proposed
method is effective and saves a significant amount of user
processing time.
Keywords-text classification, text categorization, web
services, Support Vector Machines (SVM), Really Simple
Syndication (RSS)
I.
INTRODUCTION
With the rapid development of the web for distribution of
information and business function over the Internet, Really
Simple Syndication (RSS) and Web services are increasing
in popularity.
RSS uses XML to deliver updated content on the web.
According to a survey, 52% of RSS users use RSS to get
updated news from local and international news providers
and 23% use it for blogging [1]. RSS provides a way to
dynamically change content, but its users still suffer from
information overload because of the huge volume of online
updates. A good solution is to filter RSS feeds based on
some rules before these are retrieved by the users.
Web services provide the similar functionality as
approaches such as Object Management Group's (OMG),
Common Object Request Broker Architecture (CORBA),
Microsoft's Distributed Component Object Model (DCOM)
or Sun Microsystems's Java/Remote Method Invocation
(RMI). But they support interoperable interaction over a
network using HTTP and XML in conjunction with other
Web-related standards. Web services are Internet
Application Programming Interfaces (API) that can be
accessed over a network, such as the Internet, and executed
on a remote system hosting the requested services. [2]
Since Web services support interoperable interaction and
can be accessed via HTTP, it might be a wise choice of using
978-0-7695-4108-2/10 $26.00 © 2010 IEEE
DOI 10.1109/CIT.2010.136
II.
PROPOSED METHOD
The proposed method has two distinct stages: first,
classification of RSS feeds; second, delivery of classified
news items to user.
At the classification stage, RSS feeds are collected from
various websites (e.g. BBC, Yahoo, CNN, News and 20
Newsgroup). Collected data is separated into two different
data sets, one for training and another for testing. By using
classification tools and algorithms, a classification model is
created which is then trained and used to classify collected
feeds into predefined categories. Detail is addressed in
section A.
698
N f ( w i ) = Number of words appeared in training
At the second stage, a Web service delivers categorized
news service to its user. The user reads classified news items
from the Web services repository upon his/her request. The
Web service and the client application are described in
Sections B and C, respectively.
d document
So word i in training document d can be denoted by
W ( w i , d ) = TF ( w i , d ) * IDF ( w i )
A. Classification of Feeds
There are different types of classification methods and
algorithms available for text classification. Normally for
categorization, TFIDF (term frequency–inverse document
frequency) and SVM (Support Vector Machines) algorithm
are used to classify feeds with some data preprocessing
techniques.
SVM algorithm is selected for classification because of it
high generalization ability and performance for different
range of applications [3]. It is also considered as one of the
most efficient classification methods which provide a
comprehensive comparison for text classification in
supervise machine learning method [4]. SVM method is
effective for particular high dimensional data such as text
due to its strong theoretical foundation and good
generalization performance. Though SVM suffers from
quadratic optimizer (QP) problem due to bigger training, this
problem can simply be solved by reducing the support
vector.
Data preprocessing is an important part of data mining.
Before knowledge being discovered from a data set, it is
important to process and filter data into such a way that the
data mining tool can read the data for testing and training
purposes. To match learning text classification, it is
necessary to extract features from given dataset and produce
knowledge model which can be used for text classification.
In addition to that, features reduction is also important and
necessary when dealing with a large volume of data, as it
improves the accuracy of classification, processing time and
performance by reducing data size.
In this paper, RapidMiner, a freely and commercially
available data mining and text mining software, is used. It
can be simply integrated with third-party programming
applications. String Tokenization, English Stop-words Filter
and Token Length Filter methods are used for data
preprocessing and TFIDF is adopted to calculate vector
frequency in RSS feeds.
TFIDF (term frequency–inverse document frequency)
algorithm was introduced by Salton and Buckley in 1988 [5].
In this algorithm, each document is represented as d. j is a
class in the vector space. TF (Term Frequency (wi, d)) refers
to the number of times the word j appeared in document
along with d, while IDF (inverse document frequency (wi)) is
used to measure general frequency of term in the document
by dividing number of documents and number of terms in
the document by taking their logarithm. IDF can be
calculated in the following way [6].
⎛
⎞
⎟
IDF ( w i ) = log ⎜⎜ N
⎟
(
)
N
w
f
i
⎝ f
⎠
N f = Number of training documents
⎛ Nf ⎞
⎟
= N ( wid ) * log⎜
⎜ N (w ) ⎟
f
i
⎝
⎠
N ( wid ) = Number of words i in document d.
For SVM binary classification, S data set is given as S=
{xi, yi}ni=1 where xi denoted as xi Є RN and yi is as Є(-1, +1).
SVM yields to find optimal hyper-planes and also solves
quadratic programming problem.
(1)
y i = sin g [w T ϕ ( x i ) + b ]
Optimal hyper-plane
min
w ,b
J (w ) = 1
Subject
:y
k
[w
T
ε
n
2
wT w + c∑
k
(2)
k =1
ϕ (xk) + b]
k =1
≥ 1 −ε
k
Primal problem
ζ is a slake variable to deal with miss classification. ζk
>0, k =1…..n, c>0 is equal to quadric problem and it is a
dual problem with language multiplier αk ≥0,
max α J (α ) = − 1
n
2∑
k =1
n
y k y i K ( x k , x j )α k α j + ∑
k =1
αk
(3)
n
Subject : ∑ α k y k = 0 ,0 ≤α k ≤ c
k =1
K(xi, xl) is known as Mercel Kernel, which must satisfy
the Mercel preconditions. The resulting solution is,
⎡
⎤
(4)
y ( x ) = sign ⎢ ∑ α k y k ( x k , x j ) + b ⎥
⎣ k ∈V
⎦
Many solution to equation (3) (αk = 0) is 0, so a solution
to vector is spare, and the 0’s and sum are taken over a non
0 αk. The xi is correspondent to αi , which is called SV
(Support Vector).
In the following equation (5), v is the index set of SV
and the optimal hyper-planes is,
∑α
k
y k K (xk ,x j ) + b
(5)
k ∈V
Resulting classification is
⎤
⎡
y ( x) = sign ⎢∑ α k y k K ( x k , x j ) + b ⎥
⎦
⎣k∈V
(6)
In equation (6), b is determined by Kuhn-Tucker
conditions,
699
∂L
= 0, w =
∂w
n
∑α
k
however they should be transferred into vector before
features can be selected. TFIDF algorithm is used to create
vectors and 95 support vectors are created from given
training datasets. Also a weight table (TABLE II) is created
which contains words and weight associated with them.
Once the vector is created, the next step is to save the
word vectors for later classification (testing) use. In addition
to that, Rapidminer also creates a model file which holds all
the specified parameters.
yk ϕ (xk )
k =1
n
∂L
= 0, w = ∑ α k y k = 0
∂b
k =1
∂L
= 0 , c −α k − v k = 0 (α k − c ≥ 0 )
∂ε n
α k {y k [w T ϕ ( x k ) + b ] − 1 + ε k } = 0
TABLE III.
In (2), size of w is fixed and it is not dependent on the
number of data points. In (3), solution vector α increases
with the number of data point p. In the high dimensional
space, it is always a good idea to solve dual problem and
with the large data set it serves as an advantage to solve
primal problems [6].
To create proposed classification model for feeds, data is
gathered from various news websites and 20 Newsgroup for
testing and training. Training dataset contains 50 articles for
business news and 50 articles for sports news. In addition,
1200 files are collected from news group dataset, BBC,
News and CNN websites for testing (TABLE I).
TABLE I.
DATA SET TABLE FOR TRAINING AND TESTING
SUPPORT VECTOR TABLE
Label
(abs) Alpha
Attr. 1
Sports
0.42527807
0
0.04508112
Sports
0.68267482
0
0
Sports
0.58932045
0
0
Sports
0.55646904
0
0
Sports
0.29432025
0
0
Sports
0.52408548
0
0
Sports
0.18381303
0
0
Attr. 2
Business
1.04166666
0
0
Dataset
Training
Testing
Business
1.04166666
0
0
Business
50
200 (CNN, BBC)
Business
1.04166666
0
0
Sports
50
1000 (20 newsgroups)
Business
0.88637830
0
0
Total
100
1200
Business
0.49816087
0
0
Business
0.23411624
0
0.02232141
Business
0.22118449
0
0
Business
0.28132026
0
0
Business
0.47983781
0
0
After creating a data set for training and testing,
classification is done in the following way: first step,
preprocess some data in order to remove stop words and
tags from training datasets.
TABLE II.
SOME SELECTED ATTRIBUTES AND WEIGHTS
Attribute
Weight
Team
0.008143169
takes
0.008143169
primarily
0.008143169
fired
0.008143169
fantastic
0.008143169
appeared
0.008143169
opposite
0.008143169
forcing
0.008143169
believes
0.008642927
In the testing phase, a dataset is supplied which is
preprocessed using the same techniques as training. Then the
word vectors and the model are supplied. Rapidminer
calculates word vectors created from testing dataset and the
word vector supplied from training. After comparing, a
prediction is made based on the confidence that which
category the given RSS feed item belongs to (TABLE III).
B. Web service
A Web service is developed using Java and Apache Axis
(an open source Web services container allows user to
develop Java Web services and publish them ) on Apache
Tomcat (an application server runs on top of Apache server
to host Java web applications and JSP web applications).
The interaction between client application and Web
service is as follows:
1. On the selection of a button (operation 1, Figure
1), the client makes a connection to server
application and sends the desired request to server.
The dataset is filtered by using Stop word Filter, Token
Length Filter operator and also by pruning data. Pruning is
important in text mining because it improves the efficiency
of classifier by removing less frequent words from
documents. Now the data is ready for feature selection,
700
2. Once the server receives the request, first, it
initializes the global variables and then a document
builder variable to parse the retrieeved XML-based
RSS feed documents. Then the sserver reads the
XML RSS feeds from the repositoory (operation 2,
Figure 1) as well as the <item>, <title>, <link> and
<description> channels by ignoringg other channels
as they are not considered as imporrtant as the above
four tags at this stage, and parsingg all channel tags
makes the process complicated and affects the
application performance.
3. The server creates a virtual XML
L file, and saves it
at a specific location.
4. The filtered XML (<title>
>, <link> and
<description>) data is read from thhe directory to a
buffer memory and reproduced ass a single string
(operation 4, Figure 1).
C. Client Application
The client application (Figuree 2) provides user an
interface to access the content retrieved
r
from the Web
service. Buttons are provided allow
wing user to select news
items from predefined categoriess, such as news about
business, sports and/or entertainmen
nt.
Figure 2. Client Application displays new
ws retrieved from Web service.
The client application is developed using C# .Net. It
dlines in its list view
displays all available news head
component.
III.
RESULTS AND DISCUSSION
A training model is created from
m training data sets after
applying the classification algorithm
ms to them. The model is
then used for prediction. For featu
ure selection, some data
pre-processing techniques and TFIIDF are used to convert
data into vectors. TFIDF is selected
d due to the consideration
of its higher accuracy of predictio
on, compared to Binary
Occurrence, Term Frequency and Term Occurrence. The
oved further by creating a
accuracy of prediction can be impro
more efficient training model with
w
more training data
feeding. TABLE IV portrays the different
d
accuracies when
using different classification algorith
hms.
Figure 1. Interaction between Client and W
Web service.
5. The string is sent to the cliennt application as
shown the Figure 1 (operation 5).
6. The client writes the XML ffile to the local
memory Figure 1 (operation 6).
7. The user accesses the retrievved information
locally whenever he/she wants wiithout sending a
request to the server every timee. The retrieval
process is speeded up.
8. The XML file is read from locaal memory and a
corresponding row is created in thhe list view tray
interface with <title> and <link>. U
User clicks on of
the links to view the full article.
TABLE IV.
701
VECTOR
R ACCURACY
Training
Testing
TFIDF
96.00
Binary Occurrence
93.00
Term Frequency
93.00
Term Occurrence
84.00
favorite feeds locally. Furthermore, client application can
also be equipped with a special search request for feeds,
allowing a specific set of feeds to be retrieved (E.g. request
by title).
Feed management and categorization is a problem with
current RSS technologies. In this paper, a novel approach is
presented for delivering news items from RSS feeds, based
on the existing text categorization and Web service
techniques. News feeds are collected from various news
websites and stored in folders (for training and testing
purposes) for categorization, using Rapidminer 4.3 as a data
mining tool and SVM as a classification algorithm with some
data preprocessing techniques. A Web service and a client
application are developed to deliver and display categorized
feeds. The Web service also filters feeds by stripping
unnecessary tags to improve categorizing performance. The
client application provides a user friendly interface to display
retrieved feeds.
The proposed Web-service based RSS feed categorizing
approach manages and delivers feeds in an efficient way and
overcomes feeds categorization problem. The prototype
enables the user to get a specific type of news without
subscribing to news websites and/or being flooded by
unnecessary information, which saves time and effort.
Further enhancement on the categorizing algorithms and
testing data sets is necessary to improve the prototype.
Prediction is done by calculating the average confidence
of a file in respect to categories, such as business and sports.
If the confidence is higher for business then the file is
labeled as business news, otherwise, as sports. TABLE V
lists some random prediction results generated from the
classifier for a specific training file.
The proposed Web services-based RSS feed categorizing
architecture is very efficient as users do not have to subscribe
to any websites for feeds. The only thing they needs to do is
to subscribe to the Web service for categorizing news. The
Web service processes the client requests and sends the
result back to clients which saves users a significant amount
of the time by avoiding superfluous feeds. The client
application downloads the feed in the local machine, making
the whole process quicker as there is no need for users to
keep sending requests to the Web service for the feeds.
TABLE V.
SOME RANDOM PREDICTION RESULTS
ID
Prediction
Label
Confidence
Business
Confidence
Sports
996
Sports
0.34765210
0.65234789
997
Sports
0.36646943
0.63353056
998
Sports
0.41855054
0.58144945
999
Sports
0.39448818
0.60551181
1001
Business
0.73105696
0.26894303
1002
Business
0.73107281
0.26892718
1003
Business
0.73108733
0.26891266
1004
Business
0.73303367
0.26696632
[1]
[2]
[3]
IV.
FUTURE WORK AND CONCLUSION
[4]
Currently this prototype only delivers sports and business
news, however, it can be improved by further categorizing
news items into sub-categories (e.g. cricket, football, formula
one, golf, motorsports and horse racing). In addition, in the
future it is possible to develop an automated feed
categorization system and integrate that system with the Web
services. Client application can also be improved by adding
bookmarking system which allows users to save their
[5]
[6]
702
J. Grossnickle, T. Board, B. Pickens and M. Bellmont, RSS Crossing
into the Mainstream, Yahoo White Paper, October 2005 [Available
Online]
Web services, Access on 30th July, [Available Online]
http://en.wikipedia.org/wiki/Web_service.
J. Cervantes, X. Li and W. Yu, 2008, Support Vector Classification
for Large Data Sets by Reducing Training Data with Change of
Classes, IEEE International Conference on Systems, Man and
Cybernetics, 2008.
Y. Yang and X. Liu, An re-examination of text categorization,
Proceedings of the 22nd Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval,
Berkeley, pp. 42-49, 1999.
Q. R. Zhang and L. Zhang, S. B. Dong and J. H. Tan, Document
Indexing in Text Categorization, International Conference on
Machine Learning and Cybernetics, 2005.
K. Chen and C. Zong, A New Weighting Algorithm For Linear
Classifier, International Conference on Natural Language Processing
and Knowledge Engineering, 2003.
Download