Document 12930916

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
Statistical topic Modeling for News Articles
Suganya C#1, Vijaya M S*2
#
M.Phil Research Scholar& Computer science&PSGR Krishnammal College for women
Coimbatore, India
*
Associate Professor &Department of computer science & PSGR Krishnammal College for women
Coimbatore, India
Abstract—Topic
modeling
is impressive text
mining approach for organizing the knowledge with
different content under particular topics. It is
usually being used in various sectors such as library
science, search engines and statistical language
modeling. Discovering a topic from as text corpus is
a tough task in the text mining perspective and, the
topics from a collection of documents have to
be recognized in an efficient manner. In this paper,
the topic modeling technique namely Latent
Dirichlet Allocation (LDA) is implemented
to identify a topic from text-corpora. Variants of
LDA namely Labeled LDA and Partially LDA also
have been used to label the document and to identify
the sub topics. This work is focused in predicting the
topic for given document using observed key words.
Four different categories of news articles such as
science, sports, business, and weather are collected
and a corpus is generated. Total of 100 instances
are collected from four different categories that
forms a total of 400 news articles. The three models
are implemented using Stanford Topic Modeling
Toolbox and the experimental results are generated
and the observations are made.
Keywords —Topic modeling, Generative model,
LDA,LLDA, PLDA.
I. INTRODUCTION
Topic modeling is a form of text mining, a way of
identifying patterns in a corpus. It is a type of
statistical model for discovering the abstract topics
that occur in a collection of documents. Topic
models are derived from documents which are
modeled as a weighted mixture of topics, where each
topic is a probability distribution over words in
entire corpus. This operates by examining a set of
documents and based on the weights assigned for
each word by the various topic modeling techniques.
It automatically group topically-related words in
“topics” it can associate tokens and documents with
those topics. Topic modeling mainly used in
digitized and stored in the form of news, blogs, web
pages, scientific articles, sound, video and social
networks. In topic modeling, a „topic‟ is a
probability distribution over words or, put more
simply, a group of words that often co-occur with
each other in the same documents. Generally these
groups of words are semantically related and
interpretable; in other words, a theme, issue, or
genre can often be identified simply by examining
ISSN: 2231-5381
the most common words in a topic. Beyond
identifying these words, a topic model provides
proportions of what topics appear in each document,
providing quantitative data that can be used to locate
documents on a particular topic or that combine
multiple topics and to produce a variety of revealing
visualizations about the corpus as a whole. To find a
topic in a collection of documents, the various
methods are used. Topic modeling algorithms
mainly used to develop a model for search, browse
and summarize large corpus of texts. Problem to
consider being a given large set of emails, news
articles, journal papers, reports are to understanding
of the key information contained in set of documents.
Generative models are used for topic modeling to
extract topic from large corpus, models such as
Hidden Markov Model (HMM), Gaussian model,
Latent Semantic Analysis (LSA), and Latent
Dirichlet Allocation (LDA). LDA is most common
algorithm for find a latent topics.
Researchers have recognized the importance of topic
modeling increasingly in the recent years. Various
methods have been proposed to identify the concepts
or theme when large numbers of text document are
analyzed.
Ralf Krestel, et al [3] developed to improve
searching for recommending tags using Latent
Dirichlet Allocation (LDA). Resources annotated by
many users and thus equipped with a fairly stable
and complete tag set are used to elicit latent topics to
which new resources with only a few tags are
mapped. Based on this, other tags belonging to a
topic can be recommended for the new resource.
The authors evaluated that the approach archives
better precision and recall than the use of association
rule and also recommends more specific tags.
Denial Ramage et al. [4] developed a
method based on Labeled LDA for multi-labeled
corpora. Labelled LDA improved traditional LDA
with visualization of corpus of tagged webpages.
The new model improves upon LDA for labeled
corpora by gracefully incorporating user supervision
in the form of a one-to-one mapping between topics
and labels. Delicious corpuses with four thousand
documents are used in dataset. Model compared with
standard LDA. The topics by unsupervised variant
were matched to a Labeled LDA topic highest cosine
similarity. Totally 20 topics are learned. It has been
concluded that the Labeled LDA out performs SVM
when extract tag-specific documents. GirishMaskeri
http://www.ijettjournal.org
Page 232
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
et al [5], proposed an approach based on LDA for
topic extraction from source code.
Christopher D. Manning et al. [6] proposed
a partial labeled topics for interpretable in text
mining. These models make use of the unsupervised
learning machinery of topic models to discover the
hidden topics within each label, as well as unlabeled,
corpus-wide latent topics. The many tags present in
the del.icio.us dataset to quantitatively establish the
new models' higher correlation with human
understanding scores over several strong baselines.
PhD Dissertation abstracts and delicious dataset used
in
this
work.
The
latent content capture combined general framework i
n PhD dissertations, 3 which include basic objects
such as variables that enhance or reduction, rates of
change, and
also structural
starting details regarding requirements, issues,
and objectives. High capabilities of topics space
with identified teams of
latent
dimensions are
made as output by PLDA. Correlation of PLDA,
LDA,
and
L-LDA
for similar dataset,
they observed to be a constant and high performing
metric of frame.
In [7] the authors proposed a semisupervised hierarchical topic model (sshlda) which
aims to explore new topics automatically. SSHLDA
was a probabilistic graphical model that describes a
process for generating a hierarchical labeled
document collection. They have compared the
hierarchical latent dirichlet allocation (hlda) with
hierarchical labeled latent dirichlet allocation (hllda).
Also proved an hlda and hllda are special case of
sshlda. The authors used yahoo question and answer
dataset, contain topics are computer and internet and
health. Gibbs sampling approach was used for
experiments with 1000 iterations.
From the above literature survey it is
understood that the various LDA techniques has
been applied to find the topics in journal abstracts,
documents of science and social, and yahoo question
and answer. In the proposed work, a model is
designed to identify the topic for news articles
collected from various sources. Three models LDA,
LLDA, PLDA are employed for topic modeling and
to find the topics. LDA based approaches CVB0 and
GS are utilized and the models are generated.
II. BACKGROUND STUDY
This section, provide a brief description of LDA,
LLDA, PLDA and its use of extracting topics from
text documents.
to probabilistic Latent Semantic Analysis (pLSA),
except that in LDA the topic distribution is assumed
to have a Dirichlet prior. The basic idea is that
documents are represented as random mixtures over
latent topics, where each topic is characterized by a
distribution over words. Given a corpus of
documents, LDA attempts to discover the following:
• It identifies a set of topics
• It associates a set of words with a topic
• It defines a specific mixture of these topics for
each document in the corpus. The process of
generating a corpus is as follows,
1. Randomly choose a distribution over topics
2. For each word in the document
a. randomly choose a topic from the
distribution over topics
b. randomly choose a word from the
corresponding topic
Fig. 1 Plate notation for LDA
Fig 2.1 describes, the dependencies among
the many variables can be captured concisely. The
boxes are “plates” representing replicates. The outer
plate represents documents, while the inner plate
represents the repeated choice of topics and words
within a document. M denotes the number of
documents, N the number of words in a document.
Thus:
α is the parameter of the Dirichlet prior
on
the
per-document
topic
distributions,
2. β is the parameter of the Dirichlet prior
on the per-topic word distribution,
3.
is the topic distribution for document
i,
4.
is the word distribution for topic k,
5.
is the topic for thejth word in
document i, and
6.
is the specific word.
A k-dimensional Dirichlet random variable
θ can take values in the (k-1)-simplex, and has the
following probability density on this simplex:
1.
A. LDA
Latent Dirichlet Allocation (LDA) is a
generative model which allows sets of observations
to be described by unobserved groups that explain
why some parts of the data are similar. In LDA, each
document could be perceived as quite a number
of numerous themes.
This
is comparable
ISSN: 2231-5381
(1)
where the parameter
is a k-vector with
components >0, and where (x) is the Gamma
function. The Dirichlet is a convenient distribution
http://www.ijettjournal.org
Page 233
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
on the simplex it is in the exponential family, has
finite dimensional sufficient statistics, and is
conjugate to the multinomial distribution. These
properties will facilitate the development of
inference and parameter estimation algorithms for
LDA. Given the parameters and , the joint
distribution of a topic mixture , a set of N topics z,
and a set of N words w is given by:
(2)
where
is simply for the unique i such that
Integrating over θ and summing over z, to
obtain the marginal distribution of a document.
B. Labeled LDA
Labeled LDA is a probabilistic graphical
model that describes a process for generating a
labeled document collection. Like Latent Dirichlet
Allocation, Labeled LDA models each document as
a mixture of underlying topics and generates each
word from one topic. Unlike LDA, L-LDA
incorporates supervision by simply constraining the
topic model to use only those topics that correspond
to a document‟s (observed) label set.
that of LDA model, an important distinction in that,
the target topic j is restricted to belong to the set of
labels, that is j
.
C. Partially LDA
Partially Labeled Dirichlet Allocation
(PLDA) is a generative model for a collection of
labeled documents, extending the generative of LDA
to incorporate labels, and of Labeled LDA to
incorporate per-label latent topics. Formally, PLDA
assumes the existence of a set of L labels (indexed
by 1::L), each of which has been assigned some
number of topics Kl (indexed by 1::KL) and where
each topic
is represented as a multinomial
distribution over all terms in the vocabulary V drawn
from a symmetric Dirichlet prior. One of these labels
may optionally denote the shared global latent topic
class, which can be interpreted as a label “latent”
present on every document d. PLDA assumes that
each topic takes part in exactly one label.
Fig. 3 Plate notation for PLDA
Fig. 2 Plate notation for LLDA
1.
2.
3.
4.
5.
6.
7.
α is the parameter of the Dirichlet prior
on
the
per-document
topic
distributions,
is the topic distribution for document
i,
is the word distribution for topic k,
is the topic for the jth word in
document i, and
is the specific word.
The label prior φ is d-separated from
the document given Λ
Just constrain the set of possible topics
to the observed labels
where
i, j is the count of word
in topic j, that
does not include the current assignment , a missing
subscript or superscript indicates a summation over
that dimension, and 1 is a vector of 1‟s of
appropriate dimension. Equation looks exactly as
ISSN: 2231-5381
Each document's words w and labels Ʌ are observed,
with the per-doc label distribution ψ, per-doc-label
topic distributions θ, and per-topic word
distributions ɸ hidden variables. Because each
document's label-set is observed, its sparse vector
prior is unused; included for completeness.
Each document d is generated by first
drawing a document specific subset of available
label classes, represented as a sparse binary vector
from a sparse binary vector prior. A documentspecific mix
over topics 1..Kj is drawn from a
symmetric Dirichlet prior
for each label j
present in the document. Then, a document-specific
mix of observed labels
is drawn as a
multinomial of size
from a Dirichlet prior , with
each element
corresponding to the document's
probability of using label j
when selecting a
latent topic for each word. For derivational
simplicity, define the element at position j of to be
Kj , so is not a free parameter. Each word w in
document d is drawn from some label's topic's word
distribution, i.e. it is drawn by first picking a label j
from d, a topic z from
, and then a word w from
ɸ . Ultimately, this word will be picked in proportion
to how much the enclosing document prefers the
label l, how much that label prefers the topic z, and
how much that topic prefers the word w [6].
http://www.ijettjournal.org
Page 234
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
III. PROPOSED WORK
Topic modeling system for news articles is built
using LDA, LLDA, and PLDA with Collapsed
Variational Bayes and Gibbs Sampling approaches.
The proposed framework includes different phases –
data collection, pre-processing, training and topic
prediction. News articles are collected and stored in
CSV file for processing, in the data collection phase.
In pre-processing stage the news articles are
processed to remove stop words, case fold words
filter words and numbers, and to select the model
parameter. During training the words and their
corresponding weights are extracted. Final phase
predicts the topic based on probability for a new
article. The architecture of proposed model is shown
in Fig. 3.1
Corpus of
News
articles
Pre-processing
Train
LDA
Model
LDA
Predictiv
e model
Predicted
Train
LLDA
Model
LLDA
Predictiv
e model
Predicted
Train
PLDA
Model
PLDA
Predictiv
e model
Predicted
A. Data Collection
Data are collected manually from online
news websites. 400 instances from 4 different news
websites related to science, sports, business, and
weather, each with 100 news articles are collected.
Science news articles are collected from Science
Daily website 1 . Gene and disease related news
articles are collected from science website. Sports
news is collected from BBC Sports website2. Cricket,
football, and tennis categories are collected from
sports website. Business news is collected from The
Times of India website3. Stock market and finance
http://www.sciencedaily.com/
http://www.bbc.com/sport/0/
3
http://timesofindia.indiatimes.com/
B. Pre-Processing
Scala NLP API is used to pre-process the news
documents. Scala program takes text as CSV file,
per one column one news can be pasted. CSV file is
given as input and pre-processing steps are carried
out for learning models using LDA, LLDA, and
PLDA. The pre-processing tasks like tokenization,
case folding, stop words removal and parameter
selection are discussed below.
1. Tokenization
The first step in the pre-processing is
tokenization. Tokenization is the process of breaking
a stream of text up into words, phrases, symbols, or
other meaningful elements called tokens. The list of
tokens becomes input for further processing in text
mining. This is accomplished with the following
functions.
Simple English Tokenizer ( ) function is used to
remove punctuation from the end of words and then
split up the input text by whitespace characters like
tabs, spaces and carriage returns.
Casefolder ( ) is a function used to lower-case each
word so that “THE”, “The”, “ The” all look like
“the”. Case folding reduces the number of distinct
words seen by the model by turning all character to
lowercase.
Words and Numbers Only Filter ( ) function is used
for words that are entirely punctuation and other
non-word non-number characters and are removed
from the generated lists of tokenized documents.
2.
New
articl
e
Fig. 4Topic
Architecture
model
Labelof proposed
Subtopic
1
categories news are collected for business. Weather
news collected from AccuWeather news website4.
Finding meaningful words
The use of very common words like „the‟
does not indicate the type of similarity between
documents in which one is interested. Single letters
or other small sequences are also rarely useful for
understanding content. So there is need for removing
term which appears very few times in documents,
because very rare words tell little about the
similarity of documents, and most common words in
the corpus, because words that are ubiquitous also
tell little about the similarity of documents. The
following functions are used to achieve this task.
MinimumLengthFilter ( ) function is used to remove
the terms that are shorter than minimum number of
characters.
TermStopListFilter ( ) function is used to add an list
of stop words which are like to remove from corpus,
can
be
added
by
using
TermStopFilter(List(“term1”,”terms 2”,…,”terms
n”)).
2
ISSN: 2231-5381
4
http://www.accuweather.com/en/weather-news
http://www.ijettjournal.org
Page 235
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
Removing Empty Documents:
Some documents in the dataset may be
missing or empty because some words are filtered in
the tokenisation process. These documents can be
disregarded by applying the following function.
Document Minimum Length Filter(length) function
is used to remove some documents in the dataset that
are empty. Length can be specified to identify the
shorter documents.
3.
Parameter Selection for three models
The number of topics can be specified by
using the parameter selection process and parameter
for the dirichlet term and topic smoothing used by
the LDA model can also be provided by passing
arguments such as CVB0 and Gibbs sampling are
used under the same fixed hyper-parameter. The
constructor used for LDA is given below.
Labeled LDA can use LabeledLDAModelParams
constructor,
here
need
not
specify the
numtoipcsparameterbecause the topics are already
specified as labels. Parameter used for LLDA model
is given below.
valmodelParams =
LabeledLDAModelParams(dataset);
PLDA model is used similar to LDA in
which the parameters to specify the number of
background topics and number of subtopics for each
label are included. The parameters are num
Background Topics and num Topics Per Label
respectively. PLDA performs to identify hidden
topics in particular news domain. The constructor
used for PLDA as given below.
ValmodelParams=
PLDAModelParams(dataset,
numBackgroundTopics, numTopicsPerLabel,
termSmoothing = 0.01, topicSmoothing = 0.01);
valnumTopicsPerLabel=SharedKTopicsPerLabel(2);
IV. EXPERIMENTS AND RESULTS
Three experiments have been carried out
using LDA, L-LDA, and PLDA to identify the topics
for various articles. News articles related to domains
like science, sports, business, weather are collected
from multiple sources text-corpus is created. Various
pre-processing tasks have been carried out as
described earlier to facilitate training. The
experiments have been carried out using Stanford
Topic Modeling Toolbox using scala programming
ISSN: 2231-5381
language. Stanford topic modeling toolbox is a free
collection of topic modeling tools, and a part of the
Stanford natural processing toolset. Experiments for
LDA, LLDA, and PLDA are discussed in this
section.
A. Latent Dirichlet Allocation (LDA)
The learning model is generated using
CVB0 and GS which are the two LDA approaches.
The term and topic smoothing parameters are set as
0.01 and the model is built for the selected parameter
values. The number of topics that are being chosen
to perform this experiment is set as numTopics = 4
which indicate four topics are being chosen. The
model is trained using 1000 iterations and after
every 50th iteration, the summary of the topic with
weight values is computed and displayed.
The resulting terms using CVB0 approach
is given in Table I for topic 0 and topic 1, topic 2
and topic 3. The related terms in the news articles
are clustered. The grouping of topic 0 contains
weather related terms, the grouping of topic 1
valparams = LDAModelParams(numTopics
=4, dataset = dataset, topicSmoothing = 0.01,
termSmoothing = 0.01);
contains science related terms, topic 2 contains
business related terms, and topic 3 contain sports
related terms.
Gibbs sampling is similar to CVB0
approach but topic clusters are different. In Gibbs
sampling, topics are clustered as weather, sports,
science, and business.
TABLE I.
TERMS FOR LDA MODEL
During testing, a news article of category
„weather‟ is then chosen from one of the four topics
that are considered for training using LDA. Testing
is performed using both CVB0 and GS models. The
probability values of the given news article for the
four topics inferred while testing using CVB0 are
given below
News id=1
Topic 0 (weather) = 0.756
Topic 1 (science) = 0.00094
Topic 2 (business) = 0.214
Topic 3 (sports) = 0.03
The above results shows that the probability of news
article belonging to weather (topic 0) is high, which
proves that the topic predicted by the model is
correct.
Similarly, the LDA topic model with Gibbs
sampling is tested by passing a news article of
category weather. The test result in terms of
probabilities that article belong to the topics are
given below.
News id=1
http://www.ijettjournal.org
Page 236
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
Topic 0 (weather) = 0.85
Topic 1 (sports) = 0.088
Topic 2 (business) = 0
Topic 3 (science) = 0.063
Comparison of CVB0 and GS in LDA:
The comparison of prediction results in
LDA with CVB0 and GSare summarized in Table II
and illustrated in Fig. 4.1.
Topic 0
Weather
Weekend
Rain
Patients
Common
Percent
Climate
Temperat
ure
Sereve
Winter
Season
Dry
Heavy
Thunders
toms
TABLE II.
Topic 1
Research
er
Cancer
Scientists
Disease
Research
Gene
People
Cells
Topic 2
Growth
Topic 3
World
Market
Stock
Bank
Economic
Business
Exchange
Companies
Cup
Win
Goals
Scored
Victory
Football
League
Identified
Dna
Genetic
Genes
Therapy
Brain
Rate
Sensex
Interest
Markets
Demand
Global
Team
Cricket
Australia
Match
Cricket
Player
PREDICTIVE PERFORMANCE OF LDA
Approaches
Topic 1
0.75
Topic
2
0.0009
Topic
3
0.214
Topic
4
0.03
CVB0
GS
0.85
0.088
0
0.063
Science 0
Sports 1
Business 2
Weather 3
Cancer
Researcher
s
Scientists
Dna
Cells
Treatment
Time
Patients
Vaccine
Bacteria
Liver
Brain
World
Cup
Victory
Win
Sport
Scored
Team
Life
Teams
Life
Time
Games
Markets
Economic
Rates
Companies
Global
Top
Demand
Time
Bank
Lead
Time
sensex
Weather
Climate
Severe
Change
Rain
Risk
Global
Areas
Parts
Increase
Days
wave
1
0.8
0.6
0.4
0.2
0
CVB0
GS
1
2
3
4
Fig. 5 Comparative results of LDA
From the abovetwo inferences it has been observed
that Gibbs sampling based LDA model produces
higher probability of topic than CVB0.
B. Labeled Latent Dirichlet Allocation (LLDA)
In this experiment the labels are assigned
for news articles and the training dataset in CSV file
format is created. Labeled LDA (LLDA) model is
intended to produce the label and this is achieved
through the terms extracted for each topic. Words
are extracted from text in LLDA training similar to
LDA training. LLDA incorporates supervision by
simply constraining the topic model to use only
those topics that correspond to a document‟s label
set. During testing, the LLDA trained model is tested
for its prediction of the labels appropriate for the
new document.
Training the LLDA model using CVB0
approach produces the terms along with the
corresponding label. The terms are computed using
LLDA model, along with the corresponding labels
such as science (0), sports (1), business (2), and
weather (3). These terms are used to predict the label
of the new document, based on the probability. The
terms extracted during LLDA training are given in
Table III
TABLE III.
TERMS FOR LLDA MODEL
During testing, the LLDA model determines the
probability of the article belonging to a topic based
on the weights of the terms in the article and predicts
the label for given topic. Similar to LDA experiment,
here the model is tested with weather news. It is
inferred that probability for the topic weather is
much higher than science, sports, and business.
Predictive results for LLDA with CVB0 are given
below.
News id = 1
Science (0) = 0.0007
Sports (1) = 0.003
Business (2) = 0.0673
ISSN: 2231-5381
http://www.ijettjournal.org
Page 237
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
Weather (3) = 0.789
The LLDA model training with Gibbs sampling and
testing is carried out in similar to above LLDA with
CVB0. The test result of LLDA with GS model is
given below.
are discovered for the all the topics. For example,
for science articles, the sub topics „gene‟ (science 0)
and „disease‟ (science 1) are identified by the model.
The terms are extracted for each sub topic as given
in Table V.
TABLE V.
News id = 1
Science (0) = 0.0086
Sports (1) = 0.003
Business (2) = 0.0012
Weather (3) = 0.889
Comparison of CVB0 and GS in LLDA:
The comparison of prediction results in
LLDA with CVB0 and GSare summarized in Table
IV and illustrated in Fig. 4.2.
TABLE IV.
PREDICTIVE PERFORMANCE OF LLDA
Approach
CVB0
Science
0.0007
Sports
0.05663
Business
0.067
weather
0.78
GS
0.0086
0.038
0.0286
0.89
1
0.8
0.6
0.4
0.2
0
CVB0
GS
1
2
3
4
Fig. 6 Comparative Results of LLDA
From the abovetwo inferences it has been
observed that Gibbs sampling based LLDA
model produces higher probability of label for
the given news article than CVB0.
C. Partially Labeled Dirichlet Allcation (PLDA)
Partially Labeled Dirichlet Allocation
(PLDA) is
employed
to recognize the unseen sub grouping of a
particular theme. PLDA presents significant amounts
of flexibility in describing the area of latent topics
to fundamentally learn about latent topics both
within
labels together
with a
very
common background
space. Similar dataset because found in the second
experiment is commonly used here. However,
PLDA provides support for a new parameter per
each label, Kl, which corresponds to the variety
of topics as available within each label's topic class.
During PLDA training, the terms are extracted
similar to previous models and the hidden sub topics
ISSN: 2231-5381
TERMS FOR PLDA MODEL
Science 0
Scientists
Study
Human
Dna
Genes
Virus
Vaccine
Genome
Genetic
Researchers
Identified
International
Research
Sequence
Mutation
Discovered
Science 1
Cancer
Disease
Researcher
Brain
Time
Therapy
Heart
Common
Research
Patients
Study
Cells
Brain
Health
Bacteria
Treatment
During testing, news articles from each sub topic are
chosen. The PLDA model will predict the label and
the probability. For example, two articles from each
sub topic of science category are taken and tested.
The probability values of the given news article for
the science category inferred while testing using
CVB0 are given below.
News id = 1
News id = 2
Science (0) = 0.879 Science (0) = 0.0034
Science (1) = 0.0056 Science (1) = 0.88
The PLDA training with GS is done similar
to PLDA training with CVB0 and the probability
values of the given news article for the science
category inferred while testing using GS are
obtained.
News id = 1
News id = 2
Science(0) = 0.99763 Science (0) = 0.002
Science (1) = 0.00005 Science (1) = 0.99
Comparison of CVB0 and GS in PLDA:
The comparison of prediction results in
PLDA with CVB0 and GSfor category „science‟are
summarized in Table VI and illustrated in Fig. 4.3.
http://www.ijettjournal.org
Page 238
International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016
TABLE VI PREDICTIVE PERFORMANCE OF PLDA
[2]
Approach
CVBO
GS
[3]
Science 0
0.879
0.99
Science 1
0.002
0.0453
1
0.8
0.6
0.4
0.2
0
[4]
[5]
CVB0
[6]
GS
[7]
1
2
[8]
Fig. 7 Comparative Results of PLDA
[9]
From the abovetwo inferences it has been observed
that Gibbs sampling based PLDA model produces
higher probability of subtopic for the given news
article than CVB0.
Findings and Discussion
[10]
[11]
In this work, the topics for news articles
have been predicted using LDA and its variants
LLDA, PLDA. As far as the topic modeling is
concerned, LDA is intended to predict the topic of
the article whereas LLDA is intended to determine
the labels of given topic. PLDA is used to find
hidden sub topics from collection of all categories.
LDA and LLDA help to find topics and labels of the
documents in a large collection of documents and
can generate the key words for document collection.
PLDA enables to extract keywords from particular
domain of subtopics. From the comparative analysis
of three models, it is inferred that Gibbs sampling
approach integrated with LDA, LLDA and PLDA
outperforms Collaborative variational approach.
McCallum (1999), A. “Multi-label text classification
with a mixture model trained by EM”. In AAAI
workshop on Text Learning.
Ralf Krestel, Peter Fankhauser, and Wolfgang Nejdl
(2009), “Latent Dirichlet Allocation for Tag
Recommendation”, ACM.
Denial Ramage, David Hall, Ramesh Nallapati and
Christopher D. Manning, (2009) “Labeled LDA: A
supervised topic model for credit attribution in multilabeled corpora”, proceeding of the 2009 Conference
on Empirical Methods in Natural Language Processing.
GirishMaskeri, SantonuSarkar, Kenneth Heafield
(2008), “Mining Business Topics in Source Code
using Latent Dirichlet Allocation”, ACM
Daniel Ramage, Christopher D. Manning, and Susan
Dumais (2011), “Partially Labeled Topic Models for
Interpretable Text Mining”,San Diego, California,
USA.
D. Ramage and E. Rosen, “Stanford Topic modeling
Toolbox”,
Dec
2011.
[Online].
Available:http://nlp.stanford.edu/software/tmt/tmt-0.4
Limin Yao, David Mimno, and Andrew McCallum
(2009), “Efficient Methods for Topic Model Inference
on Streaming Document Collections” June 28–July 1,
Paris, France, ACM.
Y.W. The, D. Newman, and M. Welling (2007), “A
collapsed variational Bayesian inference algorithm
for latent dirichlet allocation”, NIPS 19, pages 13531360.
OunasAsfari, Lilia Hannachi, FadilaBentayeb and
Omar Boussaid (2013), “Ontological Topic Modeling
to Extract Twitter users' Topics of Interest”, The 8th
International Conference on Information Technology
and Applications.
Blei, D. M. and Lafferty, J. D., (2006), “Correlated
topic models”, Advances in Neural Information
Processing Systems 18. MIT Press, Cambridge, MA.
V. CONCLUSIONS
This paper demonstrates the model for identifying
topics from collection of news articles. LDA, LLDA,
PLDA have been applied for generating terms and
weight values. The performance of each model has
been evaluated using two approaches namely CVB0
and GS. From the results it has been found that the
gibbs sampling outperforms collapsed variational
approach. As a future taskit can be implemented in
distributed environment using various topic
modeling methods.
REFERENCES
[1]
Liu, X. and Croft, W. B.(2004) “Cluster-based
retrieval using language models”.In Proceedings of
SIGIR '04, 186-193.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 239
Download