International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 Statistical topic Modeling for News Articles Suganya C#1, Vijaya M S*2 # M.Phil Research Scholar& Computer science&PSGR Krishnammal College for women Coimbatore, India * Associate Professor &Department of computer science & PSGR Krishnammal College for women Coimbatore, India Abstract—Topic modeling is impressive text mining approach for organizing the knowledge with different content under particular topics. It is usually being used in various sectors such as library science, search engines and statistical language modeling. Discovering a topic from as text corpus is a tough task in the text mining perspective and, the topics from a collection of documents have to be recognized in an efficient manner. In this paper, the topic modeling technique namely Latent Dirichlet Allocation (LDA) is implemented to identify a topic from text-corpora. Variants of LDA namely Labeled LDA and Partially LDA also have been used to label the document and to identify the sub topics. This work is focused in predicting the topic for given document using observed key words. Four different categories of news articles such as science, sports, business, and weather are collected and a corpus is generated. Total of 100 instances are collected from four different categories that forms a total of 400 news articles. The three models are implemented using Stanford Topic Modeling Toolbox and the experimental results are generated and the observations are made. Keywords —Topic modeling, Generative model, LDA,LLDA, PLDA. I. INTRODUCTION Topic modeling is a form of text mining, a way of identifying patterns in a corpus. It is a type of statistical model for discovering the abstract topics that occur in a collection of documents. Topic models are derived from documents which are modeled as a weighted mixture of topics, where each topic is a probability distribution over words in entire corpus. This operates by examining a set of documents and based on the weights assigned for each word by the various topic modeling techniques. It automatically group topically-related words in “topics” it can associate tokens and documents with those topics. Topic modeling mainly used in digitized and stored in the form of news, blogs, web pages, scientific articles, sound, video and social networks. In topic modeling, a „topic‟ is a probability distribution over words or, put more simply, a group of words that often co-occur with each other in the same documents. Generally these groups of words are semantically related and interpretable; in other words, a theme, issue, or genre can often be identified simply by examining ISSN: 2231-5381 the most common words in a topic. Beyond identifying these words, a topic model provides proportions of what topics appear in each document, providing quantitative data that can be used to locate documents on a particular topic or that combine multiple topics and to produce a variety of revealing visualizations about the corpus as a whole. To find a topic in a collection of documents, the various methods are used. Topic modeling algorithms mainly used to develop a model for search, browse and summarize large corpus of texts. Problem to consider being a given large set of emails, news articles, journal papers, reports are to understanding of the key information contained in set of documents. Generative models are used for topic modeling to extract topic from large corpus, models such as Hidden Markov Model (HMM), Gaussian model, Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA). LDA is most common algorithm for find a latent topics. Researchers have recognized the importance of topic modeling increasingly in the recent years. Various methods have been proposed to identify the concepts or theme when large numbers of text document are analyzed. Ralf Krestel, et al [3] developed to improve searching for recommending tags using Latent Dirichlet Allocation (LDA). Resources annotated by many users and thus equipped with a fairly stable and complete tag set are used to elicit latent topics to which new resources with only a few tags are mapped. Based on this, other tags belonging to a topic can be recommended for the new resource. The authors evaluated that the approach archives better precision and recall than the use of association rule and also recommends more specific tags. Denial Ramage et al. [4] developed a method based on Labeled LDA for multi-labeled corpora. Labelled LDA improved traditional LDA with visualization of corpus of tagged webpages. The new model improves upon LDA for labeled corpora by gracefully incorporating user supervision in the form of a one-to-one mapping between topics and labels. Delicious corpuses with four thousand documents are used in dataset. Model compared with standard LDA. The topics by unsupervised variant were matched to a Labeled LDA topic highest cosine similarity. Totally 20 topics are learned. It has been concluded that the Labeled LDA out performs SVM when extract tag-specific documents. GirishMaskeri http://www.ijettjournal.org Page 232 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 et al [5], proposed an approach based on LDA for topic extraction from source code. Christopher D. Manning et al. [6] proposed a partial labeled topics for interpretable in text mining. These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. The many tags present in the del.icio.us dataset to quantitatively establish the new models' higher correlation with human understanding scores over several strong baselines. PhD Dissertation abstracts and delicious dataset used in this work. The latent content capture combined general framework i n PhD dissertations, 3 which include basic objects such as variables that enhance or reduction, rates of change, and also structural starting details regarding requirements, issues, and objectives. High capabilities of topics space with identified teams of latent dimensions are made as output by PLDA. Correlation of PLDA, LDA, and L-LDA for similar dataset, they observed to be a constant and high performing metric of frame. In [7] the authors proposed a semisupervised hierarchical topic model (sshlda) which aims to explore new topics automatically. SSHLDA was a probabilistic graphical model that describes a process for generating a hierarchical labeled document collection. They have compared the hierarchical latent dirichlet allocation (hlda) with hierarchical labeled latent dirichlet allocation (hllda). Also proved an hlda and hllda are special case of sshlda. The authors used yahoo question and answer dataset, contain topics are computer and internet and health. Gibbs sampling approach was used for experiments with 1000 iterations. From the above literature survey it is understood that the various LDA techniques has been applied to find the topics in journal abstracts, documents of science and social, and yahoo question and answer. In the proposed work, a model is designed to identify the topic for news articles collected from various sources. Three models LDA, LLDA, PLDA are employed for topic modeling and to find the topics. LDA based approaches CVB0 and GS are utilized and the models are generated. II. BACKGROUND STUDY This section, provide a brief description of LDA, LLDA, PLDA and its use of extracting topics from text documents. to probabilistic Latent Semantic Analysis (pLSA), except that in LDA the topic distribution is assumed to have a Dirichlet prior. The basic idea is that documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. Given a corpus of documents, LDA attempts to discover the following: • It identifies a set of topics • It associates a set of words with a topic • It defines a specific mixture of these topics for each document in the corpus. The process of generating a corpus is as follows, 1. Randomly choose a distribution over topics 2. For each word in the document a. randomly choose a topic from the distribution over topics b. randomly choose a word from the corresponding topic Fig. 1 Plate notation for LDA Fig 2.1 describes, the dependencies among the many variables can be captured concisely. The boxes are “plates” representing replicates. The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document. M denotes the number of documents, N the number of words in a document. Thus: α is the parameter of the Dirichlet prior on the per-document topic distributions, 2. β is the parameter of the Dirichlet prior on the per-topic word distribution, 3. is the topic distribution for document i, 4. is the word distribution for topic k, 5. is the topic for thejth word in document i, and 6. is the specific word. A k-dimensional Dirichlet random variable θ can take values in the (k-1)-simplex, and has the following probability density on this simplex: 1. A. LDA Latent Dirichlet Allocation (LDA) is a generative model which allows sets of observations to be described by unobserved groups that explain why some parts of the data are similar. In LDA, each document could be perceived as quite a number of numerous themes. This is comparable ISSN: 2231-5381 (1) where the parameter is a k-vector with components >0, and where (x) is the Gamma function. The Dirichlet is a convenient distribution http://www.ijettjournal.org Page 233 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 on the simplex it is in the exponential family, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution. These properties will facilitate the development of inference and parameter estimation algorithms for LDA. Given the parameters and , the joint distribution of a topic mixture , a set of N topics z, and a set of N words w is given by: (2) where is simply for the unique i such that Integrating over θ and summing over z, to obtain the marginal distribution of a document. B. Labeled LDA Labeled LDA is a probabilistic graphical model that describes a process for generating a labeled document collection. Like Latent Dirichlet Allocation, Labeled LDA models each document as a mixture of underlying topics and generates each word from one topic. Unlike LDA, L-LDA incorporates supervision by simply constraining the topic model to use only those topics that correspond to a document‟s (observed) label set. that of LDA model, an important distinction in that, the target topic j is restricted to belong to the set of labels, that is j . C. Partially LDA Partially Labeled Dirichlet Allocation (PLDA) is a generative model for a collection of labeled documents, extending the generative of LDA to incorporate labels, and of Labeled LDA to incorporate per-label latent topics. Formally, PLDA assumes the existence of a set of L labels (indexed by 1::L), each of which has been assigned some number of topics Kl (indexed by 1::KL) and where each topic is represented as a multinomial distribution over all terms in the vocabulary V drawn from a symmetric Dirichlet prior. One of these labels may optionally denote the shared global latent topic class, which can be interpreted as a label “latent” present on every document d. PLDA assumes that each topic takes part in exactly one label. Fig. 3 Plate notation for PLDA Fig. 2 Plate notation for LLDA 1. 2. 3. 4. 5. 6. 7. α is the parameter of the Dirichlet prior on the per-document topic distributions, is the topic distribution for document i, is the word distribution for topic k, is the topic for the jth word in document i, and is the specific word. The label prior φ is d-separated from the document given Λ Just constrain the set of possible topics to the observed labels where i, j is the count of word in topic j, that does not include the current assignment , a missing subscript or superscript indicates a summation over that dimension, and 1 is a vector of 1‟s of appropriate dimension. Equation looks exactly as ISSN: 2231-5381 Each document's words w and labels Ʌ are observed, with the per-doc label distribution ψ, per-doc-label topic distributions θ, and per-topic word distributions ɸ hidden variables. Because each document's label-set is observed, its sparse vector prior is unused; included for completeness. Each document d is generated by first drawing a document specific subset of available label classes, represented as a sparse binary vector from a sparse binary vector prior. A documentspecific mix over topics 1..Kj is drawn from a symmetric Dirichlet prior for each label j present in the document. Then, a document-specific mix of observed labels is drawn as a multinomial of size from a Dirichlet prior , with each element corresponding to the document's probability of using label j when selecting a latent topic for each word. For derivational simplicity, define the element at position j of to be Kj , so is not a free parameter. Each word w in document d is drawn from some label's topic's word distribution, i.e. it is drawn by first picking a label j from d, a topic z from , and then a word w from ɸ . Ultimately, this word will be picked in proportion to how much the enclosing document prefers the label l, how much that label prefers the topic z, and how much that topic prefers the word w [6]. http://www.ijettjournal.org Page 234 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 III. PROPOSED WORK Topic modeling system for news articles is built using LDA, LLDA, and PLDA with Collapsed Variational Bayes and Gibbs Sampling approaches. The proposed framework includes different phases – data collection, pre-processing, training and topic prediction. News articles are collected and stored in CSV file for processing, in the data collection phase. In pre-processing stage the news articles are processed to remove stop words, case fold words filter words and numbers, and to select the model parameter. During training the words and their corresponding weights are extracted. Final phase predicts the topic based on probability for a new article. The architecture of proposed model is shown in Fig. 3.1 Corpus of News articles Pre-processing Train LDA Model LDA Predictiv e model Predicted Train LLDA Model LLDA Predictiv e model Predicted Train PLDA Model PLDA Predictiv e model Predicted A. Data Collection Data are collected manually from online news websites. 400 instances from 4 different news websites related to science, sports, business, and weather, each with 100 news articles are collected. Science news articles are collected from Science Daily website 1 . Gene and disease related news articles are collected from science website. Sports news is collected from BBC Sports website2. Cricket, football, and tennis categories are collected from sports website. Business news is collected from The Times of India website3. Stock market and finance http://www.sciencedaily.com/ http://www.bbc.com/sport/0/ 3 http://timesofindia.indiatimes.com/ B. Pre-Processing Scala NLP API is used to pre-process the news documents. Scala program takes text as CSV file, per one column one news can be pasted. CSV file is given as input and pre-processing steps are carried out for learning models using LDA, LLDA, and PLDA. The pre-processing tasks like tokenization, case folding, stop words removal and parameter selection are discussed below. 1. Tokenization The first step in the pre-processing is tokenization. Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing in text mining. This is accomplished with the following functions. Simple English Tokenizer ( ) function is used to remove punctuation from the end of words and then split up the input text by whitespace characters like tabs, spaces and carriage returns. Casefolder ( ) is a function used to lower-case each word so that “THE”, “The”, “ The” all look like “the”. Case folding reduces the number of distinct words seen by the model by turning all character to lowercase. Words and Numbers Only Filter ( ) function is used for words that are entirely punctuation and other non-word non-number characters and are removed from the generated lists of tokenized documents. 2. New articl e Fig. 4Topic Architecture model Labelof proposed Subtopic 1 categories news are collected for business. Weather news collected from AccuWeather news website4. Finding meaningful words The use of very common words like „the‟ does not indicate the type of similarity between documents in which one is interested. Single letters or other small sequences are also rarely useful for understanding content. So there is need for removing term which appears very few times in documents, because very rare words tell little about the similarity of documents, and most common words in the corpus, because words that are ubiquitous also tell little about the similarity of documents. The following functions are used to achieve this task. MinimumLengthFilter ( ) function is used to remove the terms that are shorter than minimum number of characters. TermStopListFilter ( ) function is used to add an list of stop words which are like to remove from corpus, can be added by using TermStopFilter(List(“term1”,”terms 2”,…,”terms n”)). 2 ISSN: 2231-5381 4 http://www.accuweather.com/en/weather-news http://www.ijettjournal.org Page 235 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 Removing Empty Documents: Some documents in the dataset may be missing or empty because some words are filtered in the tokenisation process. These documents can be disregarded by applying the following function. Document Minimum Length Filter(length) function is used to remove some documents in the dataset that are empty. Length can be specified to identify the shorter documents. 3. Parameter Selection for three models The number of topics can be specified by using the parameter selection process and parameter for the dirichlet term and topic smoothing used by the LDA model can also be provided by passing arguments such as CVB0 and Gibbs sampling are used under the same fixed hyper-parameter. The constructor used for LDA is given below. Labeled LDA can use LabeledLDAModelParams constructor, here need not specify the numtoipcsparameterbecause the topics are already specified as labels. Parameter used for LLDA model is given below. valmodelParams = LabeledLDAModelParams(dataset); PLDA model is used similar to LDA in which the parameters to specify the number of background topics and number of subtopics for each label are included. The parameters are num Background Topics and num Topics Per Label respectively. PLDA performs to identify hidden topics in particular news domain. The constructor used for PLDA as given below. ValmodelParams= PLDAModelParams(dataset, numBackgroundTopics, numTopicsPerLabel, termSmoothing = 0.01, topicSmoothing = 0.01); valnumTopicsPerLabel=SharedKTopicsPerLabel(2); IV. EXPERIMENTS AND RESULTS Three experiments have been carried out using LDA, L-LDA, and PLDA to identify the topics for various articles. News articles related to domains like science, sports, business, weather are collected from multiple sources text-corpus is created. Various pre-processing tasks have been carried out as described earlier to facilitate training. The experiments have been carried out using Stanford Topic Modeling Toolbox using scala programming ISSN: 2231-5381 language. Stanford topic modeling toolbox is a free collection of topic modeling tools, and a part of the Stanford natural processing toolset. Experiments for LDA, LLDA, and PLDA are discussed in this section. A. Latent Dirichlet Allocation (LDA) The learning model is generated using CVB0 and GS which are the two LDA approaches. The term and topic smoothing parameters are set as 0.01 and the model is built for the selected parameter values. The number of topics that are being chosen to perform this experiment is set as numTopics = 4 which indicate four topics are being chosen. The model is trained using 1000 iterations and after every 50th iteration, the summary of the topic with weight values is computed and displayed. The resulting terms using CVB0 approach is given in Table I for topic 0 and topic 1, topic 2 and topic 3. The related terms in the news articles are clustered. The grouping of topic 0 contains weather related terms, the grouping of topic 1 valparams = LDAModelParams(numTopics =4, dataset = dataset, topicSmoothing = 0.01, termSmoothing = 0.01); contains science related terms, topic 2 contains business related terms, and topic 3 contain sports related terms. Gibbs sampling is similar to CVB0 approach but topic clusters are different. In Gibbs sampling, topics are clustered as weather, sports, science, and business. TABLE I. TERMS FOR LDA MODEL During testing, a news article of category „weather‟ is then chosen from one of the four topics that are considered for training using LDA. Testing is performed using both CVB0 and GS models. The probability values of the given news article for the four topics inferred while testing using CVB0 are given below News id=1 Topic 0 (weather) = 0.756 Topic 1 (science) = 0.00094 Topic 2 (business) = 0.214 Topic 3 (sports) = 0.03 The above results shows that the probability of news article belonging to weather (topic 0) is high, which proves that the topic predicted by the model is correct. Similarly, the LDA topic model with Gibbs sampling is tested by passing a news article of category weather. The test result in terms of probabilities that article belong to the topics are given below. News id=1 http://www.ijettjournal.org Page 236 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 Topic 0 (weather) = 0.85 Topic 1 (sports) = 0.088 Topic 2 (business) = 0 Topic 3 (science) = 0.063 Comparison of CVB0 and GS in LDA: The comparison of prediction results in LDA with CVB0 and GSare summarized in Table II and illustrated in Fig. 4.1. Topic 0 Weather Weekend Rain Patients Common Percent Climate Temperat ure Sereve Winter Season Dry Heavy Thunders toms TABLE II. Topic 1 Research er Cancer Scientists Disease Research Gene People Cells Topic 2 Growth Topic 3 World Market Stock Bank Economic Business Exchange Companies Cup Win Goals Scored Victory Football League Identified Dna Genetic Genes Therapy Brain Rate Sensex Interest Markets Demand Global Team Cricket Australia Match Cricket Player PREDICTIVE PERFORMANCE OF LDA Approaches Topic 1 0.75 Topic 2 0.0009 Topic 3 0.214 Topic 4 0.03 CVB0 GS 0.85 0.088 0 0.063 Science 0 Sports 1 Business 2 Weather 3 Cancer Researcher s Scientists Dna Cells Treatment Time Patients Vaccine Bacteria Liver Brain World Cup Victory Win Sport Scored Team Life Teams Life Time Games Markets Economic Rates Companies Global Top Demand Time Bank Lead Time sensex Weather Climate Severe Change Rain Risk Global Areas Parts Increase Days wave 1 0.8 0.6 0.4 0.2 0 CVB0 GS 1 2 3 4 Fig. 5 Comparative results of LDA From the abovetwo inferences it has been observed that Gibbs sampling based LDA model produces higher probability of topic than CVB0. B. Labeled Latent Dirichlet Allocation (LLDA) In this experiment the labels are assigned for news articles and the training dataset in CSV file format is created. Labeled LDA (LLDA) model is intended to produce the label and this is achieved through the terms extracted for each topic. Words are extracted from text in LLDA training similar to LDA training. LLDA incorporates supervision by simply constraining the topic model to use only those topics that correspond to a document‟s label set. During testing, the LLDA trained model is tested for its prediction of the labels appropriate for the new document. Training the LLDA model using CVB0 approach produces the terms along with the corresponding label. The terms are computed using LLDA model, along with the corresponding labels such as science (0), sports (1), business (2), and weather (3). These terms are used to predict the label of the new document, based on the probability. The terms extracted during LLDA training are given in Table III TABLE III. TERMS FOR LLDA MODEL During testing, the LLDA model determines the probability of the article belonging to a topic based on the weights of the terms in the article and predicts the label for given topic. Similar to LDA experiment, here the model is tested with weather news. It is inferred that probability for the topic weather is much higher than science, sports, and business. Predictive results for LLDA with CVB0 are given below. News id = 1 Science (0) = 0.0007 Sports (1) = 0.003 Business (2) = 0.0673 ISSN: 2231-5381 http://www.ijettjournal.org Page 237 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 Weather (3) = 0.789 The LLDA model training with Gibbs sampling and testing is carried out in similar to above LLDA with CVB0. The test result of LLDA with GS model is given below. are discovered for the all the topics. For example, for science articles, the sub topics „gene‟ (science 0) and „disease‟ (science 1) are identified by the model. The terms are extracted for each sub topic as given in Table V. TABLE V. News id = 1 Science (0) = 0.0086 Sports (1) = 0.003 Business (2) = 0.0012 Weather (3) = 0.889 Comparison of CVB0 and GS in LLDA: The comparison of prediction results in LLDA with CVB0 and GSare summarized in Table IV and illustrated in Fig. 4.2. TABLE IV. PREDICTIVE PERFORMANCE OF LLDA Approach CVB0 Science 0.0007 Sports 0.05663 Business 0.067 weather 0.78 GS 0.0086 0.038 0.0286 0.89 1 0.8 0.6 0.4 0.2 0 CVB0 GS 1 2 3 4 Fig. 6 Comparative Results of LLDA From the abovetwo inferences it has been observed that Gibbs sampling based LLDA model produces higher probability of label for the given news article than CVB0. C. Partially Labeled Dirichlet Allcation (PLDA) Partially Labeled Dirichlet Allocation (PLDA) is employed to recognize the unseen sub grouping of a particular theme. PLDA presents significant amounts of flexibility in describing the area of latent topics to fundamentally learn about latent topics both within labels together with a very common background space. Similar dataset because found in the second experiment is commonly used here. However, PLDA provides support for a new parameter per each label, Kl, which corresponds to the variety of topics as available within each label's topic class. During PLDA training, the terms are extracted similar to previous models and the hidden sub topics ISSN: 2231-5381 TERMS FOR PLDA MODEL Science 0 Scientists Study Human Dna Genes Virus Vaccine Genome Genetic Researchers Identified International Research Sequence Mutation Discovered Science 1 Cancer Disease Researcher Brain Time Therapy Heart Common Research Patients Study Cells Brain Health Bacteria Treatment During testing, news articles from each sub topic are chosen. The PLDA model will predict the label and the probability. For example, two articles from each sub topic of science category are taken and tested. The probability values of the given news article for the science category inferred while testing using CVB0 are given below. News id = 1 News id = 2 Science (0) = 0.879 Science (0) = 0.0034 Science (1) = 0.0056 Science (1) = 0.88 The PLDA training with GS is done similar to PLDA training with CVB0 and the probability values of the given news article for the science category inferred while testing using GS are obtained. News id = 1 News id = 2 Science(0) = 0.99763 Science (0) = 0.002 Science (1) = 0.00005 Science (1) = 0.99 Comparison of CVB0 and GS in PLDA: The comparison of prediction results in PLDA with CVB0 and GSfor category „science‟are summarized in Table VI and illustrated in Fig. 4.3. http://www.ijettjournal.org Page 238 International Journal of Engineering Trends and Technology (IJETT) – Volume 31 Number 5- January 2016 TABLE VI PREDICTIVE PERFORMANCE OF PLDA [2] Approach CVBO GS [3] Science 0 0.879 0.99 Science 1 0.002 0.0453 1 0.8 0.6 0.4 0.2 0 [4] [5] CVB0 [6] GS [7] 1 2 [8] Fig. 7 Comparative Results of PLDA [9] From the abovetwo inferences it has been observed that Gibbs sampling based PLDA model produces higher probability of subtopic for the given news article than CVB0. Findings and Discussion [10] [11] In this work, the topics for news articles have been predicted using LDA and its variants LLDA, PLDA. As far as the topic modeling is concerned, LDA is intended to predict the topic of the article whereas LLDA is intended to determine the labels of given topic. PLDA is used to find hidden sub topics from collection of all categories. LDA and LLDA help to find topics and labels of the documents in a large collection of documents and can generate the key words for document collection. PLDA enables to extract keywords from particular domain of subtopics. From the comparative analysis of three models, it is inferred that Gibbs sampling approach integrated with LDA, LLDA and PLDA outperforms Collaborative variational approach. McCallum (1999), A. “Multi-label text classification with a mixture model trained by EM”. In AAAI workshop on Text Learning. Ralf Krestel, Peter Fankhauser, and Wolfgang Nejdl (2009), “Latent Dirichlet Allocation for Tag Recommendation”, ACM. Denial Ramage, David Hall, Ramesh Nallapati and Christopher D. Manning, (2009) “Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora”, proceeding of the 2009 Conference on Empirical Methods in Natural Language Processing. GirishMaskeri, SantonuSarkar, Kenneth Heafield (2008), “Mining Business Topics in Source Code using Latent Dirichlet Allocation”, ACM Daniel Ramage, Christopher D. Manning, and Susan Dumais (2011), “Partially Labeled Topic Models for Interpretable Text Mining”,San Diego, California, USA. D. Ramage and E. Rosen, “Stanford Topic modeling Toolbox”, Dec 2011. [Online]. Available:http://nlp.stanford.edu/software/tmt/tmt-0.4 Limin Yao, David Mimno, and Andrew McCallum (2009), “Efficient Methods for Topic Model Inference on Streaming Document Collections” June 28–July 1, Paris, France, ACM. Y.W. The, D. Newman, and M. Welling (2007), “A collapsed variational Bayesian inference algorithm for latent dirichlet allocation”, NIPS 19, pages 13531360. OunasAsfari, Lilia Hannachi, FadilaBentayeb and Omar Boussaid (2013), “Ontological Topic Modeling to Extract Twitter users' Topics of Interest”, The 8th International Conference on Information Technology and Applications. Blei, D. M. and Lafferty, J. D., (2006), “Correlated topic models”, Advances in Neural Information Processing Systems 18. MIT Press, Cambridge, MA. V. CONCLUSIONS This paper demonstrates the model for identifying topics from collection of news articles. LDA, LLDA, PLDA have been applied for generating terms and weight values. The performance of each model has been evaluated using two approaches namely CVB0 and GS. From the results it has been found that the gibbs sampling outperforms collapsed variational approach. As a future taskit can be implemented in distributed environment using various topic modeling methods. REFERENCES [1] Liu, X. and Croft, W. B.(2004) “Cluster-based retrieval using language models”.In Proceedings of SIGIR '04, 186-193. ISSN: 2231-5381 http://www.ijettjournal.org Page 239