Text mining and its association analysis on topic modeling

A text mining and association analysis: Exploring text data for creating topic models Kyuson Lim Department of Mathematics & Statistics, McMaster University, E-mail: limk15@mcmaster.ca April 26, 2022 STATS 771 Kyuson Lim Contents 1 Abstract 5 2 Motivation 7 2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 About text data . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Outline and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 3 Text mining and data visualization using wordcloud 11 3.1 Literature review on concept of wordcloud and support analysis . . . . . 11 3.1.1 Wordcloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.2 Association rule: support analysis . . . . . . . . . . . . . . . . 12 Complex data visualization: Worldcloud and co-occurences . . . . . . . 12 3.2 4 5 Machine learning in data visualization 1: Hierarchical Clustering analysis 15 4.1 Literature review on concept of hierarchical clustering analysis . . . . . 15 4.2 Hierarchical clustering and correlation analysis . . . . . . . . . . . . . 16 Machine learning in data visualization 2: K-medoid clustering 19 5.1 Literature review on concept of K-medoid clustering analysis . . . . . . 19 5.2 k-medoids and determination for number of clusters . . . . . . . . . . . 20 3 STATS 771 6 7 8 9 Kyuson Lim Time series data analysis: local smoothing regression 23 6.1 Literature review on concept of local smoothing regression . . . . . . . 23 6.2 Result of local smoothing regression . . . . . . . . . . . . . . . . . . . 24 Gaussian Graphical Models: application in unstructured text data 25 7.1 Literature review on the Gaussian Graphical Model . . . . . . . . . . . 25 7.2 Basic indirected network analysis: Undirected Gaussian Graphical Model 26 7.2.1 Partial correlation: Gaussian Graphical Models . . . . . . . . . 27 7.2.2 Interpretation of the Gaussian graphical model . . . . . . . . . 27 Network Graph in topic clustering: BTM (Biterm Topic Modelling) 31 8.1 Literature review on the biterm topic model (BTM) . . . . . . . . . . . 31 8.2 Topic clustering: result of BTM . . . . . . . . . . . . . . . . . . . . . 33 Conclusion and discussion 35 9.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 9.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 CONTENTS Chapter 1 Abstract With about 182 official news headline collected between January 2020 to December 2020 related to the COVID-19 pandemic issues in Canada, the topics and keywords of contents are statistically analyzed by graphical models and topic models to effectively portray the output and present in a collection of keywords by its connects. Such cooccurences of words to be written in the sentences can be investigated by using Gaussian graphical models to know how keywords are connected interchangeably. An association between words are understood by the Gaussian graphical model to be visually portrayed for its connection. The notion of topic modelling in text mining is an efficient tools to explore and summarize massive collections of words. In the case of clustering between words for its appearance, the data can be visually shown by the tool of wordcloud and analyzed by the hierarchical clustering analysis. A similarity between words and the grouping effect to be known from the sentence structure is resolved in the correlation analysis and the k-mean clustering classification. A method of topic clustering has been a great success in both industrial and academic research areas to be applied to formal texts. As part of an extension for machine learning algorithm, the Biterm Topic Model (BTM) is used to analyze the co-occurrences of words over massive set of documents to come up with simple network graphs. For the ultimate goal of statistical inferential problem to overcome in the text mining data, a BTM can be presented to visually interpret the superior result in topic clustering analysis. 5 STATS 771 Kyuson Lim CHAPTER 1. ABSTRACT Chapter 2 Motivation 2.1 Background In 2020, the coronavirus disease (COVID-19) pandemic had started with the cause of coronavirus known to be a severe acute respiratory syndrome coronavirus 2 (SARSCoV-2) (WHO, 2019). A COVID-19 stands for coronavirus disease and even referred to as the 2019 novel coronavirus or "2019-nCoV" (Bender, 2020). A fast transmission of the COVID-19 virus hindered the traditional face-to-face communication between people and changed the educational delivery system of courses (Liu et al., 2020). This new virus can be transmitted just in minutes through droplets or even touching surface metals or other materials which have been infected from a person who has respiratory problems. Even though the elderly and the very young children are easily affected, nobody is immune to this new infectious disease once it hits the body, so all people are susceptible to its devastating effects (Bender, 2020; Meng et al., 2020). The coronavirus disease (COVID-19) pandemic led to lockdowns and social distancing policies in more than 200 countries, which inevitably plunged global economy into worst recession since World War II (World Bank, 2020; Nayak et al., 2021). This recession has created widely different experiences across regions, demographic groups, and industries (Congressional Research Service, 2021; Gilfillan, 2020). Mass hysteria was bound to ensue in the early days of the pandemic due to uncertainty surrounding the authenticity of information sources. In times like these where the general public has been advised to stay home except for essential purposes and in-person social 7 STATS 771 Kyuson Lim gatherings have been restricted, many people relied on the latest news offered by the public internet and headlines of the newspaper. In the era of big data, the uncertainty of how much data we can analyze is a question of interest for many people. However, the importance of data mining and the techniques resolves the question of interest for many people to comprehend big data by text crawling from the certified archive instantaneously. Not only this data mining technique enables many people to model for an effective output based the unique qualitative data by text mining (using NLP: natural language processing), but also this technique resolves the quantitative questions by creating a new and rapid integrative models. 2.1.1 About text data Statistics Canada is an official agency of the Government of Canada, commissioned with producing statistics to help better understand Canada, its sociological and political issues. Statistics Canada is available to anyone in the world to access and use publicly available data. The website works as a platform, allowing users to read government official notices and news that is possible to download as well. The data used in this article consists of news articles on various subjects which explores the impact of COVID-19 on sociological and economic issues in 2020. Although there is a periodical update on the news headlines, the analysis is restricted all 2020 headlines in the website (https://www150.statcan.gc.ca/n1/pub/ 45-28-0001/452800012020001-eng.htm) for the analysis to be stable and unique for its interpretation. The purpose of data to be posted in the government of Statistics Canada from 2020 January to December as part of the third iteration of the Canadian Perspectives Survey Series (CPSS) is to explore he re-opening of economic and social activity during quarantine times. Using the data of news articles, the articles are intended to provide an update to family and public people for the livings and lives to examine lifestyles during confinement of 2020, which is the year when the COVID-19 pandemic was popular. The data mostly covers livings and lifestyle issues of aged over 20 in Canada at the time of survey to represent the popularity and representation of Canadian issues during the COVID-19 pandemics in 2020. By this idea of web crawling on Canadian lifestyle and livings, we hope to learn what interest and social ideas, how people navigated their CHAPTER 2. MOTIVATION Kyuson Lim STATS 771 livings, which issues are problematic for certain time period. Taking the advantage of the fact that Statistics Canada represents as a government official agency, having analysis on the text mining data would bring us to understand what issues we have in 2022 as to learn from the past. As the text mining data on news headlines is to provide a representative and efficient understanding of the overall contents and issues that the news are written with, the statistical analysis is only analyzing the headlines of news to know public issues, interest and relationship between words to visually portray to the reader how they are connected and what conclusion we can draw from. 2.2 Outline and goal The goal is to explore the data with various statistical models including clustering of machine learning algorithm, topic clustering, and graphical models to identify the association between keywords which best suit for the text mining data of COVID-19 news headlines and description that is established in official website of Statistics Canada. First, the wordcloud investigate to identify for most frequent used words that appears in the news headlines to define for keywords based on the ranking. The purpose is to portray overall appeared words in the news headlines that is an effective way for data visualization. Second, a hierarchical clustering mechanism with correlation plot is used to identify the statistical combination of keywords among dissimilarity and similarity between words. By grouping the hierarchical clustering of keywords, we are able to construct groups of keywords that are in the same cluster to find for latent variables. We are also able to observe the grouped keywords to construct for topic that best represent the clusters to be related to the topic clustering analysis. Then, the k-medoid clustering analysis that is an advanced method of k-mean clustering is applied for identifying quantitative groups of clusters among all possible words from the dataset. The model is expected to recognize significant relationship between variables of keywords and build up for topics of relevant words to account for highest variability as possible to be constructed with. Third, the non-directed Gaussian graphical model (DGGM) is provided for statistical CHAPTER 2. MOTIVATION STATS 771 Kyuson Lim relationship based on parametric analysis on framework of multivariate statistics to provide concise and clear structure between words of text sentences. Combined with the time series data of published timeline on news headlines analyzed by the local polynomial regression fitting, it is expected to provide a reason why the relationship between keywords is not possible to be directed for causal relationship. Lastly, the Biterm topic model (BTM) which is an ultimate model in short text analysis can be applied for identifying topics and comparing various models. While short texts traditional models for simple statistical analysis is not suitable by the availability of keywords and extremely sparse for co-occurrences, BTM attempts to model of global term co-occurrences rather than at the document level. Due to the nature of data exploration and computational statistics, the analysis of the paper only deals with exploratory data analysis (EDA) not with confirmatory data analysis (CFA) to effeciently yield the output and interpret the result. CHAPTER 2. MOTIVATION Chapter 3 Text mining and data visualization using wordcloud 3.1 Literature review on concept of wordcloud and support analysis 3.1.1 Wordcloud A wordcloud, also known as tag cloud is a visual representation of text data, which is often used to depict keyword on websites, or to visualize free form text (Halvey & Keane, 2007). Founded by the early example of Douglas Coupland (1992), a weight is given to a more frequently used terms as a non-parametric method to visually portray the data into a graph. Mainly familiarized by the Web 2.0 websites and blogs, various of modification on weights for size of the text data that is visually floating in the graph has been applied (Guattari & Deleuze, 1992) The idea for using the wordcloud in context of this paper is to identify the keywords as a measure base of frequencies it appears in the data and effectively portray the words to identify important words to be defined as keywords. From author"s previous experience working as a researcher with professionals, the attractive color and visually influential factors are emphasized by the authoritarians to come across for practical understandings and outcomes to be shown to the public. 11 STATS 771 3.1.2 Kyuson Lim Association rule: support analysis Founded by the Rakesh Agrawal, Tomasz Imielinski and Arun Swami (1994), an association rule learning is a rule based machine learning method for discovering the interesting relations between variables or target responses in a large dataset. In the context of data mining of association rule learning, a lift is a measure of the basic frequent based co-occurrence performance for a targeting data. Commonly known to be called as a market basket analysis, the lift is a basic tool to apply for the text mining data to come across for relationship between keywords by the frequencies of intersecting appearances in many machine learning papers (Wong & Thomas, 1999) However, the association rule algorithm is applied into many context of parameters only and less applied for its own technique (Garcia, et al., 2007). As it is known that the simplest method is the best method in many machine learning industries, the support gives a perfect representation on the behavior in data interconnections for the inference on classification (Blumer et al., 1987) By the support analysis, we hope to find the magnitude of relationship between keywords concisely as possible, and give clear notion of relationship between words to simply connect them to formulate into a phrase. This could give a comparison for later methods of clustering and BTM to identify for the relationships. 3.2 Complex data visualization: Worldcloud and cooccurences Before the visualization of the wordcloud, a table of top 6 ranked most frequent used words are shown in the Table 1. As a topic of interest, the word "covid" has been used 173 times followed by the "pandemics", 100 times, "articles", "Canada" and "health". From the Table 1, we can certainly confirm that the topics of headlines are restricted to Canada"s COVID-19 articles and interest of topics are restricted Canadian issues. Also, the top 6 ranked keywords for its association rule of support analysis are graphical portrayed in Figure 1. A general wordcloud (Figure 1) was constructed for an overview of the terms contained in the data. Due to the nature of wordcloud to be messy direction and floating CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD Kyuson Lim STATS 771 parts with ambiguous magnitudes (Figure 1), only top 12 ranked most frequently used words are portrayed in the wordcloud by it own magnitudes of frequencies. Rank Word Frequency 1 covid 173 2 pandemic 100 3 article 54 4 Canada 54 5 Canadian 48 6 health 48 Table 1. Top 6 ranked most frequently used words in 2020 Statistics Canada articles. From the Figure 1, we can identify some unique words such as "data", "differences", "survey", "statistics" and "study" that majority of the people rely on statistical and quantitative analysis of opinions on people to know for sociological and economical issues rather than the qualitative and education topics. However, most words are composed of economical and sociological topics for many people which include "price", "mental", "concerns", "home", and "workers". This indicates that majority of people are interested in livings during the pandemic period to last lives and know how people live to overcome the COVID-19 period where many people are isolated in their homes. Hence, the words extracted by the frequencies well represents the issues and understanding we have had during COVID-19 in 2020. CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD STATS 771 Kyuson Lim Figure 1. (a) The bar graph indicates for the support analysis of top 6 ranked frequently used words with top 3 ranked words on the top of the title to denote with. (b) The wordcloud represents the frequency based representation of the words for top 12 most often used words in the data. Note that the "ggwordcloud" has been used to graph the output that is preprocessed by the package "tokenizer" in "quanteda" to result in a hybrid graph with the "ggplot" package (Le & Slowikowski, 2019; Benoit et al., 2018). The idea belongs to the author where no result is the same to claim for the copyright. CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD Chapter 4 Machine learning in data visualization 1: Hierarchical Clustering analysis 4.1 Literature review on concept of hierarchical clustering analysis A data visualization is an interdisciplinary area that deals with the graphic representation of data and applied into machine learning output. It is a particularly efficient way of communicating when the dataset is huge and abstract especially for many cases in machine learning application as the main focus is to reinforce human cognition (Knaflic, 2015). From an academic point of view, the data visualization is a representation that is considered to be a mapping between the output of the data analysis and graphical elements. Therefore, a clustering analysis of its output is presented with its roots in the field of statistics to efficiently deliver the conclusion of the statistical analysis (Gershon & Page, 2001). A hierarchical clustering is a method of classification analysis which seeks to construct a hierarchy of clusters. The method classifies the words into groups based on the dissimilarity of the words. Here, the way to measure the dissimilarity is to apply the Euclidean distance (𝐿 2 norm). The number of times of the word used is the coordinates in the Euclidean space (Charu et al., 2012). Then, the distance in between any two words is calculated as a measure of dissimilarity. If the distance is large, then it indicates more dissimilarity of the two words. With the distance matrix, we can then cluster words 15 STATS 771 Kyuson Lim (Becue-Bertaut, 2019). The method used in the analysis is called complete-linkage clustering, which computes the farthest neighbor between words. Then, the two words and the formulation of clusters are separated by the shortest distance combined. For two clusters, the distance is the maximum distance among any pair of elements from the two clusters (Becue-Bertaut, 2019). Note that the "ggorrplot" has been used for data visualization of the correlation analysis and a "ggdendro" to visualize the output based on the frequencies words appeared in the data (Galili, 2015; Wickham & Wickham, 2007). 4.2 Hierarchical clustering and correlation analysis As the purpose is to compare for all pairs of points in 2 dimension, an Euclidean measure with a complete linkage is applied to the hierarchical clustering model to identify two clusters of words in Figure 2. Such a plot of hierarchical structure is called a "dendrogram", which has a tree structure (James et al., 2013). The hierarchical classification model has 2 groups of clusters one group with 18 words and the other group with 3 words. Based on the structure of the dendrogram, the manual cut-off is decided for 2 number of clusters due to the inter-connected relationship established for 18 words formed by the tree structures. The result of 3 or more clusters yields inappropriate groups of clusters from the dendrogram. Second, the correlation analysis in Figure 2, shows that top 6 keywords are positively correlated with each other with less negatively correlated words to be consistent in the interpretation. From the analysis, we can identify that "covid" and "pandemic" has strong correlation (0.35) and formulation of cluster to be interpreted as a combination. Also, there is a moderate correlation between the keyword "health", "covid" (0.13) and "pandemic", "health" (0.12) o form a group of words that shows the importance of health in COVID-19 pandemic for Canadians. On the other hand, the word "impact" has negative correlation with the word "health" indicating that the co-occured words is not due to the health issues for "Canadians" for a correlation value of 0.27. Similarly, the hierarchical clustering shows that the CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL CLUSTERING ANALYSIS Kyuson Lim STATS 771 "Canadian", "Canada" and "business" with "impact" is connected strongly in the same hierarchy indicating that the issue of "impact" is on the "business" to conclude with. Figure 2. (a) A circular hierarchal clustering plot shows for the top 21 most frequently used words in news headlines. (b) The correlogram shows for the top 6 ranked most frequently used words of correlation between words. Consistent with the wordcloud analysis in Figure 1, the words "examines", "study", "using", and "survey" are grouped together in the same hierarchy to yield a formulation of the cluster that majority of the people are interested in quantitative result and analysis for COVID-19 impact on Canadians. Meanwhile, we certainly reach to the question of how much variability does clustering account with and a comparison of interpretation with other clustering algorithm to reach into the second part of analysis using K-medoid clustering analysis. CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL CLUSTERING ANALYSIS STATS 771 Kyuson Lim CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL CLUSTERING ANALYSIS Chapter 5 Machine learning in data visualization 2: K-medoid clustering 5.1 Literature review on concept of K-medoid clustering analysis The k-medoids algorithm partions the data into groups and attempt to minimize the distance between points by defining a point of the center in that cluster and labeling points. Moreover, the k-medoids chooses actual data points as centers for enhanced interpretability, where the center of a cluster does not necessarily to be one of the input data points, which can be the average between the points in the cluster (Schubert & Rousseeuw, 2019). The k-medoids algorithm is found by the Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm (Partitioning Around Medoids), which is the name of the function in applying for the k-medoids (Kaufman & Rousseeuw, 2008). The greedy algorithm in k-meoids, which is heuristic to identify the clusters is different from hirerachial clustering algorithm, as there is exists for many solutions and iterations for the algorithm to actually implement with (Schubert & Rousseeuw, 2019). 19 STATS 771 Kyuson Lim 5.2 k-medoids and determination for number of clusters A various methods can be used to determine the optimal number of clusters. One the most commonly used method is an "elbow" method, which calculate how much variability in the data can be explained by the clustering. Although the variability increases with the number of clusters, we identify the drastic point of increase to be the optimal cut-off for the choice in the number of clusters to apply in the algorithm to fit with (Kumar & Paul, 2016). For example, the analysis is obtained by applying for the explained variances between 1 to 7 clusters in Figure 3. From the Figure 3, after 3 clusters, the increase in the explained variance becomes slower, drastically increase at the most between cluster 2 to cluster 3. Hence, the choice of clustering for k is 3. First of all, no cluster overlaps with each other indicating that we have had adequate fit for the data. Also, the two number of cluster accounts for 45.64% of the variability in the data by 3 clusters which is about the half of the data by the interpretation of the result to conduct with (Figure 3). Figure 3. (a) The elbow method of variability plot for each increase in the number of clusters for k-medoid. (b) The k-medoids clustering with k=3 is illustrated when the method is applied for the data. Similar to hierarchical clustering analysis, the words of "covid" and "pandemic" is separated from the major cluster (cluster 2), which contains 18 words in the data (Table 3). However, the word of "health" is contained with the other cluster (cluster 2) indicating CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID CLUSTERING Kyuson Lim STATS 771 that the socio-economic issues include the issue of health problems as well. From the perspective of headlines, we can denote that word of "covid" and "pandemic" is a title but the other 18 words are subtitle and specified issues to conclude that the hierarchy of topics. While the subtopics are actual impact of issues in living and social life of Canadian in 2020, the impact is severe by the COVID-19 pandemics and has changed many things for issues to always contain the word of "covid" and "pandemic". PC Variance explained 1 2 3 4 5 6 7 0 0.14 0.27 0.38 0.46 0.5 0.56 Table 2. Variance explained by increasing number of clusters cluster covid pandemic Canada impact Canadians business impacts number 1 3 2 2 2 2 2 cluster people health Canadian economicssurvey article data examine number 2 3 2 2 2 2 2 2 Table 3: Table of words classified by clusters in k-medoids From Figure 2 and Figure 3, we have seen that the result of k-medoids and hierarchical clustering yields a similar result of interpretation where there is a certain topics and subtopics of issues that match with wordcloud to have observed with. The statistical result shows a meaningful relation between words where the result is reflected on how people perceived in livings and issues to actually deal during COVID-19 pandemics. In the meantime, we have not analyzed for any causal relationship on time series data of publication to investigate if there is any statistical inference in which words differ in the time of appearance for the publication. Hence, we have applied the local smoothing regression to the annual time series data to observe how the trend of words that appears for 12 months differs to investigate if one words have impact on the other to appear with. CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID CLUSTERING STATS 771 Kyuson Lim CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID CLUSTERING Chapter 6 Time series data analysis: local smoothing regression 6.1 Literature review on concept of local smoothing regression A local smoothing regression (also known as loess: locally estimated scatterplot smoothing) which is commonly referred to as Savitzky-Golay filter is a type of non-parametric regression method that is mixed type of a generalized the moving average (MA) and polynomial regression (Press & Teukolsky, 1990). The local smoothing regression is a non-linear regression that is a smooth curve fitted to the actual data points (Hardle et al., 2012). For different classes of words, we may fit the local smoothing regression to identify the trend changes and increasing and decreasing point to identify if one word has impact on the other to cause some issues in 2020. By the analysis of headlines which directly reflects the issues, we can identify whether the causal relation could be identified to apply to the graphical interpretation of statistical models. 23 STATS 771 6.2 Kyuson Lim Result of local smoothing regression Figure 4. A time series data analysis by local smoothing regression on top 7 ranked most frequently used. Note that the paper uses "ggplot" function for the local smoothing regression as an option (loess) to plot the result. Overall result shows that the causality or time series related reasoning is not possible to apply for an inference, as the trend of increase and decrease is same for all 7 words. Nevertheless, the word "Canadian" is slightly falling behind in between July and August to have higher number of appearances, comparing with other words such as "covid" and "pandemic". Even though the word of "Canada" is exactly following the trend of other words, the "Canadian" issues are more frequently appeared in the headlines at the period of July to August as the word "health" does indicating that the impact on Canadian people for health issues are most severed in July to August of 2020. Furthermore, we can argue that the decreasing trend of word "Canada" after June has been moved to the word "Canadian" as people aims to describe more specified interest or issue. Still we are unable to draw a conclusive argument on casual inference for words but we are able to distinguish some minor distinction between topics of issues to identify for the popular ideas among people in 2020. CHAPTER 6. TIME SERIES DATA ANALYSIS: LOCAL SMOOTHING REGRESSION Chapter 7 Gaussian Graphical Models: application in unstructured text data A graph from applied mathematics is defined by the pair vertices (or nodes) and edges. It is applied into a statistical model in order to represent for the dependency and structural connections between words that are vertices of the data (Hojsgaard, Edwards & Lauritzen, 2012). By the undirected mathematical graph and parametric modeling to the text data, we aim to clearly define the structural dependencies and relationship in the sentence of headlines that formulate as data. 7.1 Literature review on the Gaussian Graphical Model Making use of the massive complicated text data which requires characterization for the relationship among a large number of variables, a Gaussian graphical models explicitly capture the statistical dependency between the variables of interest in the form of a network graph. By the consequence of the central limit theorem (CLT), an unstructured data of its quantities can be approximately formulated into a Gaussian distribution. Hence, assuming that the Gaussianity is imposed with the first and second moments, the graph can be described by the sparsity pattern of the concentration matrix (Uhler, 2017). First, each node in the graph corresponds to one of the variables in the text data. A missing edges in graph correspond to conditional independence relations in the corresponding Gaussian graphical model. In searching for the dependency and construction 25 STATS 771 Kyuson Lim of the network graph, a popular method is to take a stepwise approach (Scutari & Denis, 2021). We start in the complete graph (where all keywords are connected) and run a backward selection method. That is to cycle through the possible edges and remove an edge if it decreases some criterion, which is the BIC value in our model. This criteria are based on penalizing the likelihood according to the model complexity (Scutari & Denis, 2021). The problem of backward stepwise selection is on the limitation for the search coverage space as it naively iterate in big data. Hence, we apply a specific threshold for the partial correlation and remove all edge corresponding to the partial correlation that are less than the given threshold (Uhler, 2017). Note that the paper uses "cmod" function in "gRim" package in R with "stepwise" function to construct for the network graph. The partial correlation is computed based on the R package called "gRbase" with the function "cov2pcor" (Højsgaard, 2009). The result of analysis follows from the textbook, "Graphical Models with R" (Højsgaard, Edwards & Lauritzen, 2012). 7.2 Basic indirected network analysis: Undirected Gaussian Graphical Model Based on the multivariate data of keywords extracted, a graphical model is constructed based on the text data. Within the use of Gaussian graphical model, a framework of dependency structure between mutually related keywords can be clearly shown to provide a supplementary interpretation for the co-occurrence of the keywords and meaning behind the issues. For each keywords, 𝑦 1 , .., 𝑦 7 that is assumed to follow multivariate normal distribution 𝑁7 (𝜇, Σ), the inverse of covariance matrix (Table 4) © 𝑘 11 · · · 𝑘 17 ª . .. ® Σ = .. . ® ® « 𝑘 71 · · · 𝑘 77 ¬ is used to compute the partial correlation between two variables, 𝑦 𝑢 and 𝑦 𝑣 that is CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN UNSTRUCTURED TEXT DATA Kyuson Lim STATS 771 √ 𝜌𝑢𝑣|𝑉\{𝑢,𝑣} = −𝑘 𝑢𝑣 𝑘 𝑢𝑢 𝑘 𝑣𝑣 (Hojsgaard, Edwards & Lauritzen, 2012). Along with the threshold set to disconnect between the keywords, a stepwise backward model selection procedure using BIC criteria from the saturated model yields the model in Figure 5. There are 3 keywords which heavily influence other words with 4 connections to have with, which is "impact", "pandemic", and "article". In other words, these words are important to other words with its influence on the topics. This relationship is quantitatively shown with partial correlation matrix as measure: 7.2.1 Partial correlation: Gaussian Graphical Models covid pandemic article Canada Canadian health impact covid 100.00 73.00 2.00 -16.00 58.00 -8.00 46.00 pandemic 73.00 100.00 40.00 60.00 -13.00 -28.00 -69.00 article 2.00 40.00 100.00 -72.00 -15.00 75.00 68.00 Canada -16.00 60.00 -72.00 100.00 -11.00 52.00 84.00 Canadian 58.00 -13.00 -15.00 -11.00 100.00 39.00 9.00 health -8.00 -28.00 75.00 52.00 39.00 100.00 -51.00 impact 46.00 -69.00 68.00 84.00 9.00 -51.00 100.00 Table 4. Partial correlation in Gaussian graphical models Some insignificant connections can be confirmed from the partial correlation (Table 4). A word "covid" and "article" is conditionally independent (with 2) and "Canadian" and "impact" is also conditionally independent (with 9). Hence, there is no edges in between them to connect with, where there are less emphasis on issues to be connect with. 7.2.2 Interpretation of the Gaussian graphical model The interesting idea of Gaussian graphical model relies on structural interpretation. Since the given edges indicates the dependence between two vertices, we are able to make a conditional independence relationship between sets of words as a practical inference (Kim et al., 2019). CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN UNSTRUCTURED TEXT DATA STATS 771 Kyuson Lim For example, we focus on the reduced structure of words, "Canadian" and "covid", cannot connect with "article" and "Canada" without the edges in between them, which is connected by the nodes, "impact" and "pandemic" to find for the conditional independence relationship. Hence, (Canadian, covid) ⊥ (article, Canada) | (impact, pandemic). This gives an interesting result that the "article" does not indicate the issue is "Covid-19" related unless there are words of "impact" and "pandemic". Similarly, it can be reduced to understand that the "health" issues are not related to words, "impact" and "Canada" without the word in between them, "article". In summary, the relationship is (Health) ⊥ (impact, Canada) | (article), by the connected edges that indicates the correlation between nodes. Another formulation we can recognize from the Figure 5 is the the relationship of (Canadian) ⊥ (impact, pandemic) | (covid), meaning that "Canadian" issues necessary combine with the word of "covid" to state for the issue of "pandemic" and "impact". Also, the relationship of "pandemic-Canada-impact" is conditionally independent from "health" with given word "article". In other words, almost all articles of issues that deals with "health" issues are relevant to "pandemics" and "impact" in 2020. Alternatively, we can recognize from the Figure 5 for the the relationship of (health, article) ⊥ (Canadian, covid) | (impact, pandemic), meaning that "health" and "article" necessary combine with the word of "impact" and "pandemic" to state for the issue of "Canadian" and "covid". CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN UNSTRUCTURED TEXT DATA Kyuson Lim STATS 771 Figure 5. Undirected Gaussian graphical model is provided for the dependency structure ofthe top 7 keywords how they are related to each other for the relationship. Furthermore, it is easy to notice that words of "article", "impact", "pandemic", and "Canada" are interchangeably connected to formulate the issues. Therefore, the Gaussian graphical model is very interesting model and an interpretable result to be applied into the text mining data. Although there are some papers in the journals that are published for industrial data to be applied for the Gaussian graphical model, the method is not as wide as other famous techniques such as LDA (latent Dirichlet allocation) or topic clusterings (Kim & Jun, 2015). A certain limitation for the model is in the analysis for variables to be solely phrase or words to analyze with, meaning that the analysis is not done in a global document level but rather splited words. CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN UNSTRUCTURED TEXT DATA STATS 771 Kyuson Lim CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN UNSTRUCTURED TEXT DATA Chapter 8 Network Graph in topic clustering: BTM (Biterm Topic Modelling) First of all, we aim to apply for the ultimate model in text analysis named as BTM for statistical interpretation rather the fundamental scope of modeling for the limitation of understandings and the purpose of this paper to state with. As the words are short and extremely sparse in a single document with all titles or text data, there is a restriction that the co-occurrences and relationship is difficult in the situation where some significant words are less appeared than the other words in the document. Although it may be possible to pre-process or reduce the size the document, subjective aggregation and preprocessing takes developer"s efforts and issues to experience some difficulties. Hence, we provide cutting-edge technique which is BTM that is currently the ultimate analysis model to provide a topic clustering to formulate topics, which is identified to give most precise results in several papers (Pietsch & Lessmann, 2018). 8.1 Literature review on the biterm topic model (BTM) The biterm topic model (BTM) is first introduced in 2013, which attempted to address the inadequacies on short documents to do modelling of co-occurrences in global term rather than at the document level. Now, the BTM is the best method in topic clustering for short words as it is a probabilistic generative model in the generation of the biterms 31 STATS 771 Kyuson Lim (Du et al., 2017). For algorithm, the biterms are specifically extracted from each document using window size, where biterms are unordered term pairs to occur only within the prespecified window size. As the term pairs are unordered, the biterm quantities can be summarized by the integer numbers that is designated with. The first step in learning the latent topic components from co-occurrence is to model the generation of biterms. Assuming that the topics are sampled from the mixture models, each biterm is sampled from the each specified topic independent from one another. Then, all biterms are randomly re-assigned to the topic and iterates to update sequentially, as the number of times term was assigned to topic counted for posterior estimates. More details are illustrated in Gao, Kim and Sakurai (2016). The main advantage of BTM is on the assessment of high variability for coherence measurement. For the assessment of topic quality, the variability of topic for its coherence is measured to be higher than the other competing algorithm, resulting for the superiority of BTM in text mining analysis. The R package BTM was used to perform the biterm topic modeling (Wijffels, 2020). Notice that the analysis follows the guideline written by the R package instruction (Wijffels, 2020). The steps for analyzing the BTM in text data is as follows: 1. Crawl data of plain text and pre-process tokenized the inputs. This process requires R package of "ctv" to process the document terms (Zeileis, Hornik & Zeileis, 2022). After the process, the output gives unique tagging on each sentences and characteristics of words. 2. Perform tagging on title and extract co-occurrences of nouns, adjectives and verbs within 3 words distance. This process requires R package of "udpipe" to extract the biterms which perform to tokenize the text words and process for the vectorization to sample with (Wijffels, Straka & Straková, 2018). After the processing, the output will give a terms of words to put into documents. 3. Build the biterm topic model with 5 topics and provide the set of biterms to cluster. This process can be done with the R package BTM to draw topics and clusters of words to complete the topic clustering algorithm (Wijffels, 2020). This part is the important step where tuning parameters are input to analyze the data. CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC MODELLING) Kyuson Lim STATS 771 4. Visualize the biterm topic clusters. The R package of "ggraph" is used to automatically process the topic clustering data into a visual form of graph with subjective topic names to input by the user (Pedersen et al., 2017). 8.2 Topic clustering: result of BTM From the Figure 6, there are 5 topics, that are defined for the "economics issues", "global influence and educational issues", "public health issues", "main keywords", and "sociological issues". An increase of node (term) size corresponds to higher topic-term probability (Figure 6). Such words as "protective", "international", and "service" has a high probability to appear for the corresponding topics (Figure 6). Also, increase of thickness and darkness of edges (links) is a higher co-occurrences within the topic . As it was identified previously, the word "covid" and "pandemics" has a higher cooccurrences in main keywords (issue) topics. As biterms were computed using a window size of 10, adjacent nodes may not have necessarily appeared adjacently in the original text document, such as "outlook" and "postsecondary" words are rare to have appeared in the data (Figure 6). CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC MODELLING) STATS 771 Kyuson Lim Figure 6. Biterm clusters for 5 topics. Each cluster is an undirected graph as biterms are unordered pairs of terms. Words are connected for co-occurrences within the topics and formulate for the topics. Some of the unique and unobserved words include "international", "postsecondary", and "student" to have not appeared in the frequency Table 1. As the topic account for allocation of less appeared words in the data, we are now able to detect the words related the topic of global influence and education issues. Also, the economical issues and sociological issues are somewhat separated to yield a slight different result provided before. In the group of topic for the sociological issues, we are able to identify some natural consequence of covid-19 pandemics which include "medical", "protective", "business" and "personal" words that appear to be the interest to many people. With some added words of "outlook", "price" and "service", we are able to find the worries and livings of Canadians in 2020 for the impact of Covid-19. Similarly, the public health issues takes into account of its own topics with words of "mental", "group", "health" and "visible". From the analysis result in Figure 6, the "mental" and "health" is closely connected to show that the issue of public health include mental health of Canadians. Note that the analysis is consistent from Figure 5, where "health" is separated from other keywords and it is only connected with particularly relevant words to contain weak relations with other words. From the analysis of topics, we are clearly able to observe that some words are drawn from the text due to the nature of biterm analysis. While it was certainly obvious to identify most frequent words in any of the previous analysis, the BTM yields more comprehensive and grouped topics of words by the application of mixture models. CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC MODELLING) Chapter 9 Conclusion and discussion 9.1 Discussion Within the 5 different analysis of keywords, there are some variation and minor difference in methods. Although the wordcloud is regard to be the most effective tool in portraying the data analysis of frequencies in text data, the unstructured and qualitative form is in the controversy to be the issue. Therefore, the author provided a unique but rather efficient sub-plot for the co-occurence frequencies as well as table of rank 7 most frequent used words in the result to provide for better understanding of the data. Hence, the unique graph with the table can provide a clear and obvious result of the text data that is crawled from the Statistics Canada in 2020 related to the issue of Covid-19 pandemics. A hierarchal clustering result shows for only 2 clusters, but the k-mean clustering result shows for 3 clusters to be the optimal choice in classification. Also, there was a minor difference compared with the correlation analysis but generally the same consistent result to observe with. Under the purpose to examine grouping analysis of keywords, the goal to differentiate the topic keywords and sub-topics keywords is achieved by the analysis comparison between the hierarchal clustering and the k-mean clustering. Furthermore, the result was consistent to find from major cluster to differentiate with the other minor clusters where the minor cluster contained same keywords, which is "health" and "pandemic". Lastly, a BTM result is shown for the topic clustering without some basis in statistical sampling in mixture models. Unfortunately, for the purpose of topic clustering and its 35 STATS 771 Kyuson Lim result to portray with, a minor step to inform for the concise steps in BTM is skipped because it deviates from the topics in this paper. The goal is to construct a coherent and consistent topic clustering model to inform for the readers what major issues we can draw from the analysis of the BTM to provide with ultimate guidance on the data we analyzed with. It was successive to deliver for the specific words and topics that was not observed from the most frequently used words data to result in a complete overview of the major 5 topics and its co-occurrences in each clusters. 9.2 Conclusion The paper examine the result of 5 different ways for text mining data and 1 method for time series data for the published dates. Each method is different by the nature of mathematical and statistical foundation, leading us to explore the data and guide through different result of the analysis. This included analyzing term frequencies and term cofrequencies, clustering and the formulating topic models in order to better understand the topics of keywords extracted and throughout the covid-19 pandemics in Canada. First, we investigate with wordcloud to know the keywords and co-occurrence to observe for the data. Then, we applied some simple clustering algorithm of hierarchical clustering and k-mean clustering to group them and investigate for the correlation. During the process of wrangling the data and observing from the data, we are able to observe 2 groups of keywords where the first group of main topics corresponds to the most appeared words, "covid", "pandemic" and "health" which are general form of the topics in which we can observe to find with. The other group of sub-topics include such words as "Canadians", "data", "survey", "economics" and "business" which we can find the sociological difficulties and issues that many people confront with. This result shows the importance to know how to cope with socio-economics problems in the future when we have a similar types of pandemics. Second, we looked for local smoothing regression in time series data to investigate if there is a causal relationship to draw upon the different trend in the appearance of issues by the keyword data. Observed from the data, we were able to find that the trend is similar for top 7 ranked most appeared words indicating that the trend to differentiate is hard to analyze with. However, we are able to draw a minor conclusive argument that CHAPTER 9. CONCLUSION AND DISCUSSION Kyuson Lim STATS 771 the issue of "Canada" for the "covid" becomes the problem of "Canadian" around the period of July to August in the period in which pandemics was most severe. Third, we applied the Gaussian graphical model to draw a conditional independence between top 7 rank of the words. By the structural form of the Gaussian graphical model, we were able to differentiate how the co-occurrences of words for the issues are written for the headlines by the conditional independence in the undirected Gaussian graphical model to construct with. Some of the statistical significant result shows that the word "health" is not relevant to the "covid" or "pandemics" without the words of "impact" or "pandemic". This result was consistent to observe from the BTM (Biterm Topic Models for Short Text) where the biterm is the "mental" and "health" for the topic of public health issues. In regards to topic clustering, it was concluded that the topics learned by BTM were, more concise and specified than the clusters learned by hierarchical clustering and the k-mean clustering. There were hidden keywords that draws upon the result of topics that is result in with, and a specified topics to differentiate clearly for the relevant keywords. We are able to find for the 5 topics, each to be distinct from each other, to know what problems and issues with keywords Canadians have had during the Covid-19 pandemics in 2020. CHAPTER 9. CONCLUSION AND DISCUSSION STATS 771 Kyuson Lim CHAPTER 9. CONCLUSION AND DISCUSSION Bibliography [1] Liu, S., Yang, L., Zhang, C., Xiang, Y. T., Liu, Z., Hu, S., & Zhang, B. (2020). Online mental health services in China during the COVID-19 outbreak. The Lancet Psychiatry, 7(4), e17-e18. https://www.thelancet.com/journals/lanpsy/article/PIIS22150366(20)30077-8/fulltext [2] World Health Organization. Coronavirus disease (covid-19). (2021). URL https://www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19. [3] Agrawal, R., Imieliński, T., & Swami, A. (1993, June). Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (pp. 207-216). Chicago [4] Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project. org/package= BTM. R package version 0.3, 1. [5] Halvey, M. J., & Keane, M. T. (2007, May). An assessment of tag presentation techniques. In Proceedings of the 16th international conference on World Wide Web (pp. 1313-1314). [6] Guattari, F., & Deleuze, G. (1992). Tausend Plateaus. Kapitalismus und Schizophrenie. [7] Gilles Deleuze, Felix Guattari (1992). Tausend Plateaus. Kapitalismus und Schizophrenie. ISBN 978-3-88396-094-4. [8] Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499). [9] Wong, P. C., Whitney, P., & Thomas, J. (1999, October). Visualizing association rules for text mining. In Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis’ 99) (pp. 120-123). IEEE. 39 STATS 771 Kyuson Lim [10] Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s razor. Information processing letters, 24(6), 377-380. [11] Garcia, Enrique (2007). "Drawbacks and solutions of applying association rule mining in learning management systems" (PDF). Sci2s. Archived (PDF) from the original on 2009-12-23. [12] Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data. Springer Nature. [13] Congressional Research Service (2021). Unemployment rates during the COVID19 pandemic. https://crsreports.congress.gov/product/pdf/ R/R46554 [14] Gilfillan, G. (2020). Covid-19: Labour market impacts on key demographic groups, industries and Regions. Department of Parliamentary Services Australia, Parliament of Australia. https://www.voced.edu.au/content/ngv: 90977 [15] World Bank. (2020). The COVID-19 crisis response. https://doi.org/10.1596/34571 [16] Nayak, J., Mishra, M., Naik, B., Swapnarekha, H., Cengiz, K., & Shanmuganathan, V. (2021). An impact study of COVID-19 on six different industries: Automobile, energy and Power, agriculture, education, travel and Tourism and Consumer Electronics. Expert Systems. https://doi.org/10. 1111/exsy.12677 [17] Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business professionals. John Wiley & Sons. [18] Gershon, N., & Page, W. (2001). What storytelling can do for information visualization. Communications of the ACM, 44(8), 31-37. [19] Charu, C. Aggarwal, ChengXiang Zhai. (2012). Mining Text Data. [20] Becue-Bertaut, M. (2019). Textual data science with R. CRC Press. [21] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol1. 112, p.18). New York: Springer [22] Kaufman, L., & Rousseeuw, P. J. (2008). Clustering large applications (Program CLARA). Finding groups in data: an introduction to cluster analysis, 126-146. BIBLIOGRAPHY Kyuson Lim STATS 771 [23] Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In International conference on similarity search and applications (pp. 171-187). Springer, Cham. [24] Kumar, A., & Paul, A. (2016). Mastering text mining with R. Packt Publishing Ltd. [25] Press, W. H., & Teukolsky, S. A. (1990). Savitzky-Golay smoothing filters. Computers in Physics, 4(6), 669-672. [26] Hardle, W., Muller, M., Sperlich, S., & Werwatz, A. (2012). Nonparametric and semiparametric models (Vol. 2). Berlin: Springer. [27] Højsgaard, S., Edwards, D., & Lauritzen, S. (2012). Graphical models with R. Springer Science & Business Media. [28] Uhler, C. (2017). Gaussian graphical models: An algebraic and geometric perspective. arXiv preprint arXiv:1707.04345. [29] Scutari, M., & Denis, J. B. (2021). Bayesian networks: with examples in R. Chapman and Hall/CRC. [30] Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis Using Time Series and Copula Models. Applied Sciences, 9(19), 4071. [31] Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression model for apple keywords by text mining. Advanced Engineering Informatics, 29(4), 918-929. [32] Du, D., Li, L., Zhu, E., & He, K. (Eds.). (2017). Theoretical Computer Science: 35th National Conference, NCTCS 2017, Wuhan, China, October 14-15, 2017, Proceedings (Vol. 768). Springer. [33] Gao, H., Kim, J., & Sakurai, Y. (Eds.). (2016). Database Systems for Advanced Applications: DASFAA 2016 International Workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16-19, 2016, Proceedings (Vol. 9645). Springer. [34] Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom for’ggplot2’. R package version 0.5. 0. [35] Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A. (2018). quanteda: An R package for the quantitative analysis of textual data. Journal of Open Source Software, 3(30), 774. BIBLIOGRAPHY STATS 771 Kyuson Lim [36] Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31(22), 3718-3720. [37] Wickham, H., & Wickham, M. H. (2007). The ggplot package. URL: https://cran. r-project. org/web/packages/ggplot2/index. html. [38] Højsgaard, S. (2009). On the usage of the gRim package. [39] Zhu, Q., Feng, Z., & Li, X. (2018, January). GraphBTM: Graph enhanced autoencoded variational inference for biterm topic model. In Conference on Empirical Methods in Natural Language Processing (EMNLP 2018). [40] Pietsch, A. S., & Lessmann, S. (2018). Topic modeling for analyzing open-ended survey responses. Journal of Business Analytics, 1(2), 93-116. [41] Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN. R-project. org/package= BTM. R package version 0.3, 1. [42] Zeileis, A., Hornik, K., & Zeileis, M. A. (2022). Package ‘ctv’. [43] Wijffels, J., Straka, M., & Straková, J. (2018). Package ‘udpipe’. [44] Pedersen, T. L., Pedersen, M. T. L., LazyData, T. R. U. E., Rcpp, I., & Rcpp, L. (2017). Package ‘ggraph’. Retrieved January, 1, 2018. BIBLIOGRAPHY

Text mining and its association analysis on topic modeling

Related documents

Products

Support

Text mining and its association analysis on topic modeling

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib