Uploaded by Kevin Kyuson Lim

Text mining and its association analysis on topic modeling

advertisement
A text mining and association analysis:
Exploring text data for creating topic models
Kyuson Lim
Department of Mathematics & Statistics,
McMaster University, E-mail: limk15@mcmaster.ca
April 26, 2022
STATS 771
Kyuson Lim
Contents
1
Abstract
5
2
Motivation
7
2.1
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
2.1.1
About text data . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Outline and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
2.2
3
Text mining and data visualization using wordcloud
11
3.1
Literature review on concept of wordcloud and support analysis . . . . .
11
3.1.1
Wordcloud . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
3.1.2
Association rule: support analysis . . . . . . . . . . . . . . . .
12
Complex data visualization: Worldcloud and co-occurences . . . . . . .
12
3.2
4
5
Machine learning in data visualization 1: Hierarchical Clustering analysis 15
4.1
Literature review on concept of hierarchical clustering analysis . . . . .
15
4.2
Hierarchical clustering and correlation analysis . . . . . . . . . . . . .
16
Machine learning in data visualization 2: K-medoid clustering
19
5.1
Literature review on concept of K-medoid clustering analysis . . . . . .
19
5.2
k-medoids and determination for number of clusters . . . . . . . . . . .
20
3
STATS 771
6
7
8
9
Kyuson Lim
Time series data analysis: local smoothing regression
23
6.1
Literature review on concept of local smoothing regression . . . . . . .
23
6.2
Result of local smoothing regression . . . . . . . . . . . . . . . . . . .
24
Gaussian Graphical Models: application in unstructured text data
25
7.1
Literature review on the Gaussian Graphical Model . . . . . . . . . . .
25
7.2
Basic indirected network analysis: Undirected Gaussian Graphical Model 26
7.2.1
Partial correlation: Gaussian Graphical Models . . . . . . . . .
27
7.2.2
Interpretation of the Gaussian graphical model . . . . . . . . .
27
Network Graph in topic clustering: BTM (Biterm Topic Modelling)
31
8.1
Literature review on the biterm topic model (BTM) . . . . . . . . . . .
31
8.2
Topic clustering: result of BTM . . . . . . . . . . . . . . . . . . . . .
33
Conclusion and discussion
35
9.1
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
9.2
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
CONTENTS
Chapter 1
Abstract
With about 182 official news headline collected between January 2020 to December
2020 related to the COVID-19 pandemic issues in Canada, the topics and keywords of
contents are statistically analyzed by graphical models and topic models to effectively
portray the output and present in a collection of keywords by its connects. Such cooccurences of words to be written in the sentences can be investigated by using Gaussian
graphical models to know how keywords are connected interchangeably. An association
between words are understood by the Gaussian graphical model to be visually portrayed
for its connection.
The notion of topic modelling in text mining is an efficient tools to explore and
summarize massive collections of words. In the case of clustering between words for its
appearance, the data can be visually shown by the tool of wordcloud and analyzed by
the hierarchical clustering analysis. A similarity between words and the grouping effect
to be known from the sentence structure is resolved in the correlation analysis and the
k-mean clustering classification.
A method of topic clustering has been a great success in both industrial and academic
research areas to be applied to formal texts. As part of an extension for machine learning
algorithm, the Biterm Topic Model (BTM) is used to analyze the co-occurrences of
words over massive set of documents to come up with simple network graphs. For the
ultimate goal of statistical inferential problem to overcome in the text mining data, a
BTM can be presented to visually interpret the superior result in topic clustering analysis.
5
STATS 771
Kyuson Lim
CHAPTER 1. ABSTRACT
Chapter 2
Motivation
2.1
Background
In 2020, the coronavirus disease (COVID-19) pandemic had started with the cause of
coronavirus known to be a severe acute respiratory syndrome coronavirus 2 (SARSCoV-2) (WHO, 2019). A COVID-19 stands for coronavirus disease and even referred
to as the 2019 novel coronavirus or "2019-nCoV" (Bender, 2020). A fast transmission
of the COVID-19 virus hindered the traditional face-to-face communication between
people and changed the educational delivery system of courses (Liu et al., 2020).
This new virus can be transmitted just in minutes through droplets or even touching
surface metals or other materials which have been infected from a person who has
respiratory problems. Even though the elderly and the very young children are easily
affected, nobody is immune to this new infectious disease once it hits the body, so all
people are susceptible to its devastating effects (Bender, 2020; Meng et al., 2020).
The coronavirus disease (COVID-19) pandemic led to lockdowns and social distancing policies in more than 200 countries, which inevitably plunged global economy
into worst recession since World War II (World Bank, 2020; Nayak et al., 2021). This
recession has created widely different experiences across regions, demographic groups,
and industries (Congressional Research Service, 2021; Gilfillan, 2020).
Mass hysteria was bound to ensue in the early days of the pandemic due to uncertainty
surrounding the authenticity of information sources. In times like these where the general
public has been advised to stay home except for essential purposes and in-person social
7
STATS 771
Kyuson Lim
gatherings have been restricted, many people relied on the latest news offered by the
public internet and headlines of the newspaper.
In the era of big data, the uncertainty of how much data we can analyze is a question
of interest for many people. However, the importance of data mining and the techniques
resolves the question of interest for many people to comprehend big data by text crawling
from the certified archive instantaneously. Not only this data mining technique enables
many people to model for an effective output based the unique qualitative data by text
mining (using NLP: natural language processing), but also this technique resolves the
quantitative questions by creating a new and rapid integrative models.
2.1.1
About text data
Statistics Canada is an official agency of the Government of Canada, commissioned
with producing statistics to help better understand Canada, its sociological and political
issues. Statistics Canada is available to anyone in the world to access and use publicly
available data. The website works as a platform, allowing users to read government
official notices and news that is possible to download as well.
The data used in this article consists of news articles on various subjects which
explores the impact of COVID-19 on sociological and economic issues in 2020. Although there is a periodical update on the news headlines, the analysis is restricted all
2020 headlines in the website (https://www150.statcan.gc.ca/n1/pub/
45-28-0001/452800012020001-eng.htm) for the analysis to be stable and
unique for its interpretation.
The purpose of data to be posted in the government of Statistics Canada from
2020 January to December as part of the third iteration of the Canadian Perspectives
Survey Series (CPSS) is to explore he re-opening of economic and social activity during
quarantine times. Using the data of news articles, the articles are intended to provide an
update to family and public people for the livings and lives to examine lifestyles during
confinement of 2020, which is the year when the COVID-19 pandemic was popular.
The data mostly covers livings and lifestyle issues of aged over 20 in Canada at the
time of survey to represent the popularity and representation of Canadian issues during
the COVID-19 pandemics in 2020. By this idea of web crawling on Canadian lifestyle
and livings, we hope to learn what interest and social ideas, how people navigated their
CHAPTER 2. MOTIVATION
Kyuson Lim
STATS 771
livings, which issues are problematic for certain time period. Taking the advantage of
the fact that Statistics Canada represents as a government official agency, having analysis
on the text mining data would bring us to understand what issues we have in 2022 as to
learn from the past.
As the text mining data on news headlines is to provide a representative and efficient
understanding of the overall contents and issues that the news are written with, the
statistical analysis is only analyzing the headlines of news to know public issues, interest
and relationship between words to visually portray to the reader how they are connected
and what conclusion we can draw from.
2.2
Outline and goal
The goal is to explore the data with various statistical models including clustering
of machine learning algorithm, topic clustering, and graphical models to identify the
association between keywords which best suit for the text mining data of COVID-19 news
headlines and description that is established in official website of Statistics Canada.
First, the wordcloud investigate to identify for most frequent used words that appears
in the news headlines to define for keywords based on the ranking. The purpose is to
portray overall appeared words in the news headlines that is an effective way for data
visualization.
Second, a hierarchical clustering mechanism with correlation plot is used to identify
the statistical combination of keywords among dissimilarity and similarity between
words. By grouping the hierarchical clustering of keywords, we are able to construct
groups of keywords that are in the same cluster to find for latent variables. We are
also able to observe the grouped keywords to construct for topic that best represent the
clusters to be related to the topic clustering analysis.
Then, the k-medoid clustering analysis that is an advanced method of k-mean clustering is applied for identifying quantitative groups of clusters among all possible words
from the dataset. The model is expected to recognize significant relationship between
variables of keywords and build up for topics of relevant words to account for highest
variability as possible to be constructed with.
Third, the non-directed Gaussian graphical model (DGGM) is provided for statistical
CHAPTER 2. MOTIVATION
STATS 771
Kyuson Lim
relationship based on parametric analysis on framework of multivariate statistics to
provide concise and clear structure between words of text sentences. Combined with
the time series data of published timeline on news headlines analyzed by the local
polynomial regression fitting, it is expected to provide a reason why the relationship
between keywords is not possible to be directed for causal relationship.
Lastly, the Biterm topic model (BTM) which is an ultimate model in short text analysis
can be applied for identifying topics and comparing various models. While short texts
traditional models for simple statistical analysis is not suitable by the availability of
keywords and extremely sparse for co-occurrences, BTM attempts to model of global
term co-occurrences rather than at the document level.
Due to the nature of data exploration and computational statistics, the analysis of
the paper only deals with exploratory data analysis (EDA) not with confirmatory data
analysis (CFA) to effeciently yield the output and interpret the result.
CHAPTER 2. MOTIVATION
Chapter 3
Text mining and data visualization
using wordcloud
3.1
Literature review on concept of wordcloud and support analysis
3.1.1
Wordcloud
A wordcloud, also known as tag cloud is a visual representation of text data, which
is often used to depict keyword on websites, or to visualize free form text (Halvey &
Keane, 2007). Founded by the early example of Douglas Coupland (1992), a weight is
given to a more frequently used terms as a non-parametric method to visually portray
the data into a graph. Mainly familiarized by the Web 2.0 websites and blogs, various
of modification on weights for size of the text data that is visually floating in the graph
has been applied (Guattari & Deleuze, 1992)
The idea for using the wordcloud in context of this paper is to identify the keywords
as a measure base of frequencies it appears in the data and effectively portray the words to
identify important words to be defined as keywords. From author"s previous experience
working as a researcher with professionals, the attractive color and visually influential
factors are emphasized by the authoritarians to come across for practical understandings
and outcomes to be shown to the public.
11
STATS 771
3.1.2
Kyuson Lim
Association rule: support analysis
Founded by the Rakesh Agrawal, Tomasz Imielinski and Arun Swami (1994), an association rule learning is a rule based machine learning method for discovering the
interesting relations between variables or target responses in a large dataset. In the context of data mining of association rule learning, a lift is a measure of the basic frequent
based co-occurrence performance for a targeting data. Commonly known to be called as
a market basket analysis, the lift is a basic tool to apply for the text mining data to come
across for relationship between keywords by the frequencies of intersecting appearances
in many machine learning papers (Wong & Thomas, 1999)
However, the association rule algorithm is applied into many context of parameters
only and less applied for its own technique (Garcia, et al., 2007). As it is known that
the simplest method is the best method in many machine learning industries, the support
gives a perfect representation on the behavior in data interconnections for the inference
on classification (Blumer et al., 1987)
By the support analysis, we hope to find the magnitude of relationship between
keywords concisely as possible, and give clear notion of relationship between words to
simply connect them to formulate into a phrase. This could give a comparison for later
methods of clustering and BTM to identify for the relationships.
3.2
Complex data visualization: Worldcloud and cooccurences
Before the visualization of the wordcloud, a table of top 6 ranked most frequent used
words are shown in the Table 1. As a topic of interest, the word "covid" has been used
173 times followed by the "pandemics", 100 times, "articles", "Canada" and "health".
From the Table 1, we can certainly confirm that the topics of headlines are restricted
to Canada"s COVID-19 articles and interest of topics are restricted Canadian issues.
Also, the top 6 ranked keywords for its association rule of support analysis are graphical
portrayed in Figure 1.
A general wordcloud (Figure 1) was constructed for an overview of the terms contained in the data. Due to the nature of wordcloud to be messy direction and floating
CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD
Kyuson Lim
STATS 771
parts with ambiguous magnitudes (Figure 1), only top 12 ranked most frequently used
words are portrayed in the wordcloud by it own magnitudes of frequencies.
Rank
Word
Frequency
1
covid
173
2
pandemic
100
3
article
54
4
Canada
54
5
Canadian
48
6
health
48
Table 1. Top 6 ranked most frequently used words in 2020 Statistics Canada articles.
From the Figure 1, we can identify some unique words such as "data", "differences",
"survey", "statistics" and "study" that majority of the people rely on statistical and quantitative analysis of opinions on people to know for sociological and economical issues
rather than the qualitative and education topics. However, most words are composed of
economical and sociological topics for many people which include "price", "mental",
"concerns", "home", and "workers". This indicates that majority of people are interested
in livings during the pandemic period to last lives and know how people live to overcome
the COVID-19 period where many people are isolated in their homes. Hence, the words
extracted by the frequencies well represents the issues and understanding we have had
during COVID-19 in 2020.
CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD
STATS 771
Kyuson Lim
Figure 1. (a) The bar graph indicates for the support analysis of top 6 ranked
frequently used words with top 3 ranked words on the top of the title to denote with. (b)
The wordcloud represents the frequency based representation of the words for top 12
most often used words in the data.
Note that the "ggwordcloud" has been used to graph the output that is preprocessed
by the package "tokenizer" in "quanteda" to result in a hybrid graph with the "ggplot"
package (Le & Slowikowski, 2019; Benoit et al., 2018). The idea belongs to the author
where no result is the same to claim for the copyright.
CHAPTER 3. TEXT MINING AND DATA VISUALIZATION USING WORDCLOUD
Chapter 4
Machine learning in data visualization
1: Hierarchical Clustering analysis
4.1
Literature review on concept of hierarchical clustering analysis
A data visualization is an interdisciplinary area that deals with the graphic representation
of data and applied into machine learning output. It is a particularly efficient way of
communicating when the dataset is huge and abstract especially for many cases in
machine learning application as the main focus is to reinforce human cognition (Knaflic,
2015). From an academic point of view, the data visualization is a representation that
is considered to be a mapping between the output of the data analysis and graphical
elements. Therefore, a clustering analysis of its output is presented with its roots in the
field of statistics to efficiently deliver the conclusion of the statistical analysis (Gershon
& Page, 2001).
A hierarchical clustering is a method of classification analysis which seeks to construct a hierarchy of clusters. The method classifies the words into groups based on the
dissimilarity of the words. Here, the way to measure the dissimilarity is to apply the
Euclidean distance (𝐿 2 norm). The number of times of the word used is the coordinates
in the Euclidean space (Charu et al., 2012). Then, the distance in between any two words
is calculated as a measure of dissimilarity. If the distance is large, then it indicates more
dissimilarity of the two words. With the distance matrix, we can then cluster words
15
STATS 771
Kyuson Lim
(Becue-Bertaut, 2019).
The method used in the analysis is called complete-linkage clustering, which computes the farthest neighbor between words. Then, the two words and the formulation of
clusters are separated by the shortest distance combined. For two clusters, the distance is
the maximum distance among any pair of elements from the two clusters (Becue-Bertaut,
2019).
Note that the "ggorrplot" has been used for data visualization of the correlation
analysis and a "ggdendro" to visualize the output based on the frequencies words appeared
in the data (Galili, 2015; Wickham & Wickham, 2007).
4.2
Hierarchical clustering and correlation analysis
As the purpose is to compare for all pairs of points in 2 dimension, an Euclidean
measure with a complete linkage is applied to the hierarchical clustering model to
identify two clusters of words in Figure 2. Such a plot of hierarchical structure is
called a "dendrogram", which has a tree structure (James et al., 2013). The hierarchical
classification model has 2 groups of clusters one group with 18 words and the other
group with 3 words.
Based on the structure of the dendrogram, the manual cut-off is decided for 2 number
of clusters due to the inter-connected relationship established for 18 words formed by the
tree structures. The result of 3 or more clusters yields inappropriate groups of clusters
from the dendrogram.
Second, the correlation analysis in Figure 2, shows that top 6 keywords are positively
correlated with each other with less negatively correlated words to be consistent in the
interpretation. From the analysis, we can identify that "covid" and "pandemic" has
strong correlation (0.35) and formulation of cluster to be interpreted as a combination.
Also, there is a moderate correlation between the keyword "health", "covid" (0.13) and
"pandemic", "health" (0.12) o form a group of words that shows the importance of health
in COVID-19 pandemic for Canadians.
On the other hand, the word "impact" has negative correlation with the word "health"
indicating that the co-occured words is not due to the health issues for "Canadians"
for a correlation value of 0.27. Similarly, the hierarchical clustering shows that the
CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL
CLUSTERING ANALYSIS
Kyuson Lim
STATS 771
"Canadian", "Canada" and "business" with "impact" is connected strongly in the same
hierarchy indicating that the issue of "impact" is on the "business" to conclude with.
Figure 2. (a) A circular hierarchal clustering plot shows for the top 21 most frequently
used words in news headlines. (b) The correlogram shows for the top 6 ranked most
frequently used words of correlation between words.
Consistent with the wordcloud analysis in Figure 1, the words "examines", "study",
"using", and "survey" are grouped together in the same hierarchy to yield a formulation
of the cluster that majority of the people are interested in quantitative result and analysis
for COVID-19 impact on Canadians.
Meanwhile, we certainly reach to the question of how much variability does clustering
account with and a comparison of interpretation with other clustering algorithm to reach
into the second part of analysis using K-medoid clustering analysis.
CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL
CLUSTERING ANALYSIS
STATS 771
Kyuson Lim
CHAPTER 4. MACHINE LEARNING IN DATA VISUALIZATION 1: HIERARCHICAL
CLUSTERING ANALYSIS
Chapter 5
Machine learning in data visualization
2: K-medoid clustering
5.1
Literature review on concept of K-medoid clustering
analysis
The k-medoids algorithm partions the data into groups and attempt to minimize the
distance between points by defining a point of the center in that cluster and labeling
points. Moreover, the k-medoids chooses actual data points as centers for enhanced
interpretability, where the center of a cluster does not necessarily to be one of the input
data points, which can be the average between the points in the cluster (Schubert &
Rousseeuw, 2019).
The k-medoids algorithm is found by the Leonard Kaufman and Peter J. Rousseeuw
with their PAM algorithm (Partitioning Around Medoids), which is the name of the
function in applying for the k-medoids (Kaufman & Rousseeuw, 2008). The greedy
algorithm in k-meoids, which is heuristic to identify the clusters is different from hirerachial clustering algorithm, as there is exists for many solutions and iterations for the
algorithm to actually implement with (Schubert & Rousseeuw, 2019).
19
STATS 771
Kyuson Lim
5.2
k-medoids and determination for number of clusters
A various methods can be used to determine the optimal number of clusters. One the most
commonly used method is an "elbow" method, which calculate how much variability in
the data can be explained by the clustering. Although the variability increases with the
number of clusters, we identify the drastic point of increase to be the optimal cut-off for
the choice in the number of clusters to apply in the algorithm to fit with (Kumar & Paul,
2016).
For example, the analysis is obtained by applying for the explained variances between
1 to 7 clusters in Figure 3. From the Figure 3, after 3 clusters, the increase in the explained
variance becomes slower, drastically increase at the most between cluster 2 to cluster 3.
Hence, the choice of clustering for k is 3.
First of all, no cluster overlaps with each other indicating that we have had adequate
fit for the data. Also, the two number of cluster accounts for 45.64% of the variability
in the data by 3 clusters which is about the half of the data by the interpretation of the
result to conduct with (Figure 3).
Figure 3. (a) The elbow method of variability plot for each increase in the number
of clusters for k-medoid. (b) The k-medoids clustering with k=3 is illustrated when the
method is applied for the data.
Similar to hierarchical clustering analysis, the words of "covid" and "pandemic" is
separated from the major cluster (cluster 2), which contains 18 words in the data (Table
3). However, the word of "health" is contained with the other cluster (cluster 2) indicating
CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID
CLUSTERING
Kyuson Lim
STATS 771
that the socio-economic issues include the issue of health problems as well. From the
perspective of headlines, we can denote that word of "covid" and "pandemic" is a title
but the other 18 words are subtitle and specified issues to conclude that the hierarchy
of topics. While the subtopics are actual impact of issues in living and social life of
Canadian in 2020, the impact is severe by the COVID-19 pandemics and has changed
many things for issues to always contain the word of "covid" and "pandemic".
PC
Variance
explained
1
2
3
4
5
6
7
0
0.14
0.27
0.38
0.46
0.5
0.56
Table 2. Variance explained by increasing number of clusters
cluster
covid
pandemic Canada
impact
Canadians business impacts
number
1
3
2
2
2
2
2
cluster
people
health
Canadian economicssurvey
article
data
examine
number
2
3
2
2
2
2
2
2
Table 3: Table of words classified by clusters in k-medoids
From Figure 2 and Figure 3, we have seen that the result of k-medoids and hierarchical
clustering yields a similar result of interpretation where there is a certain topics and subtopics of issues that match with wordcloud to have observed with. The statistical result
shows a meaningful relation between words where the result is reflected on how people
perceived in livings and issues to actually deal during COVID-19 pandemics.
In the meantime, we have not analyzed for any causal relationship on time series data
of publication to investigate if there is any statistical inference in which words differ in
the time of appearance for the publication. Hence, we have applied the local smoothing
regression to the annual time series data to observe how the trend of words that appears
for 12 months differs to investigate if one words have impact on the other to appear with.
CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID
CLUSTERING
STATS 771
Kyuson Lim
CHAPTER 5. MACHINE LEARNING IN DATA VISUALIZATION 2: K-MEDOID
CLUSTERING
Chapter 6
Time series data analysis: local
smoothing regression
6.1
Literature review on concept of local smoothing regression
A local smoothing regression (also known as loess: locally estimated scatterplot smoothing) which is commonly referred to as Savitzky-Golay filter is a type of non-parametric
regression method that is mixed type of a generalized the moving average (MA) and
polynomial regression (Press & Teukolsky, 1990). The local smoothing regression is a
non-linear regression that is a smooth curve fitted to the actual data points (Hardle et al.,
2012).
For different classes of words, we may fit the local smoothing regression to identify
the trend changes and increasing and decreasing point to identify if one word has impact
on the other to cause some issues in 2020. By the analysis of headlines which directly
reflects the issues, we can identify whether the causal relation could be identified to
apply to the graphical interpretation of statistical models.
23
STATS 771
6.2
Kyuson Lim
Result of local smoothing regression
Figure 4. A time series data analysis by local smoothing regression on top 7 ranked
most frequently used. Note that the paper uses "ggplot" function for the local smoothing
regression as an option (loess) to plot the result.
Overall result shows that the causality or time series related reasoning is not possible
to apply for an inference, as the trend of increase and decrease is same for all 7 words.
Nevertheless, the word "Canadian" is slightly falling behind in between July and August
to have higher number of appearances, comparing with other words such as "covid" and
"pandemic".
Even though the word of "Canada" is exactly following the trend of other words,
the "Canadian" issues are more frequently appeared in the headlines at the period of
July to August as the word "health" does indicating that the impact on Canadian people
for health issues are most severed in July to August of 2020. Furthermore, we can
argue that the decreasing trend of word "Canada" after June has been moved to the word
"Canadian" as people aims to describe more specified interest or issue.
Still we are unable to draw a conclusive argument on casual inference for words but
we are able to distinguish some minor distinction between topics of issues to identify for
the popular ideas among people in 2020.
CHAPTER 6. TIME SERIES DATA ANALYSIS: LOCAL SMOOTHING REGRESSION
Chapter 7
Gaussian Graphical Models:
application in unstructured text data
A graph from applied mathematics is defined by the pair vertices (or nodes) and edges.
It is applied into a statistical model in order to represent for the dependency and structural connections between words that are vertices of the data (Hojsgaard, Edwards &
Lauritzen, 2012). By the undirected mathematical graph and parametric modeling to
the text data, we aim to clearly define the structural dependencies and relationship in the
sentence of headlines that formulate as data.
7.1
Literature review on the Gaussian Graphical Model
Making use of the massive complicated text data which requires characterization for the
relationship among a large number of variables, a Gaussian graphical models explicitly
capture the statistical dependency between the variables of interest in the form of a
network graph. By the consequence of the central limit theorem (CLT), an unstructured
data of its quantities can be approximately formulated into a Gaussian distribution.
Hence, assuming that the Gaussianity is imposed with the first and second moments, the
graph can be described by the sparsity pattern of the concentration matrix (Uhler, 2017).
First, each node in the graph corresponds to one of the variables in the text data. A
missing edges in graph correspond to conditional independence relations in the corresponding Gaussian graphical model. In searching for the dependency and construction
25
STATS 771
Kyuson Lim
of the network graph, a popular method is to take a stepwise approach (Scutari & Denis,
2021).
We start in the complete graph (where all keywords are connected) and run a backward
selection method. That is to cycle through the possible edges and remove an edge if it
decreases some criterion, which is the BIC value in our model. This criteria are based
on penalizing the likelihood according to the model complexity (Scutari & Denis, 2021).
The problem of backward stepwise selection is on the limitation for the search
coverage space as it naively iterate in big data. Hence, we apply a specific threshold for
the partial correlation and remove all edge corresponding to the partial correlation that
are less than the given threshold (Uhler, 2017).
Note that the paper uses "cmod" function in "gRim" package in R with "stepwise"
function to construct for the network graph. The partial correlation is computed based
on the R package called "gRbase" with the function "cov2pcor" (Højsgaard, 2009). The
result of analysis follows from the textbook, "Graphical Models with R" (Højsgaard,
Edwards & Lauritzen, 2012).
7.2
Basic indirected network analysis: Undirected Gaussian Graphical Model
Based on the multivariate data of keywords extracted, a graphical model is constructed
based on the text data. Within the use of Gaussian graphical model, a framework of
dependency structure between mutually related keywords can be clearly shown to provide
a supplementary interpretation for the co-occurrence of the keywords and meaning
behind the issues.
For each keywords, 𝑦 1 , .., 𝑦 7 that is assumed to follow multivariate normal distribution 𝑁7 (πœ‡, Σ), the inverse of covariance matrix (Table 4)
© π‘˜ 11 · · · π‘˜ 17 ª
­ .
.. ®
Σ = ­ ..
. ®
­
®
« π‘˜ 71 · · · π‘˜ 77 ¬
is used to compute the partial correlation between two variables, 𝑦 𝑒 and 𝑦 𝑣 that is
CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN
UNSTRUCTURED TEXT DATA
Kyuson Lim
STATS 771
√
πœŒπ‘’π‘£|𝑉\{𝑒,𝑣} = −π‘˜ 𝑒𝑣 π‘˜ 𝑒𝑒 π‘˜ 𝑣𝑣 (Hojsgaard, Edwards & Lauritzen, 2012).
Along with the threshold set to disconnect between the keywords, a stepwise backward model selection procedure using BIC criteria from the saturated model yields the
model in Figure 5. There are 3 keywords which heavily influence other words with
4 connections to have with, which is "impact", "pandemic", and "article". In other
words, these words are important to other words with its influence on the topics. This
relationship is quantitatively shown with partial correlation matrix as measure:
7.2.1
Partial correlation: Gaussian Graphical Models
covid
pandemic
article
Canada
Canadian
health
impact
covid
100.00
73.00
2.00
-16.00
58.00
-8.00
46.00
pandemic
73.00
100.00
40.00
60.00
-13.00
-28.00
-69.00
article
2.00
40.00
100.00
-72.00
-15.00
75.00
68.00
Canada
-16.00
60.00
-72.00
100.00
-11.00
52.00
84.00
Canadian
58.00
-13.00
-15.00
-11.00
100.00
39.00
9.00
health
-8.00
-28.00
75.00
52.00
39.00
100.00
-51.00
impact
46.00
-69.00
68.00
84.00
9.00
-51.00
100.00
Table 4. Partial correlation in Gaussian graphical models
Some insignificant connections can be confirmed from the partial correlation (Table
4). A word "covid" and "article" is conditionally independent (with 2) and "Canadian"
and "impact" is also conditionally independent (with 9). Hence, there is no edges in
between them to connect with, where there are less emphasis on issues to be connect
with.
7.2.2
Interpretation of the Gaussian graphical model
The interesting idea of Gaussian graphical model relies on structural interpretation. Since
the given edges indicates the dependence between two vertices, we are able to make a
conditional independence relationship between sets of words as a practical inference
(Kim et al., 2019).
CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN
UNSTRUCTURED TEXT DATA
STATS 771
Kyuson Lim
For example, we focus on the reduced structure of words, "Canadian" and "covid",
cannot connect with "article" and "Canada" without the edges in between them, which
is connected by the nodes, "impact" and "pandemic" to find for the conditional independence relationship. Hence, (Canadian, covid) ⊥ (article, Canada) | (impact, pandemic).
This gives an interesting result that the "article" does not indicate the issue is "Covid-19"
related unless there are words of "impact" and "pandemic".
Similarly, it can be reduced to understand that the "health" issues are not related to
words, "impact" and "Canada" without the word in between them, "article". In summary,
the relationship is (Health) ⊥ (impact, Canada) | (article), by the connected edges that
indicates the correlation between nodes.
Another formulation we can recognize from the Figure 5 is the the relationship of
(Canadian) ⊥ (impact, pandemic) | (covid), meaning that "Canadian" issues necessary
combine with the word of "covid" to state for the issue of "pandemic" and "impact".
Also, the relationship of "pandemic-Canada-impact" is conditionally independent from
"health" with given word "article". In other words, almost all articles of issues that deals
with "health" issues are relevant to "pandemics" and "impact" in 2020.
Alternatively, we can recognize from the Figure 5 for the the relationship of (health,
article) ⊥ (Canadian, covid) | (impact, pandemic), meaning that "health" and "article"
necessary combine with the word of "impact" and "pandemic" to state for the issue of
"Canadian" and "covid".
CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN
UNSTRUCTURED TEXT DATA
Kyuson Lim
STATS 771
Figure 5. Undirected Gaussian graphical model is provided for the dependency
structure ofthe top 7 keywords how they are related to each other for the relationship.
Furthermore, it is easy to notice that words of "article", "impact", "pandemic", and
"Canada" are interchangeably connected to formulate the issues.
Therefore, the Gaussian graphical model is very interesting model and an interpretable result to be applied into the text mining data. Although there are some papers in
the journals that are published for industrial data to be applied for the Gaussian graphical model, the method is not as wide as other famous techniques such as LDA (latent
Dirichlet allocation) or topic clusterings (Kim & Jun, 2015). A certain limitation for
the model is in the analysis for variables to be solely phrase or words to analyze with,
meaning that the analysis is not done in a global document level but rather splited words.
CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN
UNSTRUCTURED TEXT DATA
STATS 771
Kyuson Lim
CHAPTER 7. GAUSSIAN GRAPHICAL MODELS: APPLICATION IN
UNSTRUCTURED TEXT DATA
Chapter 8
Network Graph in topic clustering:
BTM (Biterm Topic Modelling)
First of all, we aim to apply for the ultimate model in text analysis named as BTM for
statistical interpretation rather the fundamental scope of modeling for the limitation of
understandings and the purpose of this paper to state with.
As the words are short and extremely sparse in a single document with all titles
or text data, there is a restriction that the co-occurrences and relationship is difficult
in the situation where some significant words are less appeared than the other words
in the document. Although it may be possible to pre-process or reduce the size the
document, subjective aggregation and preprocessing takes developer"s efforts and issues
to experience some difficulties.
Hence, we provide cutting-edge technique which is BTM that is currently the ultimate
analysis model to provide a topic clustering to formulate topics, which is identified to
give most precise results in several papers (Pietsch & Lessmann, 2018).
8.1
Literature review on the biterm topic model (BTM)
The biterm topic model (BTM) is first introduced in 2013, which attempted to address
the inadequacies on short documents to do modelling of co-occurrences in global term
rather than at the document level. Now, the BTM is the best method in topic clustering
for short words as it is a probabilistic generative model in the generation of the biterms
31
STATS 771
Kyuson Lim
(Du et al., 2017).
For algorithm, the biterms are specifically extracted from each document using
window size, where biterms are unordered term pairs to occur only within the prespecified window size. As the term pairs are unordered, the biterm quantities can be
summarized by the integer numbers that is designated with.
The first step in learning the latent topic components from co-occurrence is to model
the generation of biterms. Assuming that the topics are sampled from the mixture
models, each biterm is sampled from the each specified topic independent from one
another. Then, all biterms are randomly re-assigned to the topic and iterates to update
sequentially, as the number of times term was assigned to topic counted for posterior
estimates. More details are illustrated in Gao, Kim and Sakurai (2016). The main
advantage of BTM is on the assessment of high variability for coherence measurement.
For the assessment of topic quality, the variability of topic for its coherence is measured
to be higher than the other competing algorithm, resulting for the superiority of BTM in
text mining analysis.
The R package BTM was used to perform the biterm topic modeling (Wijffels,
2020). Notice that the analysis follows the guideline written by the R package instruction
(Wijffels, 2020). The steps for analyzing the BTM in text data is as follows:
1. Crawl data of plain text and pre-process tokenized the inputs. This process requires
R package of "ctv" to process the document terms (Zeileis, Hornik & Zeileis,
2022). After the process, the output gives unique tagging on each sentences and
characteristics of words.
2. Perform tagging on title and extract co-occurrences of nouns, adjectives and verbs
within 3 words distance. This process requires R package of "udpipe" to extract the
biterms which perform to tokenize the text words and process for the vectorization
to sample with (Wijffels, Straka & Straková, 2018). After the processing, the
output will give a terms of words to put into documents.
3. Build the biterm topic model with 5 topics and provide the set of biterms to cluster.
This process can be done with the R package BTM to draw topics and clusters of
words to complete the topic clustering algorithm (Wijffels, 2020). This part is the
important step where tuning parameters are input to analyze the data.
CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC
MODELLING)
Kyuson Lim
STATS 771
4. Visualize the biterm topic clusters. The R package of "ggraph" is used to automatically process the topic clustering data into a visual form of graph with subjective
topic names to input by the user (Pedersen et al., 2017).
8.2
Topic clustering: result of BTM
From the Figure 6, there are 5 topics, that are defined for the "economics issues",
"global influence and educational issues", "public health issues", "main keywords", and
"sociological issues". An increase of node (term) size corresponds to higher topic-term
probability (Figure 6). Such words as "protective", "international", and "service" has
a high probability to appear for the corresponding topics (Figure 6). Also, increase of
thickness and darkness of edges (links) is a higher co-occurrences within the topic .
As it was identified previously, the word "covid" and "pandemics" has a higher cooccurrences in main keywords (issue) topics. As biterms were computed using a window
size of 10, adjacent nodes may not have necessarily appeared adjacently in the original
text document, such as "outlook" and "postsecondary" words are rare to have appeared
in the data (Figure 6).
CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC
MODELLING)
STATS 771
Kyuson Lim
Figure 6. Biterm clusters for 5 topics. Each cluster is an undirected graph as biterms
are unordered pairs of terms. Words are connected for co-occurrences within the topics
and formulate for the topics.
Some of the unique and unobserved words include "international", "postsecondary",
and "student" to have not appeared in the frequency Table 1. As the topic account for
allocation of less appeared words in the data, we are now able to detect the words related
the topic of global influence and education issues. Also, the economical issues and sociological issues are somewhat separated to yield a slight different result provided before.
In the group of topic for the sociological issues, we are able to identify some natural
consequence of covid-19 pandemics which include "medical", "protective", "business"
and "personal" words that appear to be the interest to many people. With some added
words of "outlook", "price" and "service", we are able to find the worries and livings of
Canadians in 2020 for the impact of Covid-19.
Similarly, the public health issues takes into account of its own topics with words
of "mental", "group", "health" and "visible". From the analysis result in Figure 6, the
"mental" and "health" is closely connected to show that the issue of public health include
mental health of Canadians. Note that the analysis is consistent from Figure 5, where
"health" is separated from other keywords and it is only connected with particularly
relevant words to contain weak relations with other words. From the analysis of topics,
we are clearly able to observe that some words are drawn from the text due to the nature
of biterm analysis. While it was certainly obvious to identify most frequent words in
any of the previous analysis, the BTM yields more comprehensive and grouped topics
of words by the application of mixture models.
CHAPTER 8. NETWORK GRAPH IN TOPIC CLUSTERING: BTM (BITERM TOPIC
MODELLING)
Chapter 9
Conclusion and discussion
9.1
Discussion
Within the 5 different analysis of keywords, there are some variation and minor difference
in methods. Although the wordcloud is regard to be the most effective tool in portraying
the data analysis of frequencies in text data, the unstructured and qualitative form is in the
controversy to be the issue. Therefore, the author provided a unique but rather efficient
sub-plot for the co-occurence frequencies as well as table of rank 7 most frequent used
words in the result to provide for better understanding of the data. Hence, the unique
graph with the table can provide a clear and obvious result of the text data that is crawled
from the Statistics Canada in 2020 related to the issue of Covid-19 pandemics.
A hierarchal clustering result shows for only 2 clusters, but the k-mean clustering
result shows for 3 clusters to be the optimal choice in classification. Also, there was a
minor difference compared with the correlation analysis but generally the same consistent
result to observe with. Under the purpose to examine grouping analysis of keywords,
the goal to differentiate the topic keywords and sub-topics keywords is achieved by
the analysis comparison between the hierarchal clustering and the k-mean clustering.
Furthermore, the result was consistent to find from major cluster to differentiate with
the other minor clusters where the minor cluster contained same keywords, which is
"health" and "pandemic".
Lastly, a BTM result is shown for the topic clustering without some basis in statistical
sampling in mixture models. Unfortunately, for the purpose of topic clustering and its
35
STATS 771
Kyuson Lim
result to portray with, a minor step to inform for the concise steps in BTM is skipped
because it deviates from the topics in this paper. The goal is to construct a coherent
and consistent topic clustering model to inform for the readers what major issues we can
draw from the analysis of the BTM to provide with ultimate guidance on the data we
analyzed with. It was successive to deliver for the specific words and topics that was not
observed from the most frequently used words data to result in a complete overview of
the major 5 topics and its co-occurrences in each clusters.
9.2
Conclusion
The paper examine the result of 5 different ways for text mining data and 1 method
for time series data for the published dates. Each method is different by the nature of
mathematical and statistical foundation, leading us to explore the data and guide through
different result of the analysis. This included analyzing term frequencies and term cofrequencies, clustering and the formulating topic models in order to better understand
the topics of keywords extracted and throughout the covid-19 pandemics in Canada.
First, we investigate with wordcloud to know the keywords and co-occurrence to
observe for the data. Then, we applied some simple clustering algorithm of hierarchical
clustering and k-mean clustering to group them and investigate for the correlation.
During the process of wrangling the data and observing from the data, we are able to
observe 2 groups of keywords where the first group of main topics corresponds to the
most appeared words, "covid", "pandemic" and "health" which are general form of the
topics in which we can observe to find with.
The other group of sub-topics include such words as "Canadians", "data", "survey",
"economics" and "business" which we can find the sociological difficulties and issues
that many people confront with. This result shows the importance to know how to cope
with socio-economics problems in the future when we have a similar types of pandemics.
Second, we looked for local smoothing regression in time series data to investigate
if there is a causal relationship to draw upon the different trend in the appearance of
issues by the keyword data. Observed from the data, we were able to find that the trend
is similar for top 7 ranked most appeared words indicating that the trend to differentiate
is hard to analyze with. However, we are able to draw a minor conclusive argument that
CHAPTER 9. CONCLUSION AND DISCUSSION
Kyuson Lim
STATS 771
the issue of "Canada" for the "covid" becomes the problem of "Canadian" around the
period of July to August in the period in which pandemics was most severe.
Third, we applied the Gaussian graphical model to draw a conditional independence
between top 7 rank of the words. By the structural form of the Gaussian graphical model,
we were able to differentiate how the co-occurrences of words for the issues are written
for the headlines by the conditional independence in the undirected Gaussian graphical
model to construct with. Some of the statistical significant result shows that the word
"health" is not relevant to the "covid" or "pandemics" without the words of "impact" or
"pandemic". This result was consistent to observe from the BTM (Biterm Topic Models
for Short Text) where the biterm is the "mental" and "health" for the topic of public
health issues.
In regards to topic clustering, it was concluded that the topics learned by BTM were,
more concise and specified than the clusters learned by hierarchical clustering and the
k-mean clustering. There were hidden keywords that draws upon the result of topics that
is result in with, and a specified topics to differentiate clearly for the relevant keywords.
We are able to find for the 5 topics, each to be distinct from each other, to know what
problems and issues with keywords Canadians have had during the Covid-19 pandemics
in 2020.
CHAPTER 9. CONCLUSION AND DISCUSSION
STATS 771
Kyuson Lim
CHAPTER 9. CONCLUSION AND DISCUSSION
Bibliography
[1] Liu, S., Yang, L., Zhang, C., Xiang, Y. T., Liu, Z., Hu, S., & Zhang, B. (2020). Online
mental health services in China during the COVID-19 outbreak. The Lancet Psychiatry, 7(4), e17-e18. https://www.thelancet.com/journals/lanpsy/article/PIIS22150366(20)30077-8/fulltext
[2] World Health Organization. Coronavirus disease (covid-19). (2021). URL
https://www.who.int/news-room/q-a-detail/coronavirus-disease-covid-19.
[3] Agrawal, R., ImieliΕ„ski, T., & Swami, A. (1993, June). Mining association rules
between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD
international conference on Management of data (pp. 207-216). Chicago
[4] Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN.
R-project. org/package= BTM. R package version 0.3, 1.
[5] Halvey, M. J., & Keane, M. T. (2007, May). An assessment of tag presentation
techniques. In Proceedings of the 16th international conference on World Wide Web
(pp. 1313-1314).
[6] Guattari, F., & Deleuze, G. (1992). Tausend Plateaus. Kapitalismus und Schizophrenie.
[7] Gilles Deleuze, Felix Guattari (1992). Tausend Plateaus. Kapitalismus und
Schizophrenie. ISBN 978-3-88396-094-4.
[8] Agrawal, R., & Srikant, R. (1994, September). Fast algorithms for mining association
rules. In Proc. 20th int. conf. very large data bases, VLDB (Vol. 1215, pp. 487-499).
[9] Wong, P. C., Whitney, P., & Thomas, J. (1999, October). Visualizing association
rules for text mining. In Proceedings 1999 IEEE Symposium on Information Visualization (InfoVis’ 99) (pp. 120-123). IEEE.
39
STATS 771
Kyuson Lim
[10] Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam’s
razor. Information processing letters, 24(6), 377-380.
[11] Garcia, Enrique (2007). "Drawbacks and solutions of applying association rule
mining in learning management systems" (PDF). Sci2s. Archived (PDF) from the
original on 2009-12-23.
[12] Akerkar, R. (Ed.). (2020). Big Data in Emergency Management: Exploitation
Techniques for Social and Mobile Data. Springer Nature.
[13] Congressional Research Service (2021). Unemployment rates during the COVID19 pandemic. https://crsreports.congress.gov/product/pdf/ R/R46554
[14] Gilfillan, G. (2020). Covid-19: Labour market impacts on key demographic groups,
industries and Regions. Department of Parliamentary Services Australia, Parliament
of Australia. https://www.voced.edu.au/content/ngv: 90977
[15] World Bank. (2020). The COVID-19 crisis response. https://doi.org/10.1596/34571
[16] Nayak, J., Mishra, M., Naik, B., Swapnarekha, H., Cengiz, K., & Shanmuganathan,
V. (2021). An impact study of COVID-19 on six different industries: Automobile,
energy and Power, agriculture, education, travel and Tourism and Consumer Electronics. Expert Systems. https://doi.org/10. 1111/exsy.12677
[17] Knaflic, C. N. (2015). Storytelling with data: A data visualization guide for business
professionals. John Wiley & Sons.
[18] Gershon, N., & Page, W. (2001). What storytelling can do for information visualization. Communications of the ACM, 44(8), 31-37.
[19] Charu, C. Aggarwal, ChengXiang Zhai. (2012). Mining Text Data.
[20] Becue-Bertaut, M. (2019). Textual data science with R. CRC Press.
[21] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to
statistical learning (Vol1. 112, p.18). New York: Springer
[22] Kaufman, L., & Rousseeuw, P. J. (2008). Clustering large applications (Program
CLARA). Finding groups in data: an introduction to cluster analysis, 126-146.
BIBLIOGRAPHY
Kyuson Lim
STATS 771
[23] Schubert, E., & Rousseeuw, P. J. (2019, October). Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In International conference
on similarity search and applications (pp. 171-187). Springer, Cham.
[24] Kumar, A., & Paul, A. (2016). Mastering text mining with R. Packt Publishing Ltd.
[25] Press, W. H., & Teukolsky, S. A. (1990). Savitzky-Golay smoothing filters. Computers in Physics, 4(6), 669-672.
[26] Hardle, W., Muller, M., Sperlich, S., & Werwatz, A. (2012). Nonparametric and
semiparametric models (Vol. 2). Berlin: Springer.
[27] Højsgaard, S., Edwards, D., & Lauritzen, S. (2012). Graphical models with R.
Springer Science & Business Media.
[28] Uhler, C. (2017). Gaussian graphical models: An algebraic and geometric perspective. arXiv preprint arXiv:1707.04345.
[29] Scutari, M., & Denis, J. B. (2021). Bayesian networks: with examples in R.
Chapman and Hall/CRC.
[30] Kim, J. M., Yoon, J., Hwang, S. Y., & Jun, S. (2019). Patent Keyword Analysis
Using Time Series and Copula Models. Applied Sciences, 9(19), 4071.
[31] Kim, J. M., & Jun, S. (2015). Graphical causal inference and copula regression
model for apple keywords by text mining. Advanced Engineering Informatics, 29(4),
918-929.
[32] Du, D., Li, L., Zhu, E., & He, K. (Eds.). (2017). Theoretical Computer Science:
35th National Conference, NCTCS 2017, Wuhan, China, October 14-15, 2017,
Proceedings (Vol. 768). Springer.
[33] Gao, H., Kim, J., & Sakurai, Y. (Eds.). (2016). Database Systems for Advanced
Applications: DASFAA 2016 International Workshops: BDMS, BDQM, MoI, and
SeCoP, Dallas, TX, USA, April 16-19, 2016, Proceedings (Vol. 9645). Springer.
[34] Le Pennec, E., & Slowikowski, K. (2019). ggwordcloud: A Word Cloud Geom
for’ggplot2’. R package version 0.5. 0.
[35] Benoit, K., Watanabe, K., Wang, H., Nulty, P., Obeng, A., Müller, S., & Matsuo, A.
(2018). quanteda: An R package for the quantitative analysis of textual data. Journal
of Open Source Software, 3(30), 774.
BIBLIOGRAPHY
STATS 771
Kyuson Lim
[36] Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31(22), 3718-3720.
[37] Wickham, H., & Wickham, M. H. (2007). The ggplot package. URL: https://cran.
r-project. org/web/packages/ggplot2/index. html.
[38] Højsgaard, S. (2009). On the usage of the gRim package.
[39] Zhu, Q., Feng, Z., & Li, X. (2018, January). GraphBTM: Graph enhanced autoencoded variational inference for biterm topic model. In Conference on Empirical
Methods in Natural Language Processing (EMNLP 2018).
[40] Pietsch, A. S., & Lessmann, S. (2018). Topic modeling for analyzing open-ended
survey responses. Journal of Business Analytics, 1(2), 93-116.
[41] Wijffels, J. (2020). Btm: Biterm topic models for short text. URL: https://CRAN.
R-project. org/package= BTM. R package version 0.3, 1.
[42] Zeileis, A., Hornik, K., & Zeileis, M. A. (2022). Package ‘ctv’.
[43] Wijffels, J., Straka, M., & Straková, J. (2018). Package ‘udpipe’.
[44] Pedersen, T. L., Pedersen, M. T. L., LazyData, T. R. U. E., Rcpp, I., & Rcpp, L.
(2017). Package ‘ggraph’. Retrieved January, 1, 2018.
BIBLIOGRAPHY
Download