Uploaded by Kevin Kyuson Lim

Summary of Big Data Analysis on Canadian News about the Covid-19 Toward Educational Impact

advertisement
1
2
3
Summary of Big Data Analysis on Canadian News
about the Covid-19 Toward Educational Impact
ABSTRACT
A Covid-19 has changed many things in our daily life for over 23 months of period since it has spread
over Canada. One of the major changes in the world as well as Canada is the educational impact due to
public health problem. People are isolated and fear has spread over educational systems with more
concentration on public health issues. The purpose of this summary is to investigate text mining analysis
of current headlines for Canadian News of Covid-19 on educational impact. Methods used for the text
mining include wordcloud, association rule and LDA topic modeling based on the Gibbs sampling.
Keywords: wordcloud, LDA topic modelling, Covid-19, Canada, text mining
4
INTRODUCTION
5
The COVID-19 has posted one of the most challenging situations for the academics as well
6
as the students all around the world. Students being the biggest stake holders of the education
7
process, their challenges during this pandemic become even more pivotal. These challenges vary in
8
intensity and type depending on multiple factors. In this paper, we analyze and document the
9
challenges particular to the issues and keyword of Covid-19 on Canadian educational systems as a
10
public health issue to document with. An analysis is performed on news headlines and keywords by
11
tokenizing, to analyze various factors affecting the statistical analysis during COVID-19. The results
12
of the graph analysis showed that a relationship between covid-19 and government as well as
13
educational lock down during 2021.
14
15
The World Health Organization (WHO) in their media briefing on the 11th March, 2020 declared the spread
16
of the contagious coronavirus disease, also referred to as COVID-19 as a pandemic. COVID-19 stands for
17
coronavirus disease and even referred to as the 2019 novel coronavirus or ‘2019-nCoV’ (Bender, 2020).
18
The COVID-19 virus is linked to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV) that
19
similarly can be as fatal (Meng, Hua & Bian, 2020). This new virus can be transmitted just in minutes
20
through droplets or even touching surface metals or other materials which have been infected from a
21
person who has respiratory problems. Even though the elderly and the very young children are easily
22
affected, nobody is immune to this new infectious disease once it hits the body, so all people are
23
susceptible to its devastating effects (Meng, Hua & Bian, 2020)
24
25
Since social distancing was a key recommendation to check the spread of COVID-19, the
26
educational institutions across the world were suddenly faced by an unprecedented challenge of
27
continuing to provide some form of educational assistance to students despite the pandemics (Jeanne,
28
Leonie & Parlo, 2020). Furthermore, the response of the institutions globally should not only address the
29
challenges related to educational needs, but to also think about economic and sustainability factors. The
30
emerging responses varied depending on level of education, demographics, infrastructure facilities and
31
location wise impact of the pandemic (Mulenga, Marbán, 2020; Dhawan, 2020).
32
33
However, most of the educational systems decided to rely on some form of technology as a
34
substitute modality. In 2020, OECD Rapid Response report by Reimers and Schleicher surveyed 330
35
respondents across 98 countries about their educational responses to COVID-19. It was interesting to see
36
many countries like Finland, Argentina, Russia (Radio and Television); Austria, Belgium, Costa Rica, Croatia,
37
Czech Republic, Georgia, Indonesia, Israel, Latvia, Malaysia, Romania, and Saudi Arabia (Television) using
38
radio and television media to supplement and support online learning or learning at home.
39
40
Reports conflict with how quickly the higher education sector responded. While some reported
41
rapid transition by 2 February (Leung & Sharma, 2020); others reported the transition did not occur until
42
mid to late February (McKenzie, 2020). As observed in their report Global survey report by International
43
Association of Universities (IAU, 2020), at almost all higher education institutes, a COVID-19 affected
44
teaching and learning, with two-thirds of them reporting that the classroom teaching has been replaced
45
by distance teaching and learning. Distance teaching is broadly categorized as Online learning, blended
46
learning and Remote learning. Given the nature of the pandemic, the distance teaching categories
47
adapted during this period have been mainly online learning and remote learning. Although, quite often
48
the terms online learning and distance learning are used interchangeably, as believed by majority of
49
educationalists, we define online learning as the one involving the delivery of the course/program
50
material through some form of digital device utilizing the internet. On the other hand, we identify the
51
creation of online virtual synchronous classrooms as remote learning.
52
53
In 2020, OECD Rapid Response report by Reimers and Schleicher pointed out that “The domains
54
for which most people considered that an education response involved the most challenges were the
55
availability of technological infrastructure, addressing student emotional health, addressing the right
56
balance between digital and screen free activities and managing the technological infrastructure.”
57
58
The other challenge faced by the institutions was the readiness and the preparedness of the public
59
health issues and the students towards the entire distance teaching process. As Bao (2020) pointed out
60
that it was a massive, disruptive shift to move all the existing courses to an online environment in a matter
61
of days. In this paper, we try to dwell deeper into these issues, particularly for higher education. This was
62
achieved by conducting an analysis on news headlines that represent concurrent issues during the COVID-
63
19 pandemic. The result of the keywords highlighted the important factors vital for public health issue
64
during pandemics. The documentation of such results can prove helpful insight, as we end with online
65
higher education. These results can also be beneficial in the event of any future sudden disruption of
66
higher education due to unprecedented circumstances.
67
68
METHODS
69
70
71
PART 1. WEB SCRAWLING
72
1. Steps for crawling Canada news of covid-19 about relevant topics of an education. First, in the Yahoo
website Canada, search for the relevant topics, ‘Covid education’.
73
74
75
2. Second, using html source code, analyze the relevant code for headlines which represents the
interest of content for the impact of the Covid-19 toward the education.
76
77
3. Third, using program
, web crawl the data of 50 pages in total of 260 news headlines.
78
79
80
The frequency analysis of new headlines, wordcloud analysis (macroscopic outline )
81
Tokenized words of the headlines are made into a frequency table. The R output shows the table of 50
82
most often used words for headlines of 507 news (combination of first and second crawling), in search
83
for ‘covid education’ of Canadian news.
84
85
86
PART 2. WORDCLOUD VISUALIZATION
87
88
Wordcloud (also known as tag clouds or word art) are used to visualize and summarize all sorts of data,
from feedback to academic papers, and everything in between.
89
90
There are many word cloud generators to choose from, each one with its own unique design and
customization capabilities.
91
Those adverbs (‘to’, ‘in’, ‘for’, ‘of’, ‘and’, ‘on’, and ‘as’) of covid-19 virus subject is eliminated for
92
unnecessary details and only used the commonly used words for further investigation in the section of
93
topic modeling.
94
95
96
97
PART 3. WORD CONNECTION (VISUALIZATION)
98
SNA measures and maps the flow of relationships and relationship changes between knowledge-
99
possessing entities. Simple and complex entities include websites, computers, animals, humans, groups,
100
Social network analysis (SNA) is a process of quantitative and qualitative analysis of a social network.
organizations and nations.
101
102
A SNA (social network analysis) for intersected words (microscopic outline).
103
Based on the bags of words extracted, an analysis of words are made for the intersected usage of words.
104
105
106
107
PART 4. TOPIC MODELING (LDA: LATENT DIRICHLET ALLOCATION ANALYSIS)
108
important events in history by analyzing text based on year. Web based libraries can use LDA
109
to recommend books based on your past readings. News providers can use topic modelling to understand
110
articles quickly or cluster similar articles.
111
What are some of the real world uses topic modelling has? Historians can use LDA to identify
112
A tool and technique for topic modeling, Latent Dirichlet Allocation (LDA) classifies or
113
categorizes the text into a document and the words per topic, these are modeled based on the Dirichlet
114
distributions and processes.
115
The LDA makes two key assumptions:
116
1. Documents are a mixture of topics, and
117
2. Topics are a mixture of tokens (or words)
118
And, these topics using the probability distribution generate the words. In statistical language, the
119
documents are known as the probability density (or distribution) of topics and the topics are the
120
probability density (or distribution) of words.
121
The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the
122
Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word
123
distribution.
124
As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA
125
backtracks from the document level to identify which topics would have generated these documents
126
and which words would have generated those topics.
127
128
129
130
RESULTS
131
some statistically proven significant factors discussed below which need to be either maintained or
132
improved in any future public health issues and spread of respiratory disease.
The overall results showed general news keywords analysis. This words connection and ideas comes from
133
134
135
136
WORDCLOUD VISUALIZATION 1
137
adverbs. It is easy to find the significant words related to ‘covid’, and ‘education’ efficiently with the size
138
of words in the graph. Now, some inferences are found from the result of the wordcloud.
The R wordcloud shows statistical significant throughout the frequencies of words, with deletion of
139
140
Some words, including ‘pandemic’, ‘season’, ‘schools’, ‘online’, ‘lockdown’, ‘closed’ and ‘students’
141
shows the educational perspective of covid-19 impact in Canada. Then, the words including ‘Ontario’
142
and ‘B.C’ shows the locational importance of interest.
143
144
Also, it shows current terror of the ‘shooting’ in ‘Nova’, ’Soctia’, ‘funeral’, and ‘victims’ in which shows
145
the related people’s interest towards the covid-19 pandemic situation.
146
147
The interest is given towards words that people are attracted, including ‘mass’, ‘measures’, ‘ramping’,
148
‘says’, ‘news’, and ‘season’.
149
150
It is also inferences that ‘cases’ and ‘week’ ‘during’ and ‘amid’ shows the periodic intervals of covid-19
151
impact, which comes from cognitive understandings of people.
152
153
In these words, news reflects the trends and ideas of people ‘how’ influential covid-19 pandemic is, as
154
the mass media does for people in terms of understanding current situation.
155
156
157
158
WORD CONNECTION VISUALIZATION
SNA (social network analysis) for intersected words (microscopic outline)
159
160
Based on the bags of words extracted, an analysis of words are made for the intersected usage of words.
161
162
163
• The top 10 words are found from within connections by the sentences of headlines, and this is made
into matrix.
164
165
166
• This is cross checked with the lift and the support of cross table (matrix) formed by the ‘apriori’
package.
167
168
169
The
simply the ratio of these values: target response divided by average response.
170
Hence, if the lift is high, then it means the co-occurrence of A and B means higher chance of A and B
171
occurring together.
172
173
Also, the
. The purpose to find how connect words are in writing headlines can be
174
confirmed with frequencies of co-occurrence and the lift and support.
175
176
177
• Then, this is formed into a graph of network which shows the connections of words as a
representative of most used headlines.
178
179
The analysis is based on common usage of words in headlines for two words. The words commonly
180
shows up include ‘Ontario’, ‘new’, ’lockdown’, ‘B.C’, ’cases’, ‘schools’, ‘may’ and ‘students’, ‘week’.
181
From the heatmap, it is shown that the covid has high correlation with adverbs.
182
183
Hierarchal clustering analysis based on the lift of the matrix. This let us understand the similar group of
184
words and connect with different group of clustered words in a stepwise graph, which provides a
185
structural understanding of the clustered words.
186
187
More evidently, even though there should be statistical justification for how number of clusters are
188
determined, the graph clearly shows the clusters of covid-19 with adverbs and related words of an
189
education. The R network graph below shows the ultimate ranking relationships of words in the
190
headlines. The words are based on the correlation frequencies and relation being close enough is
191
determined by the hierarchical clustering and the probability of the support.
192
193
First, people are interested in the news because of the social distanced as ‘news’ of populated areas
194
‘Ontario’, ‘B.C’ where the ‘measures’ is related.
195
196
Also, the decision of the decision of the ‘minister’ and action of government is relative to students
197
current situation of covid-19.
198
199
There are some unique search analysis of relationship that covid-19 is strongly related to ‘cases’ of
200
‘new’ individuals, and the object which is ‘students’ is relevant to ‘online’ learning and ‘lockdown’ of
201
‘schools’ also mattered.
202
203
204
205
TOPIC MODELING (LDA): 5 TOPICS, WITH 20 KEYWORDS
206
of topics, under the common topic of ‘covid-19’, are deduced from the two results by the Gibbs
207
sampling. It could be summarized into 3 different topics. First, student receiving online teaching and are
208
sensitive news is noticed as coherent to previous analysis.
Although it would be added to show most likely 5 topics are adequate for the choosing topics, the table
209
210
Second, the schools related to education of during May lockdown is popular topic as the reality comes
211
true.
212
213
Third, terror is still related to covid-19 of pandemic situation as people are feared of shooting and death
214
which is new as it is a reflection that people are afraid of.
215
216
From the LDA topic modeling, it is confirmed that the topics are very easily noticed with relative words
217
that are summarized. It confirms the relation of previous analysis and combines with data sampling
218
method to fit appropriate words together.
219
220
221
DISCUSSION
222
- Some of headlines are duplicate and has same headlines to other headlines.
223
- Limited data for query access for the headlines is solved through second crawling of the data.
Problem and solutions:
224
225
226
227
CONCLUSION
228
education related to each provinces in Canada for better understanding for the future. What was achieved
229
was the distinction of the significant factors in terms of public health matters from the higher education
230
institutions. However, apart from this, it is well known for the importance of public health and issues
231
during pandemic period. During the pandemics, students and many people often have problems such as
The main aim of the study was to study and document the keywords and topics of this pandemic in higher
232
lack of interaction, communication to be relied for public news, or concurrent issues when they are self-
233
isolated at home. Ensuring successful online education requires techniques to be implemented by public
234
health communities to minimize negative public heath impact of students and encourage isolation and
235
reliability in news of public media. The following 3 key points summarize the overall statistical analysis of
236
methods used.
237
238
Finally, with 3 methods of analysis, education impact of covid-19 pandemic could be understood as
239
follows:
240
-
Online learning, where schools are lockdown during May:
241
From this fact, the keyword of during and up to shows people curiosity of minister’s decisions and what
242
news says about it.
243
244
-
New cases and death of terror in Nova Scotia:
245
Although these keywords are related to more general understandings of covid-19 pandemic situation,
246
mass media and people are feared of terror as duration of social-distancing becomes longer.
247
248
-
Students, locational importance on Ontario, B.C news
249
As for the populated areas, university students and kids are sensitive to news of local areas, especially
250
educational impact on students. There are lots of schools and colleges located in B.C and Ontario and
251
the keywords of topics reflect this fact of popular interest.
252
REFERENCES
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
Jeanne, A., Leonie, R. & Parlo, S. (2020). Teaching and teacher education in the time of
COVID-19, Asia-Pacific Journal of Teacher Education, 48:3, 233-236,
https://doi.org/10.1080/1359866X.2020.1752051
287
288
289
Mulenga, E. & Marbán, J. (2020). Is COVID-19 the Gateway for Digital Learning in
Mathematics Education? Contemporary educational technology, 12(2), ep269,
https://doi.org/10.30935/cedtech/7949
Mehta, A., & Bura, D. (2021). Mining of Association Rules in R Using Apriori Algorithm.
In Advances in Communication and Computational Technology (pp. 181-188). Springer,
Singapore. https://link.springer.com/chapter/10.1007/978-981-15-5341-7_14
Hornik, K., & Grün, B. (2011). topicmodels: An R package for fitting topic models. Journal of
statistical software, 40(13), 1-30. https://epub.wu.ac.at/3987/
Lang, D., Chien, G. T., LazyData, T. R. U. E., & Lang, M. D. (2018). Package ‘wordcloud2’.
https://cran.r-project.org/web/packages/wordcloud2/wordcloud2.pdf
Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network
research. InterJournal, complex systems, 1695(5), 1-9.
https://www.researchgate.net/profile/GaborCsardi/publication/221995787_The_Igraph_Software_Package_for_Complex_Network_Researc
h/links/0c96051d301a30f265000000/The-Igraph-Software-Package-for-Complex-NetworkResearch.pdf
Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and
scraping. SoftwareX, 6, 98-106.
https://www.sciencedirect.com/science/article/pii/S2352711017300110
Zhao, S., Guo, Y., Sheng, Q., & Shyr, Y. (2014). Heatmap3: an improved heatmap package with
more powerful and convenient features. BMC bioinformatics, 15(10), 1-2.
https://link.springer.com/article/10.1186/1471-2105-15-S10-P16
Suzuki, R., & Shimodaira, H. (2006). Pvclust: an R package for assessing the uncertainty in
hierarchical clustering. Bioinformatics, 22(12), 1540-1542.
https://academic.oup.com/bioinformatics/article/22/12/1540/207339?login=false
290
291
292
293
294
Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of
hierarchical clustering. Bioinformatics, 31(22), 3718-3720.
https://academic.oup.com/bioinformatics/article/31/22/3718/240978?login=false
295
296
Dhawan, S. (2020). Online Learning: A Panacea in the Time of COVID-19 Crisis. Journal of
Educational Technology Systems, 49(1), 5–22. https://doi.org/10.1177/0047239520934018
297
298
299
300
Leung, M., & Sharma, Y. (2020). Online classes try to fill education gap during epidemic.
University World News.
https://www.universityworldnews.com/post.php?story=2020022108360325
301
302
303
304
McKenzie, L. (2020). Coronavirus forces universities online. Inside Higher Education.
https://www.insidehighered.com/news/2020/02/25/coronavirus-forcesus-universities-onlinechina
305
306
307
308
Marinoni, G., Land, H. V. & Jensen, T. (2020). The impact of COVID-19 on higher education
around the world: IAU Global Survey Report. International Association of Universities.
https://www.iauaiu.net/IMG/pdf/iau_covid19_and_he_survey_report_final_may_2020.pdf
309
310
311
312
313
314
Reimers, F., & Schleicher, A. (2020). A framework to guide an education response to the
COVID-19 pandemic of 2020. OECD. https://iccittadichiari.edu.it/wpcontent/uploads/2020/07/COVID19_LG-settore-istruzione_OCSE_maggio2020_ESTRATTO.pdf
315
316
317
Bao, W. (2020). COVID-19 and online teaching in higher education: A case study of Peking
University. Human Behaviour & Emerging Technology, 2, 113–115.
https://doi.org/10.1002/hbe2.191
318
319
320
321
Hodges, C., Moore, S., Lockee, B., Trust, T., & Bond, A. (2020). The difference between
emergency remote teaching and online learning. Educause Review, 27.
https://medicine.hofstra.edu/pdf/faculty/facdev/facdev-article.pdf
322
323
324
325
326
WHO media briefing. https://www.who.int/dg/speeches/detail/who-director-general-s-openingremarks-at-the-media-briefing-on-covid-19---11-march-2020
FIGURES AND TABLES
327
328
329
330
331
332
333
334
335
336
337
338
339
Figure 1: Steps for crawling Canada news of covid-19 about relevant topics of an education. First, in the
Yahoo website Canada, search for the relevant topics, ‘Covid education’.
Figure 2. Using html source code, analyze the relevant code for headlines which represents the interest
of content for the impact of the Covid-19 toward the education.
Figure 3. using program
, web crawl the data of 50 pages in total of 260 news headlines.
340
341
342
Figure 4. The R wordcloud shows statistical significant throughout the frequencies of words, with
deletion of adverbs.
Figure 5. From the heatmap, it is shown that the covid has high correlation with adverbs.
343
344
345
Figure 6. Hierarchal clustering analysis based on the lift of the matrix. This let us understand the similar
group of words and connect with different group of clustered words in a stepwise graph, which provides
a structural understanding of the clustered words.
346
347
348
Figure 7. The R graph below shows the ultimate ranking relationships of words in the headlines. The
words are based on the correlation frequencies and relation being close enough is determined by the
hierarchical clustering and the probability of the support.
349
350
Table 1: Most likely 5 topics are adequate for the choosing topics, the table of topics, under the
common topic of ‘covid-19’, are deduced from the two results by the Gibbs sampling.
student
schools
shooting
online
during
death
cases
lockdown
new
news
pandemic
Nova Scotia
education
May
funerals
B.C
up to
says
Download