1 2 3 Summary of Big Data Analysis on Canadian News about the Covid-19 Toward Educational Impact ABSTRACT A Covid-19 has changed many things in our daily life for over 23 months of period since it has spread over Canada. One of the major changes in the world as well as Canada is the educational impact due to public health problem. People are isolated and fear has spread over educational systems with more concentration on public health issues. The purpose of this summary is to investigate text mining analysis of current headlines for Canadian News of Covid-19 on educational impact. Methods used for the text mining include wordcloud, association rule and LDA topic modeling based on the Gibbs sampling. Keywords: wordcloud, LDA topic modelling, Covid-19, Canada, text mining 4 INTRODUCTION 5 The COVID-19 has posted one of the most challenging situations for the academics as well 6 as the students all around the world. Students being the biggest stake holders of the education 7 process, their challenges during this pandemic become even more pivotal. These challenges vary in 8 intensity and type depending on multiple factors. In this paper, we analyze and document the 9 challenges particular to the issues and keyword of Covid-19 on Canadian educational systems as a 10 public health issue to document with. An analysis is performed on news headlines and keywords by 11 tokenizing, to analyze various factors affecting the statistical analysis during COVID-19. The results 12 of the graph analysis showed that a relationship between covid-19 and government as well as 13 educational lock down during 2021. 14 15 The World Health Organization (WHO) in their media briefing on the 11th March, 2020 declared the spread 16 of the contagious coronavirus disease, also referred to as COVID-19 as a pandemic. COVID-19 stands for 17 coronavirus disease and even referred to as the 2019 novel coronavirus or ‘2019-nCoV’ (Bender, 2020). 18 The COVID-19 virus is linked to Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV) that 19 similarly can be as fatal (Meng, Hua & Bian, 2020). This new virus can be transmitted just in minutes 20 through droplets or even touching surface metals or other materials which have been infected from a 21 person who has respiratory problems. Even though the elderly and the very young children are easily 22 affected, nobody is immune to this new infectious disease once it hits the body, so all people are 23 susceptible to its devastating effects (Meng, Hua & Bian, 2020) 24 25 Since social distancing was a key recommendation to check the spread of COVID-19, the 26 educational institutions across the world were suddenly faced by an unprecedented challenge of 27 continuing to provide some form of educational assistance to students despite the pandemics (Jeanne, 28 Leonie & Parlo, 2020). Furthermore, the response of the institutions globally should not only address the 29 challenges related to educational needs, but to also think about economic and sustainability factors. The 30 emerging responses varied depending on level of education, demographics, infrastructure facilities and 31 location wise impact of the pandemic (Mulenga, Marbán, 2020; Dhawan, 2020). 32 33 However, most of the educational systems decided to rely on some form of technology as a 34 substitute modality. In 2020, OECD Rapid Response report by Reimers and Schleicher surveyed 330 35 respondents across 98 countries about their educational responses to COVID-19. It was interesting to see 36 many countries like Finland, Argentina, Russia (Radio and Television); Austria, Belgium, Costa Rica, Croatia, 37 Czech Republic, Georgia, Indonesia, Israel, Latvia, Malaysia, Romania, and Saudi Arabia (Television) using 38 radio and television media to supplement and support online learning or learning at home. 39 40 Reports conflict with how quickly the higher education sector responded. While some reported 41 rapid transition by 2 February (Leung & Sharma, 2020); others reported the transition did not occur until 42 mid to late February (McKenzie, 2020). As observed in their report Global survey report by International 43 Association of Universities (IAU, 2020), at almost all higher education institutes, a COVID-19 affected 44 teaching and learning, with two-thirds of them reporting that the classroom teaching has been replaced 45 by distance teaching and learning. Distance teaching is broadly categorized as Online learning, blended 46 learning and Remote learning. Given the nature of the pandemic, the distance teaching categories 47 adapted during this period have been mainly online learning and remote learning. Although, quite often 48 the terms online learning and distance learning are used interchangeably, as believed by majority of 49 educationalists, we define online learning as the one involving the delivery of the course/program 50 material through some form of digital device utilizing the internet. On the other hand, we identify the 51 creation of online virtual synchronous classrooms as remote learning. 52 53 In 2020, OECD Rapid Response report by Reimers and Schleicher pointed out that “The domains 54 for which most people considered that an education response involved the most challenges were the 55 availability of technological infrastructure, addressing student emotional health, addressing the right 56 balance between digital and screen free activities and managing the technological infrastructure.” 57 58 The other challenge faced by the institutions was the readiness and the preparedness of the public 59 health issues and the students towards the entire distance teaching process. As Bao (2020) pointed out 60 that it was a massive, disruptive shift to move all the existing courses to an online environment in a matter 61 of days. In this paper, we try to dwell deeper into these issues, particularly for higher education. This was 62 achieved by conducting an analysis on news headlines that represent concurrent issues during the COVID- 63 19 pandemic. The result of the keywords highlighted the important factors vital for public health issue 64 during pandemics. The documentation of such results can prove helpful insight, as we end with online 65 higher education. These results can also be beneficial in the event of any future sudden disruption of 66 higher education due to unprecedented circumstances. 67 68 METHODS 69 70 71 PART 1. WEB SCRAWLING 72 1. Steps for crawling Canada news of covid-19 about relevant topics of an education. First, in the Yahoo website Canada, search for the relevant topics, ‘Covid education’. 73 74 75 2. Second, using html source code, analyze the relevant code for headlines which represents the interest of content for the impact of the Covid-19 toward the education. 76 77 3. Third, using program , web crawl the data of 50 pages in total of 260 news headlines. 78 79 80 The frequency analysis of new headlines, wordcloud analysis (macroscopic outline ) 81 Tokenized words of the headlines are made into a frequency table. The R output shows the table of 50 82 most often used words for headlines of 507 news (combination of first and second crawling), in search 83 for ‘covid education’ of Canadian news. 84 85 86 PART 2. WORDCLOUD VISUALIZATION 87 88 Wordcloud (also known as tag clouds or word art) are used to visualize and summarize all sorts of data, from feedback to academic papers, and everything in between. 89 90 There are many word cloud generators to choose from, each one with its own unique design and customization capabilities. 91 Those adverbs (‘to’, ‘in’, ‘for’, ‘of’, ‘and’, ‘on’, and ‘as’) of covid-19 virus subject is eliminated for 92 unnecessary details and only used the commonly used words for further investigation in the section of 93 topic modeling. 94 95 96 97 PART 3. WORD CONNECTION (VISUALIZATION) 98 SNA measures and maps the flow of relationships and relationship changes between knowledge- 99 possessing entities. Simple and complex entities include websites, computers, animals, humans, groups, 100 Social network analysis (SNA) is a process of quantitative and qualitative analysis of a social network. organizations and nations. 101 102 A SNA (social network analysis) for intersected words (microscopic outline). 103 Based on the bags of words extracted, an analysis of words are made for the intersected usage of words. 104 105 106 107 PART 4. TOPIC MODELING (LDA: LATENT DIRICHLET ALLOCATION ANALYSIS) 108 important events in history by analyzing text based on year. Web based libraries can use LDA 109 to recommend books based on your past readings. News providers can use topic modelling to understand 110 articles quickly or cluster similar articles. 111 What are some of the real world uses topic modelling has? Historians can use LDA to identify 112 A tool and technique for topic modeling, Latent Dirichlet Allocation (LDA) classifies or 113 categorizes the text into a document and the words per topic, these are modeled based on the Dirichlet 114 distributions and processes. 115 The LDA makes two key assumptions: 116 1. Documents are a mixture of topics, and 117 2. Topics are a mixture of tokens (or words) 118 And, these topics using the probability distribution generate the words. In statistical language, the 119 documents are known as the probability density (or distribution) of topics and the topics are the 120 probability density (or distribution) of words. 121 The end goal of LDA is to find the most optimal representation of the Document-Topic matrix and the 122 Topic-Word matrix to find the most optimized Document-Topic distribution and Topic-Word 123 distribution. 124 As LDA assumes that documents are a mixture of topics and topics are a mixture of words so LDA 125 backtracks from the document level to identify which topics would have generated these documents 126 and which words would have generated those topics. 127 128 129 130 RESULTS 131 some statistically proven significant factors discussed below which need to be either maintained or 132 improved in any future public health issues and spread of respiratory disease. The overall results showed general news keywords analysis. This words connection and ideas comes from 133 134 135 136 WORDCLOUD VISUALIZATION 1 137 adverbs. It is easy to find the significant words related to ‘covid’, and ‘education’ efficiently with the size 138 of words in the graph. Now, some inferences are found from the result of the wordcloud. The R wordcloud shows statistical significant throughout the frequencies of words, with deletion of 139 140 Some words, including ‘pandemic’, ‘season’, ‘schools’, ‘online’, ‘lockdown’, ‘closed’ and ‘students’ 141 shows the educational perspective of covid-19 impact in Canada. Then, the words including ‘Ontario’ 142 and ‘B.C’ shows the locational importance of interest. 143 144 Also, it shows current terror of the ‘shooting’ in ‘Nova’, ’Soctia’, ‘funeral’, and ‘victims’ in which shows 145 the related people’s interest towards the covid-19 pandemic situation. 146 147 The interest is given towards words that people are attracted, including ‘mass’, ‘measures’, ‘ramping’, 148 ‘says’, ‘news’, and ‘season’. 149 150 It is also inferences that ‘cases’ and ‘week’ ‘during’ and ‘amid’ shows the periodic intervals of covid-19 151 impact, which comes from cognitive understandings of people. 152 153 In these words, news reflects the trends and ideas of people ‘how’ influential covid-19 pandemic is, as 154 the mass media does for people in terms of understanding current situation. 155 156 157 158 WORD CONNECTION VISUALIZATION SNA (social network analysis) for intersected words (microscopic outline) 159 160 Based on the bags of words extracted, an analysis of words are made for the intersected usage of words. 161 162 163 • The top 10 words are found from within connections by the sentences of headlines, and this is made into matrix. 164 165 166 • This is cross checked with the lift and the support of cross table (matrix) formed by the ‘apriori’ package. 167 168 169 The simply the ratio of these values: target response divided by average response. 170 Hence, if the lift is high, then it means the co-occurrence of A and B means higher chance of A and B 171 occurring together. 172 173 Also, the . The purpose to find how connect words are in writing headlines can be 174 confirmed with frequencies of co-occurrence and the lift and support. 175 176 177 • Then, this is formed into a graph of network which shows the connections of words as a representative of most used headlines. 178 179 The analysis is based on common usage of words in headlines for two words. The words commonly 180 shows up include ‘Ontario’, ‘new’, ’lockdown’, ‘B.C’, ’cases’, ‘schools’, ‘may’ and ‘students’, ‘week’. 181 From the heatmap, it is shown that the covid has high correlation with adverbs. 182 183 Hierarchal clustering analysis based on the lift of the matrix. This let us understand the similar group of 184 words and connect with different group of clustered words in a stepwise graph, which provides a 185 structural understanding of the clustered words. 186 187 More evidently, even though there should be statistical justification for how number of clusters are 188 determined, the graph clearly shows the clusters of covid-19 with adverbs and related words of an 189 education. The R network graph below shows the ultimate ranking relationships of words in the 190 headlines. The words are based on the correlation frequencies and relation being close enough is 191 determined by the hierarchical clustering and the probability of the support. 192 193 First, people are interested in the news because of the social distanced as ‘news’ of populated areas 194 ‘Ontario’, ‘B.C’ where the ‘measures’ is related. 195 196 Also, the decision of the decision of the ‘minister’ and action of government is relative to students 197 current situation of covid-19. 198 199 There are some unique search analysis of relationship that covid-19 is strongly related to ‘cases’ of 200 ‘new’ individuals, and the object which is ‘students’ is relevant to ‘online’ learning and ‘lockdown’ of 201 ‘schools’ also mattered. 202 203 204 205 TOPIC MODELING (LDA): 5 TOPICS, WITH 20 KEYWORDS 206 of topics, under the common topic of ‘covid-19’, are deduced from the two results by the Gibbs 207 sampling. It could be summarized into 3 different topics. First, student receiving online teaching and are 208 sensitive news is noticed as coherent to previous analysis. Although it would be added to show most likely 5 topics are adequate for the choosing topics, the table 209 210 Second, the schools related to education of during May lockdown is popular topic as the reality comes 211 true. 212 213 Third, terror is still related to covid-19 of pandemic situation as people are feared of shooting and death 214 which is new as it is a reflection that people are afraid of. 215 216 From the LDA topic modeling, it is confirmed that the topics are very easily noticed with relative words 217 that are summarized. It confirms the relation of previous analysis and combines with data sampling 218 method to fit appropriate words together. 219 220 221 DISCUSSION 222 - Some of headlines are duplicate and has same headlines to other headlines. 223 - Limited data for query access for the headlines is solved through second crawling of the data. Problem and solutions: 224 225 226 227 CONCLUSION 228 education related to each provinces in Canada for better understanding for the future. What was achieved 229 was the distinction of the significant factors in terms of public health matters from the higher education 230 institutions. However, apart from this, it is well known for the importance of public health and issues 231 during pandemic period. During the pandemics, students and many people often have problems such as The main aim of the study was to study and document the keywords and topics of this pandemic in higher 232 lack of interaction, communication to be relied for public news, or concurrent issues when they are self- 233 isolated at home. Ensuring successful online education requires techniques to be implemented by public 234 health communities to minimize negative public heath impact of students and encourage isolation and 235 reliability in news of public media. The following 3 key points summarize the overall statistical analysis of 236 methods used. 237 238 Finally, with 3 methods of analysis, education impact of covid-19 pandemic could be understood as 239 follows: 240 - Online learning, where schools are lockdown during May: 241 From this fact, the keyword of during and up to shows people curiosity of minister’s decisions and what 242 news says about it. 243 244 - New cases and death of terror in Nova Scotia: 245 Although these keywords are related to more general understandings of covid-19 pandemic situation, 246 mass media and people are feared of terror as duration of social-distancing becomes longer. 247 248 - Students, locational importance on Ontario, B.C news 249 As for the populated areas, university students and kids are sensitive to news of local areas, especially 250 educational impact on students. There are lots of schools and colleges located in B.C and Ontario and 251 the keywords of topics reflect this fact of popular interest. 252 REFERENCES 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 Jeanne, A., Leonie, R. & Parlo, S. (2020). Teaching and teacher education in the time of COVID-19, Asia-Pacific Journal of Teacher Education, 48:3, 233-236, https://doi.org/10.1080/1359866X.2020.1752051 287 288 289 Mulenga, E. & Marbán, J. (2020). Is COVID-19 the Gateway for Digital Learning in Mathematics Education? Contemporary educational technology, 12(2), ep269, https://doi.org/10.30935/cedtech/7949 Mehta, A., & Bura, D. (2021). Mining of Association Rules in R Using Apriori Algorithm. In Advances in Communication and Computational Technology (pp. 181-188). Springer, Singapore. https://link.springer.com/chapter/10.1007/978-981-15-5341-7_14 Hornik, K., & Grün, B. (2011). topicmodels: An R package for fitting topic models. Journal of statistical software, 40(13), 1-30. https://epub.wu.ac.at/3987/ Lang, D., Chien, G. T., LazyData, T. R. U. E., & Lang, M. D. (2018). Package ‘wordcloud2’. https://cran.r-project.org/web/packages/wordcloud2/wordcloud2.pdf Csardi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, complex systems, 1695(5), 1-9. https://www.researchgate.net/profile/GaborCsardi/publication/221995787_The_Igraph_Software_Package_for_Complex_Network_Researc h/links/0c96051d301a30f265000000/The-Igraph-Software-Package-for-Complex-NetworkResearch.pdf Khalil, S., & Fakir, M. (2017). RCrawler: An R package for parallel web crawling and scraping. SoftwareX, 6, 98-106. https://www.sciencedirect.com/science/article/pii/S2352711017300110 Zhao, S., Guo, Y., Sheng, Q., & Shyr, Y. (2014). Heatmap3: an improved heatmap package with more powerful and convenient features. BMC bioinformatics, 15(10), 1-2. https://link.springer.com/article/10.1186/1471-2105-15-S10-P16 Suzuki, R., & Shimodaira, H. (2006). Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics, 22(12), 1540-1542. https://academic.oup.com/bioinformatics/article/22/12/1540/207339?login=false 290 291 292 293 294 Galili, T. (2015). dendextend: an R package for visualizing, adjusting and comparing trees of hierarchical clustering. Bioinformatics, 31(22), 3718-3720. https://academic.oup.com/bioinformatics/article/31/22/3718/240978?login=false 295 296 Dhawan, S. (2020). Online Learning: A Panacea in the Time of COVID-19 Crisis. Journal of Educational Technology Systems, 49(1), 5–22. https://doi.org/10.1177/0047239520934018 297 298 299 300 Leung, M., & Sharma, Y. (2020). Online classes try to fill education gap during epidemic. University World News. https://www.universityworldnews.com/post.php?story=2020022108360325 301 302 303 304 McKenzie, L. (2020). Coronavirus forces universities online. Inside Higher Education. https://www.insidehighered.com/news/2020/02/25/coronavirus-forcesus-universities-onlinechina 305 306 307 308 Marinoni, G., Land, H. V. & Jensen, T. (2020). The impact of COVID-19 on higher education around the world: IAU Global Survey Report. International Association of Universities. https://www.iauaiu.net/IMG/pdf/iau_covid19_and_he_survey_report_final_may_2020.pdf 309 310 311 312 313 314 Reimers, F., & Schleicher, A. (2020). A framework to guide an education response to the COVID-19 pandemic of 2020. OECD. https://iccittadichiari.edu.it/wpcontent/uploads/2020/07/COVID19_LG-settore-istruzione_OCSE_maggio2020_ESTRATTO.pdf 315 316 317 Bao, W. (2020). COVID-19 and online teaching in higher education: A case study of Peking University. Human Behaviour & Emerging Technology, 2, 113–115. https://doi.org/10.1002/hbe2.191 318 319 320 321 Hodges, C., Moore, S., Lockee, B., Trust, T., & Bond, A. (2020). The difference between emergency remote teaching and online learning. Educause Review, 27. https://medicine.hofstra.edu/pdf/faculty/facdev/facdev-article.pdf 322 323 324 325 326 WHO media briefing. https://www.who.int/dg/speeches/detail/who-director-general-s-openingremarks-at-the-media-briefing-on-covid-19---11-march-2020 FIGURES AND TABLES 327 328 329 330 331 332 333 334 335 336 337 338 339 Figure 1: Steps for crawling Canada news of covid-19 about relevant topics of an education. First, in the Yahoo website Canada, search for the relevant topics, ‘Covid education’. Figure 2. Using html source code, analyze the relevant code for headlines which represents the interest of content for the impact of the Covid-19 toward the education. Figure 3. using program , web crawl the data of 50 pages in total of 260 news headlines. 340 341 342 Figure 4. The R wordcloud shows statistical significant throughout the frequencies of words, with deletion of adverbs. Figure 5. From the heatmap, it is shown that the covid has high correlation with adverbs. 343 344 345 Figure 6. Hierarchal clustering analysis based on the lift of the matrix. This let us understand the similar group of words and connect with different group of clustered words in a stepwise graph, which provides a structural understanding of the clustered words. 346 347 348 Figure 7. The R graph below shows the ultimate ranking relationships of words in the headlines. The words are based on the correlation frequencies and relation being close enough is determined by the hierarchical clustering and the probability of the support. 349 350 Table 1: Most likely 5 topics are adequate for the choosing topics, the table of topics, under the common topic of ‘covid-19’, are deduced from the two results by the Gibbs sampling. student schools shooting online during death cases lockdown new news pandemic Nova Scotia education May funerals B.C up to says