UEH – COLLEGE OF BUSINESS FINANCE DEPARTMENT DATA SCIENCE FINAL - EXAM PROJECT TOPIC: APPLICATION OF THE NEURAL NETWORK MODEL IN FORCASTING FINANCIAL DISTRESS OF COMPANIES IN THE MANUFACTURING AND WHOLESALE INDUSTRY IN 2022, AND 2023 BY USING ORANGE PROGRAM. CLASS: 22C1INF50905912 INSTRUCTOR: Thái Kim Phụng GROUP MEMBERS: 1. Đoàn Thị Kim Anh MSSV: 31201022021 2. Phạm Công Hoàng MSSV: 31201020324 3. Trần Minh Tuyết Mai MSSV: 31201022425 4. Nguyễn Bích Ngọc MSSV: 31201020646 5. Nguyễn Thu Trang MSSV: 31201022820 HO CHI MINH CITY - 2022 GROUP REPORTS SUBJECT: DATA SCIENCE • Plagiarism check: 10% ● Group leader: Trần Minh Tuyết Mai ● Group members: 5 Student’s Name MSSV Participation (/100) 1. Đoàn Thị Kim Anh 31201022021 80 2. Phạm Công Hoàng 31201020324 100 3. Trần Minh Tuyết Mai 31201022425 100 4. Nguyễn Bích Ngọc 31201020646 60 5. Nguyễn Thu Trang 31201022820 70 I ABSTRACT According to General Statistics Office in 2021, GDP of Vietnam grew by 2.58% compared to 2020, in which the manufacturing and wholesale industry has a large increase of 6.37%, contributing 1.61 percentage points to GDP growth. However, due to the severity of the COVID-19 pandemic, there was a large wave of firms experiencing financial stress, resulting in bankruptcy. Among that, manufacturing and wholesale accounted for the largest proportion around 36.9%. The main objective of this study aims to predict the financial distress of manufacturing and wholesale firms in 2022, and 2023 by interpreting Altman’s Z-score after using the suitable classification model that the Orange program selects. Data were collected 627 observations from manufacturing and wholesale companies listed on three stock exchanges (HNX, HOSE, and UPCOM) in 2021. This sample was divided into 2 different datasets: training dataset (439 companies) and forecast dataset (188 companies). One dependent variables is Results (three values: Safe zone, Distress zone, and Gray zone). Five independent variables are: Net working capital/Total assets (NWCTA), Retained earnings/Total asset (RETA), Earnings before interest and taxes/Total assets (EBITTA), Market value of equity/Book value of total liability (MVETD), and Sales/Total assets (NRTA). Then, we removed extreme outliers, thereby training dataset has 415 observations. In all 4 classification methods (Tree, SVM, Neural Network, Logistic Regression), model Neural Network is rated the highest in 5 indexes: AUC, CA, F1, Precision, and Recall; and have the highest correct propotion of predicted value through confusion matrix. Then, the team predicted the bankruptcy probability of the remaining 188 listed companies in two industries, and found that 72 listed companies do not go bankrupt, 59 companies are at risk of bankruptcy, and 57 companies are at high risk of bankruptcy, with a true probability of 94% within 1 year (2022) and 74% within 2 years (2023). This research can be used as a reference to have a specific view of financial stress in listed manufacturing and wholesale comapanies in 2022, and 2023, giving not only managers but also governments consider strategies and solutions. However, the model is only true for Vietnam's listed firms in the manufacturing and wholesale industries in general in 2021. Therefore, the group will continue to do more research for each specific industry group and add other micro, and macro factors to give a more general model for each particular industry group. II TABLE OF CONTENTS ABSTRACT ............................................................................................................................ I TABLE OF CONTENTS ......................................................................................................II LIST OF TABLES ................................................................................................................ V LIST OF FIGURES ............................................................................................................ VI LIST OF ACRONYMS .................................................................................................... VIII CHAPTER 1: INTRODUCTION .........................................................................................1 1.1. Reason for doing the topic ................................................................................................1 1.2. Objectives of the study ......................................................................................................4 1.3. Research questions ............................................................................................................5 1.4. Research subjects and scopes ............................................................................................5 1.5. Overall research methodology ..........................................................................................6 1.6. Practical meanings of the topic .........................................................................................6 1.7. Research layout .................................................................................................................7 CHAPTER 2: LITERATURE REVIEWS ...........................................................................8 2.1. Literature reviews about financial distress ........................................................................8 2.1.1. Definition of financial distress ...................................................................................8 2.1.2. Some causes of financial distress .............................................................................10 2.1.2.1. Financial factors ................................................................................................10 2.1.2.2. Nonfinancial factors ..........................................................................................15 2.1.2.3. Macroeconomic factors .....................................................................................16 2.1.3. Financial distress costs ............................................................................................17 2.2. Literature review about data mining ...............................................................................19 2.2.1. Definition of data mining .........................................................................................19 2.2.2. The key properties of data mining ............................................................................19 III 2.2.3. Data mining processing ...........................................................................................20 2.1.4. Data mining methods ...............................................................................................22 2.1.5. Data mining tool used in the study – Orange ..........................................................23 2.3. Literature review about data classification .....................................................................23 2.3.1. Definition of data classification ...............................................................................23 2.3.2. Data classification process ......................................................................................24 2.3.3. Data classification methods .....................................................................................25 2.3.3.1. Logistic Regression ...........................................................................................25 2.3.3.2. Support Vector Machine ...................................................................................26 2.3.3.3. Decision Tree ....................................................................................................28 2.3.3.4. Neutral Network ................................................................................................29 2.3.4. Methods to evaluate classification models ...............................................................32 2.3.4.1. Confusion Matrix, Accuracy, ROC, AUC, and Precision/Recall .....................32 2.3.4.2. Cross Validation: Holdout and K-fold cross validation ....................................36 2.4. Previous empirical evidences applying data mining in forecasting the financial distress 38 2.4.1. Empirical evidences with foreign research subjects ..............................................38 2.4.2. Empirical evidences with Vietnamese research subjects .......................................39 CHAPTER 3: METHODOLOGY ......................................................................................41 3.1. Research process .............................................................................................................41 3.2. Research model ...............................................................................................................41 3.3. Variable measurements ...................................................................................................42 3.3.1. Dependent variables: Z-scores.................................................................................42 3.3.2. Independent variables ..............................................................................................43 3.3.2.1. Net Working Capital/Total assets (NWCTA) ....................................................43 IV 3.3.2.2. Retained earnings/Total assets (RETA) ............................................................44 3.3.2.3. Net revenues/Total assets (NRTA) ....................................................................44 3.3.2.4. EBIT/Total assets (EBITTA) .............................................................................45 3.3.2.5. Equity-to-debt ratio (MVETD) ..........................................................................45 3.3.3. Summary of variable measurements ........................................................................46 3.4. Data collection methods, and descriptive statistics before preprocessing ......................47 CHAPTER 4: RESULTS .....................................................................................................51 4.1. Results of preprocessing data ..........................................................................................51 4.2. Descriptive stastistic after processing training dataset ....................................................54 4.3. Results of choosing, and evaluating the most suitable classification method ................58 4.4. Results of forecasting data by using Neural Network model ..........................................62 CHAPTER 5: DISCUSSIONS, LIMITATIONS, AND RECOMMENDATIONS ........67 5.1. Discussions ......................................................................................................................67 5.2. Recommendations ...........................................................................................................68 5.3. Limitations.......................................................................................................................69 5.4. Directions ........................................................................................................................70 REFERENCES ....................................................................................................................... I APPENDIX 1: TRAINING TEST BEFORE PROCESSING DATA ...........................VII APPENDIX 2: TRAINING TEST AFTER PROCESSING DATA .............................. XX APPENDIX 3: TEST DATA BEFORE FORECASTING ...................................... XXXII APPENDIX 4: TEST DATA AFTER FORECASTING BY USING NEURAL NETWORK MODEL ............................................................................................... XXXVII V LIST OF TABLES Table 2.1: Some common formula for calculating short-term solvency................................11 Table 2.2: Some common formula for calculating long-term solvency ................................12 Table 2.3: Some common formula for calculating assets management .................................13 Table 2.4: Some common formula for calculating profitability ............................................14 Table 2.5: Some formula for calculating market value ..........................................................15 Table 3.1: Interpretation of Z-score .......................................................................................43 Table 3.2: Five independent variables selected in this study .................................................46 Table 3.3: Descriptive statistics of quantitative variables before preprocessing ...................48 VI LIST OF FIGURES Figure 2.1: Outcomes of Financial Distress .............................................................................9 Figure 2.2: Determinants of financial distress .......................................................................10 Figure 2.3: Five main groups of financial factors ..................................................................11 Figure 2.4: Illustration of the data mining process ................................................................22 Figure 2.5: Illustration of building a classification model .....................................................24 Figure 2.6: Illustration of classifying new data and estimating the accurateness ..................25 Figure 2.7: Illustration of logistic regression .........................................................................26 Figure 2.8: Maximum-margin hyperplane and margins for an SVM trained with samples from two classes .....................................................................................................................28 Figure 2.9: A simple decision tree with the tests on attributes X and Y ................................29 Figure 2.10: Simple neural network .......................................................................................31 Figure 2.11: Confusion matrix for binary classification ........................................................32 Figure 2.12: Outcomes of a confusion matrix ........................................................................33 Figure 2.13: Sensitivity and Specificity .................................................................................34 Figure 2.14: Area under the ROC Curve................................................................................36 Figure 2.15: Hold-out method ................................................................................................36 Figure 2.16: K-fold cross-validation ......................................................................................37 Figure 3.1: Overall framework of using data mining techniques for prediction of financial 41 Figure 3.2: Statistical results in training dataset results before preprocessing ......................50 Figure 4.1: Process of preprocessing data on Orange ............................................................51 Figure 4.2: Training dataset of 20 listed companies before processing .................................52 Figure 4.3: Distributions of our sample in five variables before preprocessing ....................53 Figure 4.4: Illustration of removing outliers ..........................................................................54 Figure 4.5: Training data of 20 listed companies after preprocessing ...................................54 VII Figure 4.6: Distributions of our sample in five variables after preprocessing .......................56 Figure 4.7: Statistical results in training dataset results after preprocessing .........................58 Figure 4.8: Procedure for selecting and evaluating data classification methods ...................58 Figure 4.9: Describe the roles of the variables in the training dataset ...................................59 Figure 4.10: Result of the layered evaluation model by Cross Validation ............................60 Figure 4.11: Neural Network's Confusion Matrix..................................................................61 Figure 4.12: ROC analysis .....................................................................................................61 Figure 4.13: Forecasting dataset of 20 listed companies .......................................................63 Figure 4.14: Neural Network forecasting process..................................................................63 Figure 4.15: Describe the properties of the variables in the forecast dataset.........................64 Figure 4.16: Forecast results using Neural Network model ...................................................65 Figure 4.17: Statistical forecasting results by Neuron Network model .................................66 VIII LIST OF ACRONYMS No Abbreviation Explanation 1 ANN Artificial Neural Network 2 AUC Area Under the Curve 3 CA 4 CART 5 CPI 6 EBIT 7 EBITTA 8 FPR False Positive Rate 9 GDP Gross Domestic Product 10 HNX Hanoi Stock Exchange 11 HOSE Ho Chi Minh City Stock Exchange 12 KDD Knowledge Discovery in Databases 13 MLP Multi-Layer Perceptrons 14 MVETD 15 NRTA 16 NWCTA 17 RETA Retained earnings/Total asset 18 ROC Receiver Operating Characteristic 19 SMO Sequential Minimal Optimization 20 SVM Support Vector Machine 21 TPR True Positive Rate 22 UPCoM Correspondence Analysis Classification and Regression Tree Consumer Price Index Earnings Before Interest and Tax Earnings before interest and taxes/Total assets Market value of equity/Book value of total liability Sales/Total assets Net working capital/Total assets The Unlisted Public Company Market 1 CHAPTER 1: INTRODUCTION 1.1. Reason for doing the topic 2020 and 2021 witnessed stagnation in the whole world's economic development due to the impact of COVID-19. GDP reached a strong growth rate of 7.02% in 2019. However, GDP growth decreased significantly in 2020 and 2021 (2.91% and 2.58% respectively). In particular, GDP in the second quarter of 2020 increased by only 0.36% compared to the same period in 2019, and GDP in the third quarter of 2021 decreased by 6.02% compared to the same period in 2020. By 2022, the economy entered a recovery phase. GDP in the first 6 months of 2022 increased by 6.42% over the same period in 2021. Especially, the average CPI in the first 6 months of 2022 increased by 2.44% compared to the previous year, showing an encouraging recovery in consumer demand. However, inflation has become a significant problem due to 4 main factors: CPI increases significantly due to the recovery of domestic demand; high energy prices while traveling recovers; the prices of inputs increase sharply reflect in consumer goods prices, and fiscal support causes the money supply to increase sharply in 2022. According to Associate professor Ph.D.Nguyễn Bá Minh - Director of the Institute of Economics and Finance, other reliable organizations and financial experts, they forecast inflation in 2022 will increase strongly compared to 2021 (about 4%), which is more than expected. Rising inflation causes the SBV to tighten monetary policy, leading to higher interest rates and businesses' risk of falling into financial distress due to the inability to produce. In the financial sector, financial distress is the top concern of companies to maintain their business operations. Financial distress is used to indicate a condition when promises to creditors of a company are broken or honored with difficulty. If financial distress cannot be relieved, it can lead to bankruptcy. Financial distress is usually associated with some costs to the company; these are known as costs of financial distress. The term "financial distress" of enterprises is mentioned by many researchers as a difficult period of enterprises arising from the time before the enterprise declares bankruptcy (Altman and Hotchkiss, 2006; Li Jiming and Du Weiwei, 2011; Tinoco and Nick Wilson, 2013). If an enterprise falls into a state of financial distress, it falls into one of the following situations: its securities are put under control, warned, securities are delisted, or the enterprise goes bankrupt. (Vietnam 2 Bankruptcy Law, 2014; Decree No. 58/2012/ND-CP of the Government). Financial distress can cause lasting damage to one's creditworthiness, and is often a harbinger of bankruptcy. For investors, creditors and managers, when a business goes bankrupt, the risks and losses are not small. Therefore, financial analysts always put effort into detecting financial distress, and signs of bankruptcy in a short time. Financial situation of businesses in the manufacturing and wholesale industries have been significantly badly affected by the COVID-19 pandemic in 2021. A series of serious losses, such as production and commerce being disrupted, thousands of businesses struggling due to the burden of large expenses, business activities of enterprises being extremely volatile, low liquidity, and high bad debt,... According to the General Statistics Office's announcement on the socio-economic situation in 2021, 43,200 enterprises temporarily suspended business (an increase of 25.9%): 20,267 enterprises in the wholesale and retail sectors, accounting for 13.8%; 6,558 enterprises in the processing and manufacturing sectors, accounting for 11.9%. 48,127 enterprises were waiting for dissolution procedures, which increased by 27.8% compared to 2020, accounting for the majority: wholesale and retail (17,178 enterprises, accounting for 35.7%) and processing and manufacturing (5,794 enterprises, accounting for 12.0%). The financial difficulty of these enterprises will still continue or is even worse, especially in current unstable domestic and foreign economy. Despite being heavily affected by COVID-19 in 2020 and 2021, manufacturing and wholesale are two industries that account for a large proportion and contribute greatly to the positive growth of Vietnam's GDP (Annual report General Statistics of Vietnam in 2020, and 2021). In 2020, the processing and manufacturing sectors played an important role in the economy's growth with an increase of 5.82%, contributing 1.25 percentage points to GDP growth; wholesale and retail increased by 5.53% compared to 2019. GDP in 2021 increased by 2.58%, and the processing and manufacturing sector continued to be the main driving force with an increase of 6.37%, contributing 1.61 percentage points to GDP growth. It can be seen that manufacturing and wholesale are two important industries that contribute significantly to GDP growth. However, listed companies in the manufacturing 3 and wholesale industries are facing financial distress, especially with the current unstable domestic and foreign economy, which brings high risks of financial distress. So the team decided to choose companies in these two industries as the research subjects. Finding a way to detect the warning signs of bankruptcy is always one of the the primary concern of market regulators, analysts and shareholders. There have been many models built by researchers to assess and forecast firms’ financial distress based on published financial information of enterprises. But, the Z-score model of Altman (1968) is considered as the original, and most widely recognized model, using by both academia and practice in the world. This model shows that the Z-index predicts the bankruptcy risk of enterprises business within the next 2 years with a high level of confidence. Although developed more than 40 years ago, Altman's model has remained highly accurate to this day and is a popular tool among analysts when determining corporate health. In addition, more than 20 countries around the world use this index with high reliability (Altman & Hotchkiss, 2006). Quite a lot of empirical evidence on the Altman model in Vietnam is conducted to analyze the financial distress of companies through the Z-scores index. According to the study by Hoàng Thị Hồng Vân (2020), accuracy in predicting financial distress one year before bankruptcy is 76,67%, and for 2 years before the bankruptcy is 70%, which are pretty good predictive results. The results by Võ Văn Nhị and Hoàng Cẩm Trang (2013) shows that there is a positive relationship between bankruptcy risk and earnings management of listed companies in Vietnam through the Altman model. Lê Cao Hoàng Anh và Nguyễn Thu Hằng (2012) retested Altman's Z-Score in predicting the failure of 293 companies listed on HOSE, with the results show that the Z-Score correctly predicted 91% of the year before the company went into financial distress, this rate fell to 72% within 2 years. Research by Trần Việt Hải (2017) has studied Altman’s model to fraudulently identify financial statements of companies listed on HOSE. Research results show that the model has classified companies with fraud with an accuracy rate of 68.7%. In short, the forecast rate is quite high, showing that the Z-Score is really a reliable indicator, which is suitable for the Vietnamese market. Therefore, our group used this model to predict bankruptcy for Vietnamese enterprises listed on three stock exchanges of both manufacturing and wholesale industries from 2022 to 4 2023. In recent years, society is witnessing an explosion of information technology, which has caused the data warehouse of management information systems to grow rapidly. That is the premise for the birth of data mining techniques, making processing techniques smarter and more efficient in collecting, storing, and analyzing data... to improve work productivity. Numerous organizations, such as Johnson & Johnson, GE Capital, Fingerhut, Procter & Gamble, Harrah's Casino, and so on, have acknowledged the value of data mining in accounting and finance (Calderon, Thomas G., et al., 2003). Data mining has been named one of the top ten technologies for the future by the American Institute of Chartered Public Accountants and one of the four research priorities by the Institute of Internal Auditors (Koh, 2004). The use of data mining techniques on financial data can aid in the decisionmaking process and help solve categorization and prediction issues. Corporate bankruptcy, credit risk assessment, going concern reporting, financial distress, and corporate performance forecast are common instances of financial categorization issues (Naveen, 2018) In Vietnam, there have been many articles predicting bankruptcy risk of enterprises. During the research period, we did not find any specific articles on this topic by listed companies in the manufacturing and wholesale industries, especially with the instability of current domestic and foreign economies. Therefore, on the basis of inheriting the advantages of previous studies, this study will use data mining to add to the gap in the prediction of bankruptcy risk of listed companies in the manufacturing and wholesale industries in Vietnam in the years 2022 and 2023. 1.2. Objectives of the study Originating from the instability of the economy, the importance of forecasting financial stress, the popularity of Altman's model, and the lack of papers relating to data mining techniques in Viet Nam’s corporate finance sector, especially predicting the financial distress of manufacturing and wholesale firms, the study was conducted with the following goals: • Increasing the applicability of data mining by selecting the suitable model to 5 forecast the possibility of financial distress of listed companies in the manufacturing and wholesale industries in Vietnam. • Describing, and identifying the probalility of financial distress of enterprises through the interpretation of Altman’s Z-score after using the suitable model that Orange program selects. • Giving some useful information for not only investors but also managers and policy makers. 1.3. Research questions In order to achieve the overall goal of the study, the research question posed is: • How do internal factors affect the criteria for assessing the possibility of bankruptcy of the company (Z-score) presented through descriptive statistics? • Given the training data set, which suitable model provided by Orange software should be used to predict financial distress with high level of confidence? • By using the selected suitable model, how the likelihood of financial distress of the companies are in 2022 and 2023? • What policies and strategies can investors, policymakers and business managers recommend to identify and minimize the possibility of financial distress at the moment and the next business period? 1.4. Research subjects and scopes a. Research subjects The object of research in this topic focuses mainly on the prediction of financial distress of listed companies in the manufacturing and wholesale industries in Vietnam in the years 2022 and 2023. b. Research scopes Total dataset is collected from the financial statements of 624 companies in the manufacturing and wholesale industries in Vietnam on three listed stock exchanges: HOSE, HNX, and UPCOM. These sources of data were used by the authors in the audited consolidated financial statements and annual reports of enterprises. In which: 6 • Training dataset includes 439 companies. • Forecast dataset includes 188 companies. The data was taken in 2021, when the COVID-19 epidemic is taking place and gradually showing signs of being under control. However, the current domestic and foreign economy are still complicated and tense. 1.5. Overall research methodology First, the team collected 627 observations from manufacturing and wholesale companies listed on three stock exchanges in Vietnam (HNX, HOSE, and UPCOM) in 2021. Then, 627 companies were divided into 2 different datasets, in which there are 439 companies in the training dataset and 188 companies in the forecast dataset. The data is secondary data and from cafef and vietstock. One dependent variables included in this research is Results (which has three values: Safe zone, Distress zone, and Gray zone). Five independent variables based on Altman (1968) are: Net working capital/Total assets (NWCTA), Retained earnings/Total assets. (RETA), Earnings before interest and taxes/Total assets (EBITTA), Market value of equity/Book value of total liability (MVETD), and Sales/Total assets (NRTA). The team then preprocessed the data by removing extreme outliers of each independent variable. The result of the training dataset is only 415 observations. Next, the team assigned the attributes to use for the variables, and useed four data classification methods to learn the training dataset: logistic regression, decision tree induction, support vector machines, and neural network. Then, the team evaluated 4 methods to choose the most suitable model through 5 indicators (F1 – score, CA, Precision, Recall, and AUC), and confusion matrix. After finding the most effective classification method (specifically Neural Network model), the team predicted the bankruptcy probability of the remaining 188 listed companies in the two industries of manufacturing and wholesale. 1.6. Practical meanings of the topic Due to the uncertainty of the economic situation in Vietnam and the world at the end of 2022 with many complicated changes coming from the conflict between Russia and Ukraine, and a high chance of an increase in interest rate by the Fed to control inflation after 7 the Covid-19 pandemic, our study has high practical meanings for investors, policymakers and administrators because it: • helps investors avoid investing in public companies that are in poor financial condition or even a high chance of bankruptcy. • gives policymakers more useful information in making decisions when they plan to reasonably change regulations, monetary policy to reduce the chance of financial distress in manufacturing and wholesale industries. • provides managers some useful information about the current health status of their company to promptly plan quick-responding strategies and consider rebuilding their capital structure. 1.7. Research layout The research is divided into 5 chapters: Chapter 1: Introduction. The authors indicate the reason for choosing the topic, objectives of the study, research questions, research subjects, scopes and overall research methodology. In addition, the authors give practical meanings and research layout. Chapter 2: Literature review. The authors give the current situation of enterprises in manufacturing and wholesale industries in Vietnam in recent years. After classifying the data according to the Z-score index to forecast the company's bankruptcy, the authors select the appropriate model, provide the prediction results, and determine the financial distress of the company. Chapter 3: Research methodology. The authors present the research process, and build research models. Besides, we explain our data collecting and processing methods. Chapter 4: Results. The authors of this study will apply statistical mining to overcome a discrepancy in the prediction of bankruptcy risk of listed businesses in the manufacturing and wholesale industries in Vietnam in the years 2022 and 2023. Chapter 5: Discussions and conclusions. The authors give conclusions, recommendations, limitations, and directions for further research. In addition, recommendations are drawn from the research results to apply to real life for macro policymakers, business managers, and investors in Vietnam. 8 CHAPTER 2: LITERATURE REVIEWS 2.1. Literature reviews about financial distress 2.1.1. Definition of financial distress Firms that are having financial difficulties are said to be in financial distress. These situations are most frequently described using the words "failure," "insolvency", "default", and "bankruptcy". Financial distress can be easily explained somewhat by relating it to insolvency, as Black’s Law Dictionary defines: “Inability to pay one’s debts; lack of means of paying one’s debts. Such a condition of assets and liabilities that the former made immediately available would be insufficient to discharge the latter.”. According to “Corporate Finance, 10th edition” by Ross, Westerfield, and Jaffe, this definition has two general themes: balancesheet and cash-flow. Balance-sheet insolvency occurs when a firm has a negative net worth, or the value of assets is less than the value of its debts which means that the company does not have enough assets to meet current obligations to lenders. Cash-flow insolvency occurs when the current and long-term assets of the firm are enough to fulfill its debt obligations to creditors but not the payment can not happen in liquid forms such as cash. This term also means a situation where a firm lacks liquid assets on its hand to match the financial requirement of creditors. However, financial distress has a broader definition than bankruptcy, which aids research in growing their sample sizes. In contrast, bankruptcy is a specific type of financial difficulty, and studies on it tend to have smaller samples (Altas, 1993). There are four steps in a business bankruptcy. The firm's financial status is incubating during Stage 1. The firm's financial trouble, often known as financial embarrassment, is in Stage 2 and is known to the management. In the 3rd stage of financial insolvency, the company lacks the resources to meet its debt obligations. Stage 4 is where insolvency is finally proven. The firm's bankruptcy is made official by a court determination, as its assets must be sold to pay creditors (Poston et al., 1994). Therefore, financial distress is distinct from bankruptcy. This happens when a company's business operations are not able to meet its financial obligations, and its assets are becoming less liquid. Financial distress may occasionally be identified before the business enters bankruptcy, which begins from stage 2. If an enterprise falls into a state of financial distress, it falls into one of the following situations: its securities are put 9 under control, warned, securities are delisted, or the enterprise goes bankrupt. (Vietnam Bankruptcy Law, 2014; Decree No. 58/2012/ND-CP of the Government). However, financial distress does not always progress to bankruptcy. Figure 2.1 illustates how public firms may undergo different paths of financial distress and their final destination could be private workout rather than being declared bankruptcy. Interestingly, approximately half of the financial restructurings have been done via private workouts (Wruck, 1990). Figure 2.1: Outcomes of Financial Distress Sources: Five empricial studies - Wruck (1990). Some firms may actually benefit from financial distress by restructuring their assets. For example, a levered recapitalization can change a firm’s behavior and force a firm to dispose of unrelated businesses. A firm going through a levered recapitalization will add a great deal of debt and, as a consequence, its cash flow may not be sufficient to cover required payments, and it may be forced to sell its noncore businesses. For some firms, financial distress may bring about new organizational forms and new operating strategies. Financial distress can serve as a firm’s “early warning” system for trouble. Firms with more debt will experience financial distress earlier than firms with less debt. However, firms that experience financial distress earlier will have more time for private workouts and 10 reorganization. Firms with low leverage will experience financial distress later and, in many instances, be forced to liquidate. 2.1.2. Some causes of financial distress Financial management literature has divided the causes of a firm's financial distress into two categories: internal and external factors. While the external elements are the macroeconomic factors, the internal components are further separated into financial and nonfinancial factors. Each of these elements has an impact on how the firm operates. As a result, if they are not adequately handled, they may endanger the organization's ability to continue existing. Figure 2.2: Determinants of financial distress Source: Fredrick Ikpesu, and et.al (2019). 2.1.2.1. Financial factors There is broad agreement that financial factors have been one of the major predictors of financial distress (Turetsky & McEwen, 2001; Nahar, 2006; Chancharat, 2008; Honjo, 2010; Thim et al., 2011; Parker et al., 2011; Kristanti et al., 2016; Devji & Suprabha, 2016; Wesa & Otinga, 2018; Idrees & Qayyum, 2018). Failure by businesses to manage the financial aspect typically results in businesses failing to fulfill their debt obligations by the deadline and is a sign of financial distress. Based on “Corporate Finance, 10th edition” by Ross, Westerfield, Jaffe, there are 5 groups of financial ratios, as shown figure below. 11 Figure 2.3: Five main groups of financial factors Sources: Corporate Finance - Ross, Westerfield, Raffe a. Short-term solvency The purpose of short-term solvency ratios, also known as liquidity measures, is to reveal information about a firm's liquidity. The ability of the company to make short-term bill payments without undue stress is the main concern. These ratios therefore concentrate on current assets and current liabilities. Table 2.1: Some common formula for calculating short-term solvency No Short-term Formula Meanings solvency 1 Current ratio Current assets/current This ratio measures the firm’s ability to pay liabilities its short-term obligations by converting all its current assets into cash. 2 Quick (or Acid- (Current Test) Ratio assets - This ratio looks deeply into the short-term Inventory)/Current liquidity of firm by subtracting inventories liabilities to current assets because they are often the least liquid assets and some of the inventories may later turn out to be 12 damaged, obsolete, or lost. 3 Cash Ratio Cash/Current This ratio shows a company's ability to liabilities cover its short-term obligations using only cash and cash equivalents. Sources: Corporate Finance - Ross, Westerfield, Raffe According to Chow et al. (2011), a company is in financial distress when its operating cash flows are insufficient to cover its present obligations with debtors, forcing restructuring, mergers and acquisitions, the issuance of new capital, and renegotiation of the loan arrangement. Other several well-known studies (Elloumi & Gueyee, 2001; Nahar, 2006; Thim et al., 2011; Wesa & Otinga, 2018) have confirmed that firms with low levels of liquidity are more likely to experience financial distress because they are unable to pay their recurring debts when due. These research have shown that one of the financial factors that affect a firm's financial distress is liquidity. b. Long-term solvency Long-term solvency ratios are used to address the firm’s long-run ability to meet its obligations or measure its financial leverage, which are regularly called financial leverage ratios. When a company frequently uses debt to finance its operations, it may be more susceptible to financial trouble, especially if it becomes challenging for the company to satisfy ongoing obligations (Wesa & Otinga, 2018). Table 2.2: Some common formula for calculating long-term solvency No Long-term Formula Meanings solvency 1 2 Total Debt (Total assets - Total The total debt ratio takes into account all Ratio equity)/Total assets debts of all maturities to all creditors. Time EBIT/Interest This ratio measures how well a company interest has its interest obligations covered, often earned called the interest coverage ratio. 13 3 Cash (EBIT + Depreciation and It is a basic measure of the firm’s ability to coverage amortization)/Interest generate cash from operations, frequently used as a measure of cash flow available to meet financial obligations. Sources: Corporate Finance - Ross, Westerfield, Raffe c. Assets management The specific ratios that measure the efficiency with which firm uses its assets, also used as measures of turnover. A firm needs to use its assets efficiently, or intensively, to generate sales. Table 2.3: Some common formula for calculating assets management No Assets Formula Meanings management 1 Inventory Cost of turnover sold/Inventory goods As long as firms are not running out of stock and thereby forgoing sales, the higher this ratio is, the more efficiently we are managing inventory. 2 3 4 Days’ sale in 365 days/Inventory It figures out how long it took firms to turn inventory turnover inventory over on average. Receivables Sales/Account It calculate how fast firms collect on sales. turnover receivables Days’ sale in 365 It can give insight into how a business inventory generates cash flow. days/Receivables turnover 5 Total assets Sales/Total assets turnover It measures the efficiency of a company's assets in generating revenue or sales. Sources: Corporate Finance - Ross, Westerfield, Raffe d. Profitability 14 Another financial factors that cause financial distress is profitability (Thim et al., 2011; Baimwera & Murinki, 2014; Campbell et al., 2015). A company with low profitability is usually weak at generating sufficient cash flows, which can cause the firm to experience a low level of liquidity. This could make it more difficult for the company to satisfy its obligations and expose it to a distressing situation. Table 2.4: Some common formula for calculating profitability No Profitability 1 2 Formula Meanings Profit Net A measure expresses the percentage of revenue that Margin income/Sales the company keeps as profit. EBITDA EBITDA/Sales EBITDA margin looks more directly at operating Margin cash flows, than does net income and does not include the effect of capital structure or taxes. 3 Return Assets on Net income/Total It is a ratio that shows how much profit a company is generating relative to the value of everything it owns. assets 4 Return Equity on Net It is an indicator of how the stockholders did over the income/Total course of the year. ROE is, in an accounting sense, equity the genuine bottom-line metric of success because it is managers’ intention to benefit shareholders. Sources: Corporate Finance - Ross, Westerfield, Raffe e. Market value Another financial factor that determines financial distress is the share price (Devji & Suprabha, 2016; Idrees & Qayyum, 2018). Share price and financial distress have an inverse relationship, according to numerous studies. A fall in a company's share price may raise the likelihood that it would experience financial trouble. A consistent decline in an organization's share price is a symptom of impending financial trouble. 15 Table 2.5: Some formula for calculating market value No Market value Formula Meanings 1 Price–Earnings Price per share/Earnings Higher PEs typically indicate that the Ratio per share company has substantial chances for future growth because they assess how much investors are prepared to pay per dollar of current earnings. 2 Market-to-Book Market Ratio value per It reflects historical costs. If the value is share/Book value per less than 1, it may indicate that the share company hasn't done a good job overall of generating value for its owners. 3 4 Market Price per share x Shares It refers to the total dollar market value of Capitalization outstanding Enterprise Market capitalization + It calculates the market value of the Value Market value of interest outstanding stock plus the market value of a company's outstanding shares of stock. bearing debt - Cash the outstanding debt that is interest bearing, less the cash on hand. 5 Enterprise EV/EBITDA Value Multiples It takes into account a company's debt and cash levels in addition to its stock price and relates that value to the firm's cash profitability. Sources: Corporate Finance - Ross, Westerfield, Raffe 2.1.2.2. Nonfinancial factors Financial distress among businesses may also be caused by non-financial issues (Dun & Bradstreet, 1986). According to the paper, the non-financial factors include customer cause, sales cause, experience cause, and disaster cause. The customer cause develops when a company has cash flow issues and few regular clients. The sales cause is a result of a 16 company's location, low sales, inventory problems, and tough competition for its products, which may result in low demand for the company's goods. An ineffective management team, the board of directors' lack of participation, and subpar leadesrhip are to blame for the experience cause. The disasters cause arises as sudden and unpredictable, burglary, strikes, fire and sudden death of the owner. 2.1.2.3. Macroeconomic factors According to Ikpesu, F., Vincent, O., & Dakare, O. (2020), the operations and performance of businesses are also impacted by macroeconomic factors. If firms fail to strategically identify and manage these factors, it can result in financial distress. The macroeconomic factors are inflation, interest rate, exchange rate, instability in government policy and political unrest. A company's operations can be impacted by a nation's rate of inflation. When the country's inflation rate is constant and low, the majority of enterprises are likely to perform better. High inflation increases a company's cost of production overtime and reduces its ability to compete in the global market when exporting goods, thus, reducing the net income of firm and badly affect its profitability. Based on Corporate Governance Models and Applications in Developing Economies, an upturn in interest rates frequently works as a deterrent to investment because firms are put off by high borrowing costs. They may reject potential projects that may not generate positive cash flows in short term due to the interest rate. They tend to invest less in working capital and fixed assets which have a detrimental influence on the profitability of the company in long run. Additionally, the increase in interest rates severely hampers companies' capacity to meet their borrowing obligations in terms of principal and interest repayments. Firms that depend on imported raw materials or technology to operate may be negatively impacted by the exchange rate's unpredictability. The cost of production rises as a result of import prices rising due to currency devaluation. When the high cost of production keeps the company from breaking even, it suffers from limited liquidity, and losses, and is unable to meet its contractual obligations on time. 17 Political unrest and instability of government policies are other macroeconomic factors. Political unrest may impede business activities, which could have a negative impact on the organization by endangering its long-term survival. Government laws are frequently changed, which may have an impact on an organization's sales, distribution, supply chain, reputation in the worldwide market, expansion plans, and decision-making process. As a result, a company's ability to weather political turbulence and unpredictable governmental policy may put it in financial peril. 2.1.3. Financial distress costs Financial distress is very costly when conflicts of interest hinder sound decisions about operations, investments, and financing. This incurs the costs of financial distress, including many specific costs below: • Direct costs: According to “Corporate Finance, 10th edition” by Ross, Westerfield, Jaffe, the direct costs of financial distress are legal, and administrative costs of liquidation or reorganization. During bankruptcy, with fees from hiring lawyers often in the hundreds of dollars an hour, these costs can add up quickly. In addition, administrative and accounting fees can substantially add to the total bill. And if a trial takes place, each side may hire a number of witnesses to testify about the fairness of a proposed settlement. Their fees can easily rival those of lawyers or accountants. A number of academic studies have measured the direct costs of financial distress (J. B. Warner, 1977; M. J. White, 1983; E. I. Altman, 1984; Lawrence A. Weiss, 1990; Stephen J. Lubben, 2000; Arturo Bris, and et.al, 2006). Although large in absolute amount, these costs are actually small as a percentage of firm value. • Indirect costs: Bankruptcy hampers conduct with customers and suppliers. Sales are frequently lost because of both fear of impaired service and loss of trust. Indirect costs of financial distress may be the culprit. Unfortunately, although indirect costs seem to play a significant role here; there is no easy way to have a decent quantitative method to estimate their effects. • Agency costs: When a business is in trouble, in financial distress, both creditors and shareholders. Both want the business to recover, but in other respects, their interests may conflict. They 18 tend to play their own "games" to ensure their interests. Agency costs is the conflict of interests between bondholders and shareholders when the business encounters difficulties. Stockholders employ three different types of self-serving tactics to harm bondholders and benefit themselves, specifically: • For selfish investment strategy 1 (Incentive to take large risks), the company is in such bad shape that, should a recession strike, it will come dangerously close to bankruptcy with one project and actually go into bankruptcy with the other. The important thing to remember is that, in comparison to the low-risk project, the highrisk project boosts company value during a boom and depresses it during a downturn. Thus, financial economists argue that stockholders expropriate value from the bondholders by selecting high-risk projects. • Selfish investment strategy 2 (Incentive toward underinvestment) show that stockholders of a firm with a significant probability of bankruptcy often find that new investment helps the bondholders at the stockholders’ expense. The simplest case might be a real estate owner facing imminent bankruptcy. • For selfish investment strategy 3 (Milking the property), an alternative method is to pay out additional dividends or other distributions during difficult financial times, leaving less money in the company for the bondholders. These "games" will make the financial distress of the business more and more serious and may lead to bankruptcy. It is worth noting that the costs associated with financial distress are more severe for firms with many intangible assets. This is understood because intangible assets associated with corporate health will lose value if the company falls into bankruptcy. Bankruptcy/bankruptcy forecasting is important for these companies. In the end, the expense of selfish investiment methods would be paid by the stockholders. Bondholders cannot reasonably anticipate assistance from stockholders when they are about to face financial hardship. They are likely to make investment decisions that lower the bond's value. As a result, bondholders fortify themselves by increasing the interest rate they demand on the bonds. The ultimate losers from self-serving tactics are the stockholders because they must pay such high rates. Leverage ratios will be low for businesses that deal with these distortions and debt. 19 • Economy costs: Economic issues cause performance decline, failure, insolvency, and default by affecting the economy as a whole. Although its liquidity is the primary cause of insolvency and default, a reduction in performance and failure have an impact on the firm's profitability. Due to the economic bailout packages, the government may have a national budget deficit at a time of financial difficulty. The recently established policies must be optimized at the same time to prevent financial distress from worsening and sending the nation into further financial disaster. 2.2. Literature review about data mining 2.2.1. Definition of data mining Data mining is the process of sorting through large data sets to find patterns and relations that can be used to solve business problems. Data mining is typically an interactive and iterative discovery process, according to Mohammed J. Zaki and Limsoon Wong (2003). This procedure aims to extract from big data sets patterns, associations, changes, anomalies, and statistically significant structures. The outputs of the mining process should also be reliable, original, practical, and clear. Thus, businesses can foresee future trends and make more qualified business decisions thanks to data mining techniques and tools. Data mining is a crucial component of data analytics and one of the fundamental fields in data science, which makes use of modern and recently developed analytics methods to discover valuable information in data sets. According to Koti Neha, and M Yogi Reddy (2020), descriptive data mining tasks categorize features of data in a target data set based on past or recent events. Data mining, at a more detailed level, is a step in the knowledge discovery in databases (KDD) procedure, a data science approach for collecting, processing, and evaluating data. Although they are often used interchangeably, data mining and KDD are more frequently understood to be separate concepts. 2.2.2. The key properties of data mining There are many important parameters in data mining, such as classification and clustering rules. Referring to the research of Mehmed Kantardzic (2011), the key properties of data mining are: 20 • Measurable Quality. With a smaller data set, the accuracy of approximations may be properly assessed. • Recognizable Quality. Before using any data-mining techniques, the quality of approximations can be simply assessed during the data-reduction algorithm's run time. • Monotonicity. The algorithms are often repetitive, and neither the amount of time spent running them nor the quality of the input data affects how well they perform. • Consistency. A correlation exists between calculation time and the caliber of the incoming data. • Diminishing Returns. In the initial computation stages (iterations), the improvement in the answer is significant, and it gets smaller as time goes on. • Interrupt ability. Any time can be used to halt the algorithm and output some results. • Preempt Ability. With little more work, the algorithm can be stopped and started again. Data mining can answer questions that cannot be addressed through simple query and reporting techniques. 2.2.3. Data mining processing According to “Discovering Knowledge in Data: An Introduction to Data Mining” by Daniel T. Larose (2005) and “Data Mining: Concepts, Models, Methods and Algorithms” by Mehmed Kantardzic (2011) the data mining process includes 5 steps as follows: The first step is recognizing the problem and constructing the hypothesis. Finding the cause of the problem is one of the main steps in data mining. Then, the application experts use their knowledge and experience to develop hypotheses relating to those roots. This process helps the experts to easily come up with meaningful problem statements. After that, they usually identify a set of independent variables and dependent variables based on the hypotheses and work with the modelers - data-mining experts to build the appropriate model. The second step is collecting the data. There are generally two options. The first approach is referred to as a designed experiment that is managed by an expert (modeler). The second choice is the observational approach, which is used when the expert is 21 powerless or has no power to change how the data are generated. Most samples in datamining applications are assumed to come from random data generation. This assumption is necessary because it makes the final results more accurate and the data collection process is completely objective to give additional evidence to support the ultimate outcomes. In addition, it's important to confirm that the data used to estimate a model and that the data used later to test and apply a model both come from the same sample distribution. If this is not true, the estimated model cannot be applied correctly. The third step is preprocessing or Cleaning the data. In this paper, we used a method called detection and removal of outliers. Outliers frequently originate from measurement errors, coding, and recording problems, and occasionally they are just naturally anomalous results. Such unrepresentative samples have a significant impact on the final model. Such unrepresentative samples have a significant impact on the final model. There are two common ways to deal with outliers: Either create robust modeling methods that don't bother about outliers or identify and remove them. The fourth step is estimating the model. The primary job is to choose and put into practice the best data-mining model in this phase which is not simple. Implementation is typically based on several models, and choosing the best appropriate model is an extra necessity. The final step is interpreting the final results and concluding. Data mining models usually play a crucial role in supporting decision-making. Thus, for such models to be useful, they must be interpretable because it is unlikely that humans will make decisions based on complex "black-box" models. There is some trade-off between the accuracy goals of the model and the precision of its interpretation. Simple models are typically easier to construct and understand, but they are also less precise. Complicated models generate extremely precise findings. However, their outputs are normally unreadable. Therefore, the issue of reading the final results coming from these models is treated as a separate job with particular methods for validating the outcomes. 22 Figure 2.4: Illustration of the data mining process (Source: Mehmed Kantardzic, 2011) 2.1.4. Data mining methods There are many data mining methods. Classification is a technique in data science used by data scientists to categorize data into a given number of classes. Festim Halili and Avni Rustemi (2016) claim that applications for credit risk management and fraud detection are especially well suited to this kind of study. This method usually uses classification algorithms based on decision trees or neural networks. This technique can be performed on structured or unstructured data and its main goal is to identify the category or class under which new data will fall. Regression is a probabilistic model that predicts discrete output values from a set of input values. This technique is mentioned by Breiman et al. (1984), Steinberg and Colla (1995), and Yohannes and Webb (1998). The main purpose of this regression method is to explore and map data. Third is clustering, which is the process of clustering unlabeled objects or data with similar characteristics to make data description easier. According to Fan Cai (2016), clustering also has its drawbacks, e.g. traditional clustering, as K-means clustering, can only handle numerical attributes, and is weak at computing accurate behavior-response mapping relationship since training is unsupervised and dropping targets. 23 Summarization is presentation generated data in a comprehensible and informative manner. It is a carefully performed summary that will convey trends and patterns from the dataset in a simplified manner. Accoring to Daniel T. Larose (2005), this method is used to compare records after subsampling. Other is dependency modeling, which is the method for finding and using the local model describing the dependencies is based on the Dependency modelling approach. Change and Deviation Detection is also building a model that describes the most significant changes in the data from previously measured or normative values. 2.1.5. Data mining tool used in the study – Orange Orange software is known for integrating open-source data mining tools and is programmed in Python with an intuitive interface and easy interaction. With many functions, it can analyze data from simple to complex, create beautiful and interesting graphics and also make data mining and machine learning easier for both internal and expert users. Orange is software aimed at automation. This is one handy data mining software, easy to use thanks to the compact interface, and the toolboxes are arranged in a coherent and reasonable way. Tools (widgets) provide basic functions such as reading data, displaying tabular data, selecting data properties, training data for prediction, comparing machine learning algorithms, visualization of data elements, etc. One of the simplest data mining tools is Orange, according to Janez Demar and Bla Zupan (2012). It works on Windows, Linux, and OS X. Numerous machine learning, preprocessing, and data visualization methods are included in the basic installation. Therefore, Orange is the software that the research team decided to use in the research paper. 2.3. Literature review about data classification 2.3.1. Definition of data classification Data classification is one of the main research directions of data mining. Data Mining Classification is a method that divides the data point into various classes. Data mining techniques, such as classification, allow discovering patterns, forecasting, knowledge discovery, etc., in different business sectors to decide upon the future trends in business to develop, claim Dimitrios Papakyriakou and Ioannis S. Barbounakis (2022). The quality of 24 the data can be altered by utilizing supervised learning techniques based on historical data. 2.3.2. Data classification process Data Mining Classification includes 2 steps: Learning Phase and Classification Phase. According to Adelaja Oluwaseun Adebayo and Mani Shanker Chaubey (2019), one of the main goals of the learning process is to develop models with high generalization capabilities, such as models that reliably predict the class labels of previously unknown information. The main focus of this phase of data mining classification is building the classification model using various techniques available. For the model to learn in this step, a training set is necessary. Using the target dataset as its basis, the trained model produces correct results. When test data is incorporated into the model, the generated Classification Model becomes more accurate. Figure 2.5: Illustration of building a classification model Source: Mohamed Osman Ali Hegazi, et al (2016) Second stage is estimating the accuracy of the model and classifying the new data (Classification). The main focus of this phase of data mining classification is building the classification model using various techniques available. For the model to learn in this step, a training set is necessary. The fundamental goal of classification algorithms, according to Koti Neha and M Yogi Reddy (2020), is to forecast the target class by examining the training dataset or to categorize the data into a predetermined number of classes. 25 Figure 2.6: Illustration of classifying new data and estimating the accurateness Source: Mohamed Osman Ali Hegazi, et al (2016) 2.3.3. Data classification methods Commonly used methods for data prediction include Logistic Regression, SVM (Support Vector Machine), Decision Tree, Neural Network, etc. 2.3.3.1. Logistic Regression To model continuous-value functions, linear regression is generally employed. Theoretically, the linear regression technique for modeling categorical response variables might be based on generalized regression models. Logistic regression is one popular kind of generalized linear model. Logistic regression models the likelihood of an event occurring using a set of independent variables. Rather than attempting to forecast the value of the dependent variable, the logistic regression approach seeks to determine its likelihood. We only use logistic regression when the model's output variable is a categorical binary. However, there are no requirements for the type of data of the response variables, hence, this kind of model supports a broader input data set. A common method of statistical analysis that identifies the best linear logistic regression model is called SimpleLogistic (Cox, 1958). It is similar to the LogitBoost method with simple regression functions. This algorithm, which depends on the logistic function, models the outcome's log odds rather than the actual result. SimpleLogistic 26 explains how one or more independent variables and the categorical dependent variable are related. The logistic regression model is used to predict categorical variables by one or more continuous independent variables. Our dependent variable can be ordinal, discrete,... The independent variable can be an interval or a scale, discrete... We can represent the formula as follows: 𝑑 𝑧 = ∑ 𝑤𝑖 𝑥𝑖 𝑖=0 P(y) = sigmoid(z) = 1 1 + 𝑒 −𝑧 In which: • d is the number of features of the data, • w is the weight, which will be initialized at first, then adjusted to suit the weight Since the outcome is a probability, the dependent variable is bounded between 0 and 1. Therefore, if the result of the variable x giving the result of y exceeds 1 or the probability is negative or less than 0, the interpretation of the logistic regression coefficient is meaningless. Figure 2.7: Illustration of logistic regression (Source: Synthesis by the author) 2.3.3.2. Support Vector Machine The foundations of SVMs were created by Vladimir and Alexey (1963) and are becoming more and more popular because of their numerous appealing qualities and promising empirical performance. The SRM principle is embodied in the statement. 27 Although SVMs were first created to address the classification challenge, they have lately been expanded to address regression issues (for the prediction of continuous variables). SVMs can be used to solve regression problems by introducing a different loss function that incorporates a distance measure. Vapnik et al. 1996 introduced an SVM variant that does regression rather than classification. It is known as Support Vector Regression (SVR). Using a collection of labeled training data, an SVM is a supervised learning method that generates learning functions. It has a strong theoretical underpinning and only needs a minimum number of samples to train; studies revealed that it is unaffected by the dimensions of the sample size. The program begins by addressing the overarching issue of learning to distinguish between individuals belonging to two classes represented by n-dimensional vectors The function could be a generic regression or classification function (the output is binary). SVM is effective in many classification and regression problems, it has been found. Z. Erdem et al. (2005) discuss how the use of SVM ensembles has opened up new possibilities for optical character recognition. When new training data sets are available in batches, ensemble-based methods are additionally used with a dynamic weighting scheme that is built for additional training models over the earlier ones, providing an incremental approach (X. Yang et al., 2009). To understand concept drift, R. Elwel et al. (2011) concentrate on ensemble methods. The normal form of SVM takes input data, treats them as vectors in space, and classifies them into two different classes by constructing a hyperplane in multidimensional space as the interface between the data layers. The key idea is that the decision border should be as far away from the data points of both classes as possible. Only one optimizes the distance between it and the closest data point in each class to maximize the margin. The margin is intuitively understood to be the space or gap between the two classes determined by the hyperplane. The margin is the shortest distance between two data points that are closest to a point on the hyperplane in terms of geometry. SMO (Sequential Minimal Optimization) (John C. Platt, 1998) is an improved training method for Support Vector Machines (SVM) that demonstrated good performance across a variety of problems. However, the complexity of the training and implementation 28 processes meant that SVM's use was constrained. SMO is consequently subtly enhanced by being conceptually straightforward, simple to implement, and generally faster than SVM. Figure 2.8: Maximum-margin hyperplane and margins for an SVM trained with samples from two classes (Source: Synthesis by the author) In figure 2.8, the blue and green points lying on the two boundary lines (black dashed) are called support vectors, because they have the task of helping to find the hyperplane (red line). 2.3.3.3. Decision Tree Making a decision tree, a group of decision nodes connected by branches that extend downward from the root node and finish in leaf nodes is one appealing classification method. Attributes are assessed at the decision nodes, starting at the root node, which is customarily positioned at the top of the decision tree diagram, with each conceivable conclusion leading to a branch. There will not always be the simplest tree, but each branch ultimately leads to either another decision node or a terminating leaf node. A decision tree comprises nodes where attributes are evaluated. In a univariate tree, the test only utilizes one of the characteristics to evaluate each internal node. All potential results of the test at a node are represented by the incoming branches of that node. Figure 2.9 provides a straightforward decision tree for the classification of samples using the two input qualities X and Y. 29 Figure 2.9: A simple decision tree with the tests on attributes X and Y (Source: Synthesis by the author) A decision tree provides a powerful technique for the classification and prediction of Diabetes diagnosis problems. Various decision tree algorithms are available to classify the data, including ID3, C4.5, C5, J48, CART, and CHAID (Aiswarya Iyer et al., 2015). Popular decision tree model C4.5 builds the branch with the largest information gain ratio (Quinlan JR, 1993). Small changes to the dataset, however, will probably result in a significant difference in the decision tree that is produced. In terms of advantages, decision trees are easy to understand, do not require normalization of data, can handle many different data types, and handle large amounts of data in the fastest time. However, decision trees are difficult to process data in situations where the data is affected by the time and cost to build decision tree models which are quite expensive. 2.3.3.4. Neutral Network The discovery that complicated learning systems in animal brains comprised networks of closely interconnected neurons served as the inspiration for neural networks. Although a given neuron may have a very straightforward structure, dense networks of interconnected neurons are capable of carrying out challenging learning tasks like classification and pattern recognition. At a very fundamental level, artificial neural networks 30 reflect an effort to mimic the nonlinear learning that takes place in natural neuronal networks. Another method that is frequently used to address data mining applications is the Artificial Neural Network (ANN). A network of densely connected processing units, known as a neural network, has a complex structure and exhibits some characteristics of a biological brain network. Because of how neural networks are built, users have the option to apply parallel concepts at various layer levels. Fault tolerance is another important ANN feature. ANNs work best when there is a lot of noise and uncertainty in the information. ANN is a method for processing information that significantly deviates from traditional methods in that it uses training by example to solve problems rather than a set procedure. (K. Anil Jain et al., 1996; George Cybenk et al., 1996). Based on the training methodology, they may be split into two categories: supervised training and unsupervised training. Unsupervised networks don't need the desired output for every input, whereas supervised networks need the desired output for every input. The back-propagation algorithm is the most widely used neural network algorithm. As the most thoroughly studied and utilized neural network classifiers, the emphasis is on feedforward multilayer networks or multilayer perceptrons. (R.P. Lippmann, 1989), even though there are several ways to employ neural networks for classification. Let us examine the simple neural network shown in Figure 2.10 Although most networks have three levels—an input layer, a hidden layer, and an output layer—the neural network is made up of two or more layers. Although most networks only have one hidden layer, this is adequate for the majority of applications, there may be additional hidden layers. The neural network is fully connected, which means that each node is linked to every other node in the next layer but not to nodes in the same layer. The quantity and kind of characteristics in the data collection often determine the number of input nodes. Both the total number of hidden layers and the total number of nodes in each hidden layer are userconfigurable. Depending on the specific classification job at hand, more than one node could be present in the output layer. 31 Figure 2.10: Simple neural network (Source: Synthesis by the author) Utilizing neural networks has many benefits, including the fact that they are extremely resilient to noisy input. These uninformative (or even erroneous) instances in the data set can be ignored by the network since it has multiple nodes (artificial neurons), with weights given to each link. In contrast to decision trees, which generate rules that are simple to comprehend for non-experts, neural networks are more difficult for humans to interpret. Additionally, training periods for neural networks are often longer than those for decision trees, going up to several hours frequently. Complex classification problems can be handled by multi-layer perceptrons (MLP) (Shrivastava et al., 2011; Hagan MT et al., 1996). However, MLP's drawbacks are obvious: No prior knowledge of the ideal hidden layer size. A too-small setting will result in a very weak network that could overgeneralize. A setting that is too large will result in very slow training and many hyperplanes that may coincide after training; if not, the problem is overfitting. There are several sectors where neural networks may be applied, including finance, trading, business analysis, business planning, corporate risk management, etc. Additionally, Neural Networks are used in several other industries, including business risk assessment, and weather forecasting,... Neural Network is also widely used in technology and other applications such as video games, speech recognition, social network filtering, automatic translation, and medical diagnostics. Alternatively, there are several application cases for neural networks that analyze transactions using historical data and identify better trading opportunities. 32 2.3.4. Methods to evaluate classification models Accuracy, speed, scalability, interpretability, and robustness are typically used as comparison criteria when comparing different algorithms. (Sossi Alaoui et al., 2017). The confusion matrix is a method that shows the anticipated and actual categorization, according to Provost and Kohavi (1998). This leads to many standards, including Precision, Recall, and F-Measure, being introduced. 2.3.4.1. Confusion Matrix, Accuracy, ROC, AUC, and Precision/Recall The confusion matrix is a matrix that shows how many data points belong to a particular class, and which class is predicted to fall. The confusion matrix is k x k in size, where k is the quantity of data layers. Assume that class A is positive and class B is negative. The following are the important terms in the confusion matrix: Figure 2.11: Confusion matrix for binary classification (Source: Synthesis by the author) • True positive (TP): Classifying predictions of positive outcomes as positive. • False positive (FP): Classifying predictions of negative outcomes as positive. • False negative (FN): Classifying predictions of positive outcomes as negative. • True negative (TN): Classifying predictions of negative outcomes as negative. 33 Figure 2.12: Outcomes of a confusion matrix (Source: Synthesis by the author) The term "type I error" is widely used to describe false positives. Oftentimes, type II errors are used to describe false negatives. Precision and recall are calculated using a confusion matrix. Precision and recall measures extend classification accuracy and provide a more detailed insight into model evaluation. Which one we prefer depends on the work and our goals. When the forecast is accurate, precision measures how reliable our model is. Precision is focused on making accurate predictions. It demonstrates how many prognoses have turned out to be accurate. Recall measures how accurately our model foretells positive classes. Actual positive classifications are the center of recollection. It reflects how many of the positive classifications the model accurately predicts. The F - score is an additional statistic that combines recall and accuracy into a single number. The F - score is calculated as the weighted average of accuracy and recall. 34 Because it considers both false positives and false negatives, the F - score is a more relevant metric than accuracy for situations with the unequal class distribution. The best F score value is 1 and the poorest is 0. In addition, we also have the formula for Accuracy as follows: The fraction of properly categorized samples in the overall data set is referred to as accuracy. Accuracy just shows us the proportion of data that is correctly classified, but it does not tell us how each class is classified, which classes are most accurately classified, or which classes data is typically classified in. The true positive rate (TPR), also known as sensitivity, is the same as recall. As a result, it calculates the fraction of the positive class that is accurately anticipated to be positive. Specificity is comparable to sensitivity, except it is only concerned with the negative class. It calculates the percentage of the negative class that is accurately predicted to be negative. Figure 2.13: Sensitivity and Specificity (Source: Synthesis by the author) 35 The ROC curve and AOC (area under the curve) measures are best explained using a logistic regression example. The likelihood of a sample being positive is calculated using logistic regression. Then, to discriminate between positive and negative classes, we define a threshold value for this probability. The sample is categorized as positive if the probability exceeds the threshold value. As a result, variable threshold values cause certain samples to be categorized differently, affecting accuracy and recall scores. The ROC curve describes the model's performance at various threshold levels by merging confusion matrices at all threshold values. The ROC curve's X-axis represents the true positive rate (sensitivity), while the Y-axis represents the false positive rate (1specificity). 𝑇𝑃𝑅(𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦) = 𝑇𝑃 𝑇𝑃+𝐹𝑁 𝐹𝑃𝑅(1 − 𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦) = 1 − 𝑇𝑁 𝑇𝑁+𝐹𝑃 = 𝐹𝑃 𝑇𝑁+𝐹𝑃 The model predicts that all samples will be positive when the threshold is 0. TPR (sensitivity) in this instance is 1. However, since no negative prediction was made, FPR(1specificity) is also 1. TPR and FPR both become 0 when the threshold is set to one. As a result, setting the threshold to 0 or 1 is not a wise choice. Our goal is to decrease the false positive rate (FPR) while increasing the true positive rate (TPR) (FPR). The ROC curve shows that when TPR increases, FPR also increases. How many false positives can we tolerate, then? We may use a different statistic rather than aiming to find the optimum ROC curve threshold value, known as AUC (Area under the curve). The area under the ROC curve between (0,0) and (1,1) is determined using integral calculus. AUC essentially aggregates the model's performance overall threshold levels.AUC values can be as high as 1, which denotes a flawless classifier. The classifier performs better when the AUC is larger. Classifier A outperforms classifier B in the following figure. 36 Figure 2.14: Area under the ROC Curve (Source: Melo, F. (2013)) 2.3.4.2. Cross Validation: Holdout and K-fold cross validation The hold-out method divides the original data set into 2 independent sets according to a certain ratio. For example, the training set accounts for 70%, the testing set accounts for 30%. This method is suitable for small data sets. However, the samples may not be representative of the entire data (missing classes in the test set). It can be improved by using a sampling method so that each class is evenly distributed in both training and evaluation datasets. Or random sampling: do holdout k times and accuracy acc(M) = average of k exact values. Figure 2.15: Hold-out method 37 (Source: Synthesis by the author) About K-fold cross-validation, this method splits data into k subsets of the same size (called folds). One of the folds is used as the evaluation dataset and the rest is used as the training set. The process is repeated until all the folds have been used as the evaluation dataset. Figure 2.16: K-fold cross-validation (Source: Synthesis by the author) The k-fold method is more often used because the model will be trained and evaluated on many different pieces of data. Thereby increasing the reliability of the evaluation measures of the model. 38 The hold-out method usually gives good performance on large data sets. However, in small or moderate data sets, the effectiveness of the model using this method is highly dependent on the division as well as the data division ratio. 2.4. Previous empirical evidences applying data mining in forecasting the financial distress Numerous sectors of research use data mining. The prediction of business bankruptcy is a well-known topic in research related to finance. Investors aim to lower credit risk and maintain a strategic distance from unsuccessful investments (Wilson and Sharda 1994). As a result, this subject has been investigated by a variety of authors in the past. The financial markets are significantly impacted by predictions of bankruptcy. 2.4.1. Empirical evidences with foreign research subjects Numerous foreign papers on the use of data mining in the financial sector have been published worldwide studies in order to improve the performance of enterprise financial distress evaluation. The research paper "A data mining approach to the prediction of corporate failure" (2001) by Feng Yu Lin and Sally McClean, studied the financial data of 1,113 UK companies from 1890 to 1999. The authors use four single classifiers - discriminant analysis, logistic regression, neural networks, and decision trees - each based on two feature selection methods for predicting corporate failure. The analysis's conclusion is that the hybrid method performs better when predicting company collapse a year in advance. As shown by Mahmoud Mousavi Shiri, Mahnaz Ahangary, Seyed Hesam Vaghfi, and Abolfazl Kholousi's research "Corporate Bankruptcy Prediction using Data Mining Techniques: Evidence from Iran" (2012), the study sample consists of 144 companies listed in Tehran stock exchange from 2005 to 2009, implemented various data mining algorithms such as neural networks, logistic regression, svm Bayesnet and decision trees for years t, t-1, t-2 were compared [T] year for the distress companies is bankruptcy year and for the non-bankruptcy, company is placed in the sample. However, the CART algorithm is more effective in Iran at predicting collapse and non-bankrupt firms, with 39 an average accuracy of 94.93% over a three-year period. The study "Comparison of Support Vector Machine and Back Propagation Neural Network in Evaluating the Enterprise Financial Distress" (2010) by Ming-Chang Lee and Chang, constructed an enterprise financial analysis methodology based on a support vector machine and Back Propagation neural. They concluded that Support Vector Machine provides higher precision and lower error rates than back propagation neural, even though the difference between the performance measurements is slight. "Comparison Of Wavelet Network And Logistic Regression In Predicting Enterprise Financial Distress" (2015) by Ming-Chang Lee and Li-Er Su, reviewed the Wavelet neural network structure, Wavelet network model training algorithm, Accuracy rate and error rate (accuracy of classification, Type I error, and Type II error). The major research opportunity is a potential model for predicting business failure (wavelet network model and logistic regression model). The result reveals that this wavelet network model is highly accurate, and that it improves the logistic regression model in terms of Type I error and Type II error as well as overall prediction accuracy. According to Efstathios Kirkos and Yannis Manolopoulos' essay "Data Mining In Finance And Accounting: A Review Of Current Research Trends" (2015), the purpose of this study is to classify the most popular method in financial documents using data mining. The sources of selected research papers come from reputable journals of four publishers: Elsevier, Emerald, Kluwer, and Wiley. The conclusion is that most analyses seem to favor the neural network model. 2.4.2. Empirical evidences with Vietnamese research subjects Data mining also appears in some financial research publications in Vietnam. These findings imply that the chance of financial distress at Vietnamese enterprises decreases by increasing financial liquidity, asset productivity, solvency, and profitability. Example of "Khai Phá Dữ Liệu Trên Nền Oracle Và Ứng Dụng" (2014) by Nguyễn Thị Minh Lý, who researched the issue of commercial bank classification, the thesis put up a model to address this issue research on the Naive Bayes approach, Support Vector 40 Machine method, and decision tree method. The collected experimental findings demonstrate that the suggested decision tree-based model has the highest accuracy. According to the study "Ứng dụng Data Mining dự báo kiệt quệ tài chính ở các Công ty Dược niêm yết tại Việt Nam" (2016) by Hồ Thị Thanh Thảo, using the data mining method to identify early signs of financial distress decline in profits and show which financial metrics are most effective in forecasting financial distress. The research subjects are Vietnamese joint stock companies that were listed on Ho Chi Minh City Stock Exchange and Hanoi Stock Exchange from 2011 - 2015. Algorithms used include a decision tree, neural network, and Support Vector Machine to determine if these algorithms predict financial distress well and which one is the best. Based on the research results, all three methods have accurate forecasts, but the decision tree method gives the most exact responses. As shown by Binh Pham Vo Ninh, Trung Do Thanh, and Duc Vo Hong's research "Financial distress and bankruptcy prediction: An appropriate model for listed firms in Vietnam" (2018), this study used data from 800 listed companies from 10 different industries that were traded on the Ho Chi Minh Stock Exchange (HOSE) and the Hanoi Stock Exchange (HNX) between 2003 and 2016. In a complete model that takes into account the following crucial elements of business financial distress: accounting factors, market factors, and two macroeconomic indicators, logistic regression is used. Additionally, alternative models for default prediction are compared using the AUC. The empirical findings of this study show how accounting, market, and macroeconomic factors affect the likelihood of financial hardship in Vietnamese firms over the course of the study period. However, a thorough model indicates that accounting concerns seem to have a greater influence than market variables. 41 CHAPTER 3: METHODOLOGY 3.1. Research process Figure 3.1 illustrates that there are two classes of factors that cause firms to experience financial distress. The first class is internal factors which consists of two sub factors namely financial factors and non-financial factors, the second one is external class which includes macroeconomics variables. In this paper, we use some of the financial factors with the application of data-mining techniques, such as classification, outlier detection, prediction and visualization as well as its algorithms namely Neutral Network, Logistic Regression, Support Vector Machine and Decision Tree to find out the most appropriate model for prediction of financial distress. Figure 3.1: Overall framework of using data mining techniques for prediction of financial distress Source: Synthesis by the author 3.2. Research model 42 In this research, we build the predictive model of financial distress probability based on the research of Dr. Edward I. Altman (1986). Edward Altman, a professor at New York University, has employed multiple discriminant analysis (MDA) to distinguish between bankrupt and non-bankrupt firms based on a set of predesignated financial variables. Altman demonstrates that a year before to bankruptcy, the financial characteristics of bankrupt and nonbankrupt companies are considerably different. Original Altman Z-Score (O Z-score) quantitative model proposed by Altman in 1968 is frequently used to assess the probability of bankruptcy of a manufacturing and wholesale company listed on the stock exchange in next 2 years. The probability of correctness of this model is 94% within 1 year, and 74% within 2 years. Altman’s estimated discriminate function is: 𝑍 = 0.12𝑋1 + .014𝑋2 + .033𝑋3 + .006𝑋4 + .999𝑋5 The fundamental premise of the Z-score methodology is these various financial measures: X1 : Net working capital/Total assets. X2: Retained earnings/Total assets. X3: Earnings before interest and taxes/Total assets. X4: Market value of equity/Book value of total liabilities. X5: Sales/Total assets. Z: Probability of financial distress of a manufacturing and wholesale company listed on the stock exchange. In which: • Z-score > 2.99: Safe Zone - Business is not in danger of bankruptcy. • 1.81 < Z-score < 2.99: Grey Zone - Business is in danger of bankruptcy. • Z-score < 1.81: Distress Zone - Business is at high risk of bankruptcy. 3.3. Variable measurements 3.3.1. Dependent variables: Z-scores We first calculate the Z – score for each firm using the Altman model (1968) with the 43 given data and then use his intepretation for the score to convert the set of those numbers into qualitative data which is input of the variable “Results”. This approach is suitable for calculating the because the results has strictly reviewed be other academic papers such as Nguyen Phuc Canh and Vu Xuan Hung (2014), Hoang Thi Hong Van (2020), Rahman et al (2021). The Altman Z-score intepretation can be shown in the table below: Table 3.1: Interpretation of Z-score The interval Z-score > 2.99 Value Interpretation Safe Zone Business is not in danger of bankruptcy 1.81 < Z-score < 2.99 Grey Zone Business is in danger of bankruptcy Z-score < 1.81 Business is at high risk of bankruptcy Distress Zone Source: Altman model (1968) 3.3.2. Independent variables 3.3.2.1. Net Working Capital/Total assets (NWCTA) Net working capital is equal to the difference between the firm’s current assets and its current liabilities. The numerator represents the company's financial capacity to pay its short-term obligations or the short-term liquidity of the firm. By dividing the net working capital by the total assets, we can determine the firm's net liquid assets in relation to total capitalization as well as eliminate the difference in size among the firms. According to Altman (1968), this liquidity ratio was the most useful and showed more statistical significance both on a univariate and multivariate basis. This ratio is the strongest indicator of eventual discontinuance, which is consistent with the inclusion of this statistic. Calculation formula: 𝑋1 = 𝑁𝑒𝑡 𝑊𝑜𝑟𝑘𝑖𝑛𝑔 𝐶𝑎𝑝𝑖𝑡𝑎𝑙 𝑇𝑜𝑡𝑎𝑙 𝑎𝑠𝑠𝑒𝑡𝑠 X1 has significant and a positive relationship in the Z-score model. In the articles of Nguyễn Xuân Hùng (2014), Trung Do Thanh, Binh Pham Vo Ninh, and Duc Vo Hong (2018), and Hoàng Thị Hồng Vân (2020), all of them prove that X1 is a stable variable and qualified for included in the model. 44 3.3.2.2. Retained earnings/Total assets (RETA) The ability of the corporation to accumulate earnings utilizing its total assets is gauged by the retained earnings to total assets ratio. This indicator shows long-term cumulative profitability, which reflects the extent of the company's leverage and eliminate the difference in the amount of total assets among the firms. In addition, as we mentioned in the pecking order theory, retained earnings is also the first back-up source of fund of the firm to fulfill its obligations with debtors when the company cannot generate any profits or even has a negative operating cash flow from its business activities. As claimed by Altman (1986), this ratio implicitly takes into account the age of a company because the young firm has not had time to build up its cumulative profits. Calculation formula: 𝑋2 = 𝑅𝑒𝑡𝑎𝑖𝑛𝑒𝑑 𝐸𝑎𝑟𝑛𝑖𝑛𝑔𝑠 𝑇𝑜𝑡𝑎𝑙 𝑎𝑠𝑠𝑒𝑡𝑠 In the Z-score model, X2 exhibits a strong and positive relationship. All of the authors' articles — Nguyen Xuan Hung (2014), Trung Do Thanh, Binh Pham Vo Ninh, and Duc Vo Hong (2018), and Hoang Thi Hong Van (2020) — demonstrate that X2 is a stable variable and suitable for inclusion in the model. 3.3.2.3. Net revenues/Total assets (NRTA) A common financial ratio that shows how well a company's assets can generate income is the capital-turnover ratio. It shows how well management can handle challenging market situations. In reality, it wouldn't have shown up at all based on the statistical significance metric. According to Altman (1986), this ratio ranks second in its contribution to the overall discriminating capacity of the model because of its special link to other variables in the model. Calculation formula: 𝑋3 = 𝑁𝑒𝑡 𝑅𝑒𝑣𝑒𝑛𝑢𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 𝑎𝑠𝑠𝑒𝑡𝑠 A number of research papers have included X3 in the model and demonstrated a positive relationship such as those of Nguyen Xuan Hung (2014), Arif Darmawan and Joko Supriyanto (2018), Hoang Thi Hong Van (2020). Thus, this variable is significant and stable enough to be included in the model. 45 3.3.2.4. EBIT/Total assets (EBITTA) By dividing its earnings before interest and tax (EBIT) by a company's total assets, this ratio is regarded as a sign of a company's ability to make operating profits by utilizing its assets effectively which eliminates the tax and interest factor. The higher the ratio, the better the firm can generate sufficiently cash flow to pay for the debtors and the government. According to Altman (1986), this ratio is especially suitable for research on corporate failure because a firm's ability to produce money is what ultimately determines whether it will remain in business. Calculation formula: 𝑋4 = 𝐸𝐵𝐼𝑇 𝑇𝑜𝑡𝑎𝑙 𝑎𝑠𝑠𝑒𝑡𝑠 Alman’s results show that 𝑋4 is significant at the level of 0.001. The research of Shashikanta Baisag, Dr. PramodKumar Patjoshi (2020) shows that EBIT to total assets have positive effect on financial distress. However, according to Binh Pham Vo Ninh, Trung Do Thanh and Duc Vo Hong (2017), 𝑋4 is reported to be statistically significant at 1–10% with a negative correlation with default probability, and has the largest impact on financial distress in a logistic regression. Because of its consistency and significance, this measure can be used to predict financial distress. 3.3.2.5. Equity-to-debt ratio (MVETD) The Equity-to-debt ratio evaluates the company's overall debt in relation to the capital the owners initially put up and the profits that have been held through time. A very low debt-to-equity ratio can be a sign that the company is very mature and has accumulated a lot of money over the years. Altman (2005) used this metric to quantify default in emerging markets, and the greater the ratio, the lower the default likelihood. Calculation formula: 𝑋5 = 𝑀𝑎𝑟𝑘𝑒𝑡 𝑣𝑎𝑙𝑢𝑒 𝑒𝑞𝑢𝑖𝑡𝑦 𝐵𝑜𝑜𝑘 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑡𝑜𝑡𝑎𝑙 𝑑𝑒𝑏𝑡 When using various modifications of the original model to analyze the performance of the Z-Score model for firms from 31 European and three non-European countries, Edward I. Altman, Magorzata Iwanicz-Drozdowska, Erkki K. Laitinena, and Arto Suvas (2016) found that for each model, the coefficient of Equity-to-debt ratio is very close to 46 zero, indicating a minor effect on the logit. However, Equity-to-debt is significant at 0.001 level in Altman (1986) and Binh Pham Vo Ninh, Trung Do Thanh and Duc Vo Hong (2017) researchs. Therefore, this variable have enough significance and reliability to be used in financial distress forecasting. 3.3.3. Summary of variable measurements In this study, table 3.2 below summarizes variable measurements about five independent factors affecting probability of financial distress of listed manufacturing and wholesales in Vietnam in 2022, and 2023: Table 3.2: Five independent variables selected in this study No Category Code Formula Paper Dependent variable Probability of Z-score Altman (1968), Nguyen Phuc Canh and financial Vu Xuan Hung (2014), Hoang Thi Hong distress Van (2020), Rahman et al (2021) Independent variables X1 Liquidity NWCTA Net Working Edward I.Altman (1968); James Capital/Total A.Ohlson (1980); Ming Xu and Chu assets Zhang (2008); Ben ChinFook Yap, David Gun-Fie Yong and Wai-Ching Poon (2010); Nguyen Xuan Hung (2014), Trung Do Thanh, Binh Pham Vo Ninh, and Duc Vo Hong (2018), and Hoang Thi Hong Van (2020) X2 Leverage RETA Retained Edward I.Altman (1968); Nguyen Xuan earnings/Total Hung (2014), Trung Do Thanh, Binh assets Pham Vo Ninh, and Duc Vo Hong (2018), and Hoang Thi Hong Van 47 (2020) X3 Turnover NRTA Net Edward I.Altman (1968); Nguyen Xuan revenues/Total Hung (2014), Arif Darmawan and Joko assets Supriyanto (2018), Hoang Thi Hong Van (2020) X4 Profitability EBITTA EBIT/Total Edward I.Altman (1968); Shashikanta assets Baisag, Dr. PramodKumar Patjoshi (2020); Binh Pham Vo Ninh, Trung Do Thanh and Duc Vo Hong (2017) X5 Market Valuation MVETD Market value Edward I.Altman (1968); Edward I. equity/Book value debt of Altman, Magorzata Iwanicz- total Drozdowska, Erkki K. Laitinena, and Arto Suvas (2016); Binh Pham Vo Ninh, Trung Do Thanh and Duc Vo Hong Source:Synthesis by the author. 3.4. Data collection methods, and descriptive statistics before preprocessing We collected one sample of 627 listed companies, which is then divided into 2 parts: training dataset (439 companies) and forecast dataset (188 companies). In our study, training dataset includes 5 factors to assess the company's likelihood of bankruptcy through Z-scores of 439 manufacturing and wholesale companies listed on 3 Vietnamese stock exchanges (HOSE, HNX, UPCom), which were taken from audited consolidated financial reports in 2021. Training dataset of 439 listed companies in the manufacturing and wholesale industries (Appendix 1) includes 5 independent variables: (Current assets - Current Liabilities)/Total Liabilities; Retained Earnings/Total Assets; Earnings Before Interest and Taxes/ Total Assets; Book Value of Equity/Total Liabilities; and Net Revenue/Total Assets. With the proposed model, data including 439 observations are extracted from the financial statements of 439 manufacturing, and wholesale companies in 2021 on HOSE, 48 UPCoM, and HNX. Because the financial statements of listed companies have been reviewed and publicly disclosed on the mass media according to the regulations of the State Securities Commission and the Stock Exchange, the reliability of the data is high. From the collected data, the authors conduct descriptive statistics for the variables including mean, median, maximum, minimum, and standard deviation, whose results are presented in Table 3.3 respectively Table 3.3: Descriptive statistics of quantitative variables before preprocessing . summarize NWCTA RETA EBITTA MVETD NRTA Variable Obs Mean NWCTA RETA EBITTA MVETD NRTA 439 439 439 439 439 .1467537 .0458438 .1196785 2.714288 1.340145 Std. Dev. .8232309 .0871474 .2200635 6.783536 1.198184 Min Max -9.837 -.5752 -.3732 -.9192 .0011 4.1191 .3105 3.718 75.9577 10.7641 (Source: Authors summarize the results on STATA14 software.) To measure liquidity, authors use variable “Working captial/Total Assets” (NWCTA) which has an average value at 0.147. The highest value of NWCTA in the study is 4.1191 and the lowest value is -9.837. It shows that businesses tend to invest current assets at a low level, or they have high current liabilties. For instance, in 2021 Mien Trung Petroleum Construction Joint Stock Company (PXM) has current liability are 10 times larger than current assets, which is difficult to pay short-term debts on time. This lead to ultimate discontinuance of this company over 10 years, and awaiting implementation of the restructuring policy towards holding company. Retained Earnings/Total Assets (RETA) represents average cummulative profitability, which has an average value at 0.0459. The highest value of RETA in the study is -0.5752 and the lowest value is 0.3105. There are two interpretation for this variable. First, it implicty shows that young businesses are high chance of having lower this ratio than older firm due to the shortage of time in building its cummulative profits. Quốc Tế Holding JSC (around 10-year operation) experienced revenue dropped sharply and did business below cost, resulting in a gross loss of nearly 4 billion dong in quarter 3 2021 due to many 49 disputes arising from their main business of real estate. Secondly, the lower the ratio is, the more ineffective in managing operating activities, such as G SaiGon Education JSC (founded in 1950) currently experiencing negative equity of nearly 31 billion dong due to the company unilaterally terminating labor contracts with more than 550 employees in 2017. To measure the true earning power of the firm’s assets, the authors use the variable Earnings Before Interest and Taxes/Total Assets (EBITTA) has an average value of 0.1197, and voliation is only 22%, indicating that every dollar of assets the manufacturing, and wholesale company invests in, it commonly returns 11.97 cents in EBIT per year. These companies are not fully utilize their economic resources, but try to keep in a moderate level to generate profits. However, Safoco Foodstuff Joint Stock Company (SAF) is one of foodmanufacturing companiers that truly has productive, and efficient management in utilizing their total assets, resulting in cummulative 9 months profit nearly 40 billion in 2022. Market Value of Equity/Book Value of Total Debt (MVETD) represents capital structure, which has an average value at 2.714. The highest value is 75.957, and the lowest value is -0.919, indicating companies commonly used preferred, and common stocks more than current, and long-term debt to raise capital. In addition, the standard deviation of this ratio is 6.784, indicating that significant volatility compared to the mean value. This can be explained by some companies, such as Sara Vietnam JSC has equity 76 times larger than total debt. The other reason that this ratio is too excessive is explained specifically in part 4.2 of this study. To measure turnover, the authors use the Sales/Total Assets (NRTA) has an average value of 1.34, indicating that every dollar of assets the manufacturing, and wholesale company invests in total assets, it commonly returns 1.34 dollars in net revenues per year. In our sample, companies that are in safe zone has high, and wide range value of this ratio due to the efffective management in dealing with other competive conditions. 50 Figure 3.2: Statistical results in training dataset results before preprocessing Source: Results from Orange program Figure 3.2 above offers statistics results in training dataset results before preprocessing. In our sample, manufacturing, and wholesale enterprises that are not likely to have financial distress in 2022, and 2023 accounts for the most, specifically: • 207 companies in safe zone. Of three results, these companies have not only the highest positive value, but also the most volatility in four variables: NWCTA, EBITTA, MVETD, and NRTA. These values have significant positive skewness. • 105 companies in distress zone. These companies in this zone has the least violatitly in five variables, and these values are not normal distribution. • 127 companies in grey zone. Contrary to safe zone, these companies have the highest negative value, and high volatility in two variables: NWCTA, and RETA. These value are not following normal distribution. Overall, each variable in our training dataset has considerate extreme outliers, and follow non-Gaussian distribution, meaning that the latter prediction of financial distress may not have high level of confidence. Therefore, our group preprocess the training dataset by imputing missing data value, and eliminating extreme outliers before training, and forecasting. 51 CHAPTER 4: RESULTS 4.1. Results of preprocessing data In this part, our team give some pratical, and specific explanations why we preprocess training dataset. Then, we preprocess training test by imputing mising values, and eliminating extreme outliers. First, team opened the excel file training test on the Orange program, then observed, visualized, and preprocessed the data, as below: Figure 4.1: Process of preprocessing data on Orange Source: Group’s results run from Orange software There is no missing data in our sample. The following figure 4.2 shows the list of the first 20 listed companies in the manufacturing and wholesale sectors in the training dataset: 52 Figure 4.2: Training dataset of 20 listed companies before processing Source: Group’s results run from Orange software However, given the distributions of our sample over the histogram, each variable contains some outliers that lie abnormal distances from other values, as shown in figure 4.3. This greatly affects the prediction of the company's financial distress. There are two main practical reasons why the prediction is wrong if outliers are not removed: (1) These companies are subsidiaries, just holding business activities and currently implementing the restructuring policy of the parent company (dissolution, merger, bankruptcy...). (2) The companies have switched to other business lines, except for the two main industries that the group wants to study (manufacturing, and wholesale). For example, Mien Trung Petroleum Construction Joint Stock Company (PXM) has accumulated losses for 10 consecutive years and negative equity for 9 consecutive years. Although the company has very high probability of bankruptcy, the company is only limited to trading on UPCom, and from 2012 to 2021, the business only operates almost in moderation, trying to maintain the personnel apparatus, and awaiting implementation of the restructuring policy towards bankruptcy, dissolution, merger of the Vietnam Oil and Gas Construction Joint Stock Corporation. 53 Figure 4.3: Distributions of our sample in five variables before preprocessing Source: Group’s results run from Orange software Therefore, team recognized, and removed outliers using a single-layer SVM methods with non-linear kernels (RBF) on the Orange program, for two reasons: (1) training the dataset is not multidimensional (5 features < 439 observations), and (2) all 5 variables have a non-Gaussian distribution (Figure 4.3). 54 Figure 4.4: Illustration of removing outliers Source: Group’s results run from Orange software After processing the missing data value, and removing outliers, the team received a dataset of 415 observations (24 observations removed) that can be used to train the training test to give better prediction results, as shown as figure 4.5. Finally, the team saves the processed training test data. Figure 4.5: Training data of 20 listed companies after preprocessing Source: Results from Orange program 4.2. Descriptive stastistic after processing training dataset After eliminating 24 extreme outliers, our training dataset includes 415 observations coming from manufacturing, and wholesale companies in 2021 on HOSE, UPCoM, and HNX. From the preprocessed data, the authors conduct descriptive statistics for the 55 variables, whose results are presented in Table 4.6 respectively: Table 4.6: Descriptive statistics of quantitative variables before preprocessing . summarize NWCTA RETA EBITTA MVETD NRTA Variable Obs Mean NWCTA RETA EBITTA MVETD NRTA 439 439 439 439 439 .1467537 .0458438 .1196785 2.714288 1.340145 Std. Dev. .8232309 .0871474 .2200635 6.783536 1.198184 Min Max -9.837 -.5752 -.3732 -.9192 .0011 4.1191 .3105 3.718 75.9577 10.7641 (Source: Authors summarize the results on STATA14 software.) It is clear from the above table and table 3.3 that standard deviation of five variables all decreased, meaning that we finally eliminate extreme outliers. Additionally, the mean value of EBITA, MVETD, and NRTA decrease slightly, while there was an insignificant climb in mean value of NWCTA, and RETA. In table 4.6, the mean of MVETD reduced dramatically by 0.600, and there was a gradual drop of 3.680 to 3.129 in its standard deviation. This variable experienced the most significant decrease of both mean and standard deviation, compared to other variables. There was a litte decrease of around 0.010 in mean value of EBITTA, and NRTA, while the standard deviation decrease slightly to 0.098 and 0.904 respectively. The mean value of NWCTA, and RETA increased slightly to 0.216, and 0.050 respectively, whereas two variables saw a slight drop in their standard deviation. To illustrate more specific distributions of our processed data over the histogram (figure 4.6), each variable now contains less outliers that lie abnormal distances from other values. Although these five variables are still not follow correct normal distributions, this dataset can partly eliminate statistical errors, assumptions’ violations, and make the models consistent, hence improving the accuracy prediction of the company's financial distress. 56 Figure 4.6: Distributions of our sample in five variables after preprocessing Source: Group’s results run from Orange software Another technique that we used to evaluate, and choose the final variable profile is to determine the interaction between them in the function. 57 Simple observation of the descriptive statistics, and discriminant coefficients from past empirical studies are not enough, and misleading since the actual variable measurement units are not all equivalent. From table 4.8 below, the authors found that the correlation coefficients are all less than 0.7000, and differ from 0 (the highest correlation coefficient is 0.6506, and the lowest is -0.2384). Therefore, multicollinearity is not a problem for any given pair of independent variables. Therefore, the study can use this models with all five independent variables to learn and choose the most suitable classification methods, and hence forecast. Table 4.8: Correlations of quantitative variables after preprocessing (Source: Authors summarize the results on STATA14 software.) According to Cochran (1977), this study showed that the majority of correlations between variables in previous studies were positive and that negative correlations are more beneficial than positive correlations in adding new information to the function. Interestingly, table 4.8 shows that NRTA has the most negative correlations with MVETD. This means that the lower MVETD is (due to cummulative operating losses), companies may have high NRTA. This can be explaned if MVETD is too high, that firm can experience poor credit risk, hence they have to issue more share to gain sufficient capital to generate sales. Therefore, including NRTA in this model is approriate due to its corrlations to other variables, although this variable is proven insignificantly in many past studies. Figure 4.7 above offers statistics results in training dataset results aftter preprocessing. In our sample, manufacturing, and wholesale enterprises that are not likely to have financial distress in 2022, and 2023 still accounts for the most, specifically: 194 companies in safe zone (eliminating 13 obs), 95 companies in distress zone (eliminating 10 58 obs), 126 companies in grey zone (eliminating 1 obs). The more postive values are in all five variables, the less likelihood of financial distress that manufacturing and wholesale companiers may experience. Figure 4.7: Statistical results in training dataset results after preprocessing Source: Group’s results run from Orange program 4.3. Results of choosing, and evaluating the most suitable classification method After data preprocessing, with the training dataset of 415 listed companies of the manufacturing and wholesale industries on the three exchanges HOSE, HNX, and UPCoM, our group selected the most appropriate data classification method through some evaluation results and confusion matrix, as figure 4.8 follows: Figure 4.8: Procedure for selecting and evaluating data classification methods 59 Source: Results from Orange program First, we used Orange software to input the training dataset. After input the training dataset, we will started declaring the role of each variables in the training dataset, as follows: • Independent variables NWCTA, RETA, EBITTA, MVETD, and NRTA are declared as "feature". • Dependent variable Results are declared "target". Results are divided into 3 results: Safe Zone, Distress Zone, and Gray Zone. • Variable Code does not participate in the training process, and is a categorical, so it is declared "meta". • Variable No does not participate in the training process but it is numeric, so it must be set "skip". Figure 4.9: Describe the roles of the variables in the training dataset Source: Results from Orange program After declaring the properties of the variables as figure above, the team continued to the Test and Score section to see an overview of the indicators and choose the most suitable 60 model for the study. The team used the Cross Validation evaluation method with Number of fold of 5 (k = 5) to avoid duplication between the test sets because the model is trained and evaluated on many different pieces of data, thereby increasing the reliability for the evaluation measures of the model. Figure 4.10: Result of the layered evaluation model by Cross Validation Source: Results from Orange program In all 4 classification methods that the team chose to test (Tree, SVM, Neural Network, Logistic Regression), model Neural Network is rated the highest in 5 indexes: AUC, CA, F1, Precision, and Recall. In particular, this model correctly predicts 93.2% for each class, and the AUC value is 98.9%, showing that this model is more effective than the rest. The team continued to further evaluate this model through the confusion matrix method, as figure below. 61 Figure 4.11: Neural Network's Confusion Matrix Source: Results from Orange program Through the above figure, the neural network model predicts that 97 companies are in the distress zone (at hight risk of bankruptcy), but only accurately predicts 91.8% of the actual value. In addition, there are 122 companies that are in the grey zone (in danger of bankruptcy), but are misclassified 4 companies. Finally, the model predicts 196 companies are in the safe zone (not in danger of bankruptcy), in which the accuracy of this zone is up to 95.9%, and only 2 companies are misclassified. Figure 4.12: ROC analysis 62 Source: Results from Orange program Figure 4.12 shows that the AUC for the Neural Network ROC curves are higher than that for the other classfication methods’ curves. Therefore, Neueruron Network model had a better job of correctly classifying the positive classes in the dataset. From that, the team can conclude that the Neural Network model is quite effective, and suitable for the training dataset, so it is quite suitable for predicting the bankruptcy of the remaining companies, through the forecastin dataset. 4.4. Results of forecasting data by using Neural Network model In this part, we continued to forecaste, and evaluated the other 188 listed companies in the same industries by using Neural Network model. Figure 4.13 shows forecast data of 20 listed companies as: 63 Figure 4.13: Forecasting dataset of 20 listed companies Source: Results from Orange program Based on the above procedure of training, and evaluation dataset of 415 listed companies in the manufacturing and wholesale industries on the three stock exchanges HOSE, HNX, and UPCoM, we found that Neural Network is the most appropriate classification method. Therefore, the team took the same steps as the training dataset to forecast, as figure 4.14 below: Figure 4.14: Neural Network forecasting process Source: Results from Orange program Just like the training dataset, the team inputed the forecast dataset into the Orange 64 program and set the properties for the variables in the forecast dataset, as figure 4.15 below: • The independent variables NWCTA, RETA, EBITTA, MVETD, NRTA, and Results is declared as "feature". • The Code variable does not participate in the prediction process, but is catergory data, so it is declared with the "meta" attribute. • Variable No does not participate in the training process but is numeric data, it is set "skip". Figure 4.15: Describe the properties of the variables in the forecast dataset Source: Results from Orange program Then we used Predictions to see how the forecast using Neural Network model is. Figure 4.16 shows the forecast results of the first 20 companies of the forecast dataset as follows: 65 Figure 4.16: Forecast results using Neural Network model Source: Results from Orange program The forecast results of the remaining 188 listed companies in the manufacturing and wholesale sectors show that: • There are 72 listed companies in the safe zone. It also means that these 72 companies do not go bankrupt, with the correct probability being 94% within 1 year (2022) and 74% within 2 years. • There are 59 companies in the distress zone. It also means that these 59 companies are at risk of bankruptcy, with the correct probability being 94% within 1 year and 74% within 2 years. • There are 57 companies located in the gray zone. It also means that these 57 companies are at high risk of bankruptcy, with a true probability of 94% within 1 year and 74% within 2 years. 66 Figure 4.17: Statistical forecasting results by Neural Network model Source: Results from Orange program 67 CHAPTER 5: DISCUSSIONS, LIMITATIONS, AND RECOMMENDATIONS 5.1. Discussions The topic "Application of the neural network model in forcasting financial distress of listed manufacturing and wholesale companies in 2022, and 2023 by using orange program" has basically completed the research objectives set out through two aspects: • Theoretically, the study presented the general theoretical basis of data mining technique, and data classification method (specifically Neural Network model). • Experimentally, the study combines the use of technology (Orange, STATA, and Excel) to apply to the financial sector using Altman (1968) model for predicting probability of bankruptcy in 2022 and 2023 of manufacturing and wholesale enterprises listed on three stock exchanges in Vietnam, through 5 independent variables: NWCTA, RETA, EBITTA, MVETD, NRTA, and 1 dependent variable Results (with 3 results: Safe zone, Distress zone, and Gray zone). In the study, the team used 439 companies in the training dataset and 188 companies in the forecast dataset. After the group preprocessed the data (removing outliers), the training dataset was left with 415 observations. The authors draw some conclusions below: Firstly, regarding to the question: “How do internal factors affect the criteria for assessing the possibility of bankruptcy of the company (Z-score) presented through descriptive statistics?”, our group found that the more postive values are in all five variables, the less likelihood of financial distress that manufacturing and wholesale companiers may experience. Secondly, regarding to the question: “Given the training data set, which suitable model provided by Orange software should be used to predict financial distress with high level of confidence?”, our group chose Neural Network as the effective, and suitable classification method for this study through the training data set of 415 observations (after removing 24 obs from the data preprocessing). In all 4 classification methods that the team chose to test và evaluate which is the most suitable model (Tree, SVM, Neural Network, Logistic Regression), model Neural Network is rated the highest in 5 indexes: AUC, CA, F1, Precision, and Recall. Moreover, this model also have the highest correct propotion of 68 predicted value through confusion matrix. Thirdly, regrading to the question “By using the selected suitable model, how the likelihood of financial distress of the companies are in 2022 and 2023?”, we used Neruron Network model to predict forecast data (188 observations remaining in two industries: manufacturing, and wholesale). We found that there are 72 listed companies do not go bankrupt, 59 companies are at risk of bankruptcy, and 57 companies are at high risk of bankruptcy, with a true probability of 94% within 1 year (2022) and 74% within 2 years (2023). 5.2. • Recommendations For domestic companies: Domestic companies can increase their Z score by reducing debt in their capital structure. A firm should monitor the financial ratios namely cash flow to total debt, net income to total assets, and total debt to total assets. To ensure proper asset utilization and lower the danger of financial crisis, the company should keep its debt at an optimal level. When an industry slumps, highly indebted companies are more likely to experience financial difficulties. However, these businesses can continue to thrive economically even in instances where their industry is in trouble by using effective management techniques. The achievement of efficiency while downsizing is one of the primary tactics for recovering from a financial hardship situation. To identify early warning signs of crises and take preventive action to forestall the impending danger, a firm should pay close attention to both internal (financial and nonfinancial) and external (macroeconomic factors) causes that could result in financial difficulty, especially the macroeconomic policies. The government's macroeconomic policies directly affect how the firm does business. Since businesses must adhere to the law, the board of directors and finance supervisors needs to understand how new legislation and government policies may impact their performance. Then, they must quickly establish internal control and risk management systems. • For government: 69 Financial distress of the companies is significantly influenced by the overall financial health of the economy. It is found that, structural reforms, improvements to the business climate, reduced uncertainty, measures to address bank asset quality deterioration through enhancing the legal and institutional insolvency framework are helpful in improving overall financial health of economy. When developing its policies, the government must take the business communities into consideration. In other words, government policies must be pro-business in order to support and strengthen firm growth rather than slow it down. The government should also offer infrastructure improvements and a favorable environment for businesses to flourish. 5.3. Limitation Even if the study was conducted with scientific research in mind, there are still some mistakes. In 2021, social distancing measures have affected the circulation of goods, leading to the disruption of supply chains and affecting the production and business activities of manufacturing, and wholesale industries domestically and internationally. This significantly affects the financial indicators in the study. The data was collected when the COVID-19 epidemic was ongoing and began to show signs of being under control, so it also has some influence on predictions when estimating the future of businesses after the COVID pandemic. Another limitation of the study is that the companies under investigation were all publicly held manufacturing organizations for whom extensive financial data, including market price quotations, were accessible, represents another limitation of the study. Therefore, extending the investigation to comparatively smaller asset-sized businesses and unincorporated companies, where the frequency of company failure is higher than with larger corporations, would be a subject for future research. In this paper, the team only used ratios Altman Z-score as a measure for the financial distress status of companies without referencing other models with similar functions. This Z-score degree of accuracy and reliability acquired in the tests carried out in Vietnam on industrial listed businesses (banks, insurance, and financial enterprises were excluded) could not be applied to other countries, even to nations with similar 70 environmental characteristics. Therefore, it could be wise to proceed to a preliminary validation of the model for companies that are quoted on different marketplaces. 5.4. Directions The team would like to offer the following directions for further research on the topic based on the limitations that prevent the research from being truly complete: ● To guarantee that the data collected is not diluted and results are more clear, the research subject should be a specialized industry. ● To increase the accuracy of the results, possible predictions must be implemented the model must be tested to reality, and the results must be consistently evaluated. We recommend that the model variables be added and appropriately adjusted to relate to each macro period in order to account for the limitation that the authors recognize in the research report. ● To have effective decisions for investors, the researchers proposed investigative modeling initiatives other than the Altman model. Future study ideas will discover more beneficial models and concentrated research techniques for forecasting enterprise development. I REFERENCES GSO (2020). Annual report General Statistics of Vietnam in 2020. GSO (2021). Annual report General Statistics of Vietnam in 2021. Ross, Westerfield, Jaffe (2012). Corporate Finance, 10th edition. The Law on Bankruptcy 2014 Vietnam Bankruptcy Law, 2014; Decree No. 58/2012/ND-CP of the Government Adebayo, A. O., & Chaubey, M. S. (2019). Data mining classification techniques on the analysis of student’s performance. GSJ, 7(4), 45-52. Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The journal of finance, 23(4), 589-609. Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. The journal of finance, 23(4), 589-609. Altman, E., & Hotchkiss, E. (2006). Corporate financial distress and bankruptcy. NJ: John Wiley & Sons. Baimwera, B., & Muriuki, A. M. (2014). Analysis of corporate financial distress determinants: A survey of non-financial firms listed in the NSE. International Journal of Current Business and Social Sciences, 1(2), 58-80. Baisag, S., & Patjoshi, P. (2020). Corporate Financial Distress Prediction–A Review Paper. PalArch's Journal of Archaeology of Egypt/Egyptology, 17(9), 2109-2118. Breiman, L., & Ihaka, R. (1984). Nonlinear discriminant analysis via scaling and ACE. Davis One Shields Avenue Davis, CA, USA: Department of Statistics, University of California. Calderon, T. G., Cheh, J. J., & Kim, I. W. (2003). How large corporations use data mining to create value. Management Accounting Quarterly, 4(2), 1-1. Chancharat, N. (2008). An empirical analysis of financially distressed Australian companies: the application of survival analysis. Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (Methodological), 20(2), 215-232. Devji, S., & Suprabha, K. R. (2016). Corporate financial distress and stock return: Evidence from Indian stock market. Nitte management review, 10(1), 34-44. II Dhar, S., Mukherjee, T., & Ghoshal, A. K. (2010, December). Performance evaluation of Neural Network approach in financial prediction: Evidence from Indian Market. In 2010 International Conference on Communication and Computational Intelligence (INCOCCI) (pp. 597-602). IEEE. Elloumi, F., & Gueyié, J. P. (2001). Financial distress and corporate governance: an empirical analysis. Corporate Governance: The international journal of business in society. Erdem, Z., Polikar, R., Gurgen, F., & Yumusak, N. (2005, June). Ensemble of SVMs for incremental learning. In International Workshop on Multiple Classifier Systems (pp. 246-256). Springer, Berlin, Heidelberg. Fan, A., & Palaniswami, M. (2000, July). Selecting bankruptcy predictors using a support vector machine approach. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium (Vol. 6, pp. 354-359). IEEE. Frydman, H., Altman, E. I., & Kao, D. L. (1985). Introducing recursive partitioning for financial classification: the case of financial distress. The journal of finance, 40(1), 269-291. Fun, M. H., & Hagan, M. T. (1996, September). Modular neural networks for friction modeling and compensation. In Proceeding of the 1996 IEEE International Conference on Control Applications IEEE International Conference on Control Applications held together with IEEE International Symposium on Intelligent Contro (pp. 814-819). IEEE. Hải, T. V. (2017). Nhận diện gian lận báo cáo tài chính của các công ty niêm yết trên thị trường chứng khoán Việt Nam–bằng chứng thực nghiệm tại sàn giao dịch chứng khoán HOSE. Halili, F., & Rustemi, A. (2016). Predictive modeling: data mining regression technique applied in a prototype. International Journal of Computer Science and Mobile Computing, 5(8), 207-215. Idrees, S., & Qayyum, A. (2018). The impact of financial distress risk on equity returns: A case study of non-financial firms of Pakistan Stock Exchange. Journal of Economics Bibliography, 5(2), 49-59. Ikpesu, F., Vincent, O., & Dakare, O. (2020). Financial Distress Overview, Determinants, and Sustainable Remedial Measures: Financial Distress. In Corporate Governance Models and Applications in Developing Economies (pp. 102-113). IGI Global. III Ikpesu, F., Vincent, O., & Dakare, O. (2020). Financial Distress Overview, Determinants, and Sustainable Remedial Measures: Financial Distress. In Corporate Governance Models and Applications in Developing Economies (pp. 102-113). IGI Global. Ikpesu, F., Vincent, O., & Dakare, O. (2020). Financial Distress Overview, Determinants, and Sustainable Remedial Measures: Financial Distress. In Corporate Governance Models and Applications in Developing Economies (pp. 102-113). IGI Global. Iyer, A., Jeyalatha, S., & Sumbaly, R. (2015). Diagnosis of diabetes using classification mining techniques. arXiv preprint arXiv:1502.03774. Jain, A. K., Mao, J., & Mohiuddin, K. M. (1996). Artificial neural networks: A tutorial. Computer, 29(3), 31-44. Jiming, L., & Weiwei, D. (2011). An empirical study on the corporate financial distress prediction based on logistic model: Evidence from China’s manufacturing industry. International Journal of Digital Content Technology and its Applications, 5(6), 368-379. Kantardzic, M. (2011). Data mining: concepts, models, methods, and algorithms. John Wiley & Sons. Kirkos, E., & Manolopoulos, Y. (2004). Data mining in finance and accounting: a review of current research trends. In Proceedings of the 1st international conference on enterprise systems and accounting (ICESAcc) (pp. 63-78). Kristanti, F. T., Rahayu, S., & Huda, A. N. (2016). The determinant of financial distress on Indonesian family firm. Procedia-Social and Behavioral Sciences, 219, 440-447. Kumar, P. R., & Ravi, V. (2007). Bankruptcy prediction in banks and firms via statistical and intelligent techniques–A review. European journal of operational research, 180(1), 1-28. Larose, D. T. (2005). An introduction to data mining. Traduction et adaptation de Thierry Vallaud. Lee, M. C., & Su, L. E. (2015). Comparison of wavelet network and logistic regression in predicting enterprise financial distress. International Journal of Computer Science & Information Technology, 7(3), 83-96. Lee, M. C., & To, C. (2010). Comparison of support vector machine and back propagation neural network in evaluating the enterprise financial distress. arXiv preprint arXiv:1007.5133. IV Lê, C. H. A., & Th S Nguyễn, T. H. (2012). Kiểm định mô hình chỉ số Z của Altman trong dự báo thất bại doanh nghiệp tại Việt Nam. Lin, F. Y., & McClean, S. (2001). A data mining approach to the prediction of corporate failure. Knowledge-based systems, 14(3-4), 189-195. Lippmann, R. P. (1989). Pattern classification using neural networks. IEEE communications magazine, 27(11), 47-50. McLachlan, G. J. (2004). Discriminant analysis and statistical pattern recognition. John Wiley & Sons. Neha, K., & Reddy, M. Y. (2020). A Study On Applications Of Data Mining. International Journal of Scientific & Technology Research, 9(02). Neha, K., & Reddy, M. Y. (2020). A Study On Applications Of Data Mining. International Journal of Scientific & Technology Research, 9(02). Nguyễn, T. M. L. (2014). Khai phá dữ liệu trên nền ORACLE và ứng dụng (Doctoral dissertation, Đại học Quốc gia Hà Nội). Ninh, B. P. V., Do Thanh, T., & Hong, D. V. (2018). Financial distress and bankruptcy prediction: An appropriate model for listed firms in Vietnam. Economic Systems, 42(4), 616-624. Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of accounting research, 109-131. Ohlson, J. A. (1980). Financial ratios and the probabilistic prediction of bankruptcy. Journal of accounting research, 109-131. Papakyriakou, D., & Barbounakis, I. S. Data Mining Methods: A Review. International Journal of Computer Applications, 975, 8887. Parker, J. A. (2011). On measuring the effects of fiscal policy in recessions. Journal of Economic Literature, 49(3), 703-18. Provost, F. J., Fawcett, T., & Kohavi, R. (1998, July). The case against accuracy estimation for comparing induction algorithms. In ICML (Vol. 98, pp. 445-453). Quinlan, J. R. (1993, June). Combining instance-based and model-based learning. In Proceedings of the tenth international conference on machine learning (pp. 236-243). Rani, K. U. (2011). Analysis of heart diseases dataset using neural network approach. arXiv preprint arXiv:1110.2626. V Shiri, M. M., & Ahangary, M. (2012). Corporate Bankruptcy Prediction Using Data Mining Techniques: Evidence from Iran. African J. Sci. Res. Vol, 8(1). Sossi Alaoui, S., Farhaoui, Y., & Aksasse, B. (2017, April). A comparative study of the four well-known classification algorithms in data mining. In International Conference on Advanced Information Technology, Services and Systems (pp. 362-373). Springer, Cham. Steinberg, D., & Colla, P. (1995). CART: tree-structured non-parametric data analysis. San Diego, CA: Salford Systems. Supriyanto, J., & Darmawan, A. (2018). The effect of financial ratio on financial distress in predicting bankruptcy. Journal of Applied Managerial Accounting, 2(1), 110-120. Thảo, H. T. T. (2016). Ứng dụng Data Mining dự báo kiệt quệ tài chính ở các Công ty Dược niêm yết tại Việt Nam. Thim, C. K., Choong, Y. V., & Nee, C. S. (2011). Factors affecting financial distress: The case of Malaysian public listed firms. Corporate Ownership and Control, 8(4), 345-351. Tinoco, M. H., & Wilson, N. (2013). Financial distress and bankruptcy prediction among listed companies using accounting, market and macroeconomic variables. International review of financial analysis, 30, 394-419. Trang, H. C., & Nhị, V. V. (2020). Ảnh hưởng của thành viên nữ trong hội đồng quản trị đến hiệu quả hoạt động của các công ty niêm yết. Tạp chí Phát triển Kinh tế, 61-75. Turetsky, H. F., & McEwen, R. A. (2001). An empirical investigation of firm longevity: A model of the ex ante predictors of financial distress. Review of Quantitative Finance and Accounting, 16(4), 323-343. Vân, H. T. H. Vận dụng mô hình Z-score trong dự báo khả năng phá sản doanh nghiệp tại Việt Nam. Wesa, E. W., & Otinga, H. N. (2018). Determinants of financial distress among listed firms at the Nairobi securities exchange, Kenya. Strategic Journal of Business and Change Management, 9492, 1056-1073. Wilson, R. L., & Sharda, R. (1994). Bankruptcy prediction using neural networks. Decision support systems, 11(5), 545-557. Wruck, K. H. (1990). Financial distress, reorganization, and organizational efficiency. Journal of financial economics, 27(2), 419-444. VI Yap, B. C. F., Yong, D. G. F., & Poon, W. C. (2010). How well do financial ratios and multiple discriminant analysis predict company failures in Malaysia. International Research Journal of Finance and Economics, 54(13), 166-175. Yohannes, Y., & Webb, P. (1998). Classification and regression trees, CART: a user manual for identifying indicators of vulnerability to famine and chronic food insecurity (Vol. 3). Intl Food Policy Res Inst. Zaki, M. J., & Wong, L. (2003). Data Mining Techniques, WSPC. Lecture Notes Series: 9in x 6in, 2. VII APPENDIX 1: TRAINING TEST BEFORE PROCESSING DATA VIII IX X XI XII XIII XIV XV XVI XVII XVIII XIX XX APPENDIX 2: TRAINING TEST AFTER PROCESSING DATA XXI XXII XXIII XXIV XXV XXVI XXVII XXVIII XXIX XXX XXXI XXXII APPENDIX 3: TEST DATA BEFORE FORECASTING XXXIII XXXIV XXXV XXXVI XXXVII APPENDIX 4: TEST DATA AFTER FORECASTING BY USING NEURAL NETWORK MODEL XXXVIII XXXIX XL