HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY SCHOOL OF INDUSTRIAL MANAGEMENT THESIS APPLICATION OF SEMMA PROCESS IN FORECASTING APARTMENT PRICE IN HO CHI MINH CITY Name Trịnh Trần Nguyên Chương Student ID 1952195 Supervisor in university Phạm Quốc Trung Order number 37-CLC Ho Chi Minh City – 2022 Vietnam National University HCMC UNIVERSITY OF TECHNOLOGY SCHOOL OF INDUSTRIAL MANAGEMENT SOCIALIST REPUBLIC OF VIETNAM Independence – Freedom - Happiness THESIS ASSIGNMENT DEPARTMENT: Information System STUDENT NAME: Trịnh Trần Nguyên Chương STUDENT ID: 1952195 SPECIALIZATION: Business Administration CLASS: CC19QKD1 1. Title: Application Of Semma Process In Forecasting Apartment Price In Ho Chi Minh City 2. Thesis assignment (requirement for content and data): Identify factors affecting apartments price, predict apartment price and classify apartments based on these affecting factors. The data is collected from the real estate exchange and the main constructors. 3. Date of assignment: 25/2/2022 4. Date of completion: 15/5/2022 5. Supervisor’s fullname: Assoc. Prof. Dr. Pham Quoc Trung, Advised on: Proposal and Thesis The proposal is approved by the School/ Department Ho Chi Minh city, 16th, May 2022 HEAD OF DEPARTMENT (Sign and write full name) PRIMARY SUPERVISOR (Sign and write full name) Acknowledgements I would want to express our heartfelt and sincere appreciation to Assoc. Prof.PhD. Phạm Quốc Trung. I will be eternally thankful and obliged to you for your direction, wisdom, passion, and encouragement in assisting me in studying and implementing this project. I also want to send my sincere thanks to all of faculty lecturers, without whom we would not have had the essential knowledge to complete my research. Later, armed with this university’s invaluable experience, I can apply it to my future career. Despite the efforts, I am conscious that this project is currently insufficient and will inevitably contain flaws. It is my pleasure to hear from professors on how I can improve even more. Finally, I wish you health, wealth, and success in your endeavors. Trinh Tran Nguyen Chuong Abstract In recent years, facing an increasing population in Ho Chi Minh City, the demand for housing has increased. Apartment has existed and is a reasonable house to be considered between other types of houses which can help the lowincome people can handle. According to the relationship of supply and demand and other factors, apartment prices have also increased sharply and have not shown any signs of decreasing. This research aims to identify the influencing factors, test and compare the performance of predictive models and classify apartments based on those influencing factors. Variables include internal factors of apartment buildings and external factors based on areas in Ho Chi Minh City. In addition, Predictive and Classification Models will help real estate investors and prospective homebuyers predict and categorize housing prices, thereby reducing losses for property developers and giving bargaining power to prospective homebuyers. Comparative analysis is performed on data mining techniques; Predictive models such as: Linear Regression, XGBoost Algorithm and K-Nearest Neighbors Algorithm were evaluated based on Residual Score, Root Mean Squared Error and Coefficient of determination; Random Forest Classification model will be used to classify apartments. The results of the algorithms are visualized with metrics and animations to provide clear insights into how the models are performing for both potential homebuyers and real estate investors. Contents CHAPTER 1. INTRODUCTION .......................................................................................... 1 1.1 Background ................................................................................................................. 1 1.2 Problem statement..................................................................................................... 2 1.3 Research Gap ............................................................................................................. 2 1.3.1 In Vietnam ................................................................................................................ 2 1.3.2 In other countries ...................................................................................................... 3 1.4 Research objective ..................................................................................................... 3 1.5 Scope and subject of the study ................................................................................. 4 1.5.1 Research Scope ......................................................................................................... 4 1.5.2 Research Subject ....................................................................................................... 4 1.5.3 Research Method ...................................................................................................... 4 1.6 Significance of study.................................................................................................. 5 1.7 Research Structure .................................................................................................... 5 CHAPTER 2. LITERATURE REVIEW ............................................................................... 6 2.1 Definition .................................................................................................................... 6 2.1.1 Data mining process .................................................................................................. 6 2.1.2 SEMMA Process ....................................................................................................... 6 2.1.3 Theoretical foundations of apartments...................................................................... 8 2.1.3.1 The concept of the apartment ............................................................................. 8 2.1.3.2 Current classification of apartments and condominiums ................................... 9 2.1.3.3 Characteristics of the apartment market ............................................................ 9 2.1.3.4 Apartment price prediction ................................................................................ 9 2.2 Regression Model (formula and explanation) ...................................................... 10 2.2.1 Extreme Gradient Boosting (XGBoost) .................................................................. 10 2.2.2 Linear Regression ................................................................................................... 12 2.2.3 Ordinary least-squares (OLS) ................................................................................. 14 2.2.3.1 Definition ......................................................................................................... 14 2.2.3.2 OLS results interpretation ................................................................................ 15 2.2.4 K-Nearest Neighbors (KNN) .................................................................................. 15 2.3 Random Forest classification ................................................................................. 17 2.3.1 Definition of classification ...................................................................................... 17 2.3.2 Definition of Random Forest classification ............................................................ 17 2.4 Model evaluation ..................................................................................................... 19 2.4.1 Residual and Predicted Values................................................................................ 19 2.4.2 RMSE...................................................................................................................... 20 2.4.3 Coefficient of determination (R-Squared) .............................................................. 21 2.5 Related work ............................................................................................................ 21 2.5.1 Foreign research ...................................................................................................... 21 2.5.2 Domestic research ................................................................................................... 22 2.5.3 Reviews of previous research papers ...................................................................... 22 2.6 Conclusion ................................................................................................................ 22 CHAPTER 3. METHODOLOGY ........................................................................................ 23 3.1 Methodology Research Process .............................................................................. 23 3.2 Materials Introducing ............................................................................................. 24 3.3 Data Sampling ......................................................................................................... 25 3.4 Data Exploring......................................................................................................... 28 3.5 Data Modifying ........................................................................................................ 30 3.5.1 Outliers.................................................................................................................... 30 3.5.1.1 Log transformation........................................................................................... 30 3.5.1.2 Skewness and Kurtosis .................................................................................... 33 3.6 Data preprocessing .................................................................................................. 35 3.6.1 Preprocessing process ............................................................................................. 35 3.6.2 Correlationship ........................................................................................................ 37 3.7 Experimental Design ............................................................................................... 39 3.7.1 Data Modeling: ....................................................................................................... 39 3.7.2 Data Accessing: ...................................................................................................... 39 CHAPTER 4: RESULT ANALYSIS AND DISCUSSION ................................................ 41 4.1 Result Analysis......................................................................................................... 41 4.1.1 Ordinary Least Square method evaluation .............................................................. 41 4.1.1.1 Regression results according to Ordinary Least Squares method (OLS) ........ 41 4.1.1.2 Correlation test ................................................................................................. 42 4.1.2 Predictive Models Evaluation ................................................................................. 42 4.1.2.1 Linear Regression ............................................................................................ 42 4.1.2.2 XGBoost .......................................................................................................... 43 4.1.2.3 KNN ................................................................................................................. 45 4.1.3 Random Forest classification .................................................................................. 46 4.2 Dicussion .................................................................................................................. 47 4.2.1 Factors affecting apartment prices in Ho Chi Minh city......................................... 47 4.2.2 Prediction model ..................................................................................................... 48 4.2.2.1 Residual............................................................................................................ 48 4.2.2.2 Root mean squared error .................................................................................. 49 4.2.2.3 The R2 score .................................................................................................... 49 4.2.2.4 Result summary ............................................................................................... 50 4.2.3 Classification model................................................................................................ 51 CHAPTER 5: CONCLUSION.............................................................................................. 57 5.1 Conclusion ................................................................................................................ 57 5.2 Research Meaning ................................................................................................... 58 5.3 Limitation................................................................................................................. 59 5.4 Future Research ...................................................................................................... 59 REFERENCE ......................................................................................................................... 61 Appendix ................................................................................................................................. 63 List of Figure Figure 1. Average commercial real estate price (millions/m^2) each year .............................. 1 Figure 2. SEMMA process ........................................................................................................ 7 Figure 3. XGBoost Model........................................................................................................ 11 Figure 4. Linear Regression Model ......................................................................................... 13 Figure 5. Example of OLS Regression model based on Weight and Height ........................... 15 Figure 6. Visual presentation of simulated working example ................................................. 16 Figure 7. Decision tree graph ................................................................................................... 18 Figure 8. Random Forest Classifier graph ............................................................................... 18 Figure 9. Residual Scatter plot ................................................................................................. 19 Figure 10. Methodology Research Process .............................................................................. 23 Figure 11. Interferce of Batdongsan.com.vn ........................................................................... 24 Figure 12. First step of extracting ............................................................................................ 26 Figure 13. Second step of extracting........................................................................................ 26 Figure 14. Boxplot of Price distribution .................................................................................. 31 Figure 15. Boxplot of price after log transformation ............................................................... 31 Figure 16. Probability Plot before log normalization .............................................................. 32 Figure 17. Probability Plot after log normalization ................................................................. 32 Figure 18. Skew distribution .................................................................................................... 33 Figure 19. Kurtosis distribution ............................................................................................... 34 Figure 20. Distribution of price before log normalization ....................................................... 34 Figure 21. Distribution of price after log normalization .......................................................... 35 Figure 22. Correlationship graph ............................................................................................. 37 Figure 23. Comparing LR's actual vs predicted values ........................................................... 42 Figure 24. LR's Residual graphs .............................................................................................. 43 Figure 25. Comparing XGBoost's actual vs predicted values ................................................. 43 Figure 26. XGBoost's Residual graphs .................................................................................... 44 Figure 27. Comparing KNN's actual vs predicted values ........................................................ 45 Figure 28. KNN's Residual graphs .......................................................................................... 46 Figure 29. Confusion Matrix.................................................................................................... 47 Figure 30. Residual score comparison ..................................................................................... 48 Figure 31. RMSE comparison.................................................................................................. 49 Figure 32. R-squared comparison ............................................................................................ 50 Figure 33. Decision tree (number 10) ........................................................................................ 1 Figure 34. 5 main factors affecting classification .................................................................... 56 List of Table Table 1. Description of all attributes which were collected..................................................... 27 Table 2. Total number of apartments each district .................................................................. 28 Table 3. Segment of all apartments.......................................................................................... 30 Table 4. Numeric transformation ............................................................................................. 36 Table 5. Location factors ......................................................................................................... 36 Table 6. Interpreting correlation .............................................................................................. 38 Table 7. OLS Regression Results ............................................................................................ 41 Table 8. Classification performance ........................................................................................ 47 Table 9. Evaluation score summary ......................................................................................... 51 Table 10. Classification report ................................................................................................. 55 OLS Ordinary Least Squares RMSE Root Mean Square Error KDD Knowledge discovery in databases CRISP-DM Cross Industry Standard Process for Data Mining HCMC Ho Chi Minh City R2 R-Squared KNN K-Nearest Neighbors LR Linear Regression RF Random Forest DW Durbin-Waston MAE Mean Absolute Error MAPE Mean Absolute Percentage Error MSE Mean Squared Error SD Standard Deviation RNN Recurrent Neural Network LSTM Long Short-Term Memory CHAPTER 1. INTRODUCTION 1.1 Background Vietnam is now a promising investment destination due to the development in which Ho Chi Minh City plays the role of the economic locomotive of the country and real estate becomes an effective investment and profitable field. In recent years, the housing price has risen gradually and there is no signal of negative growth which is caused by many reasons. The city is facing a big challenge with a large population due to reproduction or migration from other provinces, leading to high population density. According to Le Minh Phuong Mai (2021), from the University of Finance - Marketing, in 2021, the population of Ho Chi Minh City will reach nearly 13 million people, and immigrant population is nearly 3 million, accounting for about 23% of the population including more than 400,000 students; more than 50,000 new married couples per year; Through the survey of the Department of Construction, there are about 500,000 households have no house and about 81,000 households in need of social housing; Out of a total of more than 402,000 workers and laborers working in 17 export processing zones, industrial parks and high-tech zones of the city, 284,000 people (accounting for 70.6%) have a need for accommodation, but 39,400 people were met the demand, accounting for about 15% of the demand (Le, 2021). The rapid population growth has led to great pressure on housing, making the housing price rise each year. Figure 1. Average commercial real estate price (millions/m^2) each year 1 1.2 Problem statement A common method that many countries use to predict house prices is to compare the average house price with the average income of the people. Because real estate prices depend largely on the supply-demand relationship in the market. House prices still depend on and are closely influenced by many groups of factors and certain evaluation criteria such as the location of real estate, have convenient transportation, complete infrastructure; the ability to bring profits or available amenities attached to the house - land; favorable conditions for buyers in terms of administrative procedures on land use rights, house ownership, construction permits, etc (Vuong, 2016). In determining house prices, investors must carefully calculate and determine the appropriate method because real estate prices always increase continuously and almost never decrease in the long-term or short-term. However, non- statistical prediction methods such as prediction based on supply-demand relationship, cannot give high accuracy. Therefore, it is necessary to use other statistical tools to ensure housing prices can be predicted with the highest accuracy and less timeconsuming, rather than using the traditional way such as collecting data and comparing with each other to make a prediction. A machine learning model (or known as regression model), basically is a quantitative method that can be used effectively to give a prediction. According to Qingqi Zhang (2021) from The Hong Kong University of Science and Technology, A comparative experiment has also revealed that multiple regression applications for property appraisal work well with given data. Besides, the models based on multiple regression seem to attach more significance to statistical inference rather than prediction due to its nature. Following the features above, the topic of the research is “APPLICATION OF SEMMA PROCESS FORECASTING APARTMENT PRICE IN HO CHI MINH CITY” which is suitable. 1.3 Research Gap 1.3.1 In Vietnam There are many research papers exploring aspects of housing prices and forecasting in Ho Chi Minh City. The authors have an overall and detailed view of the housing situation in this city. However, methods of forecasting housing prices are still manual such as collecting property information, comparing and estimating prices based on available information. 2 The article "SITUATION OF USE OF COMPARATIVE AND COST METHOD IN REAL ESTATE VALUATION IN HO CHI MINH CITY" author NGUYEN TRUONG LUU (Nguyen, 2009) has shown that the comparative method is used to real estate price estimation, which is finding real estate that has been bought and sold on the market that is similar to an appraised property, and then estimates an appropriate sale price based on factors that can comparable to reflect the difference between them and find out the true value of the property. The author has specifically analyzed how to apply, benefits, advantages and achievements when applying this method. However, there are still limitations in the study such as the comparative method used is not effective in the case of big data and the degree of influence of other factors on real estate prices. 1.3.2 In other countries Housing price prediction is a topic that is studied in countries around the world, these researches focus on predicting housing prices to support not only investors, buyers but also investors. government social policy. There are many methods to forecast housing prices, but the most popular is using a machine learning model. For example, in the research paper "House Price Prediction Using LSTM", the author uses a machine learning model named RNN (Recurrent Neural Network) as the foundation of a solution (Chen, Wei and Xu, 2017). The study focused on the housing market in Beijing, Shanghai, Guangzhou, Shenzhen and used information from 80 districts in 155 months to build up the prediction model and make good predictions in the next 2 months. However, that research paper still has many limitations such as the lack of data makes the accuracy of the RNN prediction model does not bring maximum efficiency. 1.4 Research objective Currently, real estate prices in Vietnam in general and in Ho Chi Minh City in particular depend on many qualitative factors, which can be mentioned as legal issues or Feng Shui issues, a matter of great interest in Southeast Asian countries, especially Vietnam. These qualitative factors which cannot be used to predict specific prices of real estate such as land and houses (such as feng shui, house orientation, residential composition). On the other hand, apartment buildings are less dependent on qualitative variables, in addition, they have full details about each house in it. The first objective: collecting and identifying the quantitative factors that can affect apartment prices via apartment buildings’ information. 3 The second objective: applying the SEMMA process on building regression models for predicting the apartment prices and comparing the accuracy of these model The third objective: Classifying all of the apartments into each apartment's segment based on the affecting factors. In order to be able to do real estate prediction, it is necessary to have a good understanding of machine learning applications, knowledge of statistics as well as a clean and high-quality data source. 1.5 Scope and subject of the study 1.5.1 Research Scope The study aims to find out the accuracy and effectiveness of Classification and Regression Analysis of apartment price prediction in Ho Chi Minh city including collecting and analyzing housing data in the last four years from 2018 to 2022 and three months to scrap and build models. The segment that the report aims to includes high-end, middle-end, and low-end apartments. 1.5.2 Research Subject Apartment price prediction in Ho Chi Minh city 1.5.3 Research Method The method is used in the research process and is concretized through the following steps: − Collect information and documents on a website specializing in real estate brokerage: Batdongsan.com.vn. − Gather, collect and process documents, combine with learned and practical knowledge to build machine learning models to serve the house price forecasting process. − Test and compare results from different models used to select the best results. − Use classification algorithms to classify apartment data into each segment. − Give discussion on the final result. 4 1.6 Significance of study Predicting home prices can help prospective homebuyers get an estimate of the possible future price of that property, which can help them plan their finances well. In addition, home price predictions also benefit real estate investors when it comes to knowing the trends in house prices in a certain location. 1.7 Research Structure The study includes 5 chapters: − Introduction: Provides key information on the problem statement, methodological guidelines, key findings and key conclusions of housing price forecasting in HCMC. − Literature review: Analyze and combine scholarly sources on previous research related to the property appraisal process, statistical literacy, and machine learning tools. − Methodology: Explain the design of the study using the techniques used to collect the information and other aspects relevant to the experiment. − Research results: Review the information in the introduction, evaluate their results, and find the model that best fits. − Conclusion: Provide final thoughts and summaries of the entire work. Consider the limits of research and results and get potential solutions or new ideas based on the results obtained. 5 CHAPTER 2. LITERATURE REVIEW 2.1 Definition 2.1.1 Data mining process Jiawei Han and Micheline Kamber stated that data mining has gotten a lot of attention in the information industry and in society in general in recent years because of the widespread availability of massive amounts of data and the pressing need to turn that data into useful information and knowledge. Market analysis, fraud detection, and client retention, as well as production control and science exploration, can all benefit from the information and knowledge gathered (Han & Kamber, 2001). There appears to be no such thing as too much data in today's increasingly datadriven world. Data, on the other hand, is only valuable if it can be analyzed, sorted, and sifted through to determine its true worth. Most industries collect huge amounts of data, but without a filtering process that generates graphs, charts, and trending data models, the data is useless. However, filtering through the massive amount of data and the speed with which it is collected is difficult. As a result, scaling up our analysis power to manage the massive amounts of data we currently get has become economically and scientifically vital. There are a number of data mining processes that aim to extract information from rapidly accumulating data and then circulate that knowledge in order to improve the process of obtaining quality data and optimize operations. The Knowledge Discovery in Databases (KDD) Process, Sample, Explore, Modify, Model, and Access (SEMMA), and the Cross Industry Standard Process in Data Mining are three popular methods (CRISP-DM). 2.1.2 SEMMA Process Data mining, according to SAS Institute, is the process of sampling, exploring, modifying, modeling, and assessing (SEMMA) huge amounts of data in order to identify previously unrecognized patterns that can be used as a competitive advantage (SAS, 2009). The SEMMA process consists of five steps: Sample, Explore, Modify, Model and Assess. 6 Figure 2. SEMMA process Sample: This stage comprises selecting a subset of the relevant volume dataset from a large dataset provided for the model's development. The purpose of the first stage of the process is to identify variables or factors that influence the process (both dependent and independent). After that, the data is categorized into preparation and validation categories. ● Explore: Univariate and multivariate analysis are used in this step to investigate interrelated relationships between data pieces and discover data gaps. While multivariate analysis examines the link between variables, univariate analysis examines each factor separately to determine its role in the overall scheme. With a large focus on data visualization, all of the influencing factors that may influence the study's outcome are studied. ● Modify: In this step, business logic is used to draw lessons learnt in the exploration phase from the data acquired in the sample phase. In other words, the data is processed and cleaned before being passed on to the modeling step, where it is examined to see if it needs to be refined and transformed. ● Model: After the variables have been refined and the data has been cleaned, the modeling step uses a range of data mining techniques to create a projected model of how the data generates the process's final, desired output. ● 7 Assess: At this point in the SEMMA process, the model is assessed to see how useful and dependable it is for the issue at hand. The data may now be put to the test and used to determine how effective its performance is. ● SEMMA process provides an easy-to-understand process for developing and maintaining Data Mining projects in an organized and efficient manner. It thus provides a framework for conception, creation, and evolution, assisting in the presentation of business solutions as well as the identification of DM business objectives (Santos & Azevedo, 2005). 2.1.3 Theoretical foundations of apartments 2.1.3.1 The concept of the apartment An apartment, also known as a flat, is a self-contained housing unit (a type of residential real estate) that is located on each floor of a building. Apartment tenure ranges from large-scale public housing to owner occupation within what is legally a condominium, to tenants renting from a private owner. The apartment building is normally constructed on a plot of land and consists of many units of apartments inside. An individual or household who is eligible to own an apartment owns the space between the apartment's walls, floor, and ceiling. At the same time, the apartment owner also has a right to use the general utility in the apartment building. Some apartment dwellers in the United States own their units, either through a housing cooperative, in which people hold shares in a corporation that owns the building or complex, or through a condominium, in which people own their flats but share ownership of the common areas. The majority of apartment are built specifically for many specific purposes, however huge older houses are occasionally partitioned into apartments. In Vietnam, according to the content of Article 03 of the Law on Housing 2014: “an apartment building is a house with 2 floors or more, with many apartments on one floor, with common walkways and stairs, and with an infrastructure system. Common use floors for households and individuals. Including apartments built for residential purposes and apartments built with mixed purposes of both residential and business. Each household or individual has its own share and shared ownership in the apartment building”. 8 2.1.3.2 Current classification of apartments and condominiums According to the Circular No. 31/2016/TT-BXD of the Ministry of Construction, the criteria for classification of apartment buildings are based on the following four groups of criteria: - Criteria related to architecture or planning. - The criteria related to the technical system. - The criteria related to infrastructure and services. - The criteria related to quality, management and operation. 2.1.3.3 Characteristics of the apartment market "Apartment market", which is similar to the concept of "market", is often understood as a place where transactions in apartment goods take place, a collection of conditions and agreements through which buyers and sellers exchange goods with each other. However, because apartment goods are different from ordinary goods, they have their own characteristics. The first feature is the regional feature, since the apartment is a commodity that cannot be moved. Therefore, it is often associated with the economic, natural and social characteristics of the region and depends on the traditional culture and psychological characteristics of each region. The supply of apartment goods reacts more slowly to fluctuations in demand and the price of apartments. For apartment goods, when demand increases, supply cannot guarantee a quick response like other goods. Therefore, this type of goods takes time to create the product. The apartment market is governed by the law, all transactions in the apartment market must be supervised and managed by the State such as registration and issuance of ownership certificates. Legally shared apartments will be more valuable because they have the right to participate in all transaction activities such as: transfer, legalization, mortgage... The participation of the State in the apartment market has passed legislation to make this market stable and safe. 2.1.3.4 Apartment price prediction As people's living standards increased, there was a fast increase in demand for housing. Apartment or flat is a self-contained housing unit (a type of house) which is a civilian part of a building, generally on a single story. While some people buy an apartment as an investment or as a property due to its affordable price, the majority of people buy the apartment as a shelter around the world. 9 Housing markets, according to the article “Predicting Housing Sales in Turkey Using Arima, Lstm and Hybrid Models'', have a beneficial impact on a country's currency, which is a key metric in the national economy. Homeowners will purchase products for their homes, such as furniture and domestic equipment, while home builders or contractors would purchase raw materials to build houses to meet demand, indicating the economic wave impact caused by the new housing supply. Aside from that, consumers have the financial means to make a substantial investment, and the construction sector is in good shape, as seen by a country's high level of housing production. (Temür et al., 2019). Every year, there is an increase in housing demand, which also leads to an increase in apartment prices. Most stakeholders, including buyers and developers, house builders, and the real estate industry, would like to know the exact attributes or accurate factors influencing the apartment price to help investors make decisions and help house builders set the apartment price when there are numerous variables such as location and property demand that may influence the house price. 2.2 Regression Model (formula and explanation) 2.2.1 Extreme Gradient Boosting (XGBoost) When compared to an individual predictor, a model that aggregates the predictions of several predictors frequently produces better results. Ensemble learning is a strategy that uses a number of predictors to form an ensemble. Bagging, boosting, and stacking are three types of approaches that can be used to create an ensemble method. Random Forest (RF), for example, is an ensemble of random forests that is often trained using the bagging approach. Boosting, unlike bagging, trains predictors sequentially rather than concurrently. 10 Figure 3. XGBoost Model Under the gradient boosting framework, XGBoost is a scalable end-to-end tree boosting system that fits the new predictor to the residual errors created by the prior predictor. Many additive functions are used to forecast the outcome, such as Equation yi = 𝑦𝑖 0 +η ∑𝑀 𝑘=1 𝑓𝑘 (𝑋𝑖 ) where yi is the projected result based on features 𝑋𝑖 , 𝑦𝑖 0 is the initial guess (typically the mean of the observed values in the training set), and is the learning rate that allows the model to improve smoothly while adding new trees without overfitting. The estimation fk of the additional k-th estimators is as Equation: 𝑦̅k =𝑦̅(k−1) + η 𝑓𝑘 Where 𝑦̅ 𝑘 is the k-th predicted result and fk is defined by the leaves weights. The following regularized objective is minimized to learn the functions employed in the model above 11 L(φ) = ∑𝑖 𝑙(𝑦ˆ𝑖, 𝑦𝑖) +∑𝑘 𝛺(𝑓𝑘) 1 2 Where Ω(f) = γT+ λ||w|| . 2 The difference between the forecast yi and the target yi is measured by l, a differentiable convex loss function. The second term penalizes the model's complexity and functions as an additional regularization factor to prevent overfitting. XGBoost also allows users to discover the relative relevance or contribution of particular input factors in forecasting the response because it is built on random 2 forests and single random forests are highly interpretable. Il (T) could be employed as a measure of importance for each predictor variable xl, according to Breimen et al., where J is the number of nodes in the tree. Il2(T)= ∑𝐽−1 𝑖 2 𝐼(𝑣(𝑡) = 𝑙) 𝑡 =1 𝑡 The importance measure is generalized to XGBoost by averaged over the trees and is shown in above in which M is the number of the trees 𝐼𝑙 2 = 1 𝑀 2 ∑𝑀 𝑚−1 𝐼𝑙 (𝑇𝑚 ) 2.2.2 Linear Regression Linear regression is the simplest and earliest predictive method, which includes estimating a continuous outcome using a linear combination of predictors (independent variables and dependent variable). The goal of linear regression models is to minimize the mean squared error (the average squared discrepancy between the observed and anticipated result values) when estimating the regression coefficient vector β. A linear regression model with p predictors is written as follows given a dataset of n observations: 𝑌𝑖 = 𝛽1 𝑥𝑖,1 +𝛽2 𝑥𝑖,2 +···+𝛽𝑝 𝑥𝑖,𝑝 +𝜀𝑖 , i = 1,2,...,n, where 𝑌𝑖 represents the continuous response for the the 𝑖𝑡ℎ observation, The parameter 𝛽𝑗 , j = 1, . . . , p, represents the effect size of covariate j on the response, 𝑥𝑖,𝑗 represents the 𝑗𝑡ℎ variable value for the 𝑖𝑡ℎ observation, and 𝜀𝑖 is the random error term. 12 In linear regression analysis, there are several assumptions: the error terms εi are independent, uncorrelated and normally distributed with mean of zero and constant variance 𝜎2 (a.k.a. homoscedasticity). The linear regression model has the advantage of having excellent interpretability of the coefficients and strong prediction in small training data sets (Hastie et al., 2001). Linear regression has the disadvantage of being sensitive to outliers, which is a regular occurrence in most datasets. The presence of a small number of outliers in the dataset can have an impact on the linear model's performance. Figure 4. Linear Regression Model While linear regression is a very straightforward method for capturing the complexity of housing predictions, it contains key concepts that are utilized to construct alternative regression techniques. Many recent statistical learning methods, such as splines and generalized additive models, can be considered as extensions or generalizations of linear regression. 13 2.2.3 Ordinary least-squares (OLS) 2.2.3.1 Definition “Ordinary least-squares (OLS) regression is an extended linear modeling approach that may be used to describe a single response variable on at least an interval scale” argued by Dan Hutcheson from VLSI Research Inc. He also concluded that this technique can be used with single or multiple explanatory factors, as well as categorical explanatory variables that have been properly coded (Hutcheson, G. D. 2011). The OLS method has the formula, which is similar with the Linear Regression, but in the OLS technique, we must select 𝑏1 and 𝑏0 values that minimize the total sum of squares of the difference between the computed and observed values of y: 2 2 S = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂) = ∑𝑛𝑖=1(𝑦𝑖 − 𝑏1 𝑥1 − 𝑏0 )2 = 𝑖 𝑖 2 ∑𝑛𝑖=1(𝜀̂) = 𝑚𝑖𝑛 𝑖 Where: 𝑦̂𝑖 is predicted value for the ith observation , 𝑦𝑖 is actual value for the ith observation, 𝜀𝑖 is error/residual for the ith observation and n is total number of observations. It is necessary to take a partial derivative for each coefficient and equate it to zero to find the values of 𝑏0 and 𝑏1 that minimizes S. 14 Figure 5. Example of OLS Regression model based on Weight and Height 2.2.3.2 OLS results interpretation R-squared: the coefficient of determination. It is the proportion of the variance in the dependent variable that is predictable/explained. Adjust R-squared: Adjusted R-squared is the modified form of R-squared adjusted for the number of independent variables in the model. Value of adj. Rsquared increases, when we include extra variables which actually improve the model. F-statistic: the ratio of mean squared error of the model to the mean squared error of residuals. It determines the overall significance of the model. Coef: the coefficients of the independent variables and the constant term in the equation. t: the value of t-statistic. It is the ratio of the difference between the estimated and hypothesized value of a parameter, to the standard error. 2.2.4 K-Nearest Neighbors (KNN) The k-nearest neighbors’ algorithm (KNN) is a nonparametric classification method which decides which of the points from the training set are similar enough to be considered when choosing the class to predict 15 for a new observation is to pick the k closest data points to the new observation, and to take the most common class among these (Sutton,2012). According to Dr. Zhongheng Zhang, MMed. Department of Critical Care Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang University, There are two important concepts about KNN() function. The first one is that the KNN() function uses Euclidean distance, which is determined using the equation below: D(p,q) = √(𝑝1 − 𝑞1 )2 + (𝑝2 − 𝑞2 )2 + ⋯ + (𝑝𝑛 − 𝑞𝑛 )2 where p and q are subjects to be compared with n characteristics. Another concept is the k parameter, which determines how many neighbors the kNN algorithm will choose. The choice of k has a considerable impact on the kNN algorithm's diagnostic performance (Zhang, 2016). Figure 6. Visual presentation of simulated working example 16 The class 1, 2 and 3 are denoted by red, green and blue colors, respectively. Dots represent test data and triangles are training data. 2.3 Random Forest classification 2.3.1 Definition of classification Classification is a supervised machine learning technique used to predict group membership for data examples (Sukumaran, 2013). Although there are a variety of machine learning approaches available, classification is the most extensively employed. Classification is a well-liked problem in machine learning, particularly in future planning and knowledge discovery. Classification is regarded as one of the most important topics tackled by machine learning and data mining experts (Baradwaj, 2012). Linear Classifiers, Logistic Regression, Naive Bayes Classifier, Perceptron, Support Vector Machine; Quadratic Classifiers, K-Means Clustering, Boosting, Random Forest, Random Forest (RF); Neural networks, Bayesian Networks, and so on are examples of classification techniques (Ayodele, 2010). 2.3.2 Definition of Random Forest classification Xindong Wu, from Hefei University of Technology and Yuanda Australia International Education Center gives the definition of Decision tree is a machine learning model used to predict an object using a random forest. Random forest models can be classifiers or regressors, which means they can predict the categorization of an item or a certain dependent value for the item. To do this, a tree structure is used, with leaves representing values and branches representing feature conjunctions that lead to those values. Random forests are among the most popular machine learning algorithms due to their comprehensibility and simplicity (Wu, Xindong et al, 2007). 17 Figure 7. Decision tree graph Decision tree models are sometimes found as unstable models, which means that a little change in the decision tree's starting parameters might cause the model forecast to fluctuate greatly. They also have a proclivity towards overfitting. Random Forest is an ensemble learning model that overcomes these problems by combining bootstrap aggregation with random decision trees. Figure 8. Random Forest Classifier graph Random Forest, according to Leo Breiman, from Statistics Department University of California, “is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest” (Breiman, 2001). It constructs decision trees 18 from several samples and uses their majority vote for classification and average for regression. Each decision tree is trained on a subset of the feature space that is chosen at random. Trees in distinct subspaces broaden their categorization in complementary ways, improving overall classification and stability while avoiding overfitting. The number of trees produced for the Random Forest is an important factor to consider while creating the model. For a feature space of m dimensions, for example, there are m subspaces in which decision trees may be built. When a higher number of trees are employed, the model performs better. While the growth in model accuracy slows as the number of trees used grows, it does not ground to a stop. 2.4 Model evaluation 2.4.1 Residual and Predicted Values The deterministic component is in the form of a straight line which provides the predicted (mean/expected) response for a given predictor variable value. The residual terms represent the difference between the predicted value and the observed value of an individual. They are assumed to be independently and identically distributed normally with zero mean and variance, and account for natural variability as well as maybe measurement error (Kim, 2019). Our data should thus appear to be a collection of points that are randomly scattered around a straight line with constant variability along the line: Figure 9. Residual Scatter plot Figure SEQ Figure \* ARABIC 9. Residual The residual scatter plots allow you toScatter check:plot ● Normality: the residuals should be normally distributed about the predicted responses. ● Linearity: the residuals should have a straight-line relationship with the predicted responses. 19 ● Homoscedasticity: the variance of the residuals about predicted responses should be the same for all predicted responses. 2.4.2 RMSE The term "error" in statistics refers to a deviation from a known, correct value. As a result, the root mean square error (RMSE) statistic measures the average deviation of a group of observations from a known value. The RMSE is calculated as follows: 1 RMSE = √ ∑𝑛𝑖=1(𝑥𝑖 − 𝜇)2 𝑛 where 𝜇 is the known, correct value; 𝑛 is the number of observations of 𝜇; and 𝑥𝑖 is one of a set of the 𝑛 observations. RMSE is a measure of accuracy because it represents the dispersion around a true value. a test whether RMSE is zero, for example, might be used to confirm or deny observation bias. The precision and accuracy of an observation set can be assessed using both statistics, and neither statistic is sufficient on its own: ● A small RMSE indicates a low standard deviation. ● A small RMSE does not imply a small standard deviation, and a big RMSE does not imply a large standard deviation. ● A big standard deviation implies a large root mean square error (RMSE). ● To put it another way, a small RMSE indicates that the estimated mean is close to the true mean, which implies that the calculated SD is close to the RMSE: the sample is precise and accurate. ● Because the estimated mean is far from the true mean if the sample is skewed, a small SD does not imply a small RMSE: the sample is precise but erroneous. ● The same rationale applies to the previous bullet: a large RMSE does not imply a large SD. ● The same rationale applies to the previous bullet: a large RMSE does not imply a large SD. ● A large SD indicates a large RMSE: the sample is dispersed, and both SD and RMSE are dispersion measures, thus if SD is large, RMSE must be large as well: the sample is imprecise and may or may not be accurate. 20 2.4.3 Coefficient of determination (R-Squared) According to Stanton A Glantz and his colleagues, the coefficient of determination, designated 𝑅2 and pronounced "R squared," is the fraction of the variation in the dependent variable that is predicted from the independent variable(s) (Glantz, Slinker & Neilands, 2000). It is a statistic applied in the context of statistical models, the primary objective of which is either the prediction of future events or the testing of hypotheses based on other relevant information. Based on the fraction of total variance described by the model, it gives the calculation of how well observed results are. The 𝑹𝟐 value is determined as below, where 𝑦𝑖 represents the actual price, ̂𝒊 represents the predicted price, and N is the number of samples. 𝒚 𝑅2 (𝑦𝑖 , 𝑦̂𝑖 ) = ∑𝑁 𝑦𝑖 )2 𝑖=1(𝑦𝑖 − ̂ ∑𝑁 ̅ 𝑖 )2 𝑖=1(𝑦𝑖 − 𝑦 ̅𝒊 , is determined as follows. The arithmetic mean, 𝒚 1 𝑦̅𝑖 = ∑𝑁 𝑖=1(𝑦𝑖 ) 𝑁 In regression analysis assessment, the coefficient of determination can be more (naturally) useful than MAE, MAPE, MSE, and RMSE since the former can be stated as a percentage, but the latter measures have arbitrary ranges. On the test datasets in the paper, it also demonstrated more resistance for bad fits than SMAPE (Chicco, Warrens & Jurman, 2021). 2.5 Related work 2.5.1 Foreign research Follow the article “Data engineering for house price prediction” (Burgt, 2017), the authors have given prediction of housing price in dutch housing market by regression model and combine different regression models in order to create a model with higher accuracy and lower error rate during the period of increasement of demand in real estate. The research model consists of four steps to preprocess data: First, Price Index construction: different types of price indices are discussed. Second, Data Engineering is conducted on data sets in order to make the features fit better for machine learning. Third, Regression for predicting real estate prices is used. Finally, Sale and Neighborhood comparisons: Compares the sales price between Real Estate Agent and compareds between neighborhoods based on dataset. The 21 research also points out that although the fundamental sale forecast component has a decent foundation, there are upgrades that might make it more accurate and trustworthy throughout all of the Netherlands' areas. 2.5.2 Domestic research "Analysis of variables impacting apartment pricing, case study in District 2 of Ho Chi Minh City" from Ho Thi Nhai's thesis (2015). The study used a Hedonic model to collect 130 unit samples from 20 apartment buildings that were available for sale, had successful transactions in District 2, and had high prices ranging from 14 million/m2 to 25 million/m2. The research findings suggest that just seven independent variables have an influence on pricing, including distance to the center, floor location, number of toilets, position of flats on the floor, surroundings, handy services, and the investor's reputation. 2.5.3 Reviews of previous research papers The research finally gives some results: with or without weighting the accuracy of this prediction is not great: the accuracy was below zero and the average error and standard deviation around 60.000 and 70.000 respectively. On the other hand, with the ability to merge models into one final model, each municipality model paired with the Netherlands model of the same property type resulted in an accuracy of 83.95, which at first glance does not appear to be a significant improvement. However, the average error and standard deviation, which are 38.000 and 43.000 respectively, show a significant improvement. 2.6 Conclusion Chapter Two reviews the theoretical literature that will be used in the following chapters and previous domestic and international studies related to the area of real estate price prediction in the region. Determining influencing factors and predicting real estate prices in general and apartments in Ho Chi Minh city in particular is often by the Hedonic method and has not been optimized. This study will use the Ordinary Least Squares method to determine the factors affecting apartment prices. At the same time, perform apartment price forecasting by applying three more machine learning algorithms including: XGBoost, Linear Regression and K-Nearest Neighbors to forecast apartment prices in Ho Chi Minh City. The above algorithms will be evaluated for accuracy and feasibility through RMSE, Residual Score and R-squared Score. Besides, apartment classification by Random Forest Classification is also used to support real estate companies as well as apartment buyers. 22 CHAPTER 3. METHODOLOGY 3.1 Methodology Research Process During the research process, the necessary steps in the research methodology are summarized in following map: Figure 10. Methodology Research Process The research methodology includes 5 steps: - - Material Introducing: Giving introduction of data sources and the amount of data needed for the research. Data Sampling: Showing how to get data (including software and the collecting steps). Data Exploring: Analyzing a huge data collection in an unstructured way to uncover initial patterns, characteristics of data (Data Exploring is the first part of Data Pre-Processing). Data Modifying: transforming raw data into an understandable format (Data Modifying the second part of Data Pre-Processing). Experiment Design: Including Data Modeling (creating a data model for an information system by using many types of machine learning models) and Data Assessing (Evaluating data to verify whether these models match project quality requirements) 23 3.2 Materials Introducing Before starting with the first step, it is essential to discuss the source from which the data was collected. Since data collection takes place on computers (instead of collecting data in person at real estate offices), it is essential and optimal to dig data from a website that specializes in real estate. Currently, there are many online real estate brokers, Batdongsan.com.vn.vn is one of them. Although Batdongsan.com.vn is a fledgling exchange, the company is one of the biggest real estate exchanges in Vietnam. At Batdongsan.com.vn, the sellers and buyers can upload their products directly to the website, so the data collected are also considered as primary data (Batdongsan.com.vn, 2021). 11.11. Interferce of Batdongsan.com.vn Figure SEQ Figure \*Figure ARABIC Batdongsan.com.vn’s website interface Buyers and sellers before posting products need to create an account and identify themselves, thereby, ensuring the product's reputation and avoiding fraud before making direct transactions. Besides, in order to avoid data with very high or very low prices compared to reality, I will collect reference data on the prices of apartment building contractors and compare with the data collected on website Batdongsan.com.vn to remove junk data. In Vietnam, there are a lot of Real Estate Exchange websites such as: Chotot.com, Rever.vn or Propzy.vn… It is possible to collect data about apartments in those websites which provide plenty of data sources for building the regression models. However, because of having the same function, the sellers can upload their same products to different websites. Therefore, when collecting multiple websites, it can lead to many duplicate data and affect the accuracy of the models. On the other hand, with the amount of data scraped only from 24 Batdongsan.com.vn can be enough for building regression models (about 4,000 to 5,000 apartments). 3.3 Data Sampling Sampling process is using scraping software to collect data from the internet in a manner that the machine learning model can understand. The output data is interpreted by a machine during parsing, but it is difficult for humans to grasp. Data extraction is another term for data scraping. Data scraping is particularly useful because if humans collect data from the internet, there are numerous potentials for error, whereas computers transfer data between programs in the form of data structures that ensure data integrity. In terms of data consistency and correctness, the data could be problematic. As a result, data scraping only obtains raw data and necessitates considerable preprocessing and, in some cases, human intervention. The data scraping activity is mostly reliant on Internet sources for data collection and cannot be completely automated. For example, if you're scraping data from a website, based on the best structure that the developer has allocated to each unique HTML element, an attribute ID and a class attribute to each item in the same group. This helps in the creation of a script in nearly any programming language. Beside using Python packages as the crawlers, ParseHub, an application developed to crawl the data from is used in this research. Data extraction technologies have traditionally been either too difficult to utilize for non-technical people or too simplistic to manage the complexity and interaction of modern websites. When collectors consider how much time is wasted collecting data, the necessity for a strong and versatile solution that makes data appear directly at the fingertips becomes even more clear. Data scientists, for example, spend 50 to 80 percent of their time collecting and preparing data rather than putting it to use. ParseHub allows developers complete control over how they choose, structure, and altar items, eliminating the need to dig through the browser's web inspector. There is no need to create another web scraper because users are able to handle interactive maps, endless scrolling, authentication, dropdowns, forms, and more with ease using ParseHub (parsehub, 2021). Web scraping from well-established companies, on the other hand, is not simplistic since the organizations deploy defensive algorithms and software to prevent unauthorized access to their website. As a result, the goal is to use tools or applications that can scrape data as intelligently as a human. This is accomplished by automating human online browsing behavior. ParseHub can scrape data with a delay such as 3 to 5 seconds, for example, the system may not notice data extraction from the website and mistake it for a routine activity. 25 Each product data is extracted by three steps of the website: − The first step is collecting general information of the product (product title, product location and squared area of products). Figure 12. First step of extracting − The second step is that ParseHub automatically accesses url links to collect more data of each real estate. The page contains detailed information of the product and begins to collect according to the following criteria: number of bedrooms, number of bathrooms, other facilities, main contractors, and project name. Figure 13. Second step of extracting 26 Moreover, by setting ParseHub to automatically turn pages, the crawling process will continue as the two steps above to the end of the last page. The data scraping process can take a lot of time due to the large amount of data. The final step is that we will access to the apartment investors’ website to get some features which is not extracted from the website Batdongsan.com.vn in the two steps above (includes: swimming pool, gym facilities, furnitures included, population density in the districts, car parking, time access to city center, and price per meter in the areas). Besides, it is also necessary to get the reference prices from the apartment investors, then compare them to the price in the Batdongsan.com.vn platform and remove outliers. The variables I expect to extract is at the table below: Table 1. Description of all attributes which were collected Index Name of attributes Description Data Type 1 Title Product introduction title String 2 ID Address of real estate Object 3 District Location of apartment Object 4 Total_Square Area of real estate Object 5 Bedroom Real estate’s code Numeric 6 Bathroom Number of rooms Numeric 7 Price Apartment price Object 8 Policy The current certificate of ownership String 9 Furniture Furniture provided with the real estate Boolean 27 3.4 10 Swimming_Pool Swimming pool provided with the real estate Boolean 11 Gym Gym facilities of the apartment Boolean 12 Price Per Meter Each District (B/M2) Furnitures provided with the real estate Boolean 13 Population Density The population density of each district Float 14 Coordinate Car Park provided with the real estate Object 15 Project Name Name of Apartment project String Data Exploring The total number of samples collected is 4900 samples. After checking and cleaning the data, some samples will be removed due to lack of information on variables, not within the scope of the study or not assessing the quality. Analysis results are performed using Python programming software and statistical packages. Table 2. Total number of apartments each district Index District Number of apartments Price per meter (million/m2) 1 District 2 724 53.83 2 District 7 609 43.97 3 District 9 587 40.89 4 Tan Phu 416 40.74 28 5 District 4 332 61.14 6 District 8 274 35.54 7 District 12 266 33.71 8 Binh Thanh 253 60.17 9 Thu Duc 223 36.97 10 Binh Tan 219 33.91 11 Binh Chanh 212 34.86 12 District 6 139 42.91 13 Nha Be 122 37.64 14 District 10 97 66.67 15 District 5 96 48.48 16 Go Vap 87 40.03 17 Phu Nhuan 83 62.51 18 Tan Binh 74 52.21 19 District 11 43 46.36 20 District 3 24 72.87 21 District 1 20 99.05 In terms of research scope, the district with the largest number of survey samples is District 2 with 724 survey samples with the average unit price of an apartment of 53.83 million VND/m2. In which, the district with the highest transaction price is in District 1 with the average price of 99.05 million/m2. According to Circular 31/2016/TT-BXD effective February 15, 2017, issued at the Department of Planning and Architecture, apartments are classified into grades A, B, and C based on four factors: Architectural planning; Technical systems and equipment; Services and social infrastructure; Quality, management, operation. 29 From there, decide on the average price of the apartment. Currently, the price of apartments in Ho Chi Minh City includes: high-class apartment A costs from more than 60 million per square meter, mid-range apartment B costs from 35 to 60 million per square meter and finally, low-end apartment C, having price at less than 35 million per square meter. Table 3. Segment of all apartments 3.5 Index Segment Number of apartments Price per meter (million/m2) 1 A 839 73.437 2 B 2695 45.347 3 C 1366 29.35 Data Modifying 3.5.1 Outliers 3.5.1.1 Log transformation After changing prices to numeric (float) form, it is easy to see that apartment prices show outliers that can affect the results of predictive models. 30 Figure 14. Boxplot of Price distribution The outliers can be seen, located at the price located in the luxury apartment A. Therefore, the problem of outliers must be solved by the logarithm of the price variables. Figure 15. Boxplot of price after log transformation The price after log transformation is still having some outliers, so, it is necessary to drop these log_price which are higher than 0.9. The dataset will finally have 1882 values left. According to Chambers (Chambers et al., 1983), the probability plot is a graphical tool for determining if a data set follows a given distribution, such as the 31 normal or Weibull. In addition, a line can be fitted to the points and added as a reference line. The further the points differ from this line, the greater the sign of the difference from the specified distribution. Figure 16. Probability Plot before log normalization Figure 17. Probability Plot after log normalization It can be seen that the fit values with the reference line range from -3 to more than 2. Based on the probability plot, we can see, a straight, diagonal line in a 32 normal probability plot indicates normally distributed data instead of Weibull data after logarithmic. 3.5.1.2 Skewness and Kurtosis Skewness is a measure of asymmetry, while kurtosis is a measure of a distribution's 'peakedness.' The values of skewness and kurtosis, as well as their standard errors, are provided by most statistical software, according to Hae-Young Kim, from Korea University (Kim, 2013). Skewness is a measure of a variable's asymmetry in its distribution. The skew value of a normal distribution is 0, meaning that the distribution is symmetric. A positive skew value indicates that the right-hand tail of the distribution is longer than the left-hand tail, and that the majority of the values are located to the left of the mean. A negative skew value, on the other hand, shows that the left side of the distribution's tail is longer than the right side, and the majority of the values are to the right of the mean advocated an absolute skew value > 2 as a measure indicating severe deviation from normalcy. Figure 18. Skew distribution On the other hand, Kurtosis is a metric for determining how peaked a distribution is. The original kurtosis value is frequently referred to as kurtosis, and an absolute kurtosis (proper) value > 7.1 is used to indicate a significant divergence from normalcy. Most statistical tools, such as SPSS, give 'excess' kurtosis, which is calculated by subtracting 3 from the kurtosis (proper). For a fully normal distribution, the excess kurtosis should be zero. Positive excess kurtosis is known as leptokurtic distribution, which means high peak, while negative excess kurtosis is known as platykurtic distribution, which means flat-topped curve (West et al.,1996). 33 Figure 19. Kurtosis distribution Figure 20. Distribution of price before log normalization 34 Figure 21. Distribution of price after log normalization As we can see, the price distribution is not normally distributed. This target variable is skewed to the right due to multiple outliers in the variable (Skewness is 1.3393 and Kurtosis is 2.4661). After log transformation, the skewness score is 0.2484 and kurtosis score is -0.49024, the log transformation clearly removes the normality of errors, which eliminates the majority of the other errors we discussed before. 3.6 Data preprocessing 3.6.1 Preprocessing process Data preprocessing is used to turn a messy dataset into a clean dataset that can be used by machine learning algorithms. Data in raw format, which cannot be analyzed, is subjected to data preparation processes. As in our case, the information was gathered from multiple property websites where it was entered by property agents, so there are missing numbers, data in various formats, and erroneous information. Data integration was used to aggregate data from diverse industries into a single dataset. The data records were transformed using data transformation methods into a format suitable for machine learning analysis. Because the data collected from the Batdongsan.com.vn website is already relatively clean and systematic after removing outliers from the comparing process, what we need to do is make some adjustments accordingly in building the 35 machine learning model. As a first step, we can use Excel, a popular and easy-touse tool, to categorize and convert types of data. In this case, the area of the house and price are not the number (numeric type), we can change it to serve the construction of the model. Moreover, we are able to change the values from boolean to numeric (1 is “yes” and 0 is “no”). The variables which are need to be changed from Boolean to Numeric (Binary standardized) are: Table 4. Numeric transformation Index Variable 1 2 Furniture Policy (House ownership certificate) Swimming_pool Gym 3 4 Old False Basic No New False 0 0 Old True New True Included Yes 1 1 No No 0 0 Yes Yes 1 1 Besides, factors that can affect real estate in general and apartments in particular are the facilities located in the city. These factors include the distance to the airport, the distance to the train station, and the facilities in the areas such as schools. The above factors are calculated based on the coordinates of the apartment projects. The above geographical distances can be easily calculated using the package geopy.distance. One of the key factors in a residential area is "distance to market, supermarket or mall" which can affect the price of apartments in that area. However, nowadays, apartment buildings have commercial centers or supermarkets on the ground floor, which affects the overall value of the variable "distance_to_market". Therefore, the study will remove the above variable from the beginning to ensure the accuracy of the model. In addition, social facilities such as schools in each area were collected from government websites https://www.hcmcpv.org.vn/ which is the website of the Party Committee of Ho Chi Minh City. Table 5. Location factors Index 1 Variable Distance_to_airport 2 Distance_to_train_station Description Distance to airport Distance to train station Data type Float Float 36 3 4 Distance_to_school Distance_to_hospital Distance to school Distance to hospital Float Float 3.6.2 Correlationship To be able to observe which data can be used in model building, a correlationship plot is needed to see the correlations of the variables, thereby excluding those with a high dependency ratio. In addition, other graphs which can show the ratio between price and other features will be represented such as the increase in house prices in proportion to the size of the plot, or how location affects price. Figure 22. Correlationship graph 37 According to author Albert and his colleagues, from Chief Statistician of the Philippines, Correlation is a measure of a relationship between variables. In correlated data, a change in one variable's value is linked to a change in another variable's value, either in the same (positive correlation) or opposite (negative correlation) direction (Albert. et al, 2008). Several descriptive statistics were utilized to describe the study's results, including mean, standard deviation, minimum, and maximum values. Descriptive statistics are used to characterize and assess data in order to generate concise summaries and derive some helpful conclusions. When at least one of the variables is ordinal, the Spearman RHO correlation is employed to assess the connection between them. According to Albert, the correlation coefficient's range and degree of relationship are shown in the table below (Albert .et at, 2008). Table 6. Interpreting correlation Based on this table, we can take the input or predator variables and store them into Feature which can be used to predict price. Besides removing the variables that are not statistically significant including: "Title","ID", "District", "Coodinate" and "Project Name", we will continue to exclude variables with low correlation. or no correlation with price (between -3 and 3). As can be shown in Figure 22, multicollinearity persists in a variety of properties. However, for the sake of learning, we'll maintain them for now and let the models clean them afterwards. Let's take a look at some of the remaining connections. − Bathroom and Bedroom have a correlation of 0.73, or 73%. − Normalized Meter and Bedroom have a 73% correlation. − Distance to central and Distance to railway have an 89% correlation. − Distance to the airport and Distance to railway have an 85% correlation. − Distance to the hospital and Distance to the high school have an 100% correlation Therefore, to suit better multiple linear regression techniques, some attributes have to be removed from the dataset. 38 3.7 Experimental Design The experiment was performed to pre-process the data and evaluate the predictive accuracy of the models. A multi-stage experiment is required for predictable results. However, since the data from sections 3.4, 3.5 and 3.6 will be used for the whole experiment, these remaining phases including Data Modeling and Data Assessing can be defined as: 3.7.1 Data Modeling: - Affecting Factors Identification: Ordinary least squares (OLS) method is used to estimate the parameters in the regression equation, so it is possible to determine the relationship between the dependent variable (apartment price) and the independent variables (Affecting Factors). The OLS Model is imported from the statsmodels package. - Price prediction: The Data including the dependent variable ("log_price") and independent variables (in dataset "Features") will be divided into two parts which is essential to train the model with one and use the other in evaluation. The dataset will be split 80% for training and 20% for testing. Three machine learning algorithms including XGBoost imported from XGBoost Package, Linear Regression and K-Nearest Neighbor imported from Sk-learn Package will be used to predict apartment prices. - Apartment Classification: The data including the dependent variable ("segment") and independent variables (in dataset "df_class" which dropped "segment") will be divided into two parts which is essential to train the model with one and use the other in evaluation. The dataset will be split 80% for training and 20% for testing. Random Forest Classifier algorithm imported from SKlearn package will be used to classify apartments. 3.7.2 Data Accessing: Affecting Factors Identification: We will utilize the R-squared, which is the coefficient of determination, to evaluate the Ordinary Least Square technique. It is the fraction of the dependent variable's volatility that can be predicted or explained. The coefficients of the independent variables and the constant term in the equation make up the coef scores. In addition, we will reevaluate the model's correlation to avoid overfitting and underfitting. Andy Field explains how to use Durbin-Watson tests to look for serial correlations between errors in regression models. It checks if nearby residuals are associated, which is useful if independent mistakes are assumed (Field, 2009). Field also indicated that no autocorrelation is acceptable when the DW value is between 1 and 3. 39 - Price prediction: There are a variety of error measurements that can be used to evaluate a model's prediction performance. Three predictive models (Linear Regression, XGBoost and KNN) will be evaluated by three commonly used metrics in the regression sector. First of all, the residual score refers to the difference between the dependent variable's predicted score (as calculated by predictive models) and the actual observed score. A close-to-zero score suggests that the difference between each pair of points is small, indicating the predictive model's effectiveness. Secondly, the fit/correlation of the data to the regression model is indicated by R-squared (R2). An R-squared value close to 1.0 implies that the model makes an accurate prediction. Thirdly, the mean square error (RMSE) is an accuracy statistic that shows how far the model is on average from the observed data points. The RMSE can also be thought of as a measure of how evenly distributed the residuals are. Each model's error measures will be graphically illustrated. Lastly, the most effective model will be the one that is best appraised by at least two of the three error measurements. - Apartment Classification: The Precision, Recall, F1 Score, and Accuracy of the Random Forest Classifier algorithm are evaluated using confusion metrics and evaluated values in the classification report. Precision is the number of correctly classified positive examples divided by the number of examples labeled as positive by the system, Recall is the number of correctly classified positive examples divided by the number of positive examples in the data, and F1 Score is a combination of the above, according to Marina Sokolova. Furthermore, accuracy is defined as a classifier's overall efficiency (Sokolova & Lapalme, 2009). Besides, Random Forest is a collection of many Decision Trees used to classify apartments, thereby increasing the classification accuracy of Decision Trees. We can visualize one of Decision Trees to see apartments classified based on which criteria. 40 CHAPTER 4: RESULT ANALYSIS AND DISCUSSION On the dataset of Ho Chi Minh City apartment, the three machine learning models XGBoost, Linear Regression, and KNN were used to evaluate the models' performance in terms of forecasting prices. The models were compared and scored based on their measured accuracy and the time it took to obtain that accuracy. 4.1 Result Analysis After applying the Data Modeling which has process represented in Experiment Design. All models will be evaluated in this final SEMMA step for its usability and reliability for the studied problem. The data may now be examined and used to estimate how effective its performance is. 4.1.1 Ordinary Least Square method evaluation 4.1.1.1 Regression results according to Ordinary Least Squares method (OLS) Table 7. OLS Regression Results Regression results show that at the 5% level of significance, most of the variables are statistically significant (p.value <0.05) except for the variable Density (People/km2), there is enough basis to show the independent variables impact on apartment prices. Besides, the variables all have the sign of the regression coefficient in accordance with expectations and previous studies. 41 The results show that the model has a coefficient of R2=0.792=79.2%, which means that the independent variables in the model explain 79.2% of the variation of apartment prices in Ho Chi Minh City. 4.1.1.2 Correlation test The fact that the model occurs autocorrelation will make the OLS estimates ineffective. Therefore, the author uses the Durbin - Watson d test to detect this phenomenon. Specific test results are as follows: Durbin-Watson d-Statistics = 1.527 which is in the range 1 < d <3, according to Andy Field those values below 1 or more than 3 are cause for concern (Field, 2009). Therefore, it can be concluded that the model has no autocorrelation. 4.1.2 Predictive Models Evaluation 4.1.2.1 Linear Regression Figure 23. Comparing LR's actual vs predicted values The R2 coefficient is only 0.799820. This suggests that the model explains or captures 79 percent of the fluctuation in sales, while the remaining about 16 percent is attributable to external variables. With an R2 of 0.799820, the model is having good accuracy. This suggests that by incorporating different predictors, we can enhance the model. The RMSE is nearly 0.074578. It indicates that the model's average predictions are 0.074578 units off from the actual data. For our model, 0.074578 may not be a bad value. The mean of residual is -0.00049 which is very close to zero. 42 Figure 24. LR's Residual graphs 4.1.2.2 XGBoost Figure 25. Comparing XGBoost's actual vs predicted values 43 XGBoost outperforms Linear Regression and XGBoost in terms of prediction, with the highest R2 value: 0.868886. This suggests that the model can explain or capture 86% of the variation in apartment price. XGBoost has an RMSE of about 0.060356, which is lower than Linear Regression and KNN models. It indicates that the model's predictions are on average 0.060356 units off from the actual values. For our model, 0.060356 is the smallest error value. The residual score of XGBoost is -0.000533 which is farther to zero than the one of Linear Regression. Figure 26. XGBoost's Residual graphs 44 4.1.2.3 KNN Figure 27. Comparing KNN's actual vs predicted values Finally, the KNN model falls between LR and XGBoost, with an R2 value of 0.840146. It means that the model can explain or capture nearly 84 percent of the price fluctuation. The RMSE value of 0.066644 is lower than score of Linear Regression but higher than the score of XGBoost. This suggests that the model's average forecasts are 0.066644 units off the mark. As a result, 0.066644 could not be a suitable fit for the model. KNN also shows the residual score is -0.002048 which is very far away of zero compared to those of XGBoost and Linear Regression. 45 Figure 28. KNN's Residual graphs 4.1.3 Random Forest classification We can evaluate the classification of data based on the confusion matrix as below: 46 Figure 29. Confusion Matrix According to the confusion matrix, it is easy to count how many types and number of each type from three classes, which can be showed in the following table: Table 8. Classification performance Class True Positive 134 497 156 A B C False Positive 73 16 101 True Negative 761 290 713 False Negative 9 174 7 Besides, the classification report is conducted to calculate Precision, Recall score, F1-score of the data after applying the random forest model. Although the data is quite imbalanced, the confusion matrix is quite good and the accuracy is 81% on the dataset of 977 values which were taken for testing. 4.2 Dicussion 4.2.1 Factors affecting apartment prices in Ho Chi Minh city Research results show that 11 factors including: Bathroom, Normalized_Meter, Policy, Furniture, Price_per_meter_each_district (b/m2), Swimming Pool, Gym, Density(people/km2), dinstance_to_central, dinstance_to_airport, dinstance_to_hospital are the factors affecting apartment prices in Ho Chi Minh 47 City. The detailed research model is written specifically with the reduction of four digits after the dot, specifically as follows: Ln(P) = -0.1952 + 0.0074*Bathroom + 0.0062*Normalized_Meter +0.0050* Policy + 0.0006*Furniture + 0.0518* Price_per_meter_each_district(b/m2) + 0.1701*Swimming_pool + 0.1615*Gym + 8.362e-07*Density (People/km2) 0.0029*distance_to_central +0.0015*distance_to_airport + 6.055e-07*Distance_to_hospital + Ɛ. 4.2.2 Prediction model 4.2.2.1 Residual Residual score can describe the difference between each person's anticipated score and the actual score of three models is their residual score. Figure 30 below represents the residual score of three models where KNN has the farthest-to-zero residual score (-0.002048), and the nearest score is -0.000490 from the Linear Regression model. In the middle is the residual score of XGBoost which is approximately -0.000533. Figure 30. Residual score comparison 48 4.2.2.2 Root mean squared error MSE occasionally raises the actual error, making it harder to realize and comprehend the true mistake amount. The RMSE measure, which is generated by simply calculating the square root of MSE, solves this problem. Figure 31 depicts the performance of machine learning algorithms used in this study using the RMSE performance matrix. Figure 31. RMSE comparison On the other hand, XGBoost has the lowest RMSE score (0.060356) while KNN shows the second lowest score (0.066644) and the highest score belongs to Linear Regression (0.074578) which are both not as good as XGBoost. 4.2.2.3 The R2 score In three regression models, R-squared (R2) can express the amount of variation given by an independent variable or variables. Figure 32 comparing the performance of machine learning algorithms used in this study by R2 score. 49 Figure 32. R-squared comparison It is clear that the highest R2 score is 0.868886 from the XGBoost model and the second is KNN with 0.840146 and the last one is Linear Regression (0.799820). 4.2.2.4 Result summary Figure 30 shows residual distribution of data from testing dataset after applying Linear Regression, XGBoost and KNN models for predicting price. It is clear that there is the lowest dispersal of residuals points from Linear Regression compared to graphs from KNN model and XGBoost model. Moreover, the residual distribution from KNN and Linear Regression have the skew score between -5 to +5, which means they are nearly symmetric. In terms of residuals, RMSE, and R-square values, Table 9 shows how the models performed on both data sets. The residual score of XGBoost is not as high as Linear Regression and it is just a little higher than KNN’s value (-0.000533 compared to -0.000490 and -0.002048, respectively). However, because the R2 value alone can be sufficient to illustrate which model is better for each data set since the R2 values are greater when the RMSE values are lower. Therefore, for 50 the datasets, the XGBoost model has the best R2 value and the lowest RMSE score, which can be the best model to predict apartment prices in Ho Chi Minh city. Table 9. Evaluation score summary Model Linear Regression Residual RMSE R2 score -0.000490 0.074578 0.799820 XGBoost -0.000533 0.060356 0.868886 KNN -0.002048 0.066644 0.840146 4.2.3 Classification model 51 Figure 33. Decision tree (number 10) 53 We created the random forest and in Figure 33 is one of decision trees (number 10) from this random forest model. This tree has a non-uniform depth of partitioning, with 26 internal nodes (including the root node) and 34 leaf nodes. The numbers and characters in the top row of this tree represent the categories generated by the best splitting criterion. The Gini Index of Nodes, as defined by Hosein Shahnas, is a numerical value representing the quality of a node's split on a variable (feature) (Shahnas, 2020). The numbers in the third row represent the number of observations assigned to the node, while the numbers in the fourth row represent the number of variables separated into each class. Finally, the numbers in the last row indicate the apartment segment's class. All 2494 observations in the training set were assigned to the leaf nodes of the entire decision tree at the end of the tree-growing procedure. In a leaf note, the class value indicates the forecasting value of a given category's apartment class. For example, the class of 5 samples for the first node in the last row indicates the class type of apartment which has a price less than 1.79 billion, no swimming pool, a distance to the railway of less than 7.6 kilometers, and no gym, is class type of C. The tree depicts how several key independent variables in the training set separate apartment parts. The sum of the values in the training set utilized in the estimator 10 is represented by the top or root node of the tree in Figure 33. The splitter is the tree node in the initial split that determines the lower price of 2,371 billion. best at this node by comparing the categorized results given by all independent fields (variables), as this variable results in the greatest reduction in node impurity. This variable is the best at distinguishing across categories in the target field. The two nodes at the second level of the tree, from left to right, reflect other criteria for separation (Price less than 1.79 billion and no swimming pool respectively). We can compare the impact of the independent factors on the target field across categories using this partitioning. The splitting rule that takes use of the initial split is frequently regarded as the brains of decision tree algorithms (Berry and Linoff, 2000). This matches the reality of housing transactions in Ho Chi Minh City. The nodes at the following levels will assist in further classifying the apartment segments, with the leafs representing the final sorted values. 54 Table 10. Classification report The diagram depicts the classification. The confusion matrix in Figure 29 and table 10 show that the random forest classifier algorithm accurately classifies the data set with an average accuracy of more than 81 percent. As a result, many variables contain attribute combinations that are comparable to those seen in other classes. Furthermore, the samples that are misclassified are often near to the true class, indicating that the class ordering is significant. If the majority of the misclassified samples were in classes that were further away from the diagonal, the ordering may have been considered pointless in the sense that there was no evident basis for the particular ordering. Besides, from the training dataset, a random forest classification model was created to aid in the investigation of the relationship between resale prices of Ho Chi Minh city apartments and housing characteristics, as well as the identification of which characteristics are significant in predicting resale prices. Based on rules given in terms of the independent variables, random forest algorithms execute several tests and generate the optimum sequence for regressing and forecasting the dependent variable. These tests determine the optimal splitters, which successively partition the training data until they reach terminal (leaf) nodes. 55 Figure 34. 5 main factors affecting classification Using the suggested random forest technique, the created random forest reveals that the Price, Gym, Swimming pool, Distance to central and Price_per_meter_each_district are all key price variables that affect the category process of apartment class values. 56 CHAPTER 5: CONCLUSION 5.1 Conclusion - Objective 1: Collecting and identifying the quantitative factors that can affect apartment prices via apartment buildings’ information. Based on scientific theory, the samples collected through the biggest real estate exchange and the results of the studies, the author has built an OLS regression model to analyze the effects of the real estate price index. The impact of factors affecting the price of apartments in Ho Chi Minh City. Through quantitative analysis based on a dataset of 4882 observed samples collected through an experimental survey at apartment buildings in 20 districts (excluding Cu Chi and Hoc Mon), the research results showed that Apartment prices in HCMC are affected by 11 factors such as: Bathroom, Normalized_Meter, Policy, Furniture, Price_per_meter_each_district (b/m2), Swimming Pool, Gym, Density(people/km2), dinstance_to_central, dinstance_to_airport, dinstance_to_hospital . However, the regression model in this study is made based on the factors affecting the price of apartments in HCMC, so the selected variables may change when building the model in other areas. Objective 2: applying the SEMMA process on building regression models for predicting the apartment prices and comparing the accuracy of these model The purpose of this study is to assess and compare the performance of three popular machine learning regression models for apartment pricing prediction in Ho Chi Minh City: Linear Regression, K-Nearest Neighbors, and XGBoost. When the same data set with 12 attributes ((including features and target) was applied, it was discovered that XGBoost provided more accurate predictions. Significantly, the XGBoost model outperformed the other models in terms of accuracy for a large data set of approximately 4882 values. The XGBoost model was able to achieve a good performance with an accuracy of up to 89%. Machine learning is considered to be effective for housing price prediction, in this case apartment price prediction, in normal scenarios, but deviates during exceptional events. Further enhancements and model selections could improve future performance and make it a valuable tool for decision-makers. Nonetheless, there is a lot of uncertainty about housing prices, for example: the factors could cause the real estate bubble especially in Vietnam, which affects the overall performance. Existing models have not yet adequately reflected these 57 uncertainties. As a result, we expect that this uncertainty will not be present in future predictions, which is a barrier to machine learning in general. Objective 3: Classifying all of the apartments into each apartment's segment based on the affecting factors. A random forest system using various decision trees which has three types of nodes: decision nodes, leaf nodes, and root nodes. The decision tree's final output is represented by the leaf node of each tree. A majority vote technique is used to pick the final product. In this situation, the random forest system's final output is the output picked by the majority of decision trees. As a result, the random forest algorithm is more accurate than the decision tree method at predicting the outcome. The random forest algorithm is an alternate exploratory data analysis tool for examining the link between home prices and a variety of housing factors and identifying the important determinants of housing prices for the reasons stated above. The results of the study, based on the Random Forest method, which has 83% of accuracy, show that price, apartment building amenities such as a gym and a swimming pool, and proximity to the city center are all important. The distance to the center and the area price of counties Price per meter each district are important variables in deciding which apartments to buy or invest in. 5.2 Research Meaning Research results show that there is a big difference between apartment prices in different districts. The price of apartments in the central area is much higher than in other areas, which is a special point of attention for investors because this is the main factor creating the phenomenon of real estate price bubbles in general or apartment prices in Ho Chi Minh City in particular. When the demand for housing in the central areas is too high, the price of apartments in these areas rises too high compared to its real value. However, at a certain point when the market is saturated, the liquidity of apartments in this area will no longer be high which leads to a serious price drop that can cause the apartment market here to be strongly affected. Therefore, investors should depend on their capital sources and investment purposes to make an appropriate choice. Besides, the experimental results show that when the apartment complex has many internal utilities, the price also increases by 16.2% for the swimming pool and 14.3% for the gym. This feature is noted for project investors when designing a project, especially paying attention to the planning issues of utility areas inside 58 the project and ensuring the completion of the legality of the apartment to enhance the value of the apartment as well as reputation. For real estate investors, when investing in apartments, to ensure high profit potential, they also need to choose apartments with apartments with many utilities. Machine learning algorithms, which are used to predict prices and classify apartments, are one of the effective tools in supporting decision making of investors and real estate companies. In addition, if the above models are used appropriately, investors can be used as a tool to appraise apartment prices in the area, thereby, investing at the right price and making more profits. Housing and Real Estate Market Management Agencies can effectively use models to check and issue policies to ensure the stability of apartment prices in particular and real estate in general to avoid the situation of real estate bubbles becoming larger. 5.3 Limitation The study was completed in a short amount of time, and the scope of the investigation was confined to only 20 districts in Ho Chi Minh City, rather than the full Ho Chi Minh city region. Despite direct price comparison with premium data from investors, the utilization of secondary data collected on real estate exchanges has enhanced the data quality. Because the number of survey samples is small, it does not reflect all areas of HCMC. The factors affecting the price of apartments in HCMC used in the model are not comprehensive; for example, some apartments that have been built for a long time, and there is no information about the owner, project name, or construction year provided by the Department of Statistics. So, not only has it become hard to predict with great precision, but it has also become impossible to categorize residences more precisely. The study only uses Hedonic regression models to find influencing factors, 3 price prediction models and a random forest classifier algorithm to classify apartments. Furthermore, while the above methodologies are commonly used in countries around the world, the research findings are only applicable to that country's geographical and cultural peculiarities. As a result, the application in HCMC has a number of disadvantages. 5.4 Future Research Dealing with the study's limitations, it is suggested the following future research directions: First of all, additional research may extend the scope of the study to include all of HCMC, including the suburbs. As a result, the element of district location, 59 which may be separated into five areas: center district, east, west, south, and north, needs to be enlarged further. In addition, the research object can be expanded to include not only apartments but also other types of real estates. Secondly, other apartment-specific criteria, such as property turnover, proximity to the river, guarantor bank reputation, and policy for foreigner owners, may be included in future research. State macro policy, regulations governing the purchase of flats by foreigners, the manner and timing of payment under the contract, the quality of the apartment's living water, monthly management fees, the apartment's age, and the number of times the apartment has been transferred. Machine learning models, using an ever-increasing supply of data, are thought to become increasingly useful in the future. Besides, having more data, having more influencing elements, such as those described above, can improve the model's accuracy. More methods for predicting or classifying apartments is also remarkable for identification, which is a superior algorithm than the four algorithms mentioned in the article. 60 REFERENCE 1. Qingqi Zhang, "Housing Price Prediction Based on Multiple Linear Regression”, Scientific Programming, vol. 2021, Article ID 7678931, 9 pages, 2021.https://doi.org/10.1155/2021/7678931 2. Le, M. (2021). Thực trạng thị trường nhà ở đô thị cho người thu nhập trung bình tại thành phố Hồ Chí Minh. PROCEEDINGS, 16(1), 77-90. doi: 10.46223/hcmcoujs.proc.vi.16.1.1858.2021 3. Vuong, Q. (2016). Các nhân tố ảnh hưởng đến giá nhà đất ở trên địa bàn thành phố Cần Thơ. Tạp Chí Khoa Học Thương Mại, 91. 4. Chen, X., Wei, L. and Xu, J., 2017. House Price Prediction Using LSTM. [ebook] Hong Kong: The Hong Kong University of Science and Technology. Available at: <https://arxiv.org/pdf/1709.08432.pdf>. 5. Nguyen, T., 2009. Thực trạng sử dụng phương pháp so sánh và phương pháp chi phí trong thẩm định giá bất động sản tại thành phố hồ chí minh. [ebook] University of Economics HCMC. Available at: <https://text.xemtailieu.net/tai-lieu/thuc-trang-su-dung-phuong-phap-so-sanh-vaphuong-phap-chi-phi-trong-tham-dinh-gia-bat-dong-san-tai-thanh-pho-ho-chiminh-81260.html>. 6. A. S. Temür, M. Akgün, and G. Temür, “Predicting Housing Sales in Turkey Using Arima, Lstm and Hybrid Models,” J. Bus. Econ. Manag., vol. 20, no. 5, pp. 920–938, 2019, doi: 10.3846/jbem.2019.10190. 7. Burgt, v. (2017). Data engineering for house price prediction [Ebook]. Eindhoven University of Technology. Retrieved 15 December 2021, from https://pure.tue.nl/ws/portalfiles/portal/72297619/0831848_Burgt_v.d._E.J.T.G._ thesis_CSE.pdf. 8. Zhang, Z., 2016. Introduction to machine learning: k-nearest neighbors. Annals of Translational Medicine, [online] 4(11), pp.218-218. Available at: <https://www.researchgate.net/publication/303958989_Introduction_to_machine _learning_K-nearest_neighbors>. 9. Shi, Y., 2011. Comparing K-Nearest Neighbors and Potential Energy Method in classification problem. A case study using KNN applet by E.M. Mirkes and real-life benchmark data sets. [ebook] Leicester: University of Leicester. Available at: <https://arxiv.org/ftp/arxiv/papers/1211/1211.0879.pdf>. 10. Parsehub.com. 2021. ParseHub. [online] Available at: <https://www.parsehub.com/intro>. 11. Ghilani, C.D. 2010. Adjustment computations: Spatial data analysis, 5th ed. Hoboken, New Jersey: Wiley. 672 pp. 12. MacQueen, J. (1967). SOME METHODS FOR CLASSIFICATION AND ANALYSIS OF MULTIVARIATE OBSERVATIONS [Ebook]. Los Angeles: 61 UNIVERSITY OF CALIFORNIA, Retrieved from https://digitalassets.lib.berkeley.edu/math/ucb/text/math_s5_v1_article-17.pdf 13. Kim, H., 2019. Statistical notes for clinical researchers: simple linear regression 3 – residual analysis. Restorative Dentistry & Endodontics, 44(1). 14. Kim, H. (2013). Statistical notes for clinical researchers: assessing normal distribution (2) using skewness and kurtosis. Restorative Dentistry &Amp; Endodontics, 38(1), 52. doi: 10.5395/rde.2013.38.1.52 15. Schober, P., Boer, C., & Schwarte, L. (2018). Correlation Coefficients. Anesthesia &Amp; Analgesia, 126(5), 1763-1768. doi: 10.1213/ane.0000000000002864 16. Chicco, D., Warrens, M., & Jurman, G. (2021). The coefficient of determination R-squared is more informative than SMAPE, MAE, MAPE, MSE and RMSE in regression analysis evaluation. Peerj Computer Science, 7, e623. doi: 10.7717/peerj-cs.623 17. Glantz, S., Slinker, B., & Neilands, T. (2000). Primer of applied regression & analysis of variance. 18. Ayodele, T. (2010). Types of Machine Learning Algorithms. New Advances In Machine Learning, 3. 19. Steel, R. G. D.; Torrie, J. H. (1960). Principles and Procedures of Statistics with Special Reference to the Biological Sciences. McGraw Hill. 20. , D., Cabrera, J., Lee, Y.-S.: Enriched random forests. Bioinformatics 24 (18) pp.2010–2014 (2008). 21. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of performance measures for classification tasks. Information Processing &Amp; Management, 45(4), 427-437. doi: 10.1016/j.ipm.2009.03.002 22. Field, A. (2009). Discovering statistics using IBM SPSS statistics (3rd ed.). London: Sage. 62 Appendix Appendix 1. Statistical Description of feature variables Bathroom Normalized_Meter Policy Furniture Price_per_meter_each_district (b/m2) count 4882 4882 4882 4882 4882 mean 1,78 66,57 0,42 0,51 0,12 std 0,53 15,52 0,49 0,50 0,18 min 1,00 23,00 0,00 0,00 0,03 25% 1,00 56,00 0,00 0,00 0,03 50% 2,00 67,00 0,00 1,00 0,04 75% 2,00 76,68 1,00 1,00 0,06 max 4,00 100,00 1,00 1,00 0,56 Appendix 2. Statistical Description of feature variables (continued) count mean std min 25% 50% 75% max Density distance_to_central distance_to_airport Distance_to log_price (People/km2) _hospital 4882 4882 4882 4882 4882,00 17596,93 7,62 9,55 2,59 0,44 16428,52 3,8 4 27,89 0,16 16 0,49 0,35 0,04 0,03 3482 5,02 6,49 0,91 0,32 9855 7,39 9,4 1,76 0,43 28922 10,45 12,14 2,56 0,56 65113 18 19,92 1126,24 0,90 63 Appendix 3. Data correlation table of apartments data Bedroom Bathroom Normalized_Meter Policy Furniture Swimming_pool Gym Density (People/km2) log_price lat long distance_to_ central distance_to_ airport distance_to_ railway Distance_to_ highschool Distance_to_ hospital 0,096 Price_per_meter_ each_district (b/m2) -0,094 Bedroom 1,000 0,733 0,730 -0,111 -0,201 0,101 -0,015 0,290 0,072 0,124 0,047 -0,028 -0,008 -0,040 -0,043 Bathroom 0,733 1,000 0,670 -0,113 0,035 -0,071 -0,120 0,038 -0,068 0,340 0,037 0,129 0,065 -0,037 0,010 -0,035 -0,037 Normalized_Meter 0,730 0,670 1,000 -0,085 0,073 -0,020 -0,105 0,110 0,027 0,524 0,114 0,125 -0,082 -0,088 -0,108 -0,015 -0,018 Policy -0,111 -0,113 -0,085 1,000 0,032 0,160 0,138 0,124 -0,094 0,064 0,025 0,159 0,105 0,178 0,163 -0,020 -0,015 Furniture 0,096 0,035 0,073 0,032 1,000 0,028 0,184 0,116 0,029 0,182 0,012 0,032 -0,063 -0,028 -0,047 0,022 0,024 Price_per_meter_each_district (b/m2) -0,094 -0,071 -0,020 0,160 0,028 1,000 0,283 0,106 -0,324 0,196 0,069 0,338 -0,170 0,062 0,010 0,001 0,000 Swimming_pool -0,201 -0,120 -0,105 0,138 0,184 0,283 1,000 0,340 0,199 0,605 0,027 0,135 -0,395 -0,195 -0,296 0,022 0,027 Gym -0,101 -0,038 -0,110 0,124 0,116 0,106 0,340 1,000 0,089 0,447 0,058 0,194 -0,190 -0,034 -0,103 0,007 0,013 Density (People/km2) -0,015 -0,068 0,027 -0,094 0,029 -0,324 0,199 0,089 1,000 0,230 -0,592 -0,665 0,003 -0,012 0,290 0,340 0,524 0,064 0,182 0,196 0,605 0,447 0,230 1,000 0,445 0,065 -0,499 log_price -0,414 -0,231 -0,338 0,006 0,008 lat -0,072 -0,037 -0,114 0,025 0,012 0,069 0,027 0,058 -0,073 -0,026 0,073 0,026 1,000 0,303 0,357 -0,186 0,263 -0,006 0,004 long -0,124 -0,129 -0,125 0,159 0,032 0,338 0,135 0,194 -0,445 0,065 0,303 1,000 0,185 0,619 0,544 0,004 0,024 distance_to_central 0,047 0,065 -0,082 0,105 -0,063 -0,170 -0,395 0,190 -0,499 -0,414 0,357 0,185 1,000 0,591 0,894 -0,039 -0,020 distance_to_airport -0,028 -0,037 -0,088 0,178 -0,028 0,062 -0,195 0,034 -0,592 -0,231 0,186 0,619 0,591 1,000 0,846 -0,010 0,011 distance_to_railway -0,008 0,010 -0,108 0,163 -0,047 0,010 -0,296 0,103 -0,665 -0,338 0,263 0,544 0,894 0,846 1,000 -0,024 0,000 Distance_to_highschool -0,040 -0,035 -0,015 -0,020 0,022 0,001 0,022 0,007 0,003 0,006 0,006 0,004 -0,039 -0,010 -0,024 1,000 0,999 Distance_to_hospital -0,043 -0,037 -0,018 -0,015 0,024 0,000 0,027 0,013 -0,012 0,008 0,004 0,024 -0,020 0,011 0,000 0,999 1,000 64 Appendix 4. and description after model preprocessing of all variables Index Name of attributes Description Data Type Title Product introduction title Object 2 ID Address of real estate Object 3 District Location of apartment Object Area of real estate Object Real estate’s code Int64 Number of rooms Int64 Price Apartment price Object Normalized_Price Price after preprocessing Float 1 Total_Square 4 Bedroom 5 Bathroom 6 7 8 Total area of real estate 9 10 11 12 13 Normalized_Meter Float64 After preprocessing Policy The current certificate of ownership Int64 Furniture Furniture provided with the real estate Int64 Swimming_Pool Swimming pool provided with the real estate Int64 Gym Gym facilities of the apartment Int64 65 Price Per Meter Each District (B/M2) The average price for per square meter Float64 Population Density 15 The population density of each district Float64 16 Lat Latitude of the apartments Float64 17 Long Longitude of the apartments Float64 18 Project Name Name of Apartment project Object 19 segment Apartment’s type of segment Category 20 Distance_to_airport Distance to airport Float 21 Distance_to_train_ station Distance to train station Float 22 Distance_to_school Distance to school Float 23 Distance_to_hospital Distance to hospital Float Log_Price Normalized price after log transformation Float 14 24 66 Appendix 5. VIF Score Index Variable VIF 0 Bathroom 23,70975569 1 Normalized_Meter 30,43331151 2 Policy 1,920238579 3 Furniture 2,136332285 4 Price_per_meter_each_district (b/m2) 1,911938943 5 Swimming_pool 2,494362889 6 Gym 8,031218678 7 Density (People/km2) 3,26741299 8 distance_to_central 8,872648525 9 distance_to_airport 11,35895792 10 Distance_to_hospital 1,012099197 67