Uploaded by TUẤN LÊ PHƯỚC

Data mining

advertisement
HO CHI MINH CITY UNIVERSITY OF TECHNOLOGY
SCHOOL OF INDUSTRIAL MANAGEMENT
THESIS
APPLICATION OF SEMMA PROCESS IN
FORECASTING APARTMENT PRICE IN HO CHI
MINH CITY
Name
Trịnh Trần Nguyên Chương
Student ID
1952195
Supervisor in university
Phạm Quốc Trung
Order number
37-CLC
Ho Chi Minh City – 2022
Vietnam National University
HCMC UNIVERSITY OF TECHNOLOGY
SCHOOL OF INDUSTRIAL MANAGEMENT
SOCIALIST REPUBLIC OF VIETNAM
Independence – Freedom - Happiness
THESIS ASSIGNMENT
DEPARTMENT: Information System
STUDENT NAME: Trịnh Trần Nguyên Chương
STUDENT ID: 1952195
SPECIALIZATION: Business Administration
CLASS: CC19QKD1
1. Title: Application Of Semma Process In Forecasting Apartment Price In Ho Chi
Minh City
2. Thesis assignment (requirement for content and data): Identify factors affecting
apartments price, predict apartment price and classify apartments based on these
affecting factors. The data is collected from the real estate exchange and the main
constructors.
3. Date of assignment: 25/2/2022
4. Date of completion: 15/5/2022
5. Supervisor’s fullname: Assoc. Prof. Dr. Pham Quoc Trung,
Advised on: Proposal
and Thesis
The proposal is approved by the School/ Department
Ho Chi Minh city, 16th, May 2022
HEAD OF DEPARTMENT
(Sign and write full name)
PRIMARY SUPERVISOR
(Sign and write full name)
Acknowledgements
I would want to express our heartfelt and sincere appreciation to Assoc.
Prof.PhD. Phạm Quốc Trung.
I will be eternally thankful and obliged to you for your direction, wisdom,
passion, and encouragement in assisting me in studying and implementing this
project.
I also want to send my sincere thanks to all of faculty lecturers, without whom
we would not have had the essential knowledge to complete my research. Later,
armed with this university’s invaluable experience, I can apply it to my future
career.
Despite the efforts, I am conscious that this project is currently insufficient and
will inevitably contain flaws. It is my pleasure to hear from professors on how I
can improve even more.
Finally, I wish you health, wealth, and success in your endeavors.
Trinh Tran Nguyen Chuong
Abstract
In recent years, facing an increasing population in Ho Chi Minh City, the
demand for housing has increased. Apartment has existed and is a reasonable
house to be considered between other types of houses which can help the lowincome people can handle. According to the relationship of supply and demand
and other factors, apartment prices have also increased sharply and have not shown
any signs of decreasing. This research aims to identify the influencing factors, test
and compare the performance of predictive models and classify apartments based
on those influencing factors. Variables include internal factors of apartment
buildings and external factors based on areas in Ho Chi Minh City. In addition,
Predictive and Classification Models will help real estate investors and prospective
homebuyers predict and categorize housing prices, thereby reducing losses for
property developers and giving bargaining power to prospective homebuyers.
Comparative analysis is performed on data mining techniques; Predictive models
such as: Linear Regression, XGBoost Algorithm and K-Nearest Neighbors
Algorithm were evaluated based on Residual Score, Root Mean Squared Error and
Coefficient of determination; Random Forest Classification model will be used to
classify apartments. The results of the algorithms are visualized with metrics and
animations to provide clear insights into how the models are performing for both
potential homebuyers and real estate investors.
Contents
CHAPTER 1. INTRODUCTION .......................................................................................... 1
1.1
Background ................................................................................................................. 1
1.2
Problem statement..................................................................................................... 2
1.3
Research Gap ............................................................................................................. 2
1.3.1 In Vietnam ................................................................................................................ 2
1.3.2 In other countries ...................................................................................................... 3
1.4
Research objective ..................................................................................................... 3
1.5
Scope and subject of the study ................................................................................. 4
1.5.1 Research Scope ......................................................................................................... 4
1.5.2 Research Subject ....................................................................................................... 4
1.5.3 Research Method ...................................................................................................... 4
1.6
Significance of study.................................................................................................. 5
1.7
Research Structure .................................................................................................... 5
CHAPTER 2. LITERATURE REVIEW ............................................................................... 6
2.1
Definition .................................................................................................................... 6
2.1.1 Data mining process .................................................................................................. 6
2.1.2 SEMMA Process ....................................................................................................... 6
2.1.3 Theoretical foundations of apartments...................................................................... 8
2.1.3.1 The concept of the apartment ............................................................................. 8
2.1.3.2 Current classification of apartments and condominiums ................................... 9
2.1.3.3 Characteristics of the apartment market ............................................................ 9
2.1.3.4 Apartment price prediction ................................................................................ 9
2.2
Regression Model (formula and explanation) ...................................................... 10
2.2.1 Extreme Gradient Boosting (XGBoost) .................................................................. 10
2.2.2 Linear Regression ................................................................................................... 12
2.2.3 Ordinary least-squares (OLS) ................................................................................. 14
2.2.3.1 Definition ......................................................................................................... 14
2.2.3.2 OLS results interpretation ................................................................................ 15
2.2.4 K-Nearest Neighbors (KNN) .................................................................................. 15
2.3
Random Forest classification ................................................................................. 17
2.3.1 Definition of classification ...................................................................................... 17
2.3.2 Definition of Random Forest classification ............................................................ 17
2.4
Model evaluation ..................................................................................................... 19
2.4.1 Residual and Predicted Values................................................................................ 19
2.4.2 RMSE...................................................................................................................... 20
2.4.3 Coefficient of determination (R-Squared) .............................................................. 21
2.5
Related work ............................................................................................................ 21
2.5.1 Foreign research ...................................................................................................... 21
2.5.2 Domestic research ................................................................................................... 22
2.5.3 Reviews of previous research papers ...................................................................... 22
2.6
Conclusion ................................................................................................................ 22
CHAPTER 3. METHODOLOGY ........................................................................................ 23
3.1
Methodology Research Process .............................................................................. 23
3.2
Materials Introducing ............................................................................................. 24
3.3
Data Sampling ......................................................................................................... 25
3.4
Data Exploring......................................................................................................... 28
3.5
Data Modifying ........................................................................................................ 30
3.5.1 Outliers.................................................................................................................... 30
3.5.1.1 Log transformation........................................................................................... 30
3.5.1.2 Skewness and Kurtosis .................................................................................... 33
3.6
Data preprocessing .................................................................................................. 35
3.6.1 Preprocessing process ............................................................................................. 35
3.6.2 Correlationship ........................................................................................................ 37
3.7
Experimental Design ............................................................................................... 39
3.7.1 Data Modeling: ....................................................................................................... 39
3.7.2 Data Accessing: ...................................................................................................... 39
CHAPTER 4: RESULT ANALYSIS AND DISCUSSION ................................................ 41
4.1
Result Analysis......................................................................................................... 41
4.1.1 Ordinary Least Square method evaluation .............................................................. 41
4.1.1.1 Regression results according to Ordinary Least Squares method (OLS) ........ 41
4.1.1.2 Correlation test ................................................................................................. 42
4.1.2 Predictive Models Evaluation ................................................................................. 42
4.1.2.1 Linear Regression ............................................................................................ 42
4.1.2.2 XGBoost .......................................................................................................... 43
4.1.2.3 KNN ................................................................................................................. 45
4.1.3 Random Forest classification .................................................................................. 46
4.2
Dicussion .................................................................................................................. 47
4.2.1 Factors affecting apartment prices in Ho Chi Minh city......................................... 47
4.2.2 Prediction model ..................................................................................................... 48
4.2.2.1 Residual............................................................................................................ 48
4.2.2.2 Root mean squared error .................................................................................. 49
4.2.2.3 The R2 score .................................................................................................... 49
4.2.2.4 Result summary ............................................................................................... 50
4.2.3 Classification model................................................................................................ 51
CHAPTER 5: CONCLUSION.............................................................................................. 57
5.1
Conclusion ................................................................................................................ 57
5.2
Research Meaning ................................................................................................... 58
5.3
Limitation................................................................................................................. 59
5.4
Future Research ...................................................................................................... 59
REFERENCE ......................................................................................................................... 61
Appendix ................................................................................................................................. 63
List of Figure
Figure 1. Average commercial real estate price (millions/m^2) each year .............................. 1
Figure 2. SEMMA process ........................................................................................................ 7
Figure 3. XGBoost Model........................................................................................................ 11
Figure 4. Linear Regression Model ......................................................................................... 13
Figure 5. Example of OLS Regression model based on Weight and Height ........................... 15
Figure 6. Visual presentation of simulated working example ................................................. 16
Figure 7. Decision tree graph ................................................................................................... 18
Figure 8. Random Forest Classifier graph ............................................................................... 18
Figure 9. Residual Scatter plot ................................................................................................. 19
Figure 10. Methodology Research Process .............................................................................. 23
Figure 11. Interferce of Batdongsan.com.vn ........................................................................... 24
Figure 12. First step of extracting ............................................................................................ 26
Figure 13. Second step of extracting........................................................................................ 26
Figure 14. Boxplot of Price distribution .................................................................................. 31
Figure 15. Boxplot of price after log transformation ............................................................... 31
Figure 16. Probability Plot before log normalization .............................................................. 32
Figure 17. Probability Plot after log normalization ................................................................. 32
Figure 18. Skew distribution .................................................................................................... 33
Figure 19. Kurtosis distribution ............................................................................................... 34
Figure 20. Distribution of price before log normalization ....................................................... 34
Figure 21. Distribution of price after log normalization .......................................................... 35
Figure 22. Correlationship graph ............................................................................................. 37
Figure 23. Comparing LR's actual vs predicted values ........................................................... 42
Figure 24. LR's Residual graphs .............................................................................................. 43
Figure 25. Comparing XGBoost's actual vs predicted values ................................................. 43
Figure 26. XGBoost's Residual graphs .................................................................................... 44
Figure 27. Comparing KNN's actual vs predicted values ........................................................ 45
Figure 28. KNN's Residual graphs .......................................................................................... 46
Figure 29. Confusion Matrix.................................................................................................... 47
Figure 30. Residual score comparison ..................................................................................... 48
Figure 31. RMSE comparison.................................................................................................. 49
Figure 32. R-squared comparison ............................................................................................ 50
Figure 33. Decision tree (number 10) ........................................................................................ 1
Figure 34. 5 main factors affecting classification .................................................................... 56
List of Table
Table 1. Description of all attributes which were collected..................................................... 27
Table 2. Total number of apartments each district .................................................................. 28
Table 3. Segment of all apartments.......................................................................................... 30
Table 4. Numeric transformation ............................................................................................. 36
Table 5. Location factors ......................................................................................................... 36
Table 6. Interpreting correlation .............................................................................................. 38
Table 7. OLS Regression Results ............................................................................................ 41
Table 8. Classification performance ........................................................................................ 47
Table 9. Evaluation score summary ......................................................................................... 51
Table 10. Classification report ................................................................................................. 55
OLS
Ordinary Least Squares
RMSE
Root Mean Square Error
KDD
Knowledge discovery in databases
CRISP-DM
Cross Industry Standard Process for
Data Mining
HCMC
Ho Chi Minh City
R2
R-Squared
KNN
K-Nearest Neighbors
LR
Linear Regression
RF
Random Forest
DW
Durbin-Waston
MAE
Mean Absolute Error
MAPE
Mean Absolute Percentage Error
MSE
Mean Squared Error
SD
Standard Deviation
RNN
Recurrent Neural Network
LSTM
Long Short-Term Memory
CHAPTER 1. INTRODUCTION
1.1
Background
Vietnam is now a promising investment destination due to the development in
which Ho Chi Minh City plays the role of the economic locomotive of the country
and real estate becomes an effective investment and profitable field. In recent
years, the housing price has risen gradually and there is no signal of negative
growth which is caused by many reasons. The city is facing a big challenge with
a large population due to reproduction or migration from other provinces, leading
to high population density. According to Le Minh Phuong Mai (2021), from the
University of Finance - Marketing, in 2021, the population of Ho Chi Minh City
will reach nearly 13 million people, and immigrant population is nearly 3 million,
accounting for about 23% of the population including more than 400,000 students;
more than 50,000 new married couples per year; Through the survey of the
Department of Construction, there are about 500,000 households have no house
and about 81,000 households in need of social housing; Out of a total of more than
402,000 workers and laborers working in 17 export processing zones, industrial
parks and high-tech zones of the city, 284,000 people (accounting for 70.6%) have
a need for accommodation, but 39,400 people were met the demand, accounting
for about 15% of the demand (Le, 2021). The rapid population growth has led to
great pressure on housing, making the housing price rise each year.
Figure 1. Average commercial real estate price (millions/m^2) each year
1
1.2
Problem statement
A common method that many countries use to predict house prices is to
compare the average house price with the average income of the people. Because
real estate prices depend largely on the supply-demand relationship in the market.
House prices still depend on and are closely influenced by many groups of factors
and certain evaluation criteria such as the location of real estate, have convenient
transportation, complete infrastructure; the ability to bring profits or available
amenities attached to the house - land; favorable conditions for buyers in terms of
administrative procedures on land use rights, house ownership, construction
permits, etc (Vuong, 2016).
In determining house prices, investors must carefully calculate and determine
the appropriate method because real estate prices always increase continuously
and almost never decrease in the long-term or short-term. However, non- statistical
prediction methods such as prediction based on supply-demand relationship,
cannot give high accuracy. Therefore, it is necessary to use other statistical tools
to ensure housing prices can be predicted with the highest accuracy and less timeconsuming, rather than using the traditional way such as collecting data and
comparing with each other to make a prediction.
A machine learning model (or known as regression model), basically is a
quantitative method that can be used effectively to give a prediction. According to
Qingqi Zhang (2021) from The Hong Kong University of Science and
Technology, A comparative experiment has also revealed that multiple regression
applications for property appraisal work well with given data. Besides, the models
based on multiple regression seem to attach more significance to statistical
inference rather than prediction due to its nature. Following the features above, the
topic of the research is “APPLICATION OF SEMMA PROCESS
FORECASTING APARTMENT PRICE IN HO CHI MINH CITY” which is
suitable.
1.3
Research Gap
1.3.1 In Vietnam
There are many research papers exploring aspects of housing prices and
forecasting in Ho Chi Minh City. The authors have an overall and detailed view
of the housing situation in this city. However, methods of forecasting housing
prices are still manual such as collecting property information, comparing and
estimating prices based on available information.
2
The article "SITUATION OF USE OF COMPARATIVE AND COST
METHOD IN REAL ESTATE VALUATION IN HO CHI MINH CITY" author
NGUYEN TRUONG LUU (Nguyen, 2009) has shown that the comparative
method is used to real estate price estimation, which is finding real estate that has
been bought and sold on the market that is similar to an appraised property, and
then estimates an appropriate sale price based on factors that can comparable to
reflect the difference between them and find out the true value of the property. The
author has specifically analyzed how to apply, benefits, advantages and
achievements when applying this method. However, there are still limitations in
the study such as the comparative method used is not effective in the case of big
data and the degree of influence of other factors on real estate prices.
1.3.2 In other countries
Housing price prediction is a topic that is studied in countries around the world,
these researches focus on predicting housing prices to support not only investors,
buyers but also investors. government social policy. There are many methods to
forecast housing prices, but the most popular is using a machine learning model.
For example, in the research paper "House Price Prediction Using LSTM", the
author uses a machine learning model named RNN (Recurrent Neural Network)
as the foundation of a solution (Chen, Wei and Xu, 2017). The study focused on
the housing market in Beijing, Shanghai, Guangzhou, Shenzhen and used
information from 80 districts in 155 months to build up the prediction model and
make good predictions in the next 2 months. However, that research paper still has
many limitations such as the lack of data makes the accuracy of the RNN
prediction model does not bring maximum efficiency.
1.4
Research objective
Currently, real estate prices in Vietnam in general and in Ho Chi Minh City in
particular depend on many qualitative factors, which can be mentioned as legal
issues or Feng Shui issues, a matter of great interest in Southeast Asian countries,
especially Vietnam. These qualitative factors which cannot be used to predict
specific prices of real estate such as land and houses (such as feng shui, house
orientation, residential composition). On the other hand, apartment buildings are
less dependent on qualitative variables, in addition, they have full details about
each house in it.
The first objective: collecting and identifying the quantitative factors that
can affect apartment prices via apartment buildings’ information.
3
The second objective: applying the SEMMA process on building
regression models for predicting the apartment prices and comparing the accuracy
of these model
The third objective: Classifying all of the apartments into each
apartment's segment based on the affecting factors.
In order to be able to do real estate prediction, it is necessary to have a good
understanding of machine learning applications, knowledge of statistics as well as
a clean and high-quality data source.
1.5
Scope and subject of the study
1.5.1 Research Scope
The study aims to find out the accuracy and effectiveness of Classification and
Regression Analysis of apartment price prediction in Ho Chi Minh city including
collecting and analyzing housing data in the last four years from 2018 to 2022 and
three months to scrap and build models. The segment that the report aims to
includes high-end, middle-end, and low-end apartments.
1.5.2 Research Subject
Apartment price prediction in Ho Chi Minh city
1.5.3 Research Method
The method is used in the research process and is concretized through the
following steps:
−
Collect information and documents on a website specializing in real
estate brokerage: Batdongsan.com.vn.
−
Gather, collect and process documents, combine with learned and
practical knowledge to build machine learning models to serve the house price
forecasting process.
−
Test and compare results from different models used to select the best
results.
−
Use classification algorithms to classify apartment data into each
segment.
−
Give discussion on the final result.
4
1.6
Significance of study
Predicting home prices can help prospective homebuyers get an estimate of the
possible future price of that property, which can help them plan their finances well.
In addition, home price predictions also benefit real estate investors when it comes
to knowing the trends in house prices in a certain location.
1.7
Research Structure
The study includes 5 chapters:
−
Introduction: Provides key information on the problem statement,
methodological guidelines, key findings and key conclusions of housing price
forecasting in HCMC.
−
Literature review: Analyze and combine scholarly sources on previous
research related to the property appraisal process, statistical literacy, and machine
learning tools.
−
Methodology: Explain the design of the study using the techniques used
to collect the information and other aspects relevant to the experiment.
−
Research results: Review the information in the introduction, evaluate
their results, and find the model that best fits.
−
Conclusion: Provide final thoughts and summaries of the entire work.
Consider the limits of research and results and get potential solutions or new ideas
based on the results obtained.
5
CHAPTER 2. LITERATURE REVIEW
2.1
Definition
2.1.1 Data mining process
Jiawei Han and Micheline Kamber stated that data mining has gotten a lot of
attention in the information industry and in society in general in recent years
because of the widespread availability of massive amounts of data and the pressing
need to turn that data into useful information and knowledge. Market analysis,
fraud detection, and client retention, as well as production control and science
exploration, can all benefit from the information and knowledge gathered (Han &
Kamber, 2001).
There appears to be no such thing as too much data in today's increasingly datadriven world. Data, on the other hand, is only valuable if it can be analyzed, sorted,
and sifted through to determine its true worth.
Most industries collect huge amounts of data, but without a filtering process
that generates graphs, charts, and trending data models, the data is useless.
However, filtering through the massive amount of data and the speed with
which it is collected is difficult. As a result, scaling up our analysis power to
manage the massive amounts of data we currently get has become economically
and scientifically vital.
There are a number of data mining processes that aim to extract information
from rapidly accumulating data and then circulate that knowledge in order to
improve the process of obtaining quality data and optimize operations. The
Knowledge Discovery in Databases (KDD) Process, Sample, Explore, Modify,
Model, and Access (SEMMA), and the Cross Industry Standard Process in Data
Mining are three popular methods (CRISP-DM).
2.1.2 SEMMA Process
Data mining, according to SAS Institute, is the process of sampling, exploring,
modifying, modeling, and assessing (SEMMA) huge amounts of data in order to
identify previously unrecognized patterns that can be used as a competitive
advantage (SAS, 2009). The SEMMA process consists of five steps: Sample,
Explore, Modify, Model and Assess.
6
Figure 2. SEMMA process
Sample: This stage comprises selecting a subset of the relevant volume
dataset from a large dataset provided for the model's development. The purpose of
the first stage of the process is to identify variables or factors that influence the
process (both dependent and independent). After that, the data is categorized into
preparation and validation categories.
●
Explore: Univariate and multivariate analysis are used in this step to
investigate interrelated relationships between data pieces and discover data gaps.
While multivariate analysis examines the link between variables, univariate
analysis examines each factor separately to determine its role in the overall
scheme. With a large focus on data visualization, all of the influencing factors that
may influence the study's outcome are studied.
●
Modify: In this step, business logic is used to draw lessons learnt in the
exploration phase from the data acquired in the sample phase. In other words, the
data is processed and cleaned before being passed on to the modeling step, where
it is examined to see if it needs to be refined and transformed.
●
Model: After the variables have been refined and the data has been
cleaned, the modeling step uses a range of data mining techniques to create a
projected model of how the data generates the process's final, desired output.
●
7
Assess: At this point in the SEMMA process, the model is assessed to
see how useful and dependable it is for the issue at hand. The data may now be put
to the test and used to determine how effective its performance is.
●
SEMMA process provides an easy-to-understand process for developing and
maintaining Data Mining projects in an organized and efficient manner. It thus
provides a framework for conception, creation, and evolution, assisting in the
presentation of business solutions as well as the identification of DM business
objectives (Santos & Azevedo, 2005).
2.1.3 Theoretical foundations of apartments
2.1.3.1 The concept of the apartment
An apartment, also known as a flat, is a self-contained housing unit (a type of
residential real estate) that is located on each floor of a building. Apartment tenure
ranges from large-scale public housing to owner occupation within what is legally
a condominium, to tenants renting from a private owner. The apartment building
is normally constructed on a plot of land and consists of many units of
apartments inside. An individual or household who is eligible to own an apartment
owns the space between the apartment's walls, floor, and ceiling. At the same time,
the apartment owner also has a right to use the general utility in the apartment
building.
Some apartment dwellers in the United States own their units, either through a
housing cooperative, in which people hold shares in a corporation that owns the
building or complex, or through a condominium, in which people own their flats
but share ownership of the common areas. The majority of apartment are built
specifically for many specific purposes, however huge older houses are
occasionally partitioned into apartments.
In Vietnam, according to the content of Article 03 of the Law on Housing 2014:
“an apartment building is a house with 2 floors or more, with many apartments on
one floor, with common walkways and stairs, and with an infrastructure system.
Common use floors for households and individuals. Including apartments built for
residential purposes and apartments built with mixed purposes of both residential
and business. Each household or individual has its own share and shared
ownership in the apartment building”.
8
2.1.3.2 Current classification of apartments and condominiums
According to the Circular No. 31/2016/TT-BXD of the Ministry of
Construction, the criteria for classification of apartment buildings are based on the
following four groups of criteria:
- Criteria related to architecture or planning.
- The criteria related to the technical system.
- The criteria related to infrastructure and services.
- The criteria related to quality, management and operation.
2.1.3.3 Characteristics of the apartment market
"Apartment market", which is similar to the concept of "market", is often
understood as a place where transactions in apartment goods take place, a
collection of conditions and agreements through which buyers and sellers
exchange goods with each other.
However, because apartment goods are different from ordinary goods, they
have their own characteristics. The first feature is the regional feature, since the
apartment is a commodity that cannot be moved. Therefore, it is often associated
with the economic, natural and social characteristics of the region and depends on
the traditional culture and psychological characteristics of each region. The supply
of apartment goods reacts more slowly to fluctuations in demand and the price of
apartments. For apartment goods, when demand increases, supply cannot
guarantee a quick response like other goods. Therefore, this type of goods takes
time to create the product. The apartment market is governed by the law, all
transactions in the apartment market must be supervised and managed by the State
such as registration and issuance of ownership certificates. Legally shared
apartments will be more valuable because they have the right to participate in all
transaction activities such as: transfer, legalization, mortgage... The participation
of the State in the apartment market has passed legislation to make this market
stable and safe.
2.1.3.4 Apartment price prediction
As people's living standards increased, there was a fast increase in demand for
housing. Apartment or flat is a self-contained housing unit (a type of house) which
is a civilian part of a building, generally on a single story. While some people buy
an apartment as an investment or as a property due to its affordable price, the
majority of people buy the apartment as a shelter around the world.
9
Housing markets, according to the article “Predicting Housing Sales in Turkey
Using Arima, Lstm and Hybrid Models'', have a beneficial impact on a country's
currency, which is a key metric in the national economy. Homeowners will
purchase products for their homes, such as furniture and domestic equipment,
while home builders or contractors would purchase raw materials to build houses
to meet demand, indicating the economic wave impact caused by the new housing
supply. Aside from that, consumers have the financial means to make a substantial
investment, and the construction sector is in good shape, as seen by a country's
high level of housing production. (Temür et al., 2019).
Every year, there is an increase in housing demand, which also leads to an
increase in apartment prices. Most stakeholders, including buyers and developers,
house builders, and the real estate industry, would like to know the exact attributes
or accurate factors influencing the apartment price to help investors make
decisions and help house builders set the apartment price when there are numerous
variables such as location and property demand that may influence the house price.
2.2
Regression Model (formula and explanation)
2.2.1 Extreme Gradient Boosting (XGBoost)
When compared to an individual predictor, a model that aggregates the
predictions of several predictors frequently produces better results. Ensemble
learning is a strategy that uses a number of predictors to form an ensemble.
Bagging, boosting, and stacking are three types of approaches that can be used to
create an ensemble method. Random Forest (RF), for example, is an ensemble of
random forests that is often trained using the bagging approach. Boosting, unlike
bagging, trains predictors sequentially rather than concurrently.
10
Figure 3. XGBoost Model
Under the gradient boosting framework, XGBoost is a scalable end-to-end tree
boosting system that fits the new predictor to the residual errors created by the
prior predictor. Many additive functions are used to forecast the outcome, such as
Equation
yi = 𝑦𝑖 0 +η ∑𝑀
𝑘=1 𝑓𝑘 (𝑋𝑖 )
where yi is the projected result based on features 𝑋𝑖 , 𝑦𝑖 0 is the initial guess
(typically the mean of the observed values in the training set), and is the learning
rate that allows the model to improve smoothly while adding new trees without
overfitting.
The estimation fk of the additional k-th estimators is as Equation:
𝑦̅k =𝑦̅(k−1) + η 𝑓𝑘
Where 𝑦̅ 𝑘 is the k-th predicted result and fk is defined by the leaves weights.
The following regularized objective is minimized to learn the functions
employed in the model above
11
L(φ) = ∑𝑖 𝑙(𝑦ˆ𝑖, 𝑦𝑖) +∑𝑘 𝛺(𝑓𝑘)
1
2
Where Ω(f) = γT+ λ||w|| .
2
The difference between the forecast yi and the target yi is measured by l, a
differentiable convex loss function. The second term penalizes the model's
complexity and functions as an additional regularization factor to prevent
overfitting.
XGBoost also allows users to discover the relative relevance or contribution of
particular input factors in forecasting the response because it is built on random
2
forests and single random forests are highly interpretable. Il (T) could be
employed as a measure of importance for each predictor variable xl, according to
Breimen et al., where J is the number of nodes in the tree.
Il2(T)= ∑𝐽−1 𝑖 2 𝐼(𝑣(𝑡) = 𝑙)
𝑡 =1 𝑡
The importance measure is generalized to XGBoost by averaged over the trees
and is shown in above in which M is the number of the trees
𝐼𝑙 2 =
1
𝑀
2
∑𝑀
𝑚−1 𝐼𝑙 (𝑇𝑚 )
2.2.2 Linear Regression
Linear regression is the simplest and earliest predictive method, which includes
estimating a continuous outcome using a linear combination of predictors
(independent variables and dependent variable). The goal of linear regression
models is to minimize the mean squared error (the average squared discrepancy
between the observed and anticipated result values) when estimating the
regression coefficient vector β.
A linear regression model with p predictors is written as follows given a dataset
of n observations:
𝑌𝑖 = 𝛽1 𝑥𝑖,1 +𝛽2 𝑥𝑖,2 +···+𝛽𝑝 𝑥𝑖,𝑝 +𝜀𝑖 , i = 1,2,...,n,
where 𝑌𝑖 represents the continuous response for the the 𝑖𝑡ℎ observation, The
parameter 𝛽𝑗 , j = 1, . . . , p, represents the effect size of covariate j on the response,
𝑥𝑖,𝑗 represents the 𝑗𝑡ℎ variable value for the 𝑖𝑡ℎ observation, and 𝜀𝑖 is the random
error term.
12
In linear regression analysis, there are several assumptions: the error terms εi
are independent, uncorrelated and normally distributed with mean of zero and
constant variance 𝜎2 (a.k.a. homoscedasticity).
The linear regression model has the advantage of having excellent
interpretability of the coefficients and strong prediction in small training data sets
(Hastie et al., 2001). Linear regression has the disadvantage of being sensitive to
outliers, which is a regular occurrence in most datasets. The presence of a small
number of outliers in the dataset can have an impact on the linear model's
performance.
Figure 4. Linear Regression Model
While linear regression is a very straightforward method for capturing the
complexity of housing predictions, it contains key concepts that are utilized to
construct alternative regression techniques. Many recent statistical learning
methods, such as splines and generalized additive models, can be considered as
extensions or generalizations of linear regression.
13
2.2.3 Ordinary least-squares (OLS)
2.2.3.1 Definition
“Ordinary least-squares (OLS) regression is an extended linear modeling
approach that may be used to describe a single response variable on at least an
interval scale” argued by Dan Hutcheson from VLSI Research Inc. He also
concluded that this technique can be used with single or multiple explanatory
factors, as well as categorical explanatory variables that have been properly coded
(Hutcheson, G. D. 2011).
The OLS method has the formula, which is similar with the Linear Regression,
but in the OLS technique, we must select 𝑏1 and 𝑏0 values that minimize the total
sum of squares of the difference between the computed and observed values of y:
2
2
S = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̂)
= ∑𝑛𝑖=1(𝑦𝑖 − 𝑏1 𝑥1 − 𝑏0 )2 =
𝑖
𝑖
2
∑𝑛𝑖=1(𝜀̂)
= 𝑚𝑖𝑛
𝑖
Where: 𝑦̂𝑖 is predicted value for the ith observation , 𝑦𝑖 is actual value for the ith
observation, 𝜀𝑖 is error/residual for the ith observation and n is total number of
observations. It is necessary to take a partial derivative for each coefficient and
equate it to zero to find the values of 𝑏0 and 𝑏1 that minimizes S.
14
Figure 5. Example of OLS Regression model based on Weight and Height
2.2.3.2 OLS results interpretation
R-squared: the coefficient of determination. It is the proportion of the variance
in the dependent variable that is predictable/explained.
Adjust R-squared: Adjusted R-squared is the modified form of R-squared
adjusted for the number of independent variables in the model. Value of adj. Rsquared increases, when we include extra variables which actually improve the
model.
F-statistic: the ratio of mean squared error of the model to the mean squared
error of residuals. It determines the overall significance of the model.
Coef: the coefficients of the independent variables and the constant term in the
equation.
t: the value of t-statistic. It is the ratio of the difference between the estimated
and hypothesized value of a parameter, to the standard error.
2.2.4 K-Nearest Neighbors (KNN)
The k-nearest
neighbors’
algorithm (KNN)
is
a nonparametric classification method which decides which of the points from the
training set are similar enough to be considered when choosing the class to predict
15
for a new observation is to pick the k closest data points to the new observation,
and to take the most common class among these (Sutton,2012).
According to Dr. Zhongheng Zhang, MMed. Department of Critical Care
Medicine, Jinhua Municipal Central Hospital, Jinhua Hospital of Zhejiang
University, There are two important concepts about KNN() function. The first one
is that the KNN() function uses Euclidean distance, which is determined using the
equation below:
D(p,q) = √(𝑝1 − 𝑞1 )2 + (𝑝2 − 𝑞2 )2 + ⋯ + (𝑝𝑛 − 𝑞𝑛 )2
where p and q are subjects to be compared with n characteristics. Another
concept is the k parameter, which determines how many neighbors the kNN
algorithm will choose. The choice of k has a considerable impact on the kNN
algorithm's diagnostic performance (Zhang, 2016).
Figure 6. Visual presentation of simulated working example
16
The class 1, 2 and 3 are denoted by red, green and blue colors, respectively.
Dots represent test data and triangles are training data.
2.3
Random Forest classification
2.3.1 Definition of classification
Classification is a supervised machine learning technique used to predict group
membership for data examples (Sukumaran, 2013). Although there are a variety
of machine learning approaches available, classification is the most extensively
employed. Classification is a well-liked problem in machine learning, particularly
in future planning and knowledge discovery. Classification is regarded as one of
the most important topics tackled by machine learning and data mining experts
(Baradwaj, 2012). Linear Classifiers, Logistic Regression, Naive Bayes Classifier,
Perceptron, Support Vector Machine; Quadratic Classifiers, K-Means Clustering,
Boosting, Random Forest, Random Forest (RF); Neural networks, Bayesian
Networks, and so on are examples of classification techniques (Ayodele, 2010).
2.3.2 Definition of Random Forest classification
Xindong Wu, from Hefei University of Technology and Yuanda Australia
International Education Center gives the definition of Decision tree is a machine
learning model used to predict an object using a random forest. Random forest
models can be classifiers or regressors, which means they can predict the
categorization of an item or a certain dependent value for the item. To do this, a
tree structure is used, with leaves representing values and branches representing
feature conjunctions that lead to those values. Random forests are among the most
popular machine learning algorithms due to their comprehensibility and simplicity
(Wu, Xindong et al, 2007).
17
Figure 7. Decision tree graph
Decision tree models are sometimes found as unstable models, which means that
a little change in the decision tree's starting parameters might cause the model
forecast to fluctuate greatly. They also have a proclivity towards overfitting.
Random Forest is an ensemble learning model that overcomes these problems by
combining bootstrap aggregation with random decision trees.
Figure 8. Random Forest Classifier graph
Random Forest, according to Leo Breiman, from Statistics Department
University of California, “is a combination of tree predictors such that each tree
depends on the values of a random vector sampled independently and with the same
distribution for all trees in the forest” (Breiman, 2001). It constructs decision trees
18
from several samples and uses their majority vote for classification and average for
regression. Each decision tree is trained on a subset of the feature space that is
chosen at random. Trees in distinct subspaces broaden their categorization in
complementary ways, improving overall classification and stability while avoiding
overfitting.
The number of trees produced for the Random Forest is an important factor to
consider while creating the model. For a feature space of m dimensions, for
example, there are m subspaces in which decision trees may be built. When a higher
number of trees are employed, the model performs better. While the growth in
model accuracy slows as the number of trees used grows, it does not ground to a
stop.
2.4
Model evaluation
2.4.1 Residual and Predicted Values
The deterministic component is in the form of a straight line which provides the
predicted (mean/expected) response for a given predictor variable value.
The residual terms represent the difference between the predicted value and the
observed value of an individual. They are assumed to be independently and
identically distributed normally with zero mean and variance, and account for
natural variability as well as maybe measurement error (Kim, 2019). Our data
should thus appear to be a collection of points that are randomly scattered around
a straight line with constant variability along the line:
Figure 9. Residual Scatter plot
Figure SEQ Figure \* ARABIC 9. Residual
The residual scatter plots allow you toScatter
check:plot
●
Normality: the residuals should be normally distributed about the
predicted responses.
●
Linearity: the residuals should have a straight-line relationship with the
predicted responses.
19
●
Homoscedasticity: the variance of the residuals about predicted
responses should be the same for all predicted responses.
2.4.2 RMSE
The term "error" in statistics refers to a deviation from a known, correct value.
As a result, the root mean square error (RMSE) statistic measures the average
deviation of a group of observations from a known value. The RMSE is calculated
as follows:
1
RMSE = √ ∑𝑛𝑖=1(𝑥𝑖 − 𝜇)2
𝑛
where 𝜇 is the known, correct value; 𝑛 is the number of observations of 𝜇; and
𝑥𝑖 is one of a set of the 𝑛 observations.
RMSE is a measure of accuracy because it represents the dispersion around a
true value. a test whether RMSE is zero, for example, might be used to confirm or
deny observation bias.
The precision and accuracy of an observation set can be assessed using both
statistics, and neither statistic is sufficient on its own:
●
A small RMSE indicates a low standard deviation.
●
A small RMSE does not imply a small standard deviation, and a big
RMSE does not imply a large standard deviation.
●
A big standard deviation implies a large root mean square error (RMSE).
●
To put it another way, a small RMSE indicates that the estimated mean
is close to the true mean, which implies that the calculated SD is close to the
RMSE: the sample is precise and accurate.
●
Because the estimated mean is far from the true mean if the sample is
skewed, a small SD does not imply a small RMSE: the sample is precise but
erroneous.
●
The same rationale applies to the previous bullet: a large RMSE does not
imply a large SD.
●
The same rationale applies to the previous bullet: a large RMSE does not
imply a large SD.
●
A large SD indicates a large RMSE: the sample is dispersed, and both
SD and RMSE are dispersion measures, thus if SD is large, RMSE must be large
as well: the sample is imprecise and may or may not be accurate.
20
2.4.3 Coefficient of determination (R-Squared)
According to Stanton A Glantz and his colleagues, the coefficient of
determination, designated 𝑅2 and pronounced "R squared," is the fraction of the
variation in the dependent variable that is predicted from the independent
variable(s) (Glantz, Slinker & Neilands, 2000).
It is a statistic applied in the context of statistical models, the primary objective
of which is either the prediction of future events or the testing of hypotheses based
on other relevant information. Based on the fraction of total variance described by
the model, it gives the calculation of how well observed results are.
The 𝑹𝟐 value is determined as below, where 𝑦𝑖 represents the actual price,
̂𝒊 represents the predicted price, and N is the number of samples.
𝒚
𝑅2 (𝑦𝑖 , 𝑦̂𝑖 ) =
∑𝑁
𝑦𝑖 )2
𝑖=1(𝑦𝑖 − ̂
∑𝑁
̅ 𝑖 )2
𝑖=1(𝑦𝑖 − 𝑦
̅𝒊 , is determined as follows.
The arithmetic mean, 𝒚
1
𝑦̅𝑖 = ∑𝑁
𝑖=1(𝑦𝑖 )
𝑁
In regression analysis assessment, the coefficient of determination can be more
(naturally) useful than MAE, MAPE, MSE, and RMSE since the former can be
stated as a percentage, but the latter measures have arbitrary ranges. On the test
datasets in the paper, it also demonstrated more resistance for bad fits than SMAPE
(Chicco, Warrens & Jurman, 2021).
2.5
Related work
2.5.1 Foreign research
Follow the article “Data engineering for house price prediction” (Burgt, 2017),
the authors have given prediction of housing price in dutch housing market by
regression model and combine different regression models in order to create a
model with higher accuracy and lower error rate during the period of increasement
of demand in real estate.
The research model consists of four steps to preprocess data: First, Price Index
construction: different types of price indices are discussed. Second, Data
Engineering is conducted on data sets in order to make the features fit better for
machine learning. Third, Regression for predicting real estate prices is used.
Finally, Sale and Neighborhood comparisons: Compares the sales price between
Real Estate Agent and compareds between neighborhoods based on dataset. The
21
research also points out that although the fundamental sale forecast component has
a decent foundation, there are upgrades that might make it more accurate and
trustworthy throughout all of the Netherlands' areas.
2.5.2 Domestic research
"Analysis of variables impacting apartment pricing, case study in District 2 of
Ho Chi Minh City" from Ho Thi Nhai's thesis (2015). The study used a Hedonic
model to collect 130 unit samples from 20 apartment buildings that were available
for sale, had successful transactions in District 2, and had high prices ranging from
14 million/m2 to 25 million/m2. The research findings suggest that just seven
independent variables have an influence on pricing, including distance to the
center, floor location, number of toilets, position of flats on the floor, surroundings,
handy services, and the investor's reputation.
2.5.3 Reviews of previous research papers
The research finally gives some results: with or without weighting the accuracy
of this prediction is not great: the accuracy was below zero and the average error
and standard deviation around 60.000 and 70.000 respectively. On the other hand,
with the ability to merge models into one final model, each municipality model
paired with the Netherlands model of the same property type resulted in an
accuracy of 83.95, which at first glance does not appear to be a significant
improvement. However, the average error and standard deviation, which are
38.000 and 43.000 respectively, show a significant improvement.
2.6
Conclusion
Chapter Two reviews the theoretical literature that will be used in the following
chapters and previous domestic and international studies related to the area of real
estate price prediction in the region. Determining influencing factors and
predicting real estate prices in general and apartments in Ho Chi Minh city in
particular is often by the Hedonic method and has not been optimized. This study
will use the Ordinary Least Squares method to determine the factors affecting
apartment prices. At the same time, perform apartment price forecasting by
applying three more machine learning algorithms including: XGBoost, Linear
Regression and K-Nearest Neighbors to forecast apartment prices in Ho Chi Minh
City. The above algorithms will be evaluated for accuracy and feasibility through
RMSE, Residual Score and R-squared Score. Besides, apartment classification by
Random Forest Classification is also used to support real estate companies as well
as apartment buyers.
22
CHAPTER 3. METHODOLOGY
3.1
Methodology Research Process
During the research process, the necessary steps in the research methodology
are summarized in following map:
Figure 10. Methodology Research Process
The research methodology includes 5 steps:
-
-
Material Introducing: Giving introduction of data sources and the amount of data
needed for the research.
Data Sampling: Showing how to get data (including software and the collecting
steps).
Data Exploring: Analyzing a huge data collection in an unstructured way to
uncover initial patterns, characteristics of data (Data Exploring is the first part of
Data Pre-Processing).
Data Modifying: transforming raw data into an understandable format (Data
Modifying the second part of Data Pre-Processing).
Experiment Design: Including Data Modeling (creating a data model for an
information system by using many types of machine learning models) and Data
Assessing (Evaluating data to verify whether these models match project quality
requirements)
23
3.2
Materials Introducing
Before starting with the first step, it is essential to discuss the source from which
the data was collected. Since data collection takes place on computers (instead of
collecting data in person at real estate offices), it is essential and optimal to dig
data from a website that specializes in real estate. Currently, there are many online
real estate brokers, Batdongsan.com.vn.vn is one of them. Although
Batdongsan.com.vn is a fledgling exchange, the company is one of the biggest real
estate exchanges in Vietnam. At Batdongsan.com.vn, the sellers and buyers can
upload their products directly to the website, so the data collected are also
considered as primary data (Batdongsan.com.vn, 2021).
11.11.
Interferce
of Batdongsan.com.vn
Figure SEQ Figure \*Figure
ARABIC
Batdongsan.com.vn’s
website interface
Buyers and sellers before posting products need to create an account and
identify themselves, thereby, ensuring the product's reputation and avoiding fraud
before making direct transactions. Besides, in order to avoid data with very high
or very low prices compared to reality, I will collect reference data on the prices
of apartment building contractors and compare with the data collected on website
Batdongsan.com.vn to remove junk data.
In Vietnam, there are a lot of Real Estate Exchange websites such as:
Chotot.com, Rever.vn or Propzy.vn… It is possible to collect data about
apartments in those websites which provide plenty of data sources for building the
regression models. However, because of having the same function, the sellers can
upload their same products to different websites. Therefore, when collecting
multiple websites, it can lead to many duplicate data and affect the accuracy of the
models. On the other hand, with the amount of data scraped only from
24
Batdongsan.com.vn can be enough for building regression models (about 4,000 to
5,000 apartments).
3.3
Data Sampling
Sampling process is using scraping software to collect data from the internet in
a manner that the machine learning model can understand. The output data is
interpreted by a machine during parsing, but it is difficult for humans to grasp.
Data extraction is another term for data scraping. Data scraping is particularly
useful because if humans collect data from the internet, there are numerous
potentials for error, whereas computers transfer data between programs in the form
of data structures that ensure data integrity.
In terms of data consistency and correctness, the data could be problematic. As
a result, data scraping only obtains raw data and necessitates considerable
preprocessing and, in some cases, human intervention. The data scraping activity
is mostly reliant on Internet sources for data collection and cannot be completely
automated. For example, if you're scraping data from a website, based on the best
structure that the developer has allocated to each unique HTML element, an
attribute ID and a class attribute to each item in the same group. This helps in the
creation of a script in nearly any programming language. Beside using Python
packages as the crawlers, ParseHub, an application developed to crawl the data
from is used in this research.
Data extraction technologies have traditionally been either too difficult to
utilize for non-technical people or too simplistic to manage the complexity and
interaction of modern websites.
When collectors consider how much time is wasted collecting data, the
necessity for a strong and versatile solution that makes data appear directly at
the fingertips becomes even more clear. Data scientists, for example, spend 50 to
80 percent of their time collecting and preparing data rather than putting it to use.
ParseHub allows developers complete control over how they choose, structure,
and altar items, eliminating the need to dig through the browser's web inspector.
There is no need to create another web scraper because users are able to handle
interactive maps, endless scrolling, authentication, dropdowns, forms, and more
with ease using ParseHub (parsehub, 2021).
Web scraping from well-established companies, on the other hand, is not
simplistic since the organizations deploy defensive algorithms and software to
prevent unauthorized access to their website. As a result, the goal is to use tools
or applications that can scrape data as intelligently as a human. This is
accomplished by automating human online browsing behavior. ParseHub can
scrape data with a delay such as 3 to 5 seconds, for example, the system may not
notice data extraction from the website and mistake it for a routine activity.
25
Each product data is extracted by three steps of the website:
−
The first step is collecting general information of the product (product
title, product location and squared area of products).
Figure 12. First step of extracting
−
The second step is that ParseHub automatically accesses url links to
collect more data of each real estate. The page contains detailed information of the
product and begins to collect according to the following criteria: number of
bedrooms, number of bathrooms, other facilities, main contractors, and project
name.
Figure 13. Second step of extracting
26
Moreover, by setting ParseHub to automatically turn pages, the crawling
process will continue as the two steps above to the end of the last page. The data
scraping process can take a lot of time due to the large amount of data.
The final step is that we will access to the apartment investors’ website to get
some features which is not extracted from the website Batdongsan.com.vn in the
two steps above (includes: swimming pool, gym facilities, furnitures included,
population density in the districts, car parking, time access to city center, and price
per meter in the areas). Besides, it is also necessary to get the reference prices from
the apartment investors, then compare them to the price in the Batdongsan.com.vn
platform and remove outliers.
The variables I expect to extract is at the table below:
Table 1. Description of all attributes which were collected
Index
Name of attributes
Description
Data Type
1
Title
Product introduction
title
String
2
ID
Address of real estate
Object
3
District
Location of apartment
Object
4
Total_Square
Area of real estate
Object
5
Bedroom
Real estate’s code
Numeric
6
Bathroom
Number of rooms
Numeric
7
Price
Apartment price
Object
8
Policy
The current certificate
of ownership
String
9
Furniture
Furniture provided with
the real estate
Boolean
27
3.4
10
Swimming_Pool
Swimming pool
provided with the real
estate
Boolean
11
Gym
Gym facilities of the
apartment
Boolean
12
Price Per Meter Each
District (B/M2)
Furnitures provided
with the real estate
Boolean
13
Population Density
The population density
of each district
Float
14
Coordinate
Car Park provided
with the real estate
Object
15
Project Name
Name of Apartment
project
String
Data Exploring
The total number of samples collected is 4900 samples. After checking and
cleaning the data, some samples will be removed due to lack of information on
variables, not within the scope of the study or not assessing the quality. Analysis
results are performed using Python programming software and statistical
packages.
Table 2. Total number of apartments each district
Index
District
Number of
apartments
Price per meter
(million/m2)
1
District 2
724
53.83
2
District 7
609
43.97
3
District 9
587
40.89
4
Tan Phu
416
40.74
28
5
District 4
332
61.14
6
District 8
274
35.54
7
District 12
266
33.71
8
Binh Thanh
253
60.17
9
Thu Duc
223
36.97
10
Binh Tan
219
33.91
11
Binh Chanh
212
34.86
12
District 6
139
42.91
13
Nha Be
122
37.64
14
District 10
97
66.67
15
District 5
96
48.48
16
Go Vap
87
40.03
17
Phu Nhuan
83
62.51
18
Tan Binh
74
52.21
19
District 11
43
46.36
20
District 3
24
72.87
21
District 1
20
99.05
In terms of research scope, the district with the largest number of survey
samples is District 2 with 724 survey samples with the average unit price of an
apartment of 53.83 million VND/m2. In which, the district with the highest
transaction price is in District 1 with the average price of 99.05 million/m2.
According to Circular 31/2016/TT-BXD effective February 15, 2017, issued at
the Department of Planning and Architecture, apartments are classified into grades
A, B, and C based on four factors: Architectural planning; Technical systems and
equipment; Services and social infrastructure; Quality, management, operation.
29
From there, decide on the average price of the apartment. Currently, the price of
apartments in Ho Chi Minh City includes: high-class apartment A costs from more
than 60 million per square meter, mid-range apartment B costs from 35 to 60
million per square meter and finally, low-end apartment C, having price at less
than 35 million per square meter.
Table 3. Segment of all apartments
3.5
Index
Segment
Number of
apartments
Price per
meter
(million/m2)
1
A
839
73.437
2
B
2695
45.347
3
C
1366
29.35
Data Modifying
3.5.1 Outliers
3.5.1.1 Log transformation
After changing prices to numeric (float) form, it is easy to see that apartment
prices show outliers that can affect the results of predictive models.
30
Figure 14. Boxplot of Price distribution
The outliers can be seen, located at the price located in the luxury apartment A.
Therefore, the problem of outliers must be solved by the logarithm of the price
variables.
Figure 15. Boxplot of price after log transformation
The price after log transformation is still having some outliers, so, it is necessary
to drop these log_price which are higher than 0.9. The dataset will finally have
1882 values left.
According to Chambers (Chambers et al., 1983), the probability plot is a
graphical tool for determining if a data set follows a given distribution, such as the
31
normal or Weibull. In addition, a line can be fitted to the points and added as a
reference line. The further the points differ from this line, the greater the sign of
the difference from the specified distribution.
Figure 16. Probability Plot before log normalization
Figure 17. Probability Plot after log normalization
It can be seen that the fit values with the reference line range from -3 to more
than 2. Based on the probability plot, we can see, a straight, diagonal line in a
32
normal probability plot indicates normally distributed data instead of Weibull data
after logarithmic.
3.5.1.2 Skewness and Kurtosis
Skewness is a measure of asymmetry, while kurtosis is a measure of a
distribution's 'peakedness.' The values of skewness and kurtosis, as well as their
standard errors, are provided by most statistical software, according to Hae-Young
Kim, from Korea University (Kim, 2013).
Skewness is a measure of a variable's asymmetry in its distribution. The skew
value of a normal distribution is 0, meaning that the distribution is symmetric. A
positive skew value indicates that the right-hand tail of the distribution is longer
than the left-hand tail, and that the majority of the values are located to the left of
the mean. A negative skew value, on the other hand, shows that the left side of the
distribution's tail is longer than the right side, and the majority of the values are to
the right of the mean advocated an absolute skew value > 2 as a measure indicating
severe deviation from normalcy.
Figure 18. Skew distribution
On the other hand, Kurtosis is a metric for determining how peaked a
distribution is. The original kurtosis value is frequently referred to as kurtosis, and
an absolute kurtosis (proper) value > 7.1 is used to indicate a significant
divergence from normalcy. Most statistical tools, such as SPSS, give 'excess'
kurtosis, which is calculated by subtracting 3 from the kurtosis (proper). For a fully
normal distribution, the excess kurtosis should be zero. Positive excess kurtosis is
known as leptokurtic distribution, which means high peak, while negative excess
kurtosis is known as platykurtic distribution, which means flat-topped curve (West
et al.,1996).
33
Figure 19. Kurtosis distribution
Figure 20. Distribution of price before log normalization
34
Figure 21. Distribution of price after log normalization
As we can see, the price distribution is not normally distributed. This target
variable is skewed to the right due to multiple outliers in the variable (Skewness
is 1.3393 and Kurtosis is 2.4661). After log transformation, the skewness score is
0.2484 and kurtosis score is -0.49024, the log transformation clearly removes the
normality of errors, which eliminates the majority of the other errors we discussed
before.
3.6
Data preprocessing
3.6.1 Preprocessing process
Data preprocessing is used to turn a messy dataset into a clean dataset that can
be used by machine learning algorithms. Data in raw format, which cannot be
analyzed, is subjected to data preparation processes. As in our case, the
information was gathered from multiple property websites where it was entered
by property agents, so there are missing numbers, data in various formats, and
erroneous information. Data integration was used to aggregate data from diverse
industries into a single dataset. The data records were transformed using data
transformation methods into a format suitable for machine learning analysis.
Because the data collected from the Batdongsan.com.vn website is already
relatively clean and systematic after removing outliers from the comparing
process, what we need to do is make some adjustments accordingly in building the
35
machine learning model. As a first step, we can use Excel, a popular and easy-touse tool, to categorize and convert types of data. In this case, the area of the house
and price are not the number (numeric type), we can change it to serve the
construction of the model. Moreover, we are able to change the values from
boolean to numeric (1 is “yes” and 0 is “no”). The variables which are need to be
changed from Boolean to Numeric (Binary standardized) are:
Table 4. Numeric transformation
Index
Variable
1
2
Furniture
Policy
(House
ownership
certificate)
Swimming_pool
Gym
3
4
Old
False
Basic
No
New
False
0
0
Old True
New True
Included
Yes
1
1
No
No
0
0
Yes
Yes
1
1
Besides, factors that can affect real estate in general and apartments in particular
are the facilities located in the city. These factors include the distance to the airport,
the distance to the train station, and the facilities in the areas such as schools. The
above factors are calculated based on the coordinates of the apartment projects.
The above geographical distances can be easily calculated using the package
geopy.distance. One of the key factors in a residential area is "distance to market,
supermarket or mall" which can affect the price of apartments in that area.
However, nowadays, apartment buildings have commercial centers or
supermarkets on the ground floor, which affects the overall value of the variable
"distance_to_market". Therefore, the study will remove the above variable from
the beginning to ensure the accuracy of the model.
In addition, social facilities such as schools in each area were collected from
government websites https://www.hcmcpv.org.vn/ which is the website of the
Party Committee of Ho Chi Minh City.
Table 5. Location factors
Index
1
Variable
Distance_to_airport
2
Distance_to_train_station
Description
Distance to
airport
Distance to train
station
Data type
Float
Float
36
3
4
Distance_to_school
Distance_to_hospital
Distance to school
Distance to
hospital
Float
Float
3.6.2 Correlationship
To be able to observe which data can be used in model building, a
correlationship plot is needed to see the correlations of the variables, thereby
excluding those with a high dependency ratio. In addition, other graphs which can
show the ratio between price and other features will be represented such as the
increase in house prices in proportion to the size of the plot, or how location affects
price.
Figure 22. Correlationship graph
37
According to author Albert and his colleagues, from Chief Statistician of the
Philippines, Correlation is a measure of a relationship between variables. In
correlated data, a change in one variable's value is linked to a change in another
variable's value, either in the same (positive correlation) or opposite (negative
correlation) direction (Albert. et al, 2008). Several descriptive statistics were
utilized to describe the study's results, including mean, standard deviation,
minimum, and maximum values. Descriptive statistics are used to characterize and
assess data in order to generate concise summaries and derive some helpful
conclusions. When at least one of the variables is ordinal, the Spearman RHO
correlation is employed to assess the connection between them. According to
Albert, the correlation coefficient's range and degree of relationship are shown in
the table below (Albert .et at, 2008).
Table 6. Interpreting correlation
Based on this table, we can take the input or predator variables and store them
into Feature which can be used to predict price. Besides removing the variables
that are not statistically significant including: "Title","ID", "District", "Coodinate"
and "Project Name", we will continue to exclude variables with low correlation.
or no correlation with price (between -3 and 3).
As can be shown in Figure 22, multicollinearity persists in a variety of
properties. However, for the sake of learning, we'll maintain them for now and let
the models clean them afterwards. Let's take a look at some of the remaining
connections.
−
Bathroom and Bedroom have a correlation of 0.73, or 73%.
−
Normalized Meter and Bedroom have a 73% correlation.
−
Distance to central and Distance to railway have an 89% correlation.
−
Distance to the airport and Distance to railway have an 85% correlation.
−
Distance to the hospital and Distance to the high school have an 100%
correlation
Therefore, to suit better multiple linear regression techniques, some attributes
have to be removed from the dataset.
38
3.7
Experimental Design
The experiment was performed to pre-process the data and evaluate the
predictive accuracy of the models. A multi-stage experiment is required for
predictable results. However, since the data from sections 3.4, 3.5 and 3.6 will be
used for the whole experiment, these remaining phases including Data Modeling
and Data Assessing can be defined as:
3.7.1 Data Modeling:
- Affecting Factors Identification: Ordinary least squares (OLS) method is
used to estimate the parameters in the regression equation, so it is possible to
determine the relationship between the dependent variable (apartment price) and
the independent variables (Affecting Factors). The OLS Model is imported from
the statsmodels package.
- Price prediction: The Data including the dependent variable ("log_price")
and independent variables (in dataset "Features") will be divided into two parts
which is essential to train the model with one and use the other in evaluation. The
dataset will be split 80% for training and 20% for testing. Three machine learning
algorithms including XGBoost imported from XGBoost Package, Linear
Regression and K-Nearest Neighbor imported from Sk-learn Package will be used
to predict apartment prices.
- Apartment Classification: The data including the dependent variable
("segment") and independent variables (in dataset "df_class" which dropped
"segment") will be divided into two parts which is essential to train the model with
one and use the other in evaluation. The dataset will be split 80% for training and
20% for testing. Random Forest Classifier algorithm imported from SKlearn
package will be used to classify apartments.
3.7.2 Data Accessing:
Affecting Factors Identification: We will utilize the R-squared, which
is the coefficient of determination, to evaluate the Ordinary Least Square
technique. It is the fraction of the dependent variable's volatility that can be
predicted or explained. The coefficients of the independent variables and the
constant term in the equation make up the coef scores. In addition, we will reevaluate the model's correlation to avoid overfitting and underfitting. Andy Field
explains how to use Durbin-Watson tests to look for serial correlations between
errors in regression models. It checks if nearby residuals are associated, which is
useful if independent mistakes are assumed (Field, 2009). Field also indicated that
no autocorrelation is acceptable when the DW value is between 1 and 3.
39
- Price prediction: There are a variety of error measurements that can be used
to evaluate a model's prediction performance. Three predictive models (Linear
Regression, XGBoost and KNN) will be evaluated by three commonly used
metrics in the regression sector. First of all, the residual score refers to the
difference between the dependent variable's predicted score (as calculated by
predictive models) and the actual observed score. A close-to-zero score suggests
that the difference between each pair of points is small, indicating the predictive
model's effectiveness. Secondly, the fit/correlation of the data to the regression
model is indicated by R-squared (R2). An R-squared value close to 1.0 implies
that the model makes an accurate prediction. Thirdly, the mean square error
(RMSE) is an accuracy statistic that shows how far the model is on average from
the observed data points. The RMSE can also be thought of as a measure of how
evenly distributed the residuals are. Each model's error measures will be
graphically illustrated. Lastly, the most effective model will be the one that is best
appraised by at least two of the three error measurements.
- Apartment Classification: The Precision, Recall, F1 Score, and Accuracy of
the Random Forest Classifier algorithm are evaluated using confusion metrics and
evaluated values in the classification report. Precision is the number of correctly
classified positive examples divided by the number of examples labeled as positive
by the system, Recall is the number of correctly classified positive examples
divided by the number of positive examples in the data, and F1 Score is a
combination of the above, according to Marina Sokolova. Furthermore, accuracy is
defined as a classifier's overall efficiency (Sokolova & Lapalme, 2009). Besides,
Random Forest is a collection of many Decision Trees used to classify apartments,
thereby increasing the classification accuracy of Decision Trees. We can visualize
one of Decision Trees to see apartments classified based on which criteria.
40
CHAPTER 4: RESULT ANALYSIS AND DISCUSSION
On the dataset of Ho Chi Minh City apartment, the three machine learning
models XGBoost, Linear Regression, and KNN were used to evaluate the models'
performance in terms of forecasting prices. The models were compared and scored
based on their measured accuracy and the time it took to obtain that accuracy.
4.1
Result Analysis
After applying the Data Modeling which has process represented in Experiment
Design. All models will be evaluated in this final SEMMA step for its usability
and reliability for the studied problem. The data may now be examined and used
to estimate how effective its performance is.
4.1.1 Ordinary Least Square method evaluation
4.1.1.1 Regression results according to Ordinary Least Squares method
(OLS)
Table 7. OLS Regression Results
Regression results show that at the 5% level of significance, most of the
variables are statistically significant (p.value <0.05) except for the variable
Density (People/km2), there is enough basis to show the independent variables
impact on apartment prices. Besides, the variables all have the sign of the
regression coefficient in accordance with expectations and previous studies.
41
The results show that the model has a coefficient of R2=0.792=79.2%, which
means that the independent variables in the model explain 79.2% of the variation
of apartment prices in Ho Chi Minh City.
4.1.1.2 Correlation test
The fact that the model occurs autocorrelation will make the OLS estimates
ineffective. Therefore, the author uses the Durbin - Watson d test to detect this
phenomenon. Specific test results are as follows: Durbin-Watson d-Statistics =
1.527 which is in the range 1 < d <3, according to Andy Field those values below
1 or more than 3 are cause for concern (Field, 2009). Therefore, it can be concluded
that the model has no autocorrelation.
4.1.2 Predictive Models Evaluation
4.1.2.1 Linear Regression
Figure 23. Comparing LR's actual vs predicted values
The R2 coefficient is only 0.799820. This suggests that the model explains or
captures 79 percent of the fluctuation in sales, while the remaining about 16
percent is attributable to external variables. With an R2 of 0.799820, the model is
having good accuracy. This suggests that by incorporating different predictors, we
can enhance the model. The RMSE is nearly 0.074578. It indicates that the model's
average predictions are 0.074578 units off from the actual data. For our model,
0.074578 may not be a bad value. The mean of residual is -0.00049 which is very
close to zero.
42
Figure 24. LR's Residual graphs
4.1.2.2 XGBoost
Figure 25. Comparing XGBoost's actual vs predicted values
43
XGBoost outperforms Linear Regression and XGBoost in terms of prediction,
with the highest R2 value: 0.868886. This suggests that the model can explain or
capture 86% of the variation in apartment price. XGBoost has an RMSE of about
0.060356, which is lower than Linear Regression and KNN models. It indicates
that the model's predictions are on average 0.060356 units off from the actual
values. For our model, 0.060356 is the smallest error value. The residual score of
XGBoost is -0.000533 which is farther to zero than the one of Linear Regression.
Figure 26. XGBoost's Residual graphs
44
4.1.2.3 KNN
Figure 27. Comparing KNN's actual vs predicted values
Finally, the KNN model falls between LR and XGBoost, with an R2 value of
0.840146. It means that the model can explain or capture nearly 84 percent of the
price fluctuation. The RMSE value of 0.066644 is lower than score of Linear
Regression but higher than the score of XGBoost. This suggests that the model's
average forecasts are 0.066644 units off the mark. As a result, 0.066644 could not
be a suitable fit for the model. KNN also shows the residual score is -0.002048
which is very far away of zero compared to those of XGBoost and Linear
Regression.
45
Figure 28. KNN's Residual graphs
4.1.3 Random Forest classification
We can evaluate the classification of data based on the confusion matrix as
below:
46
Figure 29. Confusion Matrix
According to the confusion matrix, it is easy to count how many types and
number of each type from three classes, which can be showed in the following
table:
Table 8. Classification performance
Class
True
Positive
134
497
156
A
B
C
False
Positive
73
16
101
True
Negative
761
290
713
False
Negative
9
174
7
Besides, the classification report is conducted to calculate Precision, Recall
score, F1-score of the data after applying the random forest model. Although the
data is quite imbalanced, the confusion matrix is quite good and the accuracy is
81% on the dataset of 977 values which were taken for testing.
4.2
Dicussion
4.2.1 Factors affecting apartment prices in Ho Chi Minh city
Research results show that 11 factors including: Bathroom, Normalized_Meter,
Policy, Furniture, Price_per_meter_each_district (b/m2), Swimming Pool, Gym,
Density(people/km2),
dinstance_to_central,
dinstance_to_airport,
dinstance_to_hospital are the factors affecting apartment prices in Ho Chi Minh
47
City. The detailed research model is written specifically with the reduction of four
digits after the dot, specifically as follows:
Ln(P) = -0.1952 + 0.0074*Bathroom + 0.0062*Normalized_Meter +0.0050*
Policy + 0.0006*Furniture + 0.0518* Price_per_meter_each_district(b/m2)
+ 0.1701*Swimming_pool + 0.1615*Gym + 8.362e-07*Density (People/km2) 0.0029*distance_to_central +0.0015*distance_to_airport
+ 6.055e-07*Distance_to_hospital + Ɛ.
4.2.2 Prediction model
4.2.2.1 Residual
Residual score can describe the difference between each person's anticipated
score and the actual score of three models is their residual score.
Figure 30 below represents the residual score of three models where KNN has
the farthest-to-zero residual score (-0.002048), and the nearest score is -0.000490
from the Linear Regression model. In the middle is the residual score of XGBoost
which is approximately -0.000533.
Figure 30. Residual score comparison
48
4.2.2.2 Root mean squared error
MSE occasionally raises the actual error, making it harder to realize and
comprehend the true mistake amount. The RMSE measure, which is generated by
simply calculating the square root of MSE, solves this problem. Figure 31 depicts
the performance of machine learning algorithms used in this study using the
RMSE performance matrix.
Figure 31. RMSE comparison
On the other hand, XGBoost has the lowest RMSE score (0.060356) while
KNN shows the second lowest score (0.066644) and the highest score belongs to
Linear Regression (0.074578) which are both not as good as XGBoost.
4.2.2.3 The R2 score
In three regression models, R-squared (R2) can express the amount of variation
given by an independent variable or variables. Figure 32 comparing the
performance of machine learning algorithms used in this study by R2 score.
49
Figure 32. R-squared comparison
It is clear that the highest R2 score is 0.868886 from the XGBoost model and
the second is KNN with 0.840146 and the last one is Linear Regression
(0.799820).
4.2.2.4 Result summary
Figure 30 shows residual distribution of data from testing dataset after applying
Linear Regression, XGBoost and KNN models for predicting price. It is clear that
there is the lowest dispersal of residuals points from Linear Regression compared
to graphs from KNN model and XGBoost model. Moreover, the residual
distribution from KNN and Linear Regression have the skew score between -5 to
+5, which means they are nearly symmetric.
In terms of residuals, RMSE, and R-square values, Table 9 shows how the
models performed on both data sets. The residual score of XGBoost is not as high
as Linear Regression and it is just a little higher than KNN’s value (-0.000533
compared to -0.000490 and -0.002048, respectively). However, because the R2
value alone can be sufficient to illustrate which model is better for each data set
since the R2 values are greater when the RMSE values are lower. Therefore, for
50
the datasets, the XGBoost model has the best R2 value and the lowest RMSE
score, which can be the best model to predict apartment prices in Ho Chi Minh
city.
Table 9. Evaluation score summary
Model
Linear
Regression
Residual
RMSE
R2 score
-0.000490
0.074578
0.799820
XGBoost
-0.000533
0.060356
0.868886
KNN
-0.002048
0.066644
0.840146
4.2.3 Classification model
51
Figure 33. Decision tree (number 10)
53
We created the random forest and in Figure 33 is one of decision trees (number
10) from this random forest model. This tree has a non-uniform depth of
partitioning, with 26 internal nodes (including the root node) and 34 leaf nodes.
The numbers and characters in the top row of this tree represent the categories
generated by the best splitting criterion. The Gini Index of Nodes, as defined by
Hosein Shahnas, is a numerical value representing the quality of a node's split on a
variable (feature) (Shahnas, 2020). The numbers in the third row represent the
number of observations assigned to the node, while the numbers in the fourth row
represent the number of variables separated into each class. Finally, the numbers in
the last row indicate the apartment segment's class. All 2494 observations in the
training set were assigned to the leaf nodes of the entire decision tree at the end of
the tree-growing procedure.
In a leaf note, the class value indicates the forecasting value of a given
category's apartment class. For example, the class of 5 samples for the first node
in the last row indicates the class type of apartment which has a price less than
1.79 billion, no swimming pool, a distance to the railway of less than 7.6
kilometers, and no gym, is class type of C.
The tree depicts how several key independent variables in the training set
separate apartment parts. The sum of the values in the training set utilized in the
estimator 10 is represented by the top or root node of the tree in Figure 33. The
splitter is the tree node in the initial split that determines the lower price of 2,371
billion. best at this node by comparing the categorized results given by all
independent fields (variables), as this variable results in the greatest reduction in
node impurity. This variable is the best at distinguishing across categories in the
target field. The two nodes at the second level of the tree, from left to right, reflect
other criteria for separation (Price less than 1.79 billion and no swimming pool
respectively). We can compare the impact of the independent factors on the target
field across categories using this partitioning. The splitting rule that takes use of
the initial split is frequently regarded as the brains of decision tree algorithms
(Berry and Linoff, 2000). This matches the reality of housing transactions in Ho
Chi Minh City. The nodes at the following levels will assist in further classifying
the apartment segments, with the leafs representing the final sorted values.
54
Table 10. Classification report
The diagram depicts the classification. The confusion matrix in Figure 29 and
table 10 show that the random forest classifier algorithm accurately classifies the
data set with an average accuracy of more than 81 percent. As a result, many
variables contain attribute combinations that are comparable to those seen in other
classes.
Furthermore, the samples that are misclassified are often near to the true class,
indicating that the class ordering is significant. If the majority of the misclassified
samples were in classes that were further away from the diagonal, the ordering
may have been considered pointless in the sense that there was no evident basis
for the particular ordering.
Besides, from the training dataset, a random forest classification model was
created to aid in the investigation of the relationship between resale prices of Ho
Chi Minh city apartments and housing characteristics, as well as the identification
of which characteristics are significant in predicting resale prices. Based on rules
given in terms of the independent variables, random forest algorithms execute
several tests and generate the optimum sequence for regressing and forecasting the
dependent variable. These tests determine the optimal splitters, which successively
partition the training data until they reach terminal (leaf) nodes.
55
Figure 34. 5 main factors affecting classification
Using the suggested random forest technique, the created random forest reveals
that the Price, Gym, Swimming pool, Distance to central and
Price_per_meter_each_district are all key price variables that affect the category
process of apartment class values.
56
CHAPTER 5: CONCLUSION
5.1
Conclusion
- Objective 1: Collecting and identifying the quantitative factors that can affect
apartment prices via apartment buildings’ information.
Based on scientific theory, the samples collected through the biggest real estate
exchange and the results of the studies, the author has built an OLS regression
model to analyze the effects of the real estate price index. The impact of factors
affecting the price of apartments in Ho Chi Minh City.
Through quantitative analysis based on a dataset of 4882 observed samples
collected through an experimental survey at apartment buildings in 20 districts
(excluding Cu Chi and Hoc Mon), the research results showed that Apartment
prices in HCMC are affected by 11 factors such as: Bathroom, Normalized_Meter,
Policy, Furniture, Price_per_meter_each_district (b/m2), Swimming Pool, Gym,
Density(people/km2),
dinstance_to_central,
dinstance_to_airport,
dinstance_to_hospital . However, the regression model in this study is made based
on the factors affecting the price of apartments in HCMC, so the selected variables
may change when building the model in other areas.
Objective 2: applying the SEMMA process on building regression
models for predicting the apartment prices and comparing the accuracy of these
model
The purpose of this study is to assess and compare the performance of three
popular machine learning regression models for apartment pricing prediction in
Ho Chi Minh City: Linear Regression, K-Nearest Neighbors, and XGBoost. When
the same data set with 12 attributes ((including features and target) was applied, it
was discovered that XGBoost provided more accurate predictions. Significantly,
the XGBoost model outperformed the other models in terms of accuracy for a
large data set of approximately 4882 values. The XGBoost model was able to
achieve a good performance with an accuracy of up to 89%.
Machine learning is considered to be effective for housing price prediction, in
this case apartment price prediction, in normal scenarios, but deviates during
exceptional events. Further enhancements and model selections could improve
future performance and make it a valuable tool for decision-makers. Nonetheless,
there is a lot of uncertainty about housing prices, for example: the factors could
cause the real estate bubble especially in Vietnam, which affects the overall
performance. Existing models have not yet adequately reflected these
57
uncertainties. As a result, we expect that this uncertainty will not be present in
future predictions, which is a barrier to machine learning in general.
Objective 3: Classifying all of the apartments into each apartment's
segment based on the affecting factors.
A random forest system using various decision trees which has three types of
nodes: decision nodes, leaf nodes, and root nodes. The decision tree's final output
is represented by the leaf node of each tree. A majority vote technique is used to
pick the final product. In this situation, the random forest system's final output is
the output picked by the majority of decision trees. As a result, the random forest
algorithm is more accurate than the decision tree method at predicting the
outcome. The random forest algorithm is an alternate exploratory data analysis
tool for examining the link between home prices and a variety of housing factors
and identifying the important determinants of housing prices for the reasons stated
above.
The results of the study, based on the Random Forest method, which has 83%
of accuracy, show that price, apartment building amenities such as a gym and a
swimming pool, and proximity to the city center are all important. The distance to
the center and the area price of counties Price per meter each district are important
variables in deciding which apartments to buy or invest in.
5.2
Research Meaning
Research results show that there is a big difference between apartment prices in
different districts. The price of apartments in the central area is much higher than
in other areas, which is a special point of attention for investors because this is the
main factor creating the phenomenon of real estate price bubbles in general or
apartment prices in Ho Chi Minh City in particular. When the demand for housing
in the central areas is too high, the price of apartments in these areas rises too high
compared to its real value. However, at a certain point when the market is
saturated, the liquidity of apartments in this area will no longer be high which leads
to a serious price drop that can cause the apartment market here to be strongly
affected. Therefore, investors should depend on their capital sources and
investment purposes to make an appropriate choice.
Besides, the experimental results show that when the apartment complex has
many internal utilities, the price also increases by 16.2% for the swimming pool
and 14.3% for the gym. This feature is noted for project investors when designing
a project, especially paying attention to the planning issues of utility areas inside
58
the project and ensuring the completion of the legality of the apartment to enhance
the value of the apartment as well as reputation. For real estate investors, when
investing in apartments, to ensure high profit potential, they also need to choose
apartments with apartments with many utilities.
Machine learning algorithms, which are used to predict prices and classify
apartments, are one of the effective tools in supporting decision making of
investors and real estate companies. In addition, if the above models are used
appropriately, investors can be used as a tool to appraise apartment prices in the
area, thereby, investing at the right price and making more profits. Housing and
Real Estate Market Management Agencies can effectively use models to check
and issue policies to ensure the stability of apartment prices in particular and real
estate in general to avoid the situation of real estate bubbles becoming larger.
5.3
Limitation
The study was completed in a short amount of time, and the scope of the
investigation was confined to only 20 districts in Ho Chi Minh City, rather than
the full Ho Chi Minh city region. Despite direct price comparison with premium
data from investors, the utilization of secondary data collected on real estate
exchanges has enhanced the data quality. Because the number of survey samples
is small, it does not reflect all areas of HCMC. The factors affecting the price of
apartments in HCMC used in the model are not comprehensive; for example, some
apartments that have been built for a long time, and there is no information about
the owner, project name, or construction year provided by the Department of
Statistics. So, not only has it become hard to predict with great precision, but it has
also become impossible to categorize residences more precisely. The study only
uses Hedonic regression models to find influencing factors, 3 price prediction
models and a random forest classifier algorithm to classify apartments.
Furthermore, while the above methodologies are commonly used in countries
around the world, the research findings are only applicable to that country's
geographical and cultural peculiarities. As a result, the application in HCMC has
a number of disadvantages.
5.4
Future Research
Dealing with the study's limitations, it is suggested the following future research
directions:
First of all, additional research may extend the scope of the study to include all
of HCMC, including the suburbs. As a result, the element of district location,
59
which may be separated into five areas: center district, east, west, south, and north,
needs to be enlarged further. In addition, the research object can be expanded to
include not only apartments but also other types of real estates. Secondly, other
apartment-specific criteria, such as property turnover, proximity to the river,
guarantor bank reputation, and policy for foreigner owners, may be included in
future research. State macro policy, regulations governing the purchase of flats by
foreigners, the manner and timing of payment under the contract, the quality of the
apartment's living water, monthly management fees, the apartment's age, and the
number of times the apartment has been transferred. Machine learning models,
using an ever-increasing supply of data, are thought to become increasingly useful
in the future.
Besides, having more data, having more influencing elements, such as those
described above, can improve the model's accuracy. More methods for predicting
or classifying apartments is also remarkable for identification, which is a superior
algorithm than the four algorithms mentioned in the article.
60
REFERENCE
1.
Qingqi Zhang, "Housing Price Prediction Based on Multiple Linear
Regression”, Scientific Programming, vol. 2021, Article ID 7678931, 9 pages,
2021.https://doi.org/10.1155/2021/7678931
2.
Le, M. (2021). Thực trạng thị trường nhà ở đô thị cho người thu nhập
trung bình tại thành phố Hồ Chí Minh. PROCEEDINGS, 16(1), 77-90. doi:
10.46223/hcmcoujs.proc.vi.16.1.1858.2021
3.
Vuong, Q. (2016). Các nhân tố ảnh hưởng đến giá nhà đất ở trên địa
bàn thành phố Cần Thơ. Tạp Chí Khoa Học Thương Mại, 91.
4.
Chen, X., Wei, L. and Xu, J., 2017. House Price Prediction Using
LSTM. [ebook] Hong Kong: The Hong Kong University of Science and
Technology. Available at: <https://arxiv.org/pdf/1709.08432.pdf>.
5.
Nguyen, T., 2009. Thực trạng sử dụng phương pháp so sánh và
phương pháp chi phí trong thẩm định giá bất động sản tại thành phố hồ chí
minh. [ebook] University of Economics HCMC. Available at:
<https://text.xemtailieu.net/tai-lieu/thuc-trang-su-dung-phuong-phap-so-sanh-vaphuong-phap-chi-phi-trong-tham-dinh-gia-bat-dong-san-tai-thanh-pho-ho-chiminh-81260.html>.
6.
A. S. Temür, M. Akgün, and G. Temür, “Predicting Housing Sales in
Turkey Using Arima, Lstm and Hybrid Models,” J. Bus. Econ. Manag., vol. 20,
no. 5, pp. 920–938, 2019, doi: 10.3846/jbem.2019.10190.
7.
Burgt, v. (2017). Data engineering for house price prediction [Ebook].
Eindhoven University of Technology. Retrieved 15 December 2021, from
https://pure.tue.nl/ws/portalfiles/portal/72297619/0831848_Burgt_v.d._E.J.T.G._
thesis_CSE.pdf.
8.
Zhang, Z., 2016. Introduction to machine learning: k-nearest neighbors.
Annals of Translational Medicine, [online] 4(11), pp.218-218. Available at:
<https://www.researchgate.net/publication/303958989_Introduction_to_machine
_learning_K-nearest_neighbors>.
9.
Shi, Y., 2011. Comparing K-Nearest Neighbors and Potential Energy
Method in classification problem. A case study using KNN applet by E.M. Mirkes
and real-life benchmark data sets. [ebook] Leicester: University of Leicester.
Available at: <https://arxiv.org/ftp/arxiv/papers/1211/1211.0879.pdf>.
10. Parsehub.com. 2021. ParseHub. [online] Available at:
<https://www.parsehub.com/intro>.
11. Ghilani, C.D. 2010. Adjustment computations: Spatial data analysis,
5th ed. Hoboken, New Jersey: Wiley. 672 pp.
12. MacQueen, J. (1967). SOME METHODS FOR CLASSIFICATION
AND ANALYSIS OF MULTIVARIATE OBSERVATIONS [Ebook]. Los Angeles:
61
UNIVERSITY OF CALIFORNIA, Retrieved from
https://digitalassets.lib.berkeley.edu/math/ucb/text/math_s5_v1_article-17.pdf
13. Kim, H., 2019. Statistical notes for clinical researchers: simple linear
regression 3 – residual analysis. Restorative Dentistry & Endodontics, 44(1).
14. Kim, H. (2013). Statistical notes for clinical researchers: assessing
normal distribution (2) using skewness and kurtosis. Restorative Dentistry
&Amp; Endodontics, 38(1), 52. doi: 10.5395/rde.2013.38.1.52
15. Schober, P., Boer, C., & Schwarte, L. (2018). Correlation
Coefficients. Anesthesia &Amp; Analgesia, 126(5), 1763-1768. doi:
10.1213/ane.0000000000002864
16. Chicco, D., Warrens, M., & Jurman, G. (2021). The coefficient of
determination R-squared is more informative than SMAPE, MAE, MAPE, MSE
and RMSE in regression analysis evaluation. Peerj Computer Science, 7, e623.
doi: 10.7717/peerj-cs.623
17. Glantz, S., Slinker, B., & Neilands, T. (2000). Primer of applied
regression & analysis of variance.
18. Ayodele, T. (2010). Types of Machine Learning Algorithms. New
Advances In Machine Learning, 3.
19. Steel, R. G. D.; Torrie, J. H. (1960). Principles and Procedures of
Statistics with Special Reference to the Biological Sciences. McGraw Hill.
20. , D., Cabrera, J., Lee, Y.-S.: Enriched random forests. Bioinformatics 24
(18) pp.2010–2014 (2008).
21. Sokolova, M., & Lapalme, G. (2009). A systematic analysis of
performance measures for classification tasks. Information Processing &Amp;
Management, 45(4), 427-437. doi: 10.1016/j.ipm.2009.03.002
22. Field, A. (2009). Discovering statistics using IBM SPSS statistics (3rd
ed.). London: Sage.
62
Appendix
Appendix 1. Statistical Description of feature variables
Bathroom Normalized_Meter Policy Furniture Price_per_meter_each_district
(b/m2)
count
4882
4882 4882
4882
4882
mean
1,78
66,57
0,42
0,51
0,12
std
0,53
15,52
0,49
0,50
0,18
min
1,00
23,00
0,00
0,00
0,03
25%
1,00
56,00
0,00
0,00
0,03
50%
2,00
67,00
0,00
1,00
0,04
75%
2,00
76,68
1,00
1,00
0,06
max
4,00
100,00
1,00
1,00
0,56
Appendix 2. Statistical Description of feature variables (continued)
count
mean
std
min
25%
50%
75%
max
Density
distance_to_central distance_to_airport Distance_to log_price
(People/km2)
_hospital
4882
4882
4882
4882
4882,00
17596,93
7,62
9,55
2,59
0,44
16428,52
3,8
4
27,89
0,16
16
0,49
0,35
0,04
0,03
3482
5,02
6,49
0,91
0,32
9855
7,39
9,4
1,76
0,43
28922
10,45
12,14
2,56
0,56
65113
18
19,92
1126,24
0,90
63
Appendix 3. Data correlation table of apartments data
Bedroom
Bathroom
Normalized_Meter
Policy
Furniture
Swimming_pool
Gym
Density
(People/km2)
log_price
lat
long
distance_to_
central
distance_to_
airport
distance_to_
railway
Distance_to_
highschool
Distance_to_
hospital
0,096
Price_per_meter_
each_district
(b/m2)
-0,094
Bedroom
1,000
0,733
0,730
-0,111
-0,201
0,101
-0,015
0,290
0,072
0,124
0,047
-0,028
-0,008
-0,040
-0,043
Bathroom
0,733
1,000
0,670
-0,113
0,035
-0,071
-0,120
0,038
-0,068
0,340
0,037
0,129
0,065
-0,037
0,010
-0,035
-0,037
Normalized_Meter
0,730
0,670
1,000
-0,085
0,073
-0,020
-0,105
0,110
0,027
0,524
0,114
0,125
-0,082
-0,088
-0,108
-0,015
-0,018
Policy
-0,111
-0,113
-0,085
1,000
0,032
0,160
0,138
0,124
-0,094
0,064
0,025
0,159
0,105
0,178
0,163
-0,020
-0,015
Furniture
0,096
0,035
0,073
0,032
1,000
0,028
0,184
0,116
0,029
0,182
0,012
0,032
-0,063
-0,028
-0,047
0,022
0,024
Price_per_meter_each_district
(b/m2)
-0,094
-0,071
-0,020
0,160
0,028
1,000
0,283
0,106
-0,324
0,196
0,069
0,338
-0,170
0,062
0,010
0,001
0,000
Swimming_pool
-0,201
-0,120
-0,105
0,138
0,184
0,283
1,000
0,340
0,199
0,605
0,027
0,135
-0,395
-0,195
-0,296
0,022
0,027
Gym
-0,101
-0,038
-0,110
0,124
0,116
0,106
0,340
1,000
0,089
0,447
0,058
0,194
-0,190
-0,034
-0,103
0,007
0,013
Density (People/km2)
-0,015
-0,068
0,027
-0,094
0,029
-0,324
0,199
0,089
1,000
0,230
-0,592
-0,665
0,003
-0,012
0,290
0,340
0,524
0,064
0,182
0,196
0,605
0,447
0,230
1,000
0,445
0,065
-0,499
log_price
-0,414
-0,231
-0,338
0,006
0,008
lat
-0,072
-0,037
-0,114
0,025
0,012
0,069
0,027
0,058
-0,073
-0,026
0,073
0,026
1,000
0,303
0,357
-0,186
0,263
-0,006
0,004
long
-0,124
-0,129
-0,125
0,159
0,032
0,338
0,135
0,194
-0,445
0,065
0,303
1,000
0,185
0,619
0,544
0,004
0,024
distance_to_central
0,047
0,065
-0,082
0,105
-0,063
-0,170
-0,395
0,190
-0,499
-0,414
0,357
0,185
1,000
0,591
0,894
-0,039
-0,020
distance_to_airport
-0,028
-0,037
-0,088
0,178
-0,028
0,062
-0,195
0,034
-0,592
-0,231
0,186
0,619
0,591
1,000
0,846
-0,010
0,011
distance_to_railway
-0,008
0,010
-0,108
0,163
-0,047
0,010
-0,296
0,103
-0,665
-0,338
0,263
0,544
0,894
0,846
1,000
-0,024
0,000
Distance_to_highschool
-0,040
-0,035
-0,015
-0,020
0,022
0,001
0,022
0,007
0,003
0,006
0,006
0,004
-0,039
-0,010
-0,024
1,000
0,999
Distance_to_hospital
-0,043
-0,037
-0,018
-0,015
0,024
0,000
0,027
0,013
-0,012
0,008
0,004
0,024
-0,020
0,011
0,000
0,999
1,000
64
Appendix 4. and description after model preprocessing of all variables
Index
Name of attributes
Description
Data Type
Title
Product introduction
title
Object
2
ID
Address of real estate
Object
3
District
Location of apartment
Object
Area of real estate
Object
Real estate’s code
Int64
Number of rooms
Int64
Price
Apartment price
Object
Normalized_Price
Price after
preprocessing
Float
1
Total_Square
4
Bedroom
5
Bathroom
6
7
8
Total area of real estate
9
10
11
12
13
Normalized_Meter
Float64
After preprocessing
Policy
The current certificate
of ownership
Int64
Furniture
Furniture provided with
the real estate
Int64
Swimming_Pool
Swimming pool
provided with the real
estate
Int64
Gym
Gym facilities of the
apartment
Int64
65
Price Per Meter Each
District (B/M2)
The average price for
per square meter
Float64
Population Density
15
The population density
of each district
Float64
16
Lat
Latitude of the
apartments
Float64
17
Long
Longitude of the
apartments
Float64
18
Project Name
Name of Apartment
project
Object
19
segment
Apartment’s type of
segment
Category
20
Distance_to_airport
Distance to airport
Float
21
Distance_to_train_
station
Distance to train
station
Float
22
Distance_to_school
Distance to school
Float
23
Distance_to_hospital
Distance to hospital
Float
Log_Price
Normalized price after
log transformation
Float
14
24
66
Appendix 5. VIF Score
Index
Variable
VIF
0
Bathroom
23,70975569
1
Normalized_Meter
30,43331151
2
Policy
1,920238579
3
Furniture
2,136332285
4
Price_per_meter_each_district (b/m2)
1,911938943
5
Swimming_pool
2,494362889
6
Gym
8,031218678
7
Density (People/km2)
3,26741299
8
distance_to_central
8,872648525
9
distance_to_airport
11,35895792
10
Distance_to_hospital
1,012099197
67
Download