Uploaded by alysha1307

Machine Learning project

advertisement
Property Monthly Rent Analysis in Klang Valley
Nur Alysha Ismady, Jazmina Huda Khairuddin, Azami Shamsudin, Yong Han Syern, Nur Sarah Mohd Suhaimi
Computer and Information Science Department, Prepared for Dr. Shakirah Taib
Abstract
Tenants, landlords and researchers all have a keen interest in the rental market in Klang Valley, Malaysia. This
study aims to provide a general overview of the current trends, factors and dynamics influencing rental rates in
Klang Valley. Kuala Lumpur and its neighbouring suburbs such as Klang, Puchong, and Shah Alam make up the
Klang Valley, and serves a significant economic and cultural centre. Renters, real estate, and policymakers must
be able to make informed decisions based on the current local rental price changes. This paper delivers valuable
insights by comparing two machine learning algorithms which are KMeans and DBSCAN to analyze historical
rental data and identify key factors influencing rental rates in Klang Valley through clustering.
I.
Introduction
Variations in rental costs can be caused by
various factors in Klang Valley. These factors
encompass overall economic conditions, supply and
demand dynamics, and improvements in
transportation infrastructure. Economic conditions
play a crucial role as thriving economy may drive up
demand for rental properties leading to higher
prices. Conversely, during economic downturns,
rental prices may stabilize or even decrease as
demand decreases. Furthermore, the rental prices
variations in Klang Valley includes a wide range of
property types, from affordable houses and
apartments to luxurious condominiums.
Location,
amenities
and
property
characteristics affect the rental pricing as well.
Within Klang Valley, there are huge pricing
differences between districts. Due to their proximity
to business centers, entertainment hubs, and easy
access to public transportation. Kuala Lumpur’s
prime location charge higher rentals, whereas
suburban neighborhoods provide a more reasonable
renting option. Therefore, research and analysis of
rental pricing trends will be essential for guiding
strategies as Klang Valley continues to develop and
ensuring a sustainable and balanced rental market
for both residents and landlords.
II.
Literature Review
Population growth in Klang Valley has
been significantly driven by job opportunities. The
Klang Valley, which includes Kuala Lumpur and its
surrounding areas, is the economic and financial hub
of Malaysia. It offers a wide range of job
opportunities across various industries, such as
finance, technology, manufacturing, services, and
tourism. Housing prices including rental prices have
generally seen an upward trend over the years with
factors such as population growth, increasing
demands for urban living and limited land supply.
The mismatch between income levels and cost of
living in urban areas has become a challenge for
many individuals and families. The infrastructure
and economic activities in urban and rural areas
differ and have led to variations in the cost of living
[1].
Rapidly increasing housing prices have
become a common issue for Malaysians. As house
prices rise, rental prices tend to follow suit. Before
2008, a terrace house in Kuala Lumpur could be
purchased for RM250,000 to RM450,000, but by
2012, the cost rose to RM500,000 to RM900,000.
Factors contributing to the escalating housing prices
in Kang Valley include changes in household
structure due to fast urbanization, increased
construction costs, housing speculation, and
proximity to workplaces and facilities. The impact
of inflation rate on housing prices is considered
negligible [2].
The high rental prices in Klang Valley do
not only affect Malaysian citizens but also impact
immigrant workers in Malaysia. Malaysia is known
for being a significant destination for immigrant
labor from various South East Asian countries,
especially for Pakistani laborers. However, there
have been few studies conducted to understand the
barriers and challenges faced by Pakistani
immigrants in Malaysia [3]. One of the major
obstacles they encounter is the expensive cost of
rent, which prevents them from finding suitable
housing in good condition [3]. They often find the
rental prices shocking and challenging to afford.
Despite their hopes of finding cheaper rental options
in the housing market, the reality proves to be
difficult for them.
The scarcity of affordable housing leading
to a demand-supply gap has emerged as a critical
issue in Malaysia. If this problem remains
unresolved, it will worsen the challenges of house
ownership and accommodation in numerous urban
areas across the country [4]. The primary factors
contributing to the demand-supply gap include the
high cost of materials, unfavorable government
policies, and insufficient control and monitoring by
the government on the type of housing being built
[4]. As a result, housing prices have seen a sharp
increase, making it difficult for potential buyers to
secure bank loans. Additionally, the rising demand
for rental properties has driven up rental fees in
cities, leading to more people being forced to rent
rather than buy. Furthermore, certain urban
residential areas have become overcrowded due to
the limited availability of suitable housing options.
These challenges need to be addressed to alleviate
the housing crisis and ensure a more sustainable
housing market in Malaysia.
III.
Methodology
Machine Learning models can assist
tenants and property owners make informed
decisions by utilizing data-driven methodologies to
forecast future trends and provide an insightful
information about factors influencing rental pricing
particularly in Klang Valley. The data gathered was
retrieved from Kaggle. The dataset includes details
on the cost of renting in the Malaysian region of
Selangor and Kuala Lumpur. All the compiled data
were originated from “mudah.my”. Below are data
description and entities of the dataset utilized which
consists of 13 instances and over 20,000 records.
Table 1. Data description
ads_id
The listing id (unique)
prop_name
Name of Property or
Building
completion_year
Property’s
completion
year
of
monthly_rent
Monthly
Malaysian
(MYR)
location
Property location in Kuala
Lumpur Region
property_type
Property type; apartment,
condominium,
flat,
duplex, studio etc.
rooms
Number
of
rooms
available in each unit
parking
Number of parking spaces
for each unit
bathroom
Number of bathrooms in
each unit
size
Total area of the unit
(SQFT)
furnished
Furnishing status of the
unit (Fully, Partial or Nonfurnished)
facilities
Main facilities available
additional_facilities
Proximity to attraction
area, mall, school, malls,
etc.
rent
in
Ringgit
The CRISP-DM (cross-Industry Standard
Process for Data Mining) methodology will be
utilized in the machine learning process as it
provides a structed and comprehensive framework
to guide the development of machine learning
models and ensuring all approaches are aligned to
the problem.
Business Understanding: The project’s objectives
are established in this phase. Python with Google
Collaboratory will be the primary tool used to carry
out the project. The data will be analyzed and shown
using Python tools such as pandas, matplotlib and
missingno. In addition, this phase also involves
determining the data source where the data will be
acquired from.
Data Understanding: The data that was acquired
with 13 variables has all the data type of integers,
float, and object. This phase requires examining and
reviewing the dataset based on its quality, structure
and content that will be utilized for analysis.
Furthermore, this entails gathering the dataset from
the identified data source, comprehending its form,
identifying problems on the overall quality of the
dataset, and judging its completeness.
Data Preparation: Data cleansing was done to
improve data quality by identifying any potential
null values, eliminating errors and discrepancies to
avoid inaccuracies. Missing and/or incorrect data in
the dataset will result in inefficiencies. Checking the
datatype of each variable was necessary to ensure
that each variable in the dataset is associated with
the correct and appropriate datatype. Additionally,
there were outliers identified in the dataset.
However, it was decided that the outliers will be kept
as it may represent valid and significant
observations.
Furthermore, the data pre-processing steps
also consist of feature engineering techniques with
one of the essential techniques which is scaling and
encoding. Min-Max scaling, Standard scaling and
Robust scaling are applied in this study to further
ensure that features are on a similar scale. For
categorical data, One-Hot encoding technique is
applied to convert each category into a binary
feature and Label encoder were utilized to assign
unique integer value to each category in the
categorical data.
Modelling: K-means and DBSCAN are popular
unsupervised
machine
learning
clustering
algorithms designed to group similar data points
based on their features. K-means is a centroid-based
algorithm that divides data into a predefined number
of K clusters. The algorithm iteratively finds
centroids that minimize the squared distances
between data points and their assigned centroid. The
K-means algorithm's effectiveness relies on the
selection of the value of "k" which must be specified
beforehand to conduct any clustering analysis [5]. In
contrast to well-known clustering algorithms like Kmeans, DBSCAN which is a density-based
algorithm that clusters data points based on their
density, does not necessitate predefining or
restricting the number of clusters or classes [6].
Instead, it uses epsilon (ε) as the radius to search for
neighboring points and minPts as the minimum
number of points to form a dense region.
For this study, the K-means algorithm was
set to have 5 clusters (K=5), and the DBSCAN
algorithm used epsilon of 0.5 and a minimum
number of samples (min_samples) of 5.
Evaluation: The performance of K-means and
DBSCAN algorithms is assessed using two metrics
in unsupervised machine learning: the Silhouette
Score and the Davis-Bouldin Index. These metrics
help measure how well the data points are grouped
in terms of compactness and separation within
clusters. The Silhouette Score computes the average
silhouette coefficient for each data point and
representing its similarity to its own cluster
compared to other clusters. This coefficient ranges
from -1 to +1, with higher values indicating better
clustering.
On the other hand, the Davis-Bouldin
Index evaluates the average similarity between each
cluster and its most similar cluster. It considers both
within-cluster distances and distances between
clusters. A lower Davis-Bouldin Index indicates that
the clusters are more well-defined and distinct from
each other. These metrics provide valuable insights
into the quality of clustering results and assist in
comparing different algorithms or determining the
optimal number of clusters for a given dataset.
Deployment: The model developed for this project
was not designed for deployment in a production
environment. It was primarily focused on data
exploration, analysis, research purposes, prioritizing
accuracy, and interpretability over deployment
efficiency. The insights gained from this model can
serve as a foundation for future iterations of
deployment-ready models, if needed.
IV.
Results and Discussion
Table 2: K-Means Silhouette and K-Means
Davies Bouldin Index Scores
Metrics
K-Means
Silhouette Score
K-Means DaviesBouldin Index
Scores
0.2816401271565697
1.1741890620961124
Table 2 shows results on K-Means Silhouette
and K-Means Davies Bouldin Index Scores, KMeans performed well with some cluster separation.
Meanwhile, K-Means Davies-Boulding Index and
K-Means Silhouette has a score of approximately
1.17 and has a score of approximately 0.281
respectively. The K-Means Davies-Bouldin Index of
1.17 suggests moderately well-defined clusters, and
the silhouette score of 0.28 somehow shows a
significant separation between clusters. K-Means
generated clusters that are reasonably wellseparated. Nevertheless, the algorithm demonstrates
that the clusters may be distinguished to some extent
although they did not achieve a high score.
among them. Hence, K-Means can be considered
reliable and scalable to the dataset.
Table 3: DBSCAN Silhouette and
Bouldin Index Score
Metrics
DBSCAN
Silhouette Score
DBSCAN DaviesBouldin Index
Scores
0.00124851598085306
1.4344434732612028
DBSCAN, on the other hand, did not produce a
clear cluster for the dataset. Based on Table 3:
DBSCAN Silhouette and Bouldin Index Score,
DBSCAN Silhouette has a score of approximately
0.00125 which is very close to 0. Therefore, it can
be concluded that the clusters are either not
well-defined, overlapping with one another, or both
simultaneously. In addition to that, DBSCAN
Davies-Bouldin Index score is approximately at 1.43
which means that the clusters produced are not
well-separated as desired. Hence, it implies a less
effective clustering. It can be concluded that based
on the scores achieved indicate that the algorithm
may not have been very effective in creating distinct
clusters with the dataset. Therefore, due to the low
silhouette score and the high Davies-Boulding Index
suggests that the algorithm did not perform well on
the dataset.
Figure 2: DBSCAN Cluster based on
Combined Features
Based on Figure 2: DBSCAN Cluster based
on Combine Feature shows that the clusters lie close
to 0 and exactly lies on 0 on the graph. It can be
observed that DBSCAN groups points are close
together and have enough nearby neighbor into a
cluster. Some data points may exhibit characteristics
that make them ambiguous in terms of cluster
membership. These points might have similar
features or attributes to more than one cluster.
Based on five features which are
robust_monthly_rent, robust_size, region_Label,
furnished_Label, completion_year, it can be
concluded that there are no significant factors
affecting house rental prices in Klang Valley due to
external factors. The result of the clusters in
determining rental prices in Klang Valley has shown
that the rental prices of the properties within
different clusters are similar as they are overlapping
with one another. Factors that drive to rental
demands are the job market available in Klang
Valley and lifestyle preferences. Number of
bedrooms, amenities or location may indicate that
the properties in Klang Valley have similar
characteristics which contribute to the comparable
rental prices. However, there are possibilities that
rental prices can fluctuate overtime due to changes
in demand, economic condition or lifestyle trend.
V.
Figure 1: K-Means Cluster based on
Multiple Features
Based on Figure 1: K-means Cluster of
Properties based on Multiple Features, result shows
that the data are overlapping with one another but
cluster 3 on the diagram shows some separation
Conclusion
Rental price analysis often involves
exploring large datasets with multiple variables.
Thus, unsupervised learning algorithms such as
clustering using K-Means and DBSCAN can help
identify patterns and group similar properties based
on various features and providing a deeper
understanding of rental market trends. This research
has successfully compared the performance of the
two classifiers, shedding light on the strengths, and
weaknesses of each. In addition to that, data
exploration, pre-processing, and feature engineering
techniques has been applied in ensuring that the
output of the algorithms are enhanced.
While both K-Means and DBSCAN
provided valuable insights into the rental market, the
K-Means algorithm have demonstrated a more
promising results with reasonably well-separated
clusters. Nonetheless, further exploration and
analysis may be required to identify and understand
the potential underlying factors influencing the
rental
market
in
Klang
Valley
more
comprehensively. Additionally, considering the
dataset, other machine learning algorithms can be
explored to determine their effectiveness in rental
price analysis. In conclusion, the study's findings
provide a foundation for further investigation and
open possibilities for exploring different machine
learning approaches to enhance rental price analysis.
References
[1] S. Pinjaman and M. Kogid, "Macroeconomic
Determinants of House Prices in Malaysia,"
Jurnal Ekonomi Malaysia, vol. 54, no. 1, pp.
153-154,
2020.
http://dx.doi.org/10.17576/JEM-2020-5401-11
[2] P. A. Mariadas, M. Selvanathan and T. K. Hong,
"A Study on Housing Price in Klang Valley,
Malaysia," International Business Research,
vol.
9,
no.
12,
November
2016.
doi:10.5539/ibr.v9n12p103
[3] T. Zermina, M. N. Ajis and N. A. Zainal Abidin,
"Examining the Housing Experiences in
Malaysia: a Qualitative Research on Pakistani
Immigrant Labours," Journal of International
Migration and Integration volume, vol. 21, p.
241–251, 2020. doi:10.1007/s12134-01900723-7
[4] A. P. J. Chan and B. H. C. Lee, "A Study on
Factors Causing the Demand-Supply Gap of
Affordable Housing," INTI Journal Special
Edition – Built Environment, pp. 6-10, 2016.
http://eprints.intimal.edu.my
/600/1/EA%20-%201.pdf
[5] M. Ahmed, R. Seraj and S. M. Shamsul Islam,
"The k-means Algorithm: A Comprehensive
Survey and
Performance
Evaluation,"
Electronics , vol. 9, no. 8, p. 1295, 2020.
https://doi.org/10.3390/electro
nics9081295
[6] J. C. Perafan-Lopez, V. L. Ferrer-Gregory, C.
Nieto-Londoño
and
J.
Sierra-Pérez,
"Performance Analysis and Architecture of a
Clustering Hybrid Algorithm Called FA+GADBSCAN Using Artificial Datasets," Entropy
(Basel), vol. 24, no. 7, p. 875, 2022. doi:
10.3390/e24070875
Google Colab Link:
https://colab.research.google.com/drive/1jWnn7Me
XfF-Lt6yn-mP521NTf1lWP5l-?usp=sharing
Download