Property Monthly Rent Analysis in Klang Valley Nur Alysha Ismady, Jazmina Huda Khairuddin, Azami Shamsudin, Yong Han Syern, Nur Sarah Mohd Suhaimi Computer and Information Science Department, Prepared for Dr. Shakirah Taib Abstract Tenants, landlords and researchers all have a keen interest in the rental market in Klang Valley, Malaysia. This study aims to provide a general overview of the current trends, factors and dynamics influencing rental rates in Klang Valley. Kuala Lumpur and its neighbouring suburbs such as Klang, Puchong, and Shah Alam make up the Klang Valley, and serves a significant economic and cultural centre. Renters, real estate, and policymakers must be able to make informed decisions based on the current local rental price changes. This paper delivers valuable insights by comparing two machine learning algorithms which are KMeans and DBSCAN to analyze historical rental data and identify key factors influencing rental rates in Klang Valley through clustering. I. Introduction Variations in rental costs can be caused by various factors in Klang Valley. These factors encompass overall economic conditions, supply and demand dynamics, and improvements in transportation infrastructure. Economic conditions play a crucial role as thriving economy may drive up demand for rental properties leading to higher prices. Conversely, during economic downturns, rental prices may stabilize or even decrease as demand decreases. Furthermore, the rental prices variations in Klang Valley includes a wide range of property types, from affordable houses and apartments to luxurious condominiums. Location, amenities and property characteristics affect the rental pricing as well. Within Klang Valley, there are huge pricing differences between districts. Due to their proximity to business centers, entertainment hubs, and easy access to public transportation. Kuala Lumpur’s prime location charge higher rentals, whereas suburban neighborhoods provide a more reasonable renting option. Therefore, research and analysis of rental pricing trends will be essential for guiding strategies as Klang Valley continues to develop and ensuring a sustainable and balanced rental market for both residents and landlords. II. Literature Review Population growth in Klang Valley has been significantly driven by job opportunities. The Klang Valley, which includes Kuala Lumpur and its surrounding areas, is the economic and financial hub of Malaysia. It offers a wide range of job opportunities across various industries, such as finance, technology, manufacturing, services, and tourism. Housing prices including rental prices have generally seen an upward trend over the years with factors such as population growth, increasing demands for urban living and limited land supply. The mismatch between income levels and cost of living in urban areas has become a challenge for many individuals and families. The infrastructure and economic activities in urban and rural areas differ and have led to variations in the cost of living [1]. Rapidly increasing housing prices have become a common issue for Malaysians. As house prices rise, rental prices tend to follow suit. Before 2008, a terrace house in Kuala Lumpur could be purchased for RM250,000 to RM450,000, but by 2012, the cost rose to RM500,000 to RM900,000. Factors contributing to the escalating housing prices in Kang Valley include changes in household structure due to fast urbanization, increased construction costs, housing speculation, and proximity to workplaces and facilities. The impact of inflation rate on housing prices is considered negligible [2]. The high rental prices in Klang Valley do not only affect Malaysian citizens but also impact immigrant workers in Malaysia. Malaysia is known for being a significant destination for immigrant labor from various South East Asian countries, especially for Pakistani laborers. However, there have been few studies conducted to understand the barriers and challenges faced by Pakistani immigrants in Malaysia [3]. One of the major obstacles they encounter is the expensive cost of rent, which prevents them from finding suitable housing in good condition [3]. They often find the rental prices shocking and challenging to afford. Despite their hopes of finding cheaper rental options in the housing market, the reality proves to be difficult for them. The scarcity of affordable housing leading to a demand-supply gap has emerged as a critical issue in Malaysia. If this problem remains unresolved, it will worsen the challenges of house ownership and accommodation in numerous urban areas across the country [4]. The primary factors contributing to the demand-supply gap include the high cost of materials, unfavorable government policies, and insufficient control and monitoring by the government on the type of housing being built [4]. As a result, housing prices have seen a sharp increase, making it difficult for potential buyers to secure bank loans. Additionally, the rising demand for rental properties has driven up rental fees in cities, leading to more people being forced to rent rather than buy. Furthermore, certain urban residential areas have become overcrowded due to the limited availability of suitable housing options. These challenges need to be addressed to alleviate the housing crisis and ensure a more sustainable housing market in Malaysia. III. Methodology Machine Learning models can assist tenants and property owners make informed decisions by utilizing data-driven methodologies to forecast future trends and provide an insightful information about factors influencing rental pricing particularly in Klang Valley. The data gathered was retrieved from Kaggle. The dataset includes details on the cost of renting in the Malaysian region of Selangor and Kuala Lumpur. All the compiled data were originated from “mudah.my”. Below are data description and entities of the dataset utilized which consists of 13 instances and over 20,000 records. Table 1. Data description ads_id The listing id (unique) prop_name Name of Property or Building completion_year Property’s completion year of monthly_rent Monthly Malaysian (MYR) location Property location in Kuala Lumpur Region property_type Property type; apartment, condominium, flat, duplex, studio etc. rooms Number of rooms available in each unit parking Number of parking spaces for each unit bathroom Number of bathrooms in each unit size Total area of the unit (SQFT) furnished Furnishing status of the unit (Fully, Partial or Nonfurnished) facilities Main facilities available additional_facilities Proximity to attraction area, mall, school, malls, etc. rent in Ringgit The CRISP-DM (cross-Industry Standard Process for Data Mining) methodology will be utilized in the machine learning process as it provides a structed and comprehensive framework to guide the development of machine learning models and ensuring all approaches are aligned to the problem. Business Understanding: The project’s objectives are established in this phase. Python with Google Collaboratory will be the primary tool used to carry out the project. The data will be analyzed and shown using Python tools such as pandas, matplotlib and missingno. In addition, this phase also involves determining the data source where the data will be acquired from. Data Understanding: The data that was acquired with 13 variables has all the data type of integers, float, and object. This phase requires examining and reviewing the dataset based on its quality, structure and content that will be utilized for analysis. Furthermore, this entails gathering the dataset from the identified data source, comprehending its form, identifying problems on the overall quality of the dataset, and judging its completeness. Data Preparation: Data cleansing was done to improve data quality by identifying any potential null values, eliminating errors and discrepancies to avoid inaccuracies. Missing and/or incorrect data in the dataset will result in inefficiencies. Checking the datatype of each variable was necessary to ensure that each variable in the dataset is associated with the correct and appropriate datatype. Additionally, there were outliers identified in the dataset. However, it was decided that the outliers will be kept as it may represent valid and significant observations. Furthermore, the data pre-processing steps also consist of feature engineering techniques with one of the essential techniques which is scaling and encoding. Min-Max scaling, Standard scaling and Robust scaling are applied in this study to further ensure that features are on a similar scale. For categorical data, One-Hot encoding technique is applied to convert each category into a binary feature and Label encoder were utilized to assign unique integer value to each category in the categorical data. Modelling: K-means and DBSCAN are popular unsupervised machine learning clustering algorithms designed to group similar data points based on their features. K-means is a centroid-based algorithm that divides data into a predefined number of K clusters. The algorithm iteratively finds centroids that minimize the squared distances between data points and their assigned centroid. The K-means algorithm's effectiveness relies on the selection of the value of "k" which must be specified beforehand to conduct any clustering analysis [5]. In contrast to well-known clustering algorithms like Kmeans, DBSCAN which is a density-based algorithm that clusters data points based on their density, does not necessitate predefining or restricting the number of clusters or classes [6]. Instead, it uses epsilon (ε) as the radius to search for neighboring points and minPts as the minimum number of points to form a dense region. For this study, the K-means algorithm was set to have 5 clusters (K=5), and the DBSCAN algorithm used epsilon of 0.5 and a minimum number of samples (min_samples) of 5. Evaluation: The performance of K-means and DBSCAN algorithms is assessed using two metrics in unsupervised machine learning: the Silhouette Score and the Davis-Bouldin Index. These metrics help measure how well the data points are grouped in terms of compactness and separation within clusters. The Silhouette Score computes the average silhouette coefficient for each data point and representing its similarity to its own cluster compared to other clusters. This coefficient ranges from -1 to +1, with higher values indicating better clustering. On the other hand, the Davis-Bouldin Index evaluates the average similarity between each cluster and its most similar cluster. It considers both within-cluster distances and distances between clusters. A lower Davis-Bouldin Index indicates that the clusters are more well-defined and distinct from each other. These metrics provide valuable insights into the quality of clustering results and assist in comparing different algorithms or determining the optimal number of clusters for a given dataset. Deployment: The model developed for this project was not designed for deployment in a production environment. It was primarily focused on data exploration, analysis, research purposes, prioritizing accuracy, and interpretability over deployment efficiency. The insights gained from this model can serve as a foundation for future iterations of deployment-ready models, if needed. IV. Results and Discussion Table 2: K-Means Silhouette and K-Means Davies Bouldin Index Scores Metrics K-Means Silhouette Score K-Means DaviesBouldin Index Scores 0.2816401271565697 1.1741890620961124 Table 2 shows results on K-Means Silhouette and K-Means Davies Bouldin Index Scores, KMeans performed well with some cluster separation. Meanwhile, K-Means Davies-Boulding Index and K-Means Silhouette has a score of approximately 1.17 and has a score of approximately 0.281 respectively. The K-Means Davies-Bouldin Index of 1.17 suggests moderately well-defined clusters, and the silhouette score of 0.28 somehow shows a significant separation between clusters. K-Means generated clusters that are reasonably wellseparated. Nevertheless, the algorithm demonstrates that the clusters may be distinguished to some extent although they did not achieve a high score. among them. Hence, K-Means can be considered reliable and scalable to the dataset. Table 3: DBSCAN Silhouette and Bouldin Index Score Metrics DBSCAN Silhouette Score DBSCAN DaviesBouldin Index Scores 0.00124851598085306 1.4344434732612028 DBSCAN, on the other hand, did not produce a clear cluster for the dataset. Based on Table 3: DBSCAN Silhouette and Bouldin Index Score, DBSCAN Silhouette has a score of approximately 0.00125 which is very close to 0. Therefore, it can be concluded that the clusters are either not well-defined, overlapping with one another, or both simultaneously. In addition to that, DBSCAN Davies-Bouldin Index score is approximately at 1.43 which means that the clusters produced are not well-separated as desired. Hence, it implies a less effective clustering. It can be concluded that based on the scores achieved indicate that the algorithm may not have been very effective in creating distinct clusters with the dataset. Therefore, due to the low silhouette score and the high Davies-Boulding Index suggests that the algorithm did not perform well on the dataset. Figure 2: DBSCAN Cluster based on Combined Features Based on Figure 2: DBSCAN Cluster based on Combine Feature shows that the clusters lie close to 0 and exactly lies on 0 on the graph. It can be observed that DBSCAN groups points are close together and have enough nearby neighbor into a cluster. Some data points may exhibit characteristics that make them ambiguous in terms of cluster membership. These points might have similar features or attributes to more than one cluster. Based on five features which are robust_monthly_rent, robust_size, region_Label, furnished_Label, completion_year, it can be concluded that there are no significant factors affecting house rental prices in Klang Valley due to external factors. The result of the clusters in determining rental prices in Klang Valley has shown that the rental prices of the properties within different clusters are similar as they are overlapping with one another. Factors that drive to rental demands are the job market available in Klang Valley and lifestyle preferences. Number of bedrooms, amenities or location may indicate that the properties in Klang Valley have similar characteristics which contribute to the comparable rental prices. However, there are possibilities that rental prices can fluctuate overtime due to changes in demand, economic condition or lifestyle trend. V. Figure 1: K-Means Cluster based on Multiple Features Based on Figure 1: K-means Cluster of Properties based on Multiple Features, result shows that the data are overlapping with one another but cluster 3 on the diagram shows some separation Conclusion Rental price analysis often involves exploring large datasets with multiple variables. Thus, unsupervised learning algorithms such as clustering using K-Means and DBSCAN can help identify patterns and group similar properties based on various features and providing a deeper understanding of rental market trends. This research has successfully compared the performance of the two classifiers, shedding light on the strengths, and weaknesses of each. In addition to that, data exploration, pre-processing, and feature engineering techniques has been applied in ensuring that the output of the algorithms are enhanced. While both K-Means and DBSCAN provided valuable insights into the rental market, the K-Means algorithm have demonstrated a more promising results with reasonably well-separated clusters. Nonetheless, further exploration and analysis may be required to identify and understand the potential underlying factors influencing the rental market in Klang Valley more comprehensively. Additionally, considering the dataset, other machine learning algorithms can be explored to determine their effectiveness in rental price analysis. In conclusion, the study's findings provide a foundation for further investigation and open possibilities for exploring different machine learning approaches to enhance rental price analysis. References [1] S. Pinjaman and M. Kogid, "Macroeconomic Determinants of House Prices in Malaysia," Jurnal Ekonomi Malaysia, vol. 54, no. 1, pp. 153-154, 2020. http://dx.doi.org/10.17576/JEM-2020-5401-11 [2] P. A. Mariadas, M. Selvanathan and T. K. Hong, "A Study on Housing Price in Klang Valley, Malaysia," International Business Research, vol. 9, no. 12, November 2016. doi:10.5539/ibr.v9n12p103 [3] T. Zermina, M. N. Ajis and N. A. Zainal Abidin, "Examining the Housing Experiences in Malaysia: a Qualitative Research on Pakistani Immigrant Labours," Journal of International Migration and Integration volume, vol. 21, p. 241–251, 2020. doi:10.1007/s12134-01900723-7 [4] A. P. J. Chan and B. H. C. Lee, "A Study on Factors Causing the Demand-Supply Gap of Affordable Housing," INTI Journal Special Edition – Built Environment, pp. 6-10, 2016. http://eprints.intimal.edu.my /600/1/EA%20-%201.pdf [5] M. Ahmed, R. Seraj and S. M. Shamsul Islam, "The k-means Algorithm: A Comprehensive Survey and Performance Evaluation," Electronics , vol. 9, no. 8, p. 1295, 2020. https://doi.org/10.3390/electro nics9081295 [6] J. C. Perafan-Lopez, V. L. Ferrer-Gregory, C. Nieto-Londoño and J. Sierra-Pérez, "Performance Analysis and Architecture of a Clustering Hybrid Algorithm Called FA+GADBSCAN Using Artificial Datasets," Entropy (Basel), vol. 24, no. 7, p. 875, 2022. doi: 10.3390/e24070875 Google Colab Link: https://colab.research.google.com/drive/1jWnn7Me XfF-Lt6yn-mP521NTf1lWP5l-?usp=sharing