VIETNAM NATIONAL UNIVERSITY, HANOI
INTERNATIONAL SCHOOL
--------------
FINAL REPORT
Course: Introduction to Data Science
Topic: Housing Price Prediction
Group 1
Lecturer:
Ph.D Trương Công Đoàn
Course ID:
INS3254 – INS325403
Member:
Nguyễn Minh Thuý – 23070553
Hoàng Khánh Ly – 22070870
Phạm Linh Thu – 22070972
Nguyễn Thị Thu Huyền – 23071060
Group member
Name
Student ID
Contribution
Note
Nguyễn Minh Thuý
23070553
25%
Leader
Hoàng Khánh Ly
22070870
25%
Phạm Linh Thu
22070972
25%
Nguyễn Thị Thu Huyền
23071060
25%
The total of all members combined is 100%
1
Table of content
Group member ..................................................................................................... 1
Introduction.......................................................................................................... 4
1. Topic meaning of project ............................................................................. 4
2. Motivation and Objectives of the Study ...................................................... 5
3. Conclusion ................................................................................................... 6
Chapter 1. Dataset ................................................................................................ 6
1.1.Source
of
dataset ......................................................................................................................... 6
1.2. Features Explanation ................................................................................ 7
1.3. Descriptive statistics ................................................................................. 7
1.4. Conclusion ................................................................................................ 8
Chapter 2. Exploratory Data Analysis (EDA) ..................................................... 9
2.1. Raw Data .................................................................................................. 9
2.2. Cleaning Data ......................................................................................... 10
2.2.1. Handle missing values .................................................................... 10
2.2.2. Handle duplicated values ................................................................. 12
2.2.3. Handle outlier values ....................................................................... 13
2.2.4. Standardizing data formats .............................................................. 15
2.3. Dimensionality Reduction ...................................................................... 16
2.4. Distribution of Each Feature ................................................................... 16
2.4.1. Feature Statistics .............................................................................. 16
2.4.2. Skewness and Kurtosis .................................................................... 18
2.4.3. Feature Scaling ................................................................................ 19
2.4.4. Spliting the Data .............................................................................. 20
2.6. Dataset Preparation ................................................................................ 20
Chapter 3. Algorithm ......................................................................................... 22
3.1. LightGBM............................................................................................... 22
2
3.2. Training Model ....................................................................................... 22
3.2.1. Train-Test set split ........................................................................... 22
3.2.2. Initializing and Training Model ....................................................... 23
3.2.3. Training with Optimized Hyperparameters ..................................... 23
3.2.4. Feature Importance Analysis ........................................................... 25
3.3. Testing Data ............................................................................................ 26
3.3.1. Preparing the Test Data ................................................................... 26
3.3.2. Feature Scaling on Test Data ........................................................... 27
3.3.3. Making Predictions with LightGBM Models .................................. 27
3.3.4. Merging Predictions with Test Data ................................................ 28
Chapter 4. Results .............................................................................................. 28
4.1.
Model
Evaluation ................................................................................................................. 28
4.2. Model Evaluating Process ...................................................................... 29
4.3. Performance Evaluation.......................................................................... 32
Conclusion ......................................................................................................... 33
5.1. Summary of the Project .......................................................................... 33
5.2. Key Findings ........................................................................................... 33
5.3. Limitations .............................................................................................. 34
5.4. Future Work ............................................................................................ 34
REFERENCES .................................................................................................. 35
3
Introduction
1. Topic meaning of project
Housing price prediction involves estimating the market value of residential properties
based on various features and historical data. Accurate predictions are essential for
multiple stakeholders, including buyers seeking fair prices, sellers aiming for optimal
returns, investors making strategic decisions, and financial institutions assessing loan
risks. The ability to predict housing prices reliably facilitates transparency in the real
estate market, enhances investment strategies, and contributes to economic stability by
informing policy-making and urban planning.
4
Figure 1. Predict house prices
2. Motivation and Objectives of the Study
The motivation behind this study is to leverage advanced machine learning
techniques to enhance the accuracy of housing price predictions. The housing
market is notoriously dynamic, and traditional statistical methods often fall short
in capturing the intricate patterns and trends within this market. By employing
state-of-the-art machine learning algorithms, this project aims to provide a more
reliable and precise forecasting model. The primary objectives are:
-
To understanding the Problem: Grasping the factors influencing
housing prices and their interrelationships.
-
Data Analysis and Preprocessing: Cleaning and preparing the data to
ensure its suitability for modeling.
-
Model Development: Implementing and optimizing the LightGBM
algorithm to achieve high predictive performance.
-
Performance Evaluation: Assessing the model's accuracy and
reliability using appropriate metrics.
-
Insights Generation: Identifying key features that significantly impact
housing prices to inform stakeholders.
5
3. Conclusion
In conclusion, this project aims to advance the field of housing price prediction
through the application of sophisticated machine learning techniques and
optimization methods. By addressing the critical questions surrounding housing
market dynamics, this study seeks to contribute valuable insights that can drive
better decision-making in the real estate sector. The successful completion of this
project will not only demonstrate the efficacy of the chosen tools and methods but
also provide a practical framework for future research and applications in
predictive analytics.
Chapter 1. Dataset
1.1. Source of dataset
When predicting house prices using machine learning, data is essential, as the
model needs data to learn the patterns related to house prices. Our team has
crawled and extracted information in this Kaggle notebook
Here is our dataset:
6
Figure 1.1. Dataset
1.2. Features Explanation
The dataset encompasses the following features, each playing a crucial role in
influencing housing prices
- SquareFeet: The total square footage of the property.
- Bedrooms: Number of bedrooms in the property.
- Bathrooms: Number of bathrooms in the property.
- Neighborhood: A categorical feature indicating if the property is in a
Rural, Urban, or Suburban area.
- YearBuilt: The year the property was constructed.
- Price: The target variable representing the price of the property.
1.3. Descriptive statistics
The dataset comprises 50,000 samples, divided into a training set of 40,000
samples and a testing set of 10,000 samples. Both subsets are free from missing
values, ensuring data completeness and integrity.
Number of Samples:
7
- Training Set: 40,000 samples
- Testing Set: 10,000 samples
Independent and Dependent Variables
- Independent
Variables:
SquareFeet,
Bedrooms,
Bathrooms,
Neighborhood, and YearBuilt. These features are used by the model to
predict the target variable.
- Dependent Variable: Price. This is the variable of interest that the
model seeks to predict based on the independent variables.
Figure 1.2.1. Structure and characteristics of the data
Figure 1.2.2. Label of each column of the data
1.4. Conclusion
The dataset is comprehensive, well-structured, and devoid of missing values,
providing a solid foundation for developing a reliable housing price prediction
8
model. The clear distinction between training and testing sets facilitates unbiased
model evaluation and performance assessment. The descriptive statistics highlight
the central tendencies and dispersions of the features, offering insights into their
distributions and potential impact on the target variable.
Chapter 2. Exploratory Data Analysis
(EDA)
2.1. Raw Data
The two lines of code display(train) and display(test) are used to display data in
the two variables train and test. Identify the columns (features), amount of data,
and target variables in the data set.
9
train._append(test, ignore_index=True) will combine two DataFrames (train and
test) into a new DataFrame (df) for convenience in data analysis, missing data
handling, and data visualization.
ignore_index=True will refresh the row index (index) to ensure continuity.
2.2. Cleaning Data
2.2.1. Handle missing values
10
This code snippet is designed to check and display information about missing
values
in
the train and test datasets.
It
leverages
the isnull(), sum(),
and info() functions from the pandas library to achieve this.
Handle missing data (missing values) in DataFrame df in two ways:
- df.dropna(inplace=True): This line uses pandas' dropna() method to
drop all rows that contain at least one missing value. inplace=True
means the change will be made directly on the original DataFrame df,
rather than returning a new DataFrame.
- df['Neighborhood'].fillna('Unknown', inplace=True): This line focuses
on the Neighborhood column. It uses the fillna() method to fill in
missing values in this column with the value "Unknown". Similar to
above, inplace=True ensures changes are applied directly to df.
These lines are resolving missing values in DataFrame df, which may contain
housing data. They focus on the Bedrooms, SquareFeet, and YearBuilt columns.
These three lines are a common data cleaning technique. They handle missing
data by imputation (filling in) missing values using the mean value of the
corresponding columns. This helps maintain the overall distribution of the data
and prevent problems that can arise from having missing values in your data set,
especially when performing further analysis or using machine learning model.
Check if the data has been completely cleaned, by printing out the size of the
DataFrame and the total number of missing values in each column. If there are no
missing values, the data is ready for further analysis and processing steps.
11
2.2.2. Handle duplicated values
This code snippet is designed to identify and display duplicate rows within the
Pandas DataFrame called df.
This code is using a method called drop_duplicates() which is part of the Pandas
library in Python. Pandas is often used for data manipulation and analysis. Here's
what's happening:
- df: This refers to your DataFrame, which is essentially a table holding
your data. Think of it like a spreadsheet.
- drop_duplicates(): This is the function that does the work. It's designed
to find and remove rows in your DataFrame that are exact duplicates of
other rows.
- inplace=True: This is an important argument. It means that the changes
made by drop_duplicates() will be applied directly to the original df
DataFrame. Without inplace=True, the function would create a new
DataFrame with the duplicates removed, leaving the original df
unchanged.
This code snippet is designed to recheck if there are any duplicate rows remaining
in the DataFrame df after attempting to remove them. Here's how it works stepby-step:
12
- df.duplicated(): This part of the code calls the duplicated() function on
the DataFrame df. The duplicated() function examines each row of the
DataFrame. It returns a Series of Boolean values (True/False). For each
row, it assigns True if the row is a duplicate (meaning it's identical to a
previous row), and False otherwise.
- .sum(): This part is chained to the result of df.duplicated(). The sum()
function is applied to the Series of Boolean values. Since True is treated
as 1 and False as 0, the sum() function effectively counts the number of
True values.
This line is crucial for data quality assurance. It verifies the effectiveness of the
previous step where duplicate rows were supposed to be removed. By rechecking,
the user ensures the dataset is clean and ready for further analysis.
2.2.3. Handle outlier values
This function uses the Interquartile Range (IQR) method to identify and remove
outliers from a specified column of a Pandas DataFrame. It calculates the upper
and lower boundaries based on the IQR and then filters the DataFrame to keep
only the data points within those boundaries.
def remove_outliers(df, column):
- This line defines a function named remove_outliers. It takes two
arguments:
13
- df: The Pandas DataFrame containing the data.
- column: The name of the column from which to remove outliers (a
string).
q1 = df[column].quantile(0.25):
- This line calculates the first quartile (Q1) of the specified column. The
first quartile represents the 25th percentile of the data.
q3 = df[column].quantile(0.75):
- This line calculates the third quartile (Q3) of the specified column. The
third quartile represents the 75th percentile of the data.
iqr = q3 - q1:
- This line calculates the Interquartile Range (IQR), which is the
difference between the third and first quartiles (Q3 - Q1). The IQR
represents the middle 50% of the data.
upper_boundary = q3 + 1.5 * iqr:
- This line calculates the upper boundary for identifying outliers. Any
data point above this boundary is considered an outlier.
lower_boundary = q1 - 1.5 * iqr:
- This line calculates the lower boundary for identifying outliers. Any
data point below this boundary is considered an outlier.
new_df
=
df.loc[(df[column]
>
lower_boundary)
&
(df[column]
<
upper_boundary)]:
- This line creates a new DataFrame (new_df) that includes only the rows
where the values in the specified column are within the calculated
boundaries (i.e., not outliers).
return new_df:
- The function returns the new DataFrame (new_df) with the outliers
removed.
14
This code aims to clean the data by removing unusual values (outliers) in column
'Price' and storing the cleaned data in a new DataFrame named 'df_clean'. This
helps improve data quality and the accuracy of subsequent analyzes or models.
By comparing the two numbers printed, the user can see how many rows were
removed during the cleaning process. This provides a clear indication of the
impact of outlier removal on the dataset size. If a significant number of rows are
removed, it highlights that there were a substantial number of outliers.
print(f"Rows before cleaning: {df.shape[0]}"):
- This line displays the original number of rows in the dataset before any
data cleaning was performed, stored in the DataFrame called df.
print(f"Rows after cleaning: {df_clean.shape[0]}"):
- This line displays the number of rows remaining in the dataset after the
data cleaning steps, specifically after removing outliers, and stores the
result in the DataFrame called df_clean.
2.2.4. Standardizing data formats
In essence, this code block provides a concise summary of the structure and
contents of both the train and test datasets. It helps to understand the data at a high
15
level, identify potential issues like missing values, and confirm that the datasets
are properly structured.
The code is changing the descriptive labels (Rural, Urban, Suburb) to numbers (1,
2, 3) to make it easier for the computer to process. This process is important for
preparing data for machine learning tasks.
2.3. Dimensionality Reduction
Reducing Features
- Purpose: Reduce the number of features in the dataset.
- In this step, the code successfully removed the "Price" column from the
DataFrame and saved the list of remaining features in train_feature.
- The result is: SquareFeet, Bedrooms, Bathrooms, Neighborhood,
2.4. Distribution of Each Feature
Description: Observe the distribution of features in the dataset.
Meaning: Draw a histogram of each feature in both train and test. Calculate the
skewness and kurtosis of the features to check the normal distribution.
2.4.1. Feature Statistics
16
The statistics table shows the columns retained after removing the Price variable
(the dimension reduction step). The columns include: SquareFeet, Bedrooms,
Bathrooms, Neighborhood, YearBuilt. The statistics table helps summarize the
data and detect trends such as:
- Features like SquareFeet and YearBuilt have a wide distribution.
- Features like Bedrooms and Bathrooms have stable values with low
standard deviation.
17
2.4.2. Skewness and Kurtosis
Purpose: Measure the skewness and kurtosis of characteristics to check for normal
distribution.
Skewness measures the asymmetry of a distribution compared to a normal
distribution:
- Skewness > 0: Right skewed (long right tail).
- Skewness < 0: Left skewed (long left tail).
- Skewness ≈ 0: Nearly symmetrical distribution.
Kurtosis measures the concentration of data around the center compared to the
normal distribution:
- Kurtosis > 0: Leptokurtic distribution.
- Kurtosis < 0: Platykurtic distribution.
- Kurtosis ≈ 0: Mesokurtic distribution.
18
Skewness: Columns like SquareFeet, Bedrooms, Bathrooms, Neighborhood, and
YearBuilt all have Skewness close to 0.
Kurtosis: All columns have Kurtosis < 0, indicating that the distribution of these
features is flatter than the Platykurtic distribution.
The data is not too concentrated in the center but is more widely dispersed.
=> The data distribution of the columns is relatively stable and does not have large
skewness or excessive kurtosis.
2.4.3. Feature Scaling
Standardizing numerical features ensures that each feature contributes equally to
the model training process by having a mean of 0 and a standard deviation of 1.
This is particularly important when features have varying scales.
Benefits: Accelerates model convergence and prevents features with larger scales
from dominating the model training process.
19
2.4.4. Spliting the Data
The dataset was divided into training and evaluation sets using an 80-20 split
(40,000 samples for training and 10,000 samples for testing) to facilitate unbiased
model evaluation.
2.6. Dataset Preparation
Re-splitting dataset: After the data has been merged (train and test) for
processing and analysis, you need to split the dataset into two separate parts to
prepare for the training and testing phase (train-test split).
Specifically
- train: Includes 40,000 samples for use in the model training process.
- test: Includes 10,000 samples for evaluating the model after training
Reset index on the test set to clean the index after data extraction.
20
Recheck data (Validation)
Goal
- Ensure that the data set after dividing retains important statistical
information.
Meaning
- Consider the main characteristics of two data sets (mean, standard
deviation, min, max, etc.).
- Ensure that splitting does not change the integrity of the data.
21
Chapter 3. Algorithm
3.1. LightGBM
LightGBM (Light Gradient Boosting Machine) was chosen for its efficiency,
scalability, and superior performance in handling large datasets with complex
feature interactions. It excels in gradient boosting frameworks by leveraging
histogram-based algorithms and leaf-wise tree growth, making it well-suited for
regression tasks like housing price prediction.
3.2. Training Model
3.2.1. Train-Test set split
Figure 3.2.1 Train-Test set split
The parameter test_size=0.2 indicates that 20% of the data will be used for the
test set, the remaining 80% will be used for the training set.
The parameter random_state=2019 ensures the consistency of data splitting,
making it possible for the results to be consistently reproduced across different
runs of the code.
22
3.2.2. Initializing and Training Model
LGBMRegressor():
Initializes
the
LightGBM
regressor
with
default
hyperparameters.
The code line model.fit(X_data_feature, y_data_feature): performs the training
process of the model on the entire training dataset.
3.2.3. Training with Optimized Hyperparameters
The objective function guides Optuna in exploring the hyperparameter space. It
involves training the LightGBM model with a given set of hyperparameters.
Parameter Definitions:
- n_estimators: Number of boosting iterations.
- learning_rate: Step size shrinkage used to prevent overfitting.
- max_depth: Maximum depth of the trees.
23
- min_child_weight: Minimum sum of instance weight (hessian) needed
in a child.
- subsample: Fraction of samples to be used for fitting the individual base
learners.
- colsample_bytree: Fraction of features to be used for each tree.
- reg_lambda: L2 regularization term on weights.
- reg_alpha: L1 regularization term on weights
Sampler: Utilized RandomSampler to explore the hyperparameter space
randomly, balancing exploration and exploitation.
Trials: Executed 50 trials to navigate the hyperparameter space efficiently and
identify optimal parameter configurations.
Best Parameters Identification
Increased n_estimators: A higher number of boosting iterations allows the model
to learn more complex patterns but requires careful regularization to prevent
overfitting.
Lower learning_rate: Smaller learning rates typically lead to better generalization
but require more boosting iterations.
Optimized subsample and colsample_bytree: These parameters control the
fraction of data and features used per tree, enhancing model diversity and reducing
overfitting.
Training LightGBM with Optimized Parameters:
24
The objective='regression_l1' specifies that the model uses the L1 loss function
for regression, which minimizes the median absolute error during training.
The model is trained using the following command: .fit(X_train, y_train)
Here, the model is trained on the training dataset (X_train, y_train), which
contains features and their corresponding labels (housing prices).
3.2.4. Feature Importance Analysis
25
The code line model[i].feature_importances_: Retrieves the importance scores
of each feature. Visualization: Plots the top 10 most important features to help
identify which variables significantly impact the target 'Price'. Insight from Code
Comments: "Prices are highly correlated to SquareFeet." : This indicates that the
SquareFeet feature is a strong predictor of house prices.
3.3. Testing Data
3.3.1. Preparing the Test Data
test.reset_index(drop=True): Resets the index of the test DataFrame and drops
the old index. This ensures that the DataFrame has a clean, sequential index
starting from 0.
import_test0.drop(columns=['Price','Neighborhood','YearBuilt'],
axis=1):
Removes the columns 'Price', 'Neighborhood', and 'YearBuilt' from import_test0.
import_test: Displays the resulting DataFrame after dropping the specified
columns.
26
3.3.2. Feature Scaling on Test Data
3.3.3. Making Predictions with LightGBM Models
predict_LGBM = LGBM_model.predict(test_pred_target0): Uses the trained
LightGBM regressor (LGBM_model) to predict housing prices on the same test
data.
pd.DataFrame(predict_LGBM): Converts the LightGBM predictions similarly.
set_axis(axis=1, labels=['LGBM_pred']): Renames the column of the
LightGBM predictions DataFrame to 'LGBM_pred'.
27
3.3.4. Merging Predictions with Test Data
test.merge(predict_XGB_df,
how='inner',
left_index=True,
right_index=True): Merges the original testDataFrame with the XGBoost
predictions (predict_XGB_df) based on their indices, using an inner join to ensure
only matching indices are retained.
test_pred.merge(predict_LGBM_df,
how='inner',
left_index=True,
right_index=True): Further merges the resulting test_pred DataFrame with the
LightGBM predictions (predict_LGBM_df) on the same basis.
test_pred: Displays the final merged DataFrame containing both actual values
and predictions from both models.
Chapter 4. Results
4.1 . Model Evaluation
To comprehensively assess the LightGBM model's performance, multiple
evaluation metrics were employed:
- Median Absolute Error (MedAE): Measures the median of absolute
differences between predicted and actual values. Provides a robust
measure less sensitive to outliers.
28
- R² Score: Indicates the proportion of variance in the dependent variable
(Price) predictable from the independent variables. Reflects the model's
explanatory power.
- Cross-Validation R²:
Assesses the model's
consistency and
generalizability across different training-validation splits.
4.2. Model Evaluating Process
In this section, we evaluate the performance of our LightGBM (LGBMRegressor)
model using several metrics to ensure robustness and reliability. Our approach
involves cross-validation, predictions on an evaluation set, and the use of multiple
performance metrics.
a) Cross-Validation (CV):
Cross-validation is performed to assess the model's ability to generalize.
cross_val_score: This function performs 5-fold cross-validation (cv = 5). It splits
the training data (X_train, y_train) into 5 parts.
- It trains the model LGBM_model on 4 parts and tests it on the remaining
part.
- This process is repeated 5 times, using a different part for testing each
time.
cv_LGBM.mean(): Calculates the average performance across the 5 folds, giving
a more robust estimate of the model's ability to generalize to unseen data.
29
b) Prediction on Evaluation Set:
LGBM_model.predict(X_eval): The trained model is used to predict housing
prices on a separate evaluation set (X_eval).
y_pred_LGBM_eval: The variable stores the predicted prices.
c) Performance Metrics:
R-squared (R2):
r2_score_LGBM_eval = r2_score(y_eval, y_pred_LGBM_eval): This metric
measures how well the predicted prices match the actual prices (y_eval) in the
evaluation set. A higher R2 score (closer to 1) indicates better model performance.
Median Absolute Error (MedAE):
MedAE_LGBM
=
np.sqrt(median_absolute_error(y_eval,
y_pred_LGBM_eval)): This metric calculates the median difference between the
predicted and actual prices. It is less sensitive to outliers than the mean absolute
error. The square root is taken to bring the error back to the same scale as the
target variable (housing price).
Evaluation on Test Set
r2_score_LGBM_test = r2_score(test_pred['Price'],
test_pred['LGBM_pred']): This line calculates the R-squared score for the
LightGBM model on the test set.
30
MedAE_LGBM_test = (np.sqrt(median_absolute_error(test_pred['Price'],
test_pred['LGBM_pred']))): This line calculates the square root of the Median
Absolute Error (MedAE) for the LightGBM model on the test set.
d) Printing Results
print("CV: ", cv_LGBM.mean()): Prints the average cross-validation score for
the model.
print('R2_score (eval): ', r2_score_LGBM_eval): Prints the R2 score for the
model on the evaluation set.
print("MedAE: ", MedAE_LGBM): Prints the MedAE for each model on the
evaluation set.
Results:
Interpretation:
- MedAE: A lower MedAE indicates that the median prediction
error is minimal, reflecting high accuracy.
- R² Score: An R² of 0.56 signifies that 56% of the variance in
housing prices is explained by the model, indicating strong
predictive capability.
- Cross-Validation R²: A mean CV R² of 0.55 confirms the
model's robustness and generalizability across different data
subsets.
31
print('R2_score (eval): ', r2_score_LGBM_test): This line prints the calculated
R-squared score to the console.
print("MedAE: ", MedAE_LGBM_test): This line prints the calculated square
root of the MedAE to the console.
Results:
Interpretation:
- R² Score: An R² of 0.56 on the test set indicates that the model
retains substantial predictive power on unseen data, though
slightly lower than on the validation set.
- MedAE: A MedAE of 183.5 on the test set reflects high
median prediction accuracy, demonstrating the model's
effectiveness in real-world scenarios.
4.3. Performance Evaluation
The model’s performance metrics are stored in a data frame for easy comparison
and visualization:
The results are as follows:
32
The results are as follows:
Conclusion
5.1. Summary of the Project
This project focused on developing a robust housing price prediction model using
the LightGBM algorithm, enhanced through Optuna's hyperparameter
optimization. The process began with comprehensive data preprocessing,
including cleaning, feature selection, and scaling, ensuring that the model was
trained on high-quality and well-structured data. By meticulously addressing data
anomalies and standardizing feature scales, the foundation was laid for accurate
and reliable predictions. The integration of advanced machine learning techniques
facilitated the creation of a model capable of capturing complex relationships
within the dataset, ultimately aiming to provide precise housing price estimates.
5.2. Key Findings
The
optimized
LightGBM
model
demonstrated
significant
predictive
performance, effectively minimizing the Median Absolute Error (MedAE)
through fine-tuned hyperparameters identified by Optuna. Feature importance
analysis revealed that SquareFeet was the most influential predictor of housing
prices, followed by Bedrooms and Bathrooms. This insight underscores the
critical role of these features in determining property values. Additionally, the
comparative analysis with the XGBoost model highlighted the strengths of
33
LightGBM in handling large datasets with intricate feature interactions, affirming
its suitability for this prediction task. The successful merging of predictions with
actual test data facilitated a comprehensive evaluation of the model's accuracy and
reliability.
5.3. Limitations
Despite the project's successes, certain limitations were encountered. The dataset,
while extensive, may not encapsulate all external factors influencing housing
prices, such as economic indicators, local market trends, or property-specific
attributes like condition and amenities. The exclusion of features like
Neighborhood and YearBuilt, based on their minimal impact in feature
importance analysis, might overlook subtle regional influences or historical
property value trends. Furthermore, the model's reliance on historical data means
it may not fully account for abrupt market shifts or unforeseen external factors,
potentially affecting its adaptability to dynamic real estate environments.
5.4. Future Work
Future research can address these limitations by incorporating a broader range of
features,
including
macroeconomic
indicators
and
detailed
property
characteristics, to enhance the model's comprehensiveness. Expanding the dataset
to include temporal data could improve the model's ability to adapt to market
fluctuations and emerging trends. Exploring ensemble methods that combine the
strengths of multiple algorithms, such as stacking LightGBM and XGBoost, may
further elevate predictive performance. Additionally, deploying the model in a
real-world application, such as a web-based prediction tool, could provide
practical value to users seeking instant and dynamic housing price estimates.
Integrating real-time data and continuously updating the model will ensure its
relevance and accuracy in ever-evolving real estate markets.
34
REFERENCES
[1.] Bergstra, J., & Bengio, Y. (2012). Random search for hyper-parameter
optimization. Journal of Machine Learning Research, 13, 281-305.
[2.] Bergstra, J., & Bengio, Y. (2013). Algorithms for hyper-parameter
optimization. In Advances in Neural Information Processing Systems (pp. 2546-2554).
[3.] Brownlee, J. (2020). Master Machine Learning Algorithms. Machine
Learning Mastery. Retrieved from https://machinelearningmastery.com/mastermachine-learning-algorithms/
[4.] Guan, L. (2020). Machine Learning with Optuna for Housing Price
Prediction.
Kaggle
Notebook.
Retrieved
from
https://www.kaggle.com/code/guanlintao/ml-optuna-eda-housing-priceprediction/notebook
[5.] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12, 2825-2830.
35
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )