An Optimal House Price Prediction Algorithm: XGBoost: Comparison
Please note this is a comparison between Version 1 by Hitesh Harsora and Version 2 by Fanny Huang.

An accurate prediction of house prices is a fundamental requirement for various sectors, including real estate and mortgage lending. It is widely recognized that a property’s value is not solely determined by its physical attributes but is significantly influenced by its surrounding neighborhood. Meeting the diverse housing needs of individuals while balancing budget constraints is a primary concern for real estate developers. 

  • house price prediction
  • XGBoost
  • feature engineering
  • feature importance
  • house price prediction
  • hyperparameter tuning
  • machine learning
  • regression modeling

1. Introduction

Housing is one of the basic human needs. House price prediction is of utmost importance for real estate and mortgage lending organizations due to the significant contribution of the real estate sector to the global economy. This process is beneficial not only for businesses but also for buyers, as it helps mitigate risks and bridges the gap between supply and demand [1]. To estimate house prices, regression methods are commonly employed, utilizing numerous variables to create models [2]. An efficient and accessible housing price prediction model has numerous benefits for various stakeholders. Real estate businesses can utilize the model to assess risks and make informed investment decisions. Mortgage lending organizations can leverage it to evaluate loan applications and determine appropriate interest rates. Buyers can use the model to estimate the affordability of properties and make informed purchasing decisions. Most importantly, the recent instability of house prices has made the need for prediction models more important than before.
Previous studies [3][4][5][3,4,5] have applied various machine learning (ML) algorithms for house price prediction, with the focus on developing a model; not much attention has been paid to house price predictors. Researchers' literature review findings suggest that various traditional ML algorithms have been studied; however, there is a need to identify the optimal methodology for house price prediction. For example, Madhuri et al. [6] compared multiple linear regression, lasso regression, ridge regression, elastic net regression, and gradient boosting regression algorithms for house price prediction. However, their study did not propose an optimal solution. This is due to the fact that they applied regression algorithms using the default settings only, with no attempt to achieve optimality. The dearth of research regarding this underscores the need for a more comprehensive study on the diverse elements that contribute to the effectiveness of house price predictive models. By delving deeper into the identification and analysis of these influential factors, researchers can unveil valuable insights that will aid in achieving an optimal house price prediction model. This is beneficial to the real estate sector for understanding the significant factors that influence house costs.

2. Optimal House Price Prediction Algorithm

Predicting house prices provides insights into economic trends, guides investment decisions, and supports the development of effective policies for sustainable housing markets. The study by [7][8] emphasized the reliance of real estate investors and portfolio managers on house price predictions for making informed investment decisions. Recent market trends have demonstrated a clear connection between the accuracy of these predictions and the improved optimization of investment portfolios. Anticipating fluctuations in house prices empowers investors to proactively adapt their portfolios, seize emerging opportunities, and strategically navigate risks, leading to more robust and resilient investment outcomes. Furthermore, the authors in [8][9] discussed how individuals can gain a better understanding of real estate for their own personal investment and financing decisions. Similarly, [9][10] added that financial institutions and policymakers recognize house price trends as an economic indicator, as it is important to note that fluctuations in house prices can affect consumer spending, borrowing, and the overall economy. The study by [10][11] proposed an intuitive theoretical model of house prices, where the demand for housing was driven by how much individuals could borrow from financial institutions. A borrower’s level of debt depends on the level of disposable income he or she has and the current interest rate. The study showed that actual house prices and the amount individuals can borrow are related in the long run with plausible and statistically significant adjustments. The authors in [11][12] argued that landscape influences the real estate market, adding that macro- (foreign exchange) and micro-variables (such as transportation access, financial stability, and stocks) can change the land price, therefore these can be used to predict future land prices.
ML has revolutionized the process of uncovering patterns and making reliable predictions. This is due to the fact that ML involves the process of acquiring knowledge from past experiences in relation to specific tasks and performance criteria [12][13]. ML algorithms are of two main categories, namely the supervised and the unsupervised ML approach [13][14]. The supervised ML approach makes use of a subset of labeled data (where target variable is known) for training and testing on the remaining data to make predictions on unseen datasets [14][15]. Whilst the unsupervised ML approach does not require a labeled dataset, the approach facilitates the analysis (by uncovering hidden patterns) and makes prediction from unlabeled datasets [15][16]. In the context of house price prediction, previous studies have conceptualized the problem as a classification task [16][17] or a regression task [17][18]. The supervised ML algorithms are capable of modeling both tasks. An example of the classification approach was performed in the work of [16][17]. They aimed to predict whether the closing house price was greater than or less than the listing house price. They transformed the target variable as “high” when the closing price was greater than or equal to the listing price and as “low” when the closing price was lower than the listing price. Thus, their classification result showed that RIPPER (repeated incremental pruning to produce error reduction) outperformed C4.5, naïve Bayes, and AdaBoost in the Fairfax County, Virginia house dataset, which consisted of 5359 townhouse records.
Most studies have approached the house price prediction problem as a regression task to be able to provide estimates that are predictive in determining the direction of future trends. For example, in China, [18][19] used 9875 records of Jinan city estate market data for house price prediction. The paper showed that CatBoost was superior to multiple linear regression and random forest, with an R-squared of 91.3% and an RMSE of 772.408. In the Norwegian housing market, [19][20] introduced squared percentage error (SPE) loss function to improve XGBoost for a house price prediction model. Thus, they showed that their SPE loss function XGBoost algorithm—named SPE-XGBoost—achieved the lowest RMSE of 0.154. The authors in [17][18] used a Boston (USA) house dataset that consisted of 506 entries and 14 features to implement a random forest regressor and achieved an R-squared of 90%, an MSE (mean square error) of 6.7026, and an RMSE (root mean square error) of 2.5889. Similarly, Ref. [20][21] showed that lasso regression outperformed linear regression, polynomial regression, and ridge regression using the Boston house dataset, with an R-squared of 88.79% and an RMSE of 2.833. The authors in [6] used the King County housing dataset to compare multiple linear regression, ridge regression, lasso regression, elastic net regression, AdaBoost regression, and gradient boosting, and showed that gradient boosting achieved the superior result. However, it is worth stating that most of these studies applied a basic (default) regression model without considering optimizing the model and did not perform a comprehensive analysis of the feature importance. For illustration, this study provides a summary of the literature findings in Table 1 below is provided.
Table 1.
Summary of the literature evidencing dataset used and their findings.
Author Dataset Findings RMSE
Zou [18][19] Jinan city estate market, China CatBoost is superior to multiple linear regression and random forest, with an R-squared of 91.3%. 772.408
Hjort et al. [19][20] Norwegian housing market SPE-XGBoost achieved the lowest RMSE compared with linear regression, nearest neighbour regression, random forest, and SE-XGBoost. 0.154
Adetunji et al.
In summary, researchers reviewed the recent literature, specifically in the context of techniques utilized to provide up-to-date information on the house price prediction models. The findings showed that only a few studies considered optimality and the significance of features. To evidence this, researchwers summarized the techniques (including the optimization approach) that have been used in previous studies, as shown in Table 2 below.
Table 2.
Summary of the recent literature evidencing techniques/optimization.
Author(s) Method Hyperparameter

Tuning
Azimlu et al. [22][23] ANN, GP, Lasso, Ridge, Linear, Polynomial, SVR Not performed
Wang [23][24] OLS Linear Regression, Random Forest Not performed
[17][18]
Fan et al. [24Boston (USA) house dataset Random forest regressor achieved an R-squared of 90% and an MSE (mean square error) of 6.7026. 2.5889
][25] Ridge Linear Regression, Lasso Linear Regression, Random Forest, Support Vector Regressor (Linear Kernel and Gaussian Kernel), XGBoost GridSearchCV Sanyal et al. [20][21] Boston (USA) house dataset Lasso regression outperformed linear regression, polynomial regression, and ridge regression with an R-squared of 88.79%. 2.833
Viana and Barbosa [21][22] Linear Regression, Random Forest, LightGBM, XGBoost, Auto-klearn, Regression Layer Keras(Hyperas) Madhuri et al. [6]
Aijohani [King County housing (USA) Gradient boosting showed a superior result with an adjusted R-squared of 91.77% over multiple linear regression, ridge regression, lasso regression, elastic net regression, and AdaBoost regression. 10,971,390,390
1] Multiple Regression, Lasso Regression, Ridge Regression Not performed Aijohani [1] King County housing (USA) Ridge regression outperformed lasso regression and multiple linear regression with an adjusted R-squared of 67.3%. 224,121
Sharma et al. [25][26] Linear Regression, Gradient Boosting Regressor, Histogram Gradient Boosting Regressor, and Random Forest Not performed Viana and Barbosa [21][22]
Madhuri et al. [6]
  • King County (KC), USA;
  • Fayette Count (FC), USA;
  • São Paulo (SP), Brazil;
  • Porto Alegre (POA), Brazil.
Spatial interpolation attention network and linear regression showed robust performance over other models such as random forest, Lightgbm, XGboost, and auto-sklearn. 115,763 (KC)

22,783 (FC)

Multiple Regression, Lasso Regression, Ridge Regression, Elastic Net Regression, and Gradient Boosting Regression154,964 (SP)

94,201 (POA)
Not performed
Video Production Service