Energy Reports 8 (2022) 8511–8522 Contents lists available at ScienceDirect Energy Reports journal homepage: www.elsevier.com/locate/egyr Research paper Multi-dimensional data-based medium- and long-term power-load forecasting using double-layer CatBoost ∗ Wen Xiang a,b , Peng Xu a , Junlong Fang a , , Qinghe Zhao a , Zhenggang Gu c , Qirui Zhang c a Northeast Agricultural University, Harbin, CO 150038, China Economic and Technological Research Institute of State Grid Heilongjiang Electric Power Co., LTD, Harbin, CO 150038, China c State Grid Heilongjiang Electric Power Co. LTD, Harbin, CO 150036, China b article info Article history: Received 29 January 2022 Received in revised form 25 May 2022 Accepted 21 June 2022 Available online 1 July 2022 Keywords: Load forecasting Machine learning CatBoost Randomised search CV a b s t r a c t In this study, a medium- and long-term power load prediction method is proposed based on the two-layer categorical boosting (CatBoost) algorithm with multi-dimensional feature considerations. Simultaneously, the influences of economic fluctuation, power generation disruption, and meteorological data on power load are considered, whereby the dimension of power-load forecasting data characteristics is broadened. A randomised search cross-validation (CV) regression model is also applied to model parameter optimisation. Real data from a province in northeast China were used for the training and test sets. Compared with nine advanced load prediction models, including eXtreme gradient boosting and adaptive boosting, the coefficient of determination (R2 ) of the proposed method was 0.925, mean average percentage error (MAPE) was 0.0158, and root-mean-square error (RMSE) was 274.2036. In this study, a popular, viable artificial intelligence technology, two-layer CatBoost, was explored, and multi-dimensional external variables of power generation were added for the first time for load prediction. Finally, a higher accuracy load forecasting tree model is presented. The method has good potential for use in medium- and long-term power-load forecasting applications. © 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). 1. Introduction Medium- and long-term power-load forecasting is a necessary condition for ensuring the correct operation of power systems. Correct power load forecasting is conducive to the accurate implementation of power activities, such as power generation, fuel procurement, maintenance, investment plans, and safety analysis (Zhong et al., 2014). By contrast, inaccurate middle- and long-term load-forecasting increases the operating cost of the power system. For example, if the prediction result is too large, the power supply will be in a thermal standby state for a long time, thus resulting in the waste of power generation energy and inefficient power distribution. However, if the prediction result is too small, it will cause high-energy consumption of the generator set, and can render the system incapable of supplying power, leading to power outage. The above two situations will hinder the safe and economic operation of the power system (Gao and Gao, 2014). ∗ Corresponding author. E-mail address: jlfang@neau.edu.cn (J. Fang). 1.1. Literature survey In general, the prediction methods covered in the literature are usually based on mathematical analyses. Smooth curve (Mao et al., 2008; Ji and Wu, 2018), elastic coefficient (Ertugrul, 2016), and fuzzy linear regression (Jiang et al., 2018; Liu et al., 2019) are widely used in power load prediction. The advantages of these algorithmic models are reflected in the simplicity of their applications. However, often only the relationship between time and historical load data is considered to predict the future power load value, ignoring the numerous factors affecting the load forecasting results. Thus, these methods cannot meet the accuracy requirements of practical work. To improve the accuracy of load forecasting and solve the problem of a single factor affecting the accuracy of power load, some researchers incorporated multi-dimensional factors into medium and long-term load forecasting. For example, in Gu (2004), the correlation between economy and load was determined based on the matter-element model using the concept of classification, in combination with the economic indicators of output value and gross-domestic product of the three major industries. However, the selected indicators in the study were not comprehensive because economic characteristics also include industry, investment, consumption, and other factors, which have varying degrees of impact on the power load. In Luo et al. (2020), https://doi.org/10.1016/j.egyr.2022.06.063 2352-4847/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/bync-nd/4.0/). W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 1.2. Gaps and objectives multi-time period data were included, and the dependence between economic and meteorological characteristics in different time periods were fully considered, demonstrating the influence of economic and meteorological data on power-load-forecasting results. However, the economic data in the article were provided annually, and the accuracy is limited, which weakens to a certain extent the impact of economic data on power-load forecasting. In these articles, although the combination of multiple factors is realised, there is still room for improvement in the algorithm in terms of prediction accuracy. It is difficult to express mathematically the nonlinear relationship between power load and influencing factors. With the growing number of intelligent algorithms, artificial intelligence forecasting based on anatomically informed basis functions (AIBF) technology is being increasingly used in load forecasting to improve accuracy. For example, the improved grey theory in Zhao and Wang (2021) aims to minimise the average relative error between the predicted and actual values. The one-dimensional search method was adopted to solve the model background value, thus addressing the poor accuracy issue associated with the weight coefficient of the model background. However, the initial value of the model was fuzzified, and prediction accuracy can still be improved. In Wang (2021), all the data in the original data series of the grey model were considered as initial values in the calculations, and the model value with the highest accuracy was selected as the initial value for prediction. The advantage of this method is the simplicity of the operation. However, because every dataset in the original data series is capable of generating uncertain random errors, it cannot be inferred that the initial model value with the highest prediction accuracy can match any of the values listed in the original data column. The lack of a precise theory raises the possibility of improving prediction accuracy. In Liu et al. (2021), a multi-layer, bi-directional, recursive, neural network model based on long short-term memory (LSTM) was proposed for power-load prediction. In this study, the training time and prediction time were shortened. However, the accuracy of data classification was not sufficient, and the problem of inaccurate data classification still remains. In Yang et al. (2018), grey correlation analysis was adopted, and the improved fireworks algorithm was proved to be effective in optimising the weight coefficient of the background value and the correction term of the initial value of the grey model. Although this solved the problem of prediction accuracy, the problems of slow convergence speed, early maturity, and low efficiency of the algorithm still need to be solved. In Zhang et al. (2019), a power-load forecasting method using LSTM was proposed. Compared with the traditional recursive neural network model, a memory unit was added. In this study, the problem associated with the disappearing or exploding recurrent neural network gradient was solved successfully. However, with a large amount of data imported into the model, the computational efficiency deteriorated. In Yang et al. (2020), a load-forecasting method was proposed based on support vector regression for the distribution system. In this study, the parameters were integrated successfully using particle swarm optimisation, and the prediction effect improved compared with the traditional method. However, the large amount of misaligned data biased the prediction results. Some articles have been proven to be effective in other fields, but their applicability to medium- and long-term power-load forecasting still needs to be determined (Zhang et al., 2020, 2022; Talaat et al., 2022). According to the latest progress on medium- and long-term power-load forecasting, some artificial intelligence methods have been applied, but there are still issues to address, such as the slow convergence speed, high-sample requirements, and overfitting. In summary, medium- and long-term load forecasting has large prediction time span and long-cycle characteristics. The main problems of the current, commonly used medium- and long-term load-forecasting methods are the identification of the correlation between the load, equal intervals, and single characteristic data, or the independent consideration of the impact of different dimensional characteristic data on load forecasting. These methods rarely consider the relationship between multi-dimensional data and load simultaneously. The continuous changes in system load data cannot be easily formulated mathematically, and the accuracy needs to be improved. To address the above considerations, a medium - and longterm power load prediction method is proposed based on the two-layer categorical boosting (CatBoost) algorithm. In the first layer, economic, meteorological, and power generation data are input into the CatBoost algorithm. These data are processed and constructed into multiple learning tree models. These tree models are then sorted and promoted, which can effectively solve the prediction offset problem. To estimate the degree of correlation between each factor and power load, the tree model is combined continuously. Finally, the factors with the correlation degree (from high to low) are identified. In the second layer, the random search module is applied to the factors which are highly correlated with the power load while a randomised search crossvalidation (CV) regression model is applied to model parameter optimisation. Following model training, a power-load prediction model based on the CatBoost algorithm is proposed. In this study, the selection of the number of factors is particularly critical to the coefficient of determination (R2 ), which is considered a standard of model accuracy. It is thus concluded that the load prediction effect of the model with the first eleven factors is the best. In comparisons with other algorithms, the mean average percentage error (MAPE) and root-mean-square error (RMSE) are considered as standards of model accuracy. As observed in the experimental results, compared with the predicted values of nine load prediction models, including eXtreme gradient boosting (XGBoost) and adaptive boosting (AdaBoost), the proposed power-load prediction model based on the CatBoost algorithm in this study displayed higher accuracy, better data interpretation capacity, and better model effects. In this study, a popular, viable artificial intelligence technology, two-layer CatBoost, is explored for the first time for load prediction. The internal variables based on time transformation are set, and the multi-dimensional external variables of climate, economy, and power generation are added. Through the node splitting and entropy in the feature importance observation tree, the normalised feature variable correlation graph is used to rank the correlation of the introduced feature variables. Using R2 as the standard, the optimal number of characteristic variables is determined. To prevent the features from being standardised, randomised-search CV is used to optimise the hyper parameters of the algorithm in the CatBoost regression model. The accuracy and prediction results of nine advanced models were compared with the methods proposed herein, thus focusing on the advantages of the CatBoost algorithm in decision boosting tree algorithms. MAPE, RMSE, and R2 were considered as the evaluation criteria for model accuracy. The real data of a province in northeast China were adopted for the test and training sets. The experimental results show that the prediction accuracy of this method is higher than that of the comparison models. The main novelties include the exploration of two-layer CatBoost and the addition of multi-dimensional external variables of power generation for the first time for load prediction. Finally, a higher accuracy tree model load forecasting model is presented. 8512 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 2. Experiments and methods 25% and 75% of the data in the sequence, respectively. Python’s Pandas library supports the connection of two data frames with indices or specified columns. Herein, the fill function in Python’s Pandas library was applied to fill in the missing data (Xiao, 2021). Occasionally, data losses occur because of specific time delays in data release or low-update speed (Yu and Li, 2016). CatBoost can be flexibly used to handle various types of data, including continuous and discrete values. This algorithm is more suitable for medium- and long-term load forecasting. The algorithm has been proved to achieve a high-predictive readiness rate within a short parameter adjustment time. Herein, factor data are processed to obtain better load forecasting results, and a two-layer CatBoost algorithm for load prediction has been selected. 2.1.2. Economic data processing There is a close relationship between economic development and power load. Thus, daily economic parameter data were selected as reference because of their high-reference value and accuracy (Wang et al., 2021). In general, there was no data distortion; however, in the process of data collection, there will inevitably be some abnormal data or unpublished data, which can lead to abnormal input values and abnormal fluctuations. If these data are not processed before model training, the training effect and training accuracy of the model will be reduced. Missing values can be obfuscated by means of correlation. In view of the characteristics of economic data in regions, the methods used for handling abnormal data in this study are described below. The first step involved horizontal processing of the data. In general, economic data are smooth with a few differences between similar data. In this case, horizontal processing can be used as follows: yt −1 + yt +1 (2) yt = 2 The second step is to process the data vertically. In general, economic data also have a certain periodicity, and the value of a time point corresponding to the period differs by little. In this case, vertical processing can be used as follows, 2.1. Data sources Economic data from a region in China were selected from January 2017 to March 2020. The data source is the official website of the National Bureau of Statistics of China. Data include the consumer price index, X1, industrial producer purchasing price index, X2, producer price index, X3, industrial added value increase rate, X4, industrial added value increase rate, X5, real estate investment increase, X6, real-estate investment, X7, accumulated value of residential investment, X8, cumulative growth of residential investment, X9, cumulative value of residential investment production and construction area, X10, and the cumulative growth of real estate construction area, X11. Meteorological data from a region in China from January 2017 to March 2020 were also selected. The data source is the official website of China’s National Meteorological Administration. The data include the daily mean wind speed, X12, daily maximum sustained wind speed, X13, daily mean temperature, X14, daily maximum temperature,X15, and daily minimum temperature, X16. Power generation data for a region in China from January 2017 to March 2020 were selected. The data source was China’s state grid enterprises (internal data). They include the daily power generation, X17, accumulated power in current generation, X18, daily new energy generation, X19, new energy generation to the current cumulative value, X20, daily thermal power generation, X21, thermal power generation to the current cumulative value, X22, accumulated growth to the current generation growth rate, X23, and daily power generation year-on-year growth rate, X24. The load data from January 2017 to March 2020 came from the internal data of China’s state grid enterprises. The original collection interval of load data was 15 min, and the experimental data were resampled within a period of 1 day. The data from January 1, 2017 to September 30, 2019 were considered as the training dataset, and the data from October 1 2019 to March 31, 2020 as the test data. 2.1.1. Meteorological data processing In this study, the changes in meteorological characteristics within a certain range were considered. Meteorological data, such as temperature data, have been regarded as abnormal (Sulandari et al., 2020). Abnormal data were corrected according to monthly maximum, minimum, average, and adjacent data., calculating the quartiles for data with large differences in adjacent values. Then, the acceptable data value range was set. When the range was equal to three, extreme abnormal numerical detection was initiated. For ordinal data, data outside the value range were detected and eliminated as abnormal data. For nominal data, the detection procedure for abnormal data was the same as that of the ordinal data. Q3 + β (Q3 − Q1 ) ∼ Q1 − β (Q3 − Q1 ) yd−1 + yd+1 (3) 2 The third step is to conduct four-point data processing. By combining horizontal and vertical economic data, better conclusions can be drawn, and the processed data will be relatively more accurate: yt −1 + yt +1 + yd−1 + yd+1 yt = (4) 4 Finally, the authors applied the fill-in function in Python’s Pandas library, which is a data analysis package from Python for filling in missing values in the data (Hou, 2020). One option involves the specification of what the missing value will be replaced with. For the growth rate of economic data, it is more acceptable to replace all missing values with zeros, as this will not affect the overall growth trend (Khwaja et al., 2020). The authors converted the data on economic indicators into daily data, to reflect the long-term economic development, and satisfy the requirements of daily data analysis while meeting the needs of overall forecasting. The transformation formula can be specified as follows, yt = Ed1 (t ) = Ed2 (t ) = · · · = Edn (t) = Em (t) (5) where dn stands for the nth day.Edn (t) stands for the economic data value on the dn of each month, and Em (t) stands for the monthly economic data value. 2.1.3. Power generation data processing In general, power loads are characterised by periodicity, and there are specific similarities between loads and power generation at the same time of the day. Power generation data positively correlate with power load data. When power generation increases, it can solve the problem of insufficient load supply and demand. Therefore, the corresponding power generation data need to be kept within a certain range. The authors set the maximum possible variation range of the predicted value according to the power generation data of two time points (Barman (1) where Q1 is the first quartile, and Q3 is the third quartile. Once the training data are sorted from small to large, Q1 and Q3 represent 8513 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Calculate the weight of the classifier Gm (xi ). Then, et al., 2019). If the absolute value of the difference between the generation value and the generation data at the two time points exceeds the threshold value, the generation data are considered abnormal (Liang et al., 2019). The power generation data process is described below. The first step involved the setting of the target function. { |y(i, t) − y(i, t − 1)| > α (t) |y(i, t) − y(i, t + 1)| > β (t) αm = log 2 1 − em (15) em Update the weight distribution: Dm+1 = ωm+1,1 , ωm+1,2 , . . . , ωm+1,N ( ωmi ωm+1,i = (6) Then, y(i, t) = 1 Zm ) exp (−αm yi Gm (xi )) , i = 1, 2, . . . , N , (16) The normalisation factor is thus defined as y(i, t + 1) + y(i, t − 1) 2 + y(i − 1, t) − Zm = y(i − 1, t + 1) + y(i − 1, t − 1) ωmi exp (−αm yi Gm (xi )) (17) i=1 (7) 2 where α (t ) and β (t) represent the range interval and load data, respectively. In the second step, the previously available data of the same day are substituted into the above formula. This yielded, n ⎡∑ N ∑ Repeat the process M times (m = 1, 2, 3, . . . , M ) to reduce the amount of overfitting and computation required by CatBoost, thus making the computation faster when large amounts of data are involved. ⎤ ai gi (y1 ) 2.3. Second CatBoost layer ⎢ i=1 ⎥ ⎡ ⎤ ⎥ ⎢ n y1 ⎥ ⎢∑ ⎥ ⎥ ⎢ ⎢r ⎥ ⎢ ai gi (y2 )⎥ y2 ⎥ ⎢ 2⎥ ⎢ ⎥ ⎢ ⎢ ⎢.⎥=⎢ i = 1 ⎥ − ⎢ .. ⎥ ⎢.⎥ ⎢ ⎥ ⎣.⎥ .. ⎦ ⎣.⎦ ⎢ ⎥ ⎢ ⎥ ⎢ n . ⎥ ⎢∑ yk rk ⎣ ⎦ ai gi (yk ) r1 ⎡ ⎤ The second layer of CatBoost is used to establish the loadforecasting model. CatBoost can be executed based on a secondorder Taylor expansion, wherein both first and second derivatives are used, thus making the solution of the model more efficient. In the hypothetical data, (8) i=1 Consider the least-squares principle, m ∑ δk k=1 m ∑ ∂δk = 0(i = 1, 2, . . . , n) ∂ ai δk2 = ∆min ŷi = fk (xi ) , fk ∈ F (18) k=1 (9) k is the number of subsets, ŷi denotes all the possible subsets, fk (xi ) represents a subset, and the model consists of subsets. For the best parameters, the objective function is defined as, (10) k=1 Obj(θ ) = ai can thus be solved, and the fitting curve can be obtained. This fitting curve can be used to correct the missing data. n ∑ ( l yi , ŷi + ) i=1 K ∑ Ω (fk ) (19) k=1 By adding the prediction model, the result of the tth accumulation becomes 2.2. First CatBoost layer (t ) The first layer of CatBoost is the factor correlation analysis. The advantage of CatBoost is that it will reduce the probability of misclassifying sample data. CatBoost adopts the weighted voting method to increase the weight of data associated with small errors, thereby improving the accuracy of the predicted results (Malekizadeh et al., 2020). In this study, the load forecasting process based on the first layer CatBoost algorithm is outlined below. Set xi ∈ X ∈ Rn yi ∈ Y ∈ {−1, +1}. Initialise the training data weights, D1 = (ω11 , . . . , ω1i , . . . , ω1N ) (t ) Obj = n [ ∑ gi fi (xi ) + i=1 1 2 hi ft2 ] (xi ) + Ω (ft ) (21) The complexity in the objective function can be used to prevent data overfitting. The datasets were randomly sorted and formed into groups based on random permutations (Sadaei et al., 2019). Assuming a given sequence, the average of each set of data in the same category was calculated (Ko and Lee, 2013). All the classified data were converted into numerical results. This model can be expressed as (11) ∑p−1 [ N ωmi I (Gm (xi ) ̸= yi ) (20) By expanding the objective function (second order Taylor series) and removing the constant term in the objective function, the following equation can be obtained. 1 , i = 1, 2, . . . , N (12) N Let the training samples with weights be learnt to obtain the classifier Gm (xi ). The classification error rate of Gm (x) is calculated as ∑ = ŷ(ti −1) + fi (xi ) ŷi ω1i = em = P (Gm (xi ) ̸ = yi ) = K ∑ x̂ik (13) = j=1 ] xσj ,k = xσp ,k · Yσj + a · P ∑p−1 [ j=1 ] (22) xσj ,k = xσp ,k + a i=1 where xσp ,k represents the prior term, and Yσj represents the weight coefficient (a > 0). In this study, the pseudocode of the two-layer CatBoost is given as follows: where { I= 1, Gm (xi ) ̸ = yi 0, other (14) 8514 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 3.1.2. Meteorological data processing results The meteorological data after processing are as follows. Because meteorological data are easy to obtain, there are few outliers and missing values in this study. We only filled in a small amount of missing data and deleted outlier data. In this section, the marked positions in the graph indicate the missing part of the data (Figs. 3.4 and 3.5). Algorithm 1: Updating the models and calculating model values for gradient estimation input: {(Xk , Yk )}nk=1 ordered according to Yk , and the number of trees is I; 1 Mi 0 for i = 1 . . . n; 2 for iteration 1 to I do 3 for i 1 to n do 4 for j = 1 to i – 1 do d L cos(yj , a)|a=M1 (xj ) ; [gj + da ( ) M ← LearnOneTree (Xj , gj ) for j = 1..i − 1 ; Mi ↔ Mi + M ; return M1 Mn ; M1 (X1 ), M2 (X2 )Mn (Xn ) 3.1.3. Power generation data processing results The processed power generation data are displayed below. Because the power generation data in this study come from the internal data of the enterprise, data sources are considered preferable, there are no missing values and only a small number of abnormal values are observed. So, only a small amount of outlier data was deleted. The marked positions in the graph indicate the missing part of the data (Figs. 3.6 and 3.7). 2.4. Evaluation criteria for algorithm performance The authors established a machine learning model and provided an evaluation value to assess the model. To verify the consistency between the predicted results and the actual load, MAPE, RMSE, and R2 were introduced as evaluation indices (Li et al., 2018). MAPE is the most extensively used measure of predictive accuracy in enterprises and organisations (Massaoudi et al., 2021). It is used to reflect the average degree of relative error. The smaller the value of MAPE in the following equation, the smaller the error will be. n MAPE = 100% ∑ n n |δ| = ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ 100% ∑ ⏐ ŷi − yi ⏐ n i=1 i=1 yi 3.2. Results and discussion on factor correlation The data from January 1, 2017 to September 30, 2019 were considered as the training dataset, and the data from October 1, 2019 to March 31, 2020 as the test data. According to the method in Section 2, the first layer CatBoost prediction model was established. The order of feature importance is shown in Fig. 3.8. Among the features, x25, x26, x27, x28, and x29, denote the year, month, day of the corresponding data, the conditions of holidays, and cold days, respectively. In the feature importance diagram based on entropy, the cumulative values of real estate and residential investments are the most important external variables, accounting for 7.23% and 6.77% of the economic data, respectively. The maximum and average temperatures accounted for 5.83% and 5.49% of the climate data, respectively. In the power data, the daily and cumulative values of power generation were the most important, and accounted for 7.20% and 6.56% of the data, respectively. (23) RMSE is usually used in validation experiments on climate predictions to penalise data items with large errors (Fu et al., 2018). The sensitivity of the system is more accurate. Given that this parameter represents prediction bias, the smaller the value of RMSE in the following equation, the smaller the proof bias will be. n n 1 ∑ 1 ∑ ( )2 √ 2 RMSE = ε =√ ŷi − yi n n i=1 3.3. Selection results of experimental factors The number of selected relevant factors had a specific influence on the accuracy of load forecasting results. Herein, the number of factors were increased in load forecasting. R2 was regarded as the criterion for selecting the number of factors. The higher the value of R2 , the better the accuracy of the corresponding predicted outcome is. When 10 to 13 related factors are selected, the R2 value is the largest, as shown in Fig. 3.9. Finally, 11 relevant factors were selected. According to the degree of relevance, these factors are: time (month) judgment condition, X26, time (day) judgment condition, X27, cumulative increase in real estate investment, X6, daily power generation, X17, accumulated residential investment value, X8, daily cumulative power generation, X18, daily maximum temperature, X15, daily average temperature, X14, thermal power generation to the current day cumulative value, X22, daily minimum temperature, X16, and daily thermal power generation, X21. (24) i=1 In the linear regression model, RMSE is considered the contribution rate of the explanatory variable with respect to the change in the predictor variable. The closer it is to the value of one, the better the regression will be Ahmad and Zhang (2020). The calculation model is as follows, TSS = n ∑ (yi − y)2 (25) i=1 ESS = n ∑ ( )2 ŷi − y (26) i=1 R2 = ESS /TSS (27) 3. Results and discussion 3.4. Experimental results and discussion 3.1. Data processing results In this study, Python was used as a tool for all algorithmic models. To build an accurate load-forecasting model, the parameters of the load-forecasting model need to be adjusted, but the manual adjustment of the parameters is laborious. Therefore, the Python random search module ‘Randomised-Search-CV’ was used to quickly adjust various combination parameters. The method of Randomised-Search-CV is not to try every single combination of hyper parameters in details. When the number of random search 3.1.1. Economic data processing results The results of economic data after processing are as follows: Because of the availability of reliable economic data, only a small amount of abnormal data cleaning and data filling were carried out in this study. The marked positions in the graph indicate the missing part of the data (see Figs. 3.1, 3.2 and 3.3). 8515 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Fig. 3.1. Results of price index data pre-processing. Fig. 3.2. Results of increase rate data pre-processing. Fig. 3.3. Results of real-estate investment and growth data pre-processing. 8516 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Fig. 3.4. Wind speed pre-processing results. Fig. 3.5. Temperature data pre-processing results. Fig. 3.6. Results of electrical energy generation data pre-processing. (3) AdaBoost: learning_rate = 0.08, n_estimators = 464. In this study, default values were selected for other prediction algorithm parameters. LinearSVR implements linear regression support vector machine, which is implemented according to liblinear. It has greater flexibility in the selection of penalty and loss functions and can be easily extended to a large number of samples (Liu and Zhang, 2022). Decision tree is a decision analysis is limited, the hyper parameters are sampled randomly, and it can get close to the best set. The model parameters of the algorithm were set as follows: (1) CatBoost: learning_rate = 0.04, iterations = 700, depth = 6 l2_leaf_reg(reg_lambda) = 2. (2) XGBoost: learning_rate = 0.02, n_estimators = 500, max_ depth = 8, subsample = 0.56. 8517 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Fig. 3.7. Results of electrical generation growth data pre-processing. Fig. 3.8. Feature importance diagram. method based on the known probability of occurrence of various scenarios. It is an intuitive graphical method of probability analysis for calculating the probability that the expected value of net present value is greater than or equal to zero as well as for evaluating project risk and assessing feasibility (Shi et al., 2022). Multilayer perceptron (MLP) is a feedforward artificial neural network model, which maps multiple input datasets to a single output dataset (Lu and Yang, 2022). Random forest, which essentially belongs to a major branch of machine learning, ensemble learning, integrates many decision trees into a forest to predict the final result (Yang et al., 2021). Gradient boosting decision tree (GBDT) is an integrated algorithm based on decision tree. It is a widely used algorithm, which can be used for classification and regression (Xia, 2022). Bagging algorithm, a group learning algorithm in the field of machine learning, can be combined with other classification and regression algorithms to improve accuracy and stability while avoiding overfitting by reducing the variance of the results (Huang et al., 2016). Extra trees is an extreme random tree, which is also an integrated machine learning algorithm (Zhang, 2020). To verify the actual performance of the established loadforecasting algorithm, the test set was applied to obtain the prediction results. The experimental result of each algorithm is shown in Table 3.1 and Figs. 3.10, 3.11 and 3.12. As shown in Fig. 3.10. and Table 3.2, the CatBoost algorithmic model is the best for load forecasting, with MAPE = 0.0158, RMSE = 274.2036, and R2 = 0.9250. Compared with LinearSVR, decision-tree, MLP, random-forest, bagging, GBDT, extra-tree, XGBoost, and AdaBoost algorithms, the RMSE of CatBoost model yielded reductions of 17.45%, 72.36%, 100.40%, 44.30%, 125.18%, 9.23%, 43.98%, 20.65%, 65.68%, respectively, the MAPE of CatBoost model yielded reductions of 13.29%, 84.17%, 122.78%, 40.50%, 8518 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Fig. 3.9. Relationship between the number of characteristic variables and R2 . Fig. 3.12. Comparison of R2 values of the tested algorithms. Fig. 3.10. Comparison of root-mean-square values of the tested algorithms. Table 3.1 Default parameters of tested algorithms. Algorithm Default parameters LinearSVR intercept_scaling: 1.0 max_iter: 1000 tol: 0.0001 min_samples_leaf: 1 min_samples_split: 2 min_weight_fraction_leaf: 0.0 alpha: 0.0001 max_fun: 15000 max_iter: 200 min_samples_leaf: 1 min_samples_split: 2 min_weight_fraction_leaf: 0.0 alpha: 0.9 max_depth: 3 n_estimators: 100 max_features: 1.0, max_samples: 1.0, n_estimators: 10, min_samples_leaf: 1 min_samples_split: 2 min_weight_fraction_leaf: 0.0 DecisionTree (DT) Multilayer Perceptron (MLP) RandomForest (RF) Gradient Boosting Decision Tree (GBDT) Bagging Fig. 3.11. Comparison of mean average percentage error values of the tested algorithms. ExtraTree (ET) 155.06%, 9.49%, 37.97%, 18.98%, and 55.06%, respectively, and the R2 of CatBoost model increases of 3.31%, 30.02%, 41.74%, 7.64%, 96.14%, 1.49%, 10.51%, 3.69%, and 13.83%, respectively. In this study, the top three algorithmic models excelling in accuracy were selected for load forecasting, namely CatBoost, RF, and XGB, and their results were compared with the actual values. The load forecasting was for a province in northeast China, from July to December 2020. The load-forecasting results after standardisation are as follows: The results of load forecasting are shown in Fig. 3.13. It is observed that the method proposed in this study is the closest to the actual values. As shown in Fig. 3.13, the CatBoost, RF, 8519 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Fig. 3.13. Relationship between predicted and actual values of the three algorithm models tested herein. 4. Conclusion Table 3.2 Experimental results of tested algorithms. Algorithm Root-mean square Mean average percentage error R2 Cat XGB Ada LinearSVR DT 274.2036 322.0762 472.6259 549.5250 395.6999 0.0158 0.0179 0.0291 0.0352 0.0222 0.9250 0.8953 0.7114 0.6526 0.8593 MLP RF GBDT Bagging ET 617.4581 299.5294 394.8017 330.8425 454.3186 0.0403 0.0173 0.0218 0.0188 0.0245 0.4716 0.9114 0.8370 0.8920 0.8126 In this study, the double-layer CatBoost algorithm was used in medium and long-term power load forecasting, subject to the influences of multi-dimensional characteristics. The dataset comprised internal variables based on time transformation and multi-dimensional external variables in meteorology, economy, and power generation. The normalised feature variable correlation graph was used according to the node splitting and entropy observed in the feature importance tree to sort the correlation degree of the introduced feature variables, whereby the optimal number of feature variables was identified based on R2 . The CatBoost algorithm was optimised to avoid feature standardisation. The purpose of this study was to achieve a better grey model than the time series itself and improve the accuracy of the model. In our next study, more advanced high-performance algorithms will be compared. and XGB algorithmic models predicted the power load trend accurately. The refrigeration power load decreased considerably close to that in October 2020 with the cooling of the climate. The predicted trend of the three models is consistent with the actual value, verifying the sensitivity and effectiveness of the models. Among them, CatBoost had the highest fitting degree between the predicted and actual curves, and the model is more sensitive than the other two algorithms. When the daily load forecast value of the model was the closest, CatBoost, RF, and XGB achieved MAPE values of 3.57%, 5.44%, and 7.92%, respectively. When the daily load forecast value deviated from the maximum, CatBoost, RF, and XGB reached MAPE values of 14.43%, 17.57%, and 18.16%, respectively. The medium- and long-term load forecasting model based on the CatBoost algorithm proposed herein reduced the overall forecast error considerably. The experimental results showed that CatBoost had the highest prediction accuracy and the best experimental effect compared with the other algorithms. The boosting model generally performs better when dealing with a variety of external variables, which is attributed to the tree structure of the model, and the use of the ensemble method can effectively avoid the regression task, that is, the load prediction task in this study is incorporated into the model to suppress overfitting. By contrast, some of the aforementioned studies can provide good model interpretability and high accuracy in the scheduling task for more meaningful guidance in related power planning. Abbreviations and Nomenclature MAPE Mean absolute percentage error RMSE Root-mean-squared error TSS Total sum-of-squares ESS Explained sum-of-squares R2 Coefficient-of-determination Zm Normalisation factor ŷi Y estimate Obj(t) Objective function x̂ik X processing value Gm (xi ) Predicted label Declaration of competing interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Role of the funding source The authors acknowledge the funding of the scientific research project by the State Grid Heilongjiang Co. LTD (finding code 522448190001). 8520 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 References Shi, H., Gao, T., Ding, M., Li, Z., Zhang, Z., Yan, J., 2022. Wind power multi-interval composite short-term prediction method based on trend clustering and decision tree. Acta Energiae Solaris Sinica http://dx.doi.org/10.19912/j.02540096.tynxb.2020-0734. Sulandari, W., Lee, M.H., Rodrigues, P.C., 2020. Indonesian electricity load forecasting using singular spectrum analysis, fuzzy systems and neural networks. Energy (Oxford) 190, 116408. http://dx.doi.org/10.1016/j.energy. 2019.116408. Talaat, M., Taghreed, S., Mohamed, A.E., Hatata, A.Y., 2022. Integrated MFFNNMVO approach for PV solar power forecasting considering thermal effects and environmental conditions. Electr. Power Energy Syst. 135, 107570. http: //dx.doi.org/10.1016/j.ijepes.2021.107570, 2022. Wang, Q., 2021. Analysis and prediction of medium and long term load characteristics of power system based on spatial auto regressive model. J. Northeast Electr. Power Univ. 41 (3), 118–123. http://dx.doi.org/10.19718/j.issn.10052992.2021-03-0118-06. Wang, Z., Zhou, X., Tian, J., Huang, T., 2021. Hierarchical parameter optimizationbased support vector regression for power load forecasting. Sustain. Cities Soc. 71, 102937. http://dx.doi.org/10.1016/j.scs.2021.102937. Xia, B., 2022. Mechanical speed prediction method of air percussion rotary drilling based on GBDT algorithm. Manuf. Autom. 1009-0134(2022)03-018504, http://qikan.cqvip.com/Qikan/Article/Detail?id=7106847234. Xiao, H., 2021. Review of Python technology in data visualization. Netw. Inf. Eng. 13, 87–89. http://dx.doi.org/10.16520/j.cnki.1000-8519.2021.13.029. Yang, N., Li, H., Yuan, J., 2018. Medium- and long-term load forecasting method considering grey correlation degree analysis. Proc. CSU-EPSA 30 (6), 108–114. http://dx.doi.org/10.3969/j.issn.1003-8930.2018.06.017. Yang, J., Luo, C., Zhang, S., 2020. Short-term load forecasting based on phase space reconstruction and SVR coupling model. Electr. Meas. Instrum. 57 (16), 96–100. http://dx.doi.org/10.19753/j.issn1001-1390.2020.16.017. Yang, S., Wu, L., Liu, D., 2021. Cooling load prediction and characteristic analysis of terminal based on random forest. Buil. Energy Environ. 1003-0344(2021)12-001-6, http://qikan.cqvip.com/Qikan/Article/Detail? id=7106541478. Yu, Y., Li, W., 2016. A hybrid short-term load forecasting method based on improved ensemble empirical mode decomposition and back propagation neural network. J. Zhejiang Univ. Sci. A 17 (2), 101–114. http://dx.doi.org/ 10.1631/jzus.A1500156. Zhang, Y., 2020. Research on software defect prediction based on extra tree. Intell. Comput Appl. http://dx.doi.org/10.11907/rjdk.191625. Zhang, Y., Ai, Q., Lin, L., Yuan, S., Li, Z., 2019. A very short-term load forecasting method based on deep LSTM RNN at zone level. Power Syst. Technol. 43 (06), 1884–1892. http://dx.doi.org/10.13335/j.1000-3673.pst.2018.2101. Zhang, Z., Dou, C., Yue, D., Zhang, B., 2022. Predictive Voltage Hierarchical Controller Design for Islanded Microgrids under Limited Communication. IEEE, http://dx.doi.org/10.1109/TCSI.2021.3117048, 2022. Zhang, Z., Mishra, Y., Dong, Y., Dou, C., Zhang, B., Tian, Y.C., 2020. Delay-Tolerant Predictive Power Compensation Control for Photovoltaic Voltage Regulation. IEEE, http://dx.doi.org/10.1109/TII.2020.3024069, 2020. Zhao, W., Wang, F., 2021. Prediction model for medium and long term electric load based on improved grey theory. Northeast Electr. Power Technol. 32 (7), 325–331. http://dx.doi.org/10.3969/j.issn.1004-7913.2011.07.015. Zhong, Q., Sun, W., Yu, N., Liu, C., Wang, F., Zhang, X., 2014. Load and power forecasting in active distribution network planning. Proc. CS 34 (19), 3050–3056. http://dx.doi.org/10.13334/j.0258-8013.pcsee.2014.19.002. Ahmad, T., Zhang, H., 2020. Novel deep supervised ML models with feature selection approach for large-scale utilities and buildings short and mediumterm load requirement forecasts. Energy (Oxford) 209, 118477. http://dx.doi. org/10.1016/j.energy.2020.118477. Barman, M., Behari, N., Choudhury, D., 2019. Season specific approach for shortterm load forecasting based on hybrid FA-SVM and similarity concept. Energy (Oxford) 174, 886–896. http://dx.doi.org/10.1016/j.energy.2019.03.010. Ertugrul, Ö.F., 2016. Forecasting electricity load by a novel recurrent extreme learning machines approach. Int. J. Electr. Power Energy Syst. 78, 429–435. http://dx.doi.org/10.1016/j.ijepes.2015.12.006. Fu, X., Zeng, X., Feng, P., Cai, X., 2018. Clustering-based short-term load forecasting for residential electricity under the increasing-block pricing tariffs in China. Energy (Oxford) 165, 76–89. http://dx.doi.org/10.1016/j.energy.2018. 09.156. Gao, D., Gao, S., 2014. Summary of research on medium and long term power load forecasting. Sci. Technol. Innov. Guide 7 (25), http://dx.doi.org/10.3969/ j.issn.1674-098X.2014.07.017. Gu, J., 2004. Study on the model of mid-long term load forecasting for power system based on matter element. Proc. CSU-EPSA 16 (6), 68–71. http://dx. doi.org/10.1023/B:JOGO.0000006653.60256.f6. Hou, B., 2020. Data analysis of communication system based on Python. Commun. Technol. 53 (7), 1715–1720. http://dx.doi.org/10.3969/j.issn.10020802.2020.07.023. Huang, X., Li, W., Song, T., Wang, Y., 2016. Application of bagging-CART algorithm optimized by genetic algorithm in transformer fault diagnosis. High Volt. Eng. http://dx.doi.org/10.13336/j.1003-6520.hve.20160412052. Ji, B., Wu, Z., 2018. Application of exponential smoothing method in power system load forecasting. Technol. Innov. Appl. 30, 173–174, CNKI:SUN:CXYY.0.2018-30-077. Jiang, H., Zhang, Y., Muljadi, E., Zhang, J.J., Gao, D.W., 2018. A short-term and high-resolution distribution system load forecasting approach using support vector regression with hybrid parameters optimization. IEEE T. Smart Grid 9 (4), 3341–3350. http://dx.doi.org/10.1109/TSG.2016.2628061. Khwaja, A.S., Anpalagan, A., Naeem, M., Venkatesh, B., 2020. Joint bagged-boosted artificial neural networks: Using ensemble machine learning to improve short-term electricity load forecasting. Electr. Pow. Syst. Res. 179, 106080. http://dx.doi.org/10.1016/j.epsr.2019.106080. Ko, C., Lee, C., 2013. Short-term load forecasting using SVR (support vector regression)-based radial basis function neural network with dual extended Kalman filter. Energy 49, 413–422. http://dx.doi.org/10.1016/j.energy.2012. 11.015. Li, Y., Che, J., Yang, Y., 2018. Subsampled support vector regression ensemble for short term electric load forecasting. Energy (Oxford) 164, 160–170. http://dx.doi.org/10.1016/j.energy.2018.08.169. Liang, Y., Niu, D., Hong, W., 2019. Short term load forecasting based on feature extraction and improved general regression neural network model. Energy (Oxford) 166, 653–663. http://dx.doi.org/10.1016/j.energy.2018.10.119. Liu, Z., Liu, A., Li, Y., 2021. Medium term load forecasting model based on attention RESNET LSTM network. Chem. Autom. Instrum. 48 (6), 575–580, 1000-3932(2021)06-0575-07. http://qikan.cqvip.com/Qikan/Article/ Detail?id=7106067568. Liu, X., Teng, H., Gong, Y., Teng, D., 2019. Short-term load forecasting based on the improved Kalman filter algorithm. Electr. Meas. Instrum. 56 (3), 42–46. http://dx.doi.org/10.19753/j.issn1001-1390.2019.03.007. Liu, M., Zhang, Q., 2022. Prediction of strip crown based on support vector machine and neural network. CAAI Trans. Intell. Syst. http://dx.doi.org/10. 11992/tis.202101002. Lu, H., Yang, S., 2022. Three-dimensional object detection algorithm based on deep neural networks for automatic driving. J. BEIJING Univ. Technol. http: //dx.doi.org/10.11936/bjutxb2021100027. Luo, S., Ma, M., Jiang, L., Jin, B., Lin, Y., Diao, X., Li, C., Yang, B., 2020. Medium and long-term load forecasting method considering multi-time scale data. Proc. CSEE 40, 11–19. http://dx.doi.org/10.13334/j.0258-8013.pcsee.190550. Malekizadeh, M., Karami, H., Karimi, M., Moshari, A., Sanjari, M.J., 2020. Shortterm load forecast using ensemble neuro-fuzzy model. Energy (Oxford) 196, 117–127. http://dx.doi.org/10.1016/j.energy.2020.117127, 2020. Mao, L., Jiang, Y., Long, R., Li, N., Huang, H., Huang, S., 2008. Medium- and long-term load forecasting based on partial least squares regression analysis. Power Syst. Technol. 32 (19), 71–77, CNKI:SUN:DWJS.0.2008-19-020. Massaoudi, M., Refaat, S.S., Chihi, I., Trabelsi, M., Oueslati, F.S., Abu-Rub, H., 2021. A novel stacked generalization ensemble-based hybrid LGBM-XGB-MLP model for short-term load forecasting. Energy 214, 118874. http://dx.doi.org/ 10.1016/j.energy.2020.118874. Sadaei, H.J., de Lima, P.C., Silva, E., Guimarães, F.G., Lee, M.H., 2019. Shortterm load forecasting by using a combined method of convolutional neural networks and fuzzy time series. Energy (Oxford) 175, 365–377. http://dx.doi. org/10.1016/j.energy.2019.03.081. Wen Xiang (1989-), female, intermediate engineer, received the M.S. degrees in Agricultural Electrification and Automation from Northeast Agricultural University, China, in 2015, Now studying for a doctorate in the College of Electrical and Information, Northeast Agricultural University mainly researching power grid planning. Meanwhile, she works at the Economic and Technological Research Institute of State Grid Heilongjiang Electric Power Co. LTD, mainly responsible for investment and evaluation work. Peng Xu (1996-), male, Postgraduate, graduated from Northeast Agricultural University of China in 2018, majoring in electrical engineering and automation. Now he is studying for a master’s degree in the school of electrical information of Northeast Agricultural University, mainly studying machine learning and power grid planning. 8521 W. Xiang, P. Xu, J. Fang et al. Energy Reports 8 (2022) 8511–8522 Junlong Fang (1971-), male, professor, doctor of engineering, doctoral supervisor, Dean of School of electrical and information, Northeast Agricultural University, and reserve leader of agricultural electrification and automation, a provincial key discipline of Northeast Agricultural University. Executive director of Heilongjiang electrical engineering society, director of Heilongjiang automation society and member of agricultural power special committee of Heilongjiang Agricultural Engineering Society. His research direction is power system automation, information processing and intelligent measurement and control. Zhenggang Gu (1979-), male, bachelor’s degree, master’s degree, graduated from Harbin University of Science and Technology, he works at the State Grid Heilongjiang Electric Power Co. LTD, mainly researching are investment, evaluation and power grid data analysis. Qinghe Zhao (1995-), male, he received the bachelor of engineering degree in electrical engineering from Northeast Agricultural University, China. Currently, he is working toward the PhD degree in agricultural electrification from NEAU, China. His major research focuses on the advanced algorithms application in Power System and Load Forecasting. His research interests lie in the areas of machine learning of GBDT and deep learning. Zhang Qirui (1980-), male, master degree, graduated from Harbin Institute of Technology, majoring in electrical Engineering and automation, research direction: power grid investment management, power system and automation. He has participated in Hei Longjiang province’s power grid investment interface research, power grid development diagnosis and analysis, power grid project post-evaluation, power grid investment ability research, and is now in charge of power grid. 8522