Uploaded by Никита Чуйкин

Power Load Forecasting with Double-Layer CatBoost

advertisement
Energy Reports 8 (2022) 8511–8522
Contents lists available at ScienceDirect
Energy Reports
journal homepage: www.elsevier.com/locate/egyr
Research paper
Multi-dimensional data-based medium- and long-term power-load
forecasting using double-layer CatBoost
∗
Wen Xiang a,b , Peng Xu a , Junlong Fang a , , Qinghe Zhao a , Zhenggang Gu c , Qirui Zhang c
a
Northeast Agricultural University, Harbin, CO 150038, China
Economic and Technological Research Institute of State Grid Heilongjiang Electric Power Co., LTD, Harbin, CO 150038, China
c
State Grid Heilongjiang Electric Power Co. LTD, Harbin, CO 150036, China
b
article
info
Article history:
Received 29 January 2022
Received in revised form 25 May 2022
Accepted 21 June 2022
Available online 1 July 2022
Keywords:
Load forecasting
Machine learning
CatBoost
Randomised search CV
a b s t r a c t
In this study, a medium- and long-term power load prediction method is proposed based on the
two-layer categorical boosting (CatBoost) algorithm with multi-dimensional feature considerations.
Simultaneously, the influences of economic fluctuation, power generation disruption, and meteorological data on power load are considered, whereby the dimension of power-load forecasting data
characteristics is broadened. A randomised search cross-validation (CV) regression model is also
applied to model parameter optimisation. Real data from a province in northeast China were used for
the training and test sets. Compared with nine advanced load prediction models, including eXtreme
gradient boosting and adaptive boosting, the coefficient of determination (R2 ) of the proposed method
was 0.925, mean average percentage error (MAPE) was 0.0158, and root-mean-square error (RMSE)
was 274.2036. In this study, a popular, viable artificial intelligence technology, two-layer CatBoost, was
explored, and multi-dimensional external variables of power generation were added for the first time
for load prediction. Finally, a higher accuracy load forecasting tree model is presented. The method
has good potential for use in medium- and long-term power-load forecasting applications.
© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND
license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Medium- and long-term power-load forecasting is a necessary
condition for ensuring the correct operation of power systems.
Correct power load forecasting is conducive to the accurate implementation of power activities, such as power generation, fuel
procurement, maintenance, investment plans, and safety analysis (Zhong et al., 2014). By contrast, inaccurate middle- and
long-term load-forecasting increases the operating cost of the
power system. For example, if the prediction result is too large,
the power supply will be in a thermal standby state for a long
time, thus resulting in the waste of power generation energy and
inefficient power distribution. However, if the prediction result is
too small, it will cause high-energy consumption of the generator
set, and can render the system incapable of supplying power,
leading to power outage. The above two situations will hinder
the safe and economic operation of the power system (Gao and
Gao, 2014).
∗ Corresponding author.
E-mail address: jlfang@neau.edu.cn (J. Fang).
1.1. Literature survey
In general, the prediction methods covered in the literature
are usually based on mathematical analyses. Smooth curve (Mao
et al., 2008; Ji and Wu, 2018), elastic coefficient (Ertugrul, 2016),
and fuzzy linear regression (Jiang et al., 2018; Liu et al., 2019)
are widely used in power load prediction. The advantages of
these algorithmic models are reflected in the simplicity of their
applications. However, often only the relationship between time
and historical load data is considered to predict the future power
load value, ignoring the numerous factors affecting the load forecasting results. Thus, these methods cannot meet the accuracy
requirements of practical work.
To improve the accuracy of load forecasting and solve the
problem of a single factor affecting the accuracy of power load,
some researchers incorporated multi-dimensional factors into
medium and long-term load forecasting. For example, in Gu
(2004), the correlation between economy and load was determined based on the matter-element model using the concept
of classification, in combination with the economic indicators
of output value and gross-domestic product of the three major
industries. However, the selected indicators in the study were
not comprehensive because economic characteristics also include
industry, investment, consumption, and other factors, which have
varying degrees of impact on the power load. In Luo et al. (2020),
https://doi.org/10.1016/j.egyr.2022.06.063
2352-4847/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/bync-nd/4.0/).
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
1.2. Gaps and objectives
multi-time period data were included, and the dependence between economic and meteorological characteristics in different
time periods were fully considered, demonstrating the influence
of economic and meteorological data on power-load-forecasting
results. However, the economic data in the article were provided
annually, and the accuracy is limited, which weakens to a certain
extent the impact of economic data on power-load forecasting.
In these articles, although the combination of multiple factors is
realised, there is still room for improvement in the algorithm in
terms of prediction accuracy.
It is difficult to express mathematically the nonlinear relationship between power load and influencing factors. With the
growing number of intelligent algorithms, artificial intelligence
forecasting based on anatomically informed basis functions (AIBF)
technology is being increasingly used in load forecasting to improve accuracy. For example, the improved grey theory in Zhao
and Wang (2021) aims to minimise the average relative error
between the predicted and actual values. The one-dimensional
search method was adopted to solve the model background value,
thus addressing the poor accuracy issue associated with the
weight coefficient of the model background. However, the initial
value of the model was fuzzified, and prediction accuracy can
still be improved. In Wang (2021), all the data in the original
data series of the grey model were considered as initial values in
the calculations, and the model value with the highest accuracy
was selected as the initial value for prediction. The advantage of
this method is the simplicity of the operation. However, because
every dataset in the original data series is capable of generating
uncertain random errors, it cannot be inferred that the initial
model value with the highest prediction accuracy can match
any of the values listed in the original data column. The lack
of a precise theory raises the possibility of improving prediction
accuracy. In Liu et al. (2021), a multi-layer, bi-directional, recursive, neural network model based on long short-term memory
(LSTM) was proposed for power-load prediction. In this study,
the training time and prediction time were shortened. However,
the accuracy of data classification was not sufficient, and the
problem of inaccurate data classification still remains. In Yang
et al. (2018), grey correlation analysis was adopted, and the
improved fireworks algorithm was proved to be effective in
optimising the weight coefficient of the background value and the
correction term of the initial value of the grey model. Although
this solved the problem of prediction accuracy, the problems
of slow convergence speed, early maturity, and low efficiency
of the algorithm still need to be solved. In Zhang et al. (2019),
a power-load forecasting method using LSTM was proposed.
Compared with the traditional recursive neural network model, a
memory unit was added. In this study, the problem associated
with the disappearing or exploding recurrent neural network
gradient was solved successfully. However, with a large amount
of data imported into the model, the computational efficiency
deteriorated. In Yang et al. (2020), a load-forecasting method was
proposed based on support vector regression for the distribution
system. In this study, the parameters were integrated successfully
using particle swarm optimisation, and the prediction effect improved compared with the traditional method. However, the large
amount of misaligned data biased the prediction results. Some
articles have been proven to be effective in other fields, but their
applicability to medium- and long-term power-load forecasting
still needs to be determined (Zhang et al., 2020, 2022; Talaat et al.,
2022).
According to the latest progress on medium- and long-term
power-load forecasting, some artificial intelligence methods have
been applied, but there are still issues to address, such as the slow
convergence speed, high-sample requirements, and overfitting.
In summary, medium- and long-term load forecasting has
large prediction time span and long-cycle characteristics. The
main problems of the current, commonly used medium- and
long-term load-forecasting methods are the identification of the
correlation between the load, equal intervals, and single characteristic data, or the independent consideration of the impact
of different dimensional characteristic data on load forecasting. These methods rarely consider the relationship between
multi-dimensional data and load simultaneously. The continuous changes in system load data cannot be easily formulated
mathematically, and the accuracy needs to be improved.
To address the above considerations, a medium - and longterm power load prediction method is proposed based on the
two-layer categorical boosting (CatBoost) algorithm. In the first
layer, economic, meteorological, and power generation data are
input into the CatBoost algorithm. These data are processed and
constructed into multiple learning tree models. These tree models are then sorted and promoted, which can effectively solve
the prediction offset problem. To estimate the degree of correlation between each factor and power load, the tree model is
combined continuously. Finally, the factors with the correlation
degree (from high to low) are identified. In the second layer, the
random search module is applied to the factors which are highly
correlated with the power load while a randomised search crossvalidation (CV) regression model is applied to model parameter
optimisation. Following model training, a power-load prediction
model based on the CatBoost algorithm is proposed. In this study,
the selection of the number of factors is particularly critical to the
coefficient of determination (R2 ), which is considered a standard
of model accuracy. It is thus concluded that the load prediction
effect of the model with the first eleven factors is the best. In
comparisons with other algorithms, the mean average percentage
error (MAPE) and root-mean-square error (RMSE) are considered
as standards of model accuracy. As observed in the experimental
results, compared with the predicted values of nine load prediction models, including eXtreme gradient boosting (XGBoost)
and adaptive boosting (AdaBoost), the proposed power-load prediction model based on the CatBoost algorithm in this study
displayed higher accuracy, better data interpretation capacity,
and better model effects.
In this study, a popular, viable artificial intelligence technology, two-layer CatBoost, is explored for the first time for load
prediction. The internal variables based on time transformation
are set, and the multi-dimensional external variables of climate,
economy, and power generation are added. Through the node
splitting and entropy in the feature importance observation tree,
the normalised feature variable correlation graph is used to rank
the correlation of the introduced feature variables. Using R2 as
the standard, the optimal number of characteristic variables is
determined. To prevent the features from being standardised,
randomised-search CV is used to optimise the hyper parameters of the algorithm in the CatBoost regression model. The
accuracy and prediction results of nine advanced models were
compared with the methods proposed herein, thus focusing on
the advantages of the CatBoost algorithm in decision boosting
tree algorithms. MAPE, RMSE, and R2 were considered as the
evaluation criteria for model accuracy. The real data of a province
in northeast China were adopted for the test and training sets.
The experimental results show that the prediction accuracy of
this method is higher than that of the comparison models. The
main novelties include the exploration of two-layer CatBoost and
the addition of multi-dimensional external variables of power
generation for the first time for load prediction. Finally, a higher
accuracy tree model load forecasting model is presented.
8512
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
2. Experiments and methods
25% and 75% of the data in the sequence, respectively. Python’s
Pandas library supports the connection of two data frames with
indices or specified columns. Herein, the fill function in Python’s
Pandas library was applied to fill in the missing data (Xiao, 2021).
Occasionally, data losses occur because of specific time delays
in data release or low-update speed (Yu and Li, 2016). CatBoost
can be flexibly used to handle various types of data, including
continuous and discrete values. This algorithm is more suitable
for medium- and long-term load forecasting. The algorithm has
been proved to achieve a high-predictive readiness rate within
a short parameter adjustment time. Herein, factor data are processed to obtain better load forecasting results, and a two-layer
CatBoost algorithm for load prediction has been selected.
2.1.2. Economic data processing
There is a close relationship between economic development
and power load. Thus, daily economic parameter data were selected as reference because of their high-reference value and
accuracy (Wang et al., 2021). In general, there was no data distortion; however, in the process of data collection, there will
inevitably be some abnormal data or unpublished data, which
can lead to abnormal input values and abnormal fluctuations. If
these data are not processed before model training, the training
effect and training accuracy of the model will be reduced. Missing
values can be obfuscated by means of correlation. In view of the
characteristics of economic data in regions, the methods used for
handling abnormal data in this study are described below.
The first step involved horizontal processing of the data. In
general, economic data are smooth with a few differences between similar data. In this case, horizontal processing can be used
as follows:
yt −1 + yt +1
(2)
yt =
2
The second step is to process the data vertically. In general,
economic data also have a certain periodicity, and the value of
a time point corresponding to the period differs by little. In this
case, vertical processing can be used as follows,
2.1. Data sources
Economic data from a region in China were selected from January 2017 to March 2020. The data source is the official website
of the National Bureau of Statistics of China. Data include the consumer price index, X1, industrial producer purchasing price index,
X2, producer price index, X3, industrial added value increase rate,
X4, industrial added value increase rate, X5, real estate investment increase, X6, real-estate investment, X7, accumulated value
of residential investment, X8, cumulative growth of residential
investment, X9, cumulative value of residential investment production and construction area, X10, and the cumulative growth
of real estate construction area, X11.
Meteorological data from a region in China from January 2017
to March 2020 were also selected. The data source is the official
website of China’s National Meteorological Administration. The
data include the daily mean wind speed, X12, daily maximum
sustained wind speed, X13, daily mean temperature, X14, daily
maximum temperature,X15, and daily minimum temperature,
X16.
Power generation data for a region in China from January 2017
to March 2020 were selected. The data source was China’s state
grid enterprises (internal data). They include the daily power
generation, X17, accumulated power in current generation, X18,
daily new energy generation, X19, new energy generation to the
current cumulative value, X20, daily thermal power generation,
X21, thermal power generation to the current cumulative value,
X22, accumulated growth to the current generation growth rate,
X23, and daily power generation year-on-year growth rate, X24.
The load data from January 2017 to March 2020 came from
the internal data of China’s state grid enterprises. The original
collection interval of load data was 15 min, and the experimental
data were resampled within a period of 1 day. The data from
January 1, 2017 to September 30, 2019 were considered as the
training dataset, and the data from October 1 2019 to March 31,
2020 as the test data.
2.1.1. Meteorological data processing
In this study, the changes in meteorological characteristics
within a certain range were considered. Meteorological data, such
as temperature data, have been regarded as abnormal (Sulandari
et al., 2020). Abnormal data were corrected according to monthly
maximum, minimum, average, and adjacent data., calculating the
quartiles for data with large differences in adjacent values. Then,
the acceptable data value range was set. When the range was
equal to three, extreme abnormal numerical detection was initiated. For ordinal data, data outside the value range were detected
and eliminated as abnormal data. For nominal data, the detection
procedure for abnormal data was the same as that of the ordinal
data.
Q3 + β (Q3 − Q1 ) ∼ Q1 − β (Q3 − Q1 )
yd−1 + yd+1
(3)
2
The third step is to conduct four-point data processing. By combining horizontal and vertical economic data, better conclusions
can be drawn, and the processed data will be relatively more
accurate:
yt −1 + yt +1 + yd−1 + yd+1
yt =
(4)
4
Finally, the authors applied the fill-in function in Python’s
Pandas library, which is a data analysis package from Python for
filling in missing values in the data (Hou, 2020). One option involves the specification of what the missing value will be replaced
with. For the growth rate of economic data, it is more acceptable
to replace all missing values with zeros, as this will not affect the
overall growth trend (Khwaja et al., 2020). The authors converted
the data on economic indicators into daily data, to reflect the
long-term economic development, and satisfy the requirements
of daily data analysis while meeting the needs of overall forecasting. The transformation formula can be specified as follows,
yt =
Ed1 (t ) = Ed2 (t ) = · · · = Edn (t) = Em (t)
(5)
where dn stands for the nth day.Edn (t) stands for the economic
data value on the dn of each month, and Em (t) stands for the
monthly economic data value.
2.1.3. Power generation data processing
In general, power loads are characterised by periodicity, and
there are specific similarities between loads and power generation at the same time of the day. Power generation data
positively correlate with power load data. When power generation increases, it can solve the problem of insufficient load supply
and demand. Therefore, the corresponding power generation data
need to be kept within a certain range. The authors set the
maximum possible variation range of the predicted value according to the power generation data of two time points (Barman
(1)
where Q1 is the first quartile, and Q3 is the third quartile. Once the
training data are sorted from small to large, Q1 and Q3 represent
8513
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Calculate the weight of the classifier Gm (xi ). Then,
et al., 2019). If the absolute value of the difference between the
generation value and the generation data at the two time points
exceeds the threshold value, the generation data are considered
abnormal (Liang et al., 2019). The power generation data process
is described below.
The first step involved the setting of the target function.
{
|y(i, t) − y(i, t − 1)| > α (t)
|y(i, t) − y(i, t + 1)| > β (t)
αm =
log
2
1 − em
(15)
em
Update the weight distribution:
Dm+1 = ωm+1,1 , ωm+1,2 , . . . , ωm+1,N
(
ωmi
ωm+1,i =
(6)
Then,
y(i, t) =
1
Zm
)
exp (−αm yi Gm (xi )) , i = 1, 2, . . . , N ,
(16)
The normalisation factor is thus defined as
y(i, t + 1) + y(i, t − 1)
2
+ y(i − 1, t) −
Zm =
y(i − 1, t + 1) + y(i − 1, t − 1)
ωmi exp (−αm yi Gm (xi ))
(17)
i=1
(7)
2
where α (t ) and β (t) represent the range interval and load data,
respectively.
In the second step, the previously available data of the same
day are substituted into the above formula. This yielded,
n
⎡∑
N
∑
Repeat the process M times (m = 1, 2, 3, . . . , M ) to reduce the
amount of overfitting and computation required by CatBoost, thus
making the computation faster when large amounts of data are
involved.
⎤
ai gi (y1 )
2.3. Second CatBoost layer
⎢ i=1
⎥ ⎡ ⎤
⎥
⎢ n
y1
⎥
⎢∑
⎥
⎥
⎢
⎢r ⎥ ⎢
ai gi (y2 )⎥
y2 ⎥
⎢ 2⎥ ⎢
⎥ ⎢
⎢
⎢.⎥=⎢
i
=
1
⎥ − ⎢ .. ⎥
⎢.⎥ ⎢
⎥ ⎣.⎥
..
⎦
⎣.⎦ ⎢
⎥
⎢
⎥
⎢ n .
⎥
⎢∑
yk
rk
⎣
⎦
ai gi (yk )
r1
⎡ ⎤
The second layer of CatBoost is used to establish the loadforecasting model. CatBoost can be executed based on a secondorder Taylor expansion, wherein both first and second derivatives
are used, thus making the solution of the model more efficient.
In the hypothetical data,
(8)
i=1
Consider the least-squares principle,
m
∑
δk
k=1
m
∑
∂δk
= 0(i = 1, 2, . . . , n)
∂ ai
δk2 = ∆min
ŷi =
fk (xi ) , fk ∈ F
(18)
k=1
(9)
k is the number of subsets, ŷi denotes all the possible subsets,
fk (xi ) represents a subset, and the model consists of subsets. For
the best parameters, the objective function is defined as,
(10)
k=1
Obj(θ ) =
ai can thus be solved, and the fitting curve can be obtained.
This fitting curve can be used to correct the missing data.
n
∑
(
l yi , ŷi +
)
i=1
K
∑
Ω (fk )
(19)
k=1
By adding the prediction model, the result of the tth accumulation becomes
2.2. First CatBoost layer
(t )
The first layer of CatBoost is the factor correlation analysis.
The advantage of CatBoost is that it will reduce the probability of misclassifying sample data. CatBoost adopts the weighted
voting method to increase the weight of data associated with
small errors, thereby improving the accuracy of the predicted
results (Malekizadeh et al., 2020).
In this study, the load forecasting process based on the first
layer CatBoost algorithm is outlined below.
Set xi ∈ X ∈ Rn yi ∈ Y ∈ {−1, +1}. Initialise the training data
weights,
D1 = (ω11 , . . . , ω1i , . . . , ω1N )
(t )
Obj
=
n [
∑
gi fi (xi ) +
i=1
1
2
hi ft2
]
(xi ) + Ω (ft )
(21)
The complexity in the objective function can be used to prevent data overfitting. The datasets were randomly sorted and
formed into groups based on random permutations (Sadaei et al.,
2019). Assuming a given sequence, the average of each set of data
in the same category was calculated (Ko and Lee, 2013). All the
classified data were converted into numerical results. This model
can be expressed as
(11)
∑p−1 [
N
ωmi I (Gm (xi ) ̸= yi )
(20)
By expanding the objective function (second order Taylor series)
and removing the constant term in the objective function, the
following equation can be obtained.
1
, i = 1, 2, . . . , N
(12)
N
Let the training samples with weights be learnt to obtain the
classifier Gm (xi ). The classification error rate of Gm (x) is calculated
as
∑
= ŷ(ti −1) + fi (xi )
ŷi
ω1i =
em = P (Gm (xi ) ̸ = yi ) =
K
∑
x̂ik
(13)
=
j=1
]
xσj ,k = xσp ,k · Yσj + a · P
∑p−1 [
j=1
]
(22)
xσj ,k = xσp ,k + a
i=1
where xσp ,k represents the prior term, and Yσj represents the
weight coefficient (a > 0).
In this study, the pseudocode of the two-layer CatBoost is
given as follows:
where
{
I=
1,
Gm (xi ) ̸ = yi
0,
other
(14)
8514
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
3.1.2. Meteorological data processing results
The meteorological data after processing are as follows.
Because meteorological data are easy to obtain, there are few
outliers and missing values in this study. We only filled in a small
amount of missing data and deleted outlier data. In this section,
the marked positions in the graph indicate the missing part of the
data (Figs. 3.4 and 3.5).
Algorithm 1: Updating the models and calculating model
values for gradient estimation
input: {(Xk , Yk )}nk=1 ordered according to Yk , and the
number of trees is I;
1 Mi 0 for i = 1 . . . n;
2 for iteration 1 to I do
3 for i 1 to n do
4 for j = 1 to i – 1 do
d
L cos(yj , a)|a=M1 (xj ) ;
[gj + da
(
)
M ← LearnOneTree (Xj , gj ) for j = 1..i − 1 ;
Mi ↔ Mi + M ;
return M1 Mn ; M1 (X1 ), M2 (X2 )Mn (Xn )
3.1.3. Power generation data processing results
The processed power generation data are displayed below.
Because the power generation data in this study come from
the internal data of the enterprise, data sources are considered
preferable, there are no missing values and only a small number
of abnormal values are observed. So, only a small amount of
outlier data was deleted. The marked positions in the graph
indicate the missing part of the data (Figs. 3.6 and 3.7).
2.4. Evaluation criteria for algorithm performance
The authors established a machine learning model and provided an evaluation value to assess the model. To verify the
consistency between the predicted results and the actual load,
MAPE, RMSE, and R2 were introduced as evaluation indices (Li
et al., 2018).
MAPE is the most extensively used measure of predictive accuracy in enterprises and organisations (Massaoudi et al., 2021). It
is used to reflect the average degree of relative error. The smaller
the value of MAPE in the following equation, the smaller the error
will be.
n
MAPE =
100% ∑
n
n
|δ| =
⏐
⏐
⏐
⏐
⏐
⏐
100% ∑ ⏐ ŷi − yi ⏐
n
i=1
i=1
yi
3.2. Results and discussion on factor correlation
The data from January 1, 2017 to September 30, 2019 were
considered as the training dataset, and the data from October
1, 2019 to March 31, 2020 as the test data. According to the
method in Section 2, the first layer CatBoost prediction model was
established. The order of feature importance is shown in Fig. 3.8.
Among the features, x25, x26, x27, x28, and x29, denote the
year, month, day of the corresponding data, the conditions of
holidays, and cold days, respectively. In the feature importance
diagram based on entropy, the cumulative values of real estate and residential investments are the most important external
variables, accounting for 7.23% and 6.77% of the economic data,
respectively. The maximum and average temperatures accounted
for 5.83% and 5.49% of the climate data, respectively. In the power
data, the daily and cumulative values of power generation were
the most important, and accounted for 7.20% and 6.56% of the
data, respectively.
(23)
RMSE is usually used in validation experiments on climate
predictions to penalise data items with large errors (Fu et al.,
2018). The sensitivity of the system is more accurate. Given that
this parameter represents prediction bias, the smaller the value
of RMSE in the following equation, the smaller the proof bias will
be.


 n
 n
1 ∑
1 ∑ (
)2
√
2
RMSE =
ε =√
ŷi − yi
n
n
i=1
3.3. Selection results of experimental factors
The number of selected relevant factors had a specific influence on the accuracy of load forecasting results. Herein, the number of factors were increased in load forecasting. R2 was regarded
as the criterion for selecting the number of factors. The higher
the value of R2 , the better the accuracy of the corresponding
predicted outcome is.
When 10 to 13 related factors are selected, the R2 value is
the largest, as shown in Fig. 3.9. Finally, 11 relevant factors
were selected. According to the degree of relevance, these factors
are: time (month) judgment condition, X26, time (day) judgment
condition, X27, cumulative increase in real estate investment, X6,
daily power generation, X17, accumulated residential investment
value, X8, daily cumulative power generation, X18, daily maximum temperature, X15, daily average temperature, X14, thermal
power generation to the current day cumulative value, X22, daily
minimum temperature, X16, and daily thermal power generation,
X21.
(24)
i=1
In the linear regression model, RMSE is considered the contribution rate of the explanatory variable with respect to the change
in the predictor variable. The closer it is to the value of one,
the better the regression will be Ahmad and Zhang (2020). The
calculation model is as follows,
TSS =
n
∑
(yi − y)2
(25)
i=1
ESS =
n
∑
(
)2
ŷi − y
(26)
i=1
R2 = ESS /TSS
(27)
3. Results and discussion
3.4. Experimental results and discussion
3.1. Data processing results
In this study, Python was used as a tool for all algorithmic
models. To build an accurate load-forecasting model, the parameters of the load-forecasting model need to be adjusted, but the
manual adjustment of the parameters is laborious. Therefore, the
Python random search module ‘Randomised-Search-CV’ was used
to quickly adjust various combination parameters. The method of
Randomised-Search-CV is not to try every single combination of
hyper parameters in details. When the number of random search
3.1.1. Economic data processing results
The results of economic data after processing are as follows:
Because of the availability of reliable economic data, only a
small amount of abnormal data cleaning and data filling were
carried out in this study. The marked positions in the graph
indicate the missing part of the data (see Figs. 3.1, 3.2 and 3.3).
8515
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Fig. 3.1. Results of price index data pre-processing.
Fig. 3.2. Results of increase rate data pre-processing.
Fig. 3.3. Results of real-estate investment and growth data pre-processing.
8516
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Fig. 3.4. Wind speed pre-processing results.
Fig. 3.5. Temperature data pre-processing results.
Fig. 3.6. Results of electrical energy generation data pre-processing.
(3) AdaBoost: learning_rate = 0.08, n_estimators = 464.
In this study, default values were selected for other prediction
algorithm parameters. LinearSVR implements linear regression
support vector machine, which is implemented according to liblinear. It has greater flexibility in the selection of penalty and
loss functions and can be easily extended to a large number of
samples (Liu and Zhang, 2022). Decision tree is a decision analysis
is limited, the hyper parameters are sampled randomly, and it can
get close to the best set.
The model parameters of the algorithm were set as follows:
(1) CatBoost: learning_rate = 0.04, iterations = 700, depth =
6 l2_leaf_reg(reg_lambda) = 2.
(2) XGBoost: learning_rate = 0.02, n_estimators = 500, max_
depth = 8, subsample = 0.56.
8517
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Fig. 3.7. Results of electrical generation growth data pre-processing.
Fig. 3.8. Feature importance diagram.
method based on the known probability of occurrence of various scenarios. It is an intuitive graphical method of probability
analysis for calculating the probability that the expected value
of net present value is greater than or equal to zero as well as
for evaluating project risk and assessing feasibility (Shi et al.,
2022). Multilayer perceptron (MLP) is a feedforward artificial
neural network model, which maps multiple input datasets to
a single output dataset (Lu and Yang, 2022). Random forest,
which essentially belongs to a major branch of machine learning,
ensemble learning, integrates many decision trees into a forest
to predict the final result (Yang et al., 2021). Gradient boosting
decision tree (GBDT) is an integrated algorithm based on decision tree. It is a widely used algorithm, which can be used
for classification and regression (Xia, 2022). Bagging algorithm,
a group learning algorithm in the field of machine learning, can
be combined with other classification and regression algorithms
to improve accuracy and stability while avoiding overfitting by
reducing the variance of the results (Huang et al., 2016). Extra
trees is an extreme random tree, which is also an integrated
machine learning algorithm (Zhang, 2020).
To verify the actual performance of the established loadforecasting algorithm, the test set was applied to obtain the
prediction results. The experimental result of each algorithm is
shown in Table 3.1 and Figs. 3.10, 3.11 and 3.12.
As shown in Fig. 3.10. and Table 3.2, the CatBoost algorithmic
model is the best for load forecasting, with MAPE = 0.0158,
RMSE = 274.2036, and R2 = 0.9250. Compared with LinearSVR,
decision-tree, MLP, random-forest, bagging, GBDT, extra-tree, XGBoost, and AdaBoost algorithms, the RMSE of CatBoost model
yielded reductions of 17.45%, 72.36%, 100.40%, 44.30%, 125.18%,
9.23%, 43.98%, 20.65%, 65.68%, respectively, the MAPE of CatBoost
model yielded reductions of 13.29%, 84.17%, 122.78%, 40.50%,
8518
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Fig. 3.9. Relationship between the number of characteristic variables and R2 .
Fig. 3.12. Comparison of R2 values of the tested algorithms.
Fig. 3.10. Comparison of root-mean-square values of the tested algorithms.
Table 3.1
Default parameters of tested algorithms.
Algorithm
Default parameters
LinearSVR
intercept_scaling: 1.0
max_iter: 1000
tol: 0.0001
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
alpha: 0.0001
max_fun: 15000
max_iter: 200
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
alpha: 0.9
max_depth: 3
n_estimators: 100
max_features: 1.0,
max_samples: 1.0,
n_estimators: 10,
min_samples_leaf: 1
min_samples_split: 2
min_weight_fraction_leaf: 0.0
DecisionTree (DT)
Multilayer Perceptron
(MLP)
RandomForest (RF)
Gradient Boosting
Decision Tree (GBDT)
Bagging
Fig. 3.11. Comparison of mean average percentage error values of the tested
algorithms.
ExtraTree (ET)
155.06%, 9.49%, 37.97%, 18.98%, and 55.06%, respectively, and the
R2 of CatBoost model increases of 3.31%, 30.02%, 41.74%, 7.64%,
96.14%, 1.49%, 10.51%, 3.69%, and 13.83%, respectively.
In this study, the top three algorithmic models excelling in
accuracy were selected for load forecasting, namely CatBoost,
RF, and XGB, and their results were compared with the actual
values. The load forecasting was for a province in northeast China,
from July to December 2020. The load-forecasting results after
standardisation are as follows:
The results of load forecasting are shown in Fig. 3.13. It is
observed that the method proposed in this study is the closest
to the actual values. As shown in Fig. 3.13, the CatBoost, RF,
8519
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Fig. 3.13. Relationship between predicted and actual values of the three algorithm models tested herein.
4. Conclusion
Table 3.2
Experimental results of tested algorithms.
Algorithm
Root-mean
square
Mean average
percentage
error
R2
Cat
XGB
Ada
LinearSVR
DT
274.2036
322.0762
472.6259
549.5250
395.6999
0.0158
0.0179
0.0291
0.0352
0.0222
0.9250
0.8953
0.7114
0.6526
0.8593
MLP
RF
GBDT
Bagging
ET
617.4581
299.5294
394.8017
330.8425
454.3186
0.0403
0.0173
0.0218
0.0188
0.0245
0.4716
0.9114
0.8370
0.8920
0.8126
In this study, the double-layer CatBoost algorithm was used
in medium and long-term power load forecasting, subject to
the influences of multi-dimensional characteristics. The dataset
comprised internal variables based on time transformation and
multi-dimensional external variables in meteorology, economy,
and power generation. The normalised feature variable correlation graph was used according to the node splitting and entropy
observed in the feature importance tree to sort the correlation
degree of the introduced feature variables, whereby the optimal
number of feature variables was identified based on R2 . The CatBoost algorithm was optimised to avoid feature standardisation.
The purpose of this study was to achieve a better grey model than
the time series itself and improve the accuracy of the model. In
our next study, more advanced high-performance algorithms will
be compared.
and XGB algorithmic models predicted the power load trend
accurately. The refrigeration power load decreased considerably
close to that in October 2020 with the cooling of the climate. The
predicted trend of the three models is consistent with the actual
value, verifying the sensitivity and effectiveness of the models.
Among them, CatBoost had the highest fitting degree between
the predicted and actual curves, and the model is more sensitive
than the other two algorithms. When the daily load forecast value
of the model was the closest, CatBoost, RF, and XGB achieved
MAPE values of 3.57%, 5.44%, and 7.92%, respectively. When the
daily load forecast value deviated from the maximum, CatBoost,
RF, and XGB reached MAPE values of 14.43%, 17.57%, and 18.16%,
respectively.
The medium- and long-term load forecasting model based
on the CatBoost algorithm proposed herein reduced the overall
forecast error considerably. The experimental results showed that
CatBoost had the highest prediction accuracy and the best experimental effect compared with the other algorithms. The boosting
model generally performs better when dealing with a variety of
external variables, which is attributed to the tree structure of
the model, and the use of the ensemble method can effectively
avoid the regression task, that is, the load prediction task in this
study is incorporated into the model to suppress overfitting. By
contrast, some of the aforementioned studies can provide good
model interpretability and high accuracy in the scheduling task
for more meaningful guidance in related power planning.
Abbreviations and Nomenclature
MAPE
Mean absolute percentage error
RMSE
Root-mean-squared error
TSS
Total sum-of-squares
ESS
Explained sum-of-squares
R2
Coefficient-of-determination
Zm
Normalisation factor
ŷi
Y estimate
Obj(t)
Objective function
x̂ik
X processing value
Gm (xi )
Predicted label
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared
to influence the work reported in this paper.
Role of the funding source
The authors acknowledge the funding of the scientific research
project by the State Grid Heilongjiang Co. LTD (finding code
522448190001).
8520
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
References
Shi, H., Gao, T., Ding, M., Li, Z., Zhang, Z., Yan, J., 2022. Wind power multi-interval
composite short-term prediction method based on trend clustering and
decision tree. Acta Energiae Solaris Sinica http://dx.doi.org/10.19912/j.02540096.tynxb.2020-0734.
Sulandari, W., Lee, M.H., Rodrigues, P.C., 2020. Indonesian electricity load
forecasting using singular spectrum analysis, fuzzy systems and neural
networks. Energy (Oxford) 190, 116408. http://dx.doi.org/10.1016/j.energy.
2019.116408.
Talaat, M., Taghreed, S., Mohamed, A.E., Hatata, A.Y., 2022. Integrated MFFNNMVO approach for PV solar power forecasting considering thermal effects
and environmental conditions. Electr. Power Energy Syst. 135, 107570. http:
//dx.doi.org/10.1016/j.ijepes.2021.107570, 2022.
Wang, Q., 2021. Analysis and prediction of medium and long term load characteristics of power system based on spatial auto regressive model. J. Northeast
Electr. Power Univ. 41 (3), 118–123. http://dx.doi.org/10.19718/j.issn.10052992.2021-03-0118-06.
Wang, Z., Zhou, X., Tian, J., Huang, T., 2021. Hierarchical parameter optimizationbased support vector regression for power load forecasting. Sustain. Cities
Soc. 71, 102937. http://dx.doi.org/10.1016/j.scs.2021.102937.
Xia, B., 2022. Mechanical speed prediction method of air percussion rotary
drilling based on GBDT algorithm. Manuf. Autom. 1009-0134(2022)03-018504, http://qikan.cqvip.com/Qikan/Article/Detail?id=7106847234.
Xiao, H., 2021. Review of Python technology in data visualization. Netw. Inf. Eng.
13, 87–89. http://dx.doi.org/10.16520/j.cnki.1000-8519.2021.13.029.
Yang, N., Li, H., Yuan, J., 2018. Medium- and long-term load forecasting method
considering grey correlation degree analysis. Proc. CSU-EPSA 30 (6), 108–114.
http://dx.doi.org/10.3969/j.issn.1003-8930.2018.06.017.
Yang, J., Luo, C., Zhang, S., 2020. Short-term load forecasting based on phase
space reconstruction and SVR coupling model. Electr. Meas. Instrum. 57 (16),
96–100. http://dx.doi.org/10.19753/j.issn1001-1390.2020.16.017.
Yang, S., Wu, L., Liu, D., 2021. Cooling load prediction and characteristic analysis of terminal based on random forest. Buil. Energy Environ. 1003-0344(2021)12-001-6, http://qikan.cqvip.com/Qikan/Article/Detail?
id=7106541478.
Yu, Y., Li, W., 2016. A hybrid short-term load forecasting method based on
improved ensemble empirical mode decomposition and back propagation
neural network. J. Zhejiang Univ. Sci. A 17 (2), 101–114. http://dx.doi.org/
10.1631/jzus.A1500156.
Zhang, Y., 2020. Research on software defect prediction based on extra tree.
Intell. Comput Appl. http://dx.doi.org/10.11907/rjdk.191625.
Zhang, Y., Ai, Q., Lin, L., Yuan, S., Li, Z., 2019. A very short-term load forecasting
method based on deep LSTM RNN at zone level. Power Syst. Technol. 43
(06), 1884–1892. http://dx.doi.org/10.13335/j.1000-3673.pst.2018.2101.
Zhang, Z., Dou, C., Yue, D., Zhang, B., 2022. Predictive Voltage Hierarchical
Controller Design for Islanded Microgrids under Limited Communication.
IEEE, http://dx.doi.org/10.1109/TCSI.2021.3117048, 2022.
Zhang, Z., Mishra, Y., Dong, Y., Dou, C., Zhang, B., Tian, Y.C., 2020. Delay-Tolerant
Predictive Power Compensation Control for Photovoltaic Voltage Regulation.
IEEE, http://dx.doi.org/10.1109/TII.2020.3024069, 2020.
Zhao, W., Wang, F., 2021. Prediction model for medium and long term electric
load based on improved grey theory. Northeast Electr. Power Technol. 32
(7), 325–331. http://dx.doi.org/10.3969/j.issn.1004-7913.2011.07.015.
Zhong, Q., Sun, W., Yu, N., Liu, C., Wang, F., Zhang, X., 2014. Load and
power forecasting in active distribution network planning. Proc. CS 34 (19),
3050–3056. http://dx.doi.org/10.13334/j.0258-8013.pcsee.2014.19.002.
Ahmad, T., Zhang, H., 2020. Novel deep supervised ML models with feature
selection approach for large-scale utilities and buildings short and mediumterm load requirement forecasts. Energy (Oxford) 209, 118477. http://dx.doi.
org/10.1016/j.energy.2020.118477.
Barman, M., Behari, N., Choudhury, D., 2019. Season specific approach for shortterm load forecasting based on hybrid FA-SVM and similarity concept. Energy
(Oxford) 174, 886–896. http://dx.doi.org/10.1016/j.energy.2019.03.010.
Ertugrul, Ö.F., 2016. Forecasting electricity load by a novel recurrent extreme
learning machines approach. Int. J. Electr. Power Energy Syst. 78, 429–435.
http://dx.doi.org/10.1016/j.ijepes.2015.12.006.
Fu, X., Zeng, X., Feng, P., Cai, X., 2018. Clustering-based short-term load forecasting for residential electricity under the increasing-block pricing tariffs in
China. Energy (Oxford) 165, 76–89. http://dx.doi.org/10.1016/j.energy.2018.
09.156.
Gao, D., Gao, S., 2014. Summary of research on medium and long term power
load forecasting. Sci. Technol. Innov. Guide 7 (25), http://dx.doi.org/10.3969/
j.issn.1674-098X.2014.07.017.
Gu, J., 2004. Study on the model of mid-long term load forecasting for power
system based on matter element. Proc. CSU-EPSA 16 (6), 68–71. http://dx.
doi.org/10.1023/B:JOGO.0000006653.60256.f6.
Hou, B., 2020. Data analysis of communication system based on Python.
Commun. Technol. 53 (7), 1715–1720. http://dx.doi.org/10.3969/j.issn.10020802.2020.07.023.
Huang, X., Li, W., Song, T., Wang, Y., 2016. Application of bagging-CART algorithm
optimized by genetic algorithm in transformer fault diagnosis. High Volt. Eng.
http://dx.doi.org/10.13336/j.1003-6520.hve.20160412052.
Ji, B., Wu, Z., 2018. Application of exponential smoothing method in
power system load forecasting. Technol. Innov. Appl. 30, 173–174,
CNKI:SUN:CXYY.0.2018-30-077.
Jiang, H., Zhang, Y., Muljadi, E., Zhang, J.J., Gao, D.W., 2018. A short-term and
high-resolution distribution system load forecasting approach using support
vector regression with hybrid parameters optimization. IEEE T. Smart Grid
9 (4), 3341–3350. http://dx.doi.org/10.1109/TSG.2016.2628061.
Khwaja, A.S., Anpalagan, A., Naeem, M., Venkatesh, B., 2020. Joint bagged-boosted
artificial neural networks: Using ensemble machine learning to improve
short-term electricity load forecasting. Electr. Pow. Syst. Res. 179, 106080.
http://dx.doi.org/10.1016/j.epsr.2019.106080.
Ko, C., Lee, C., 2013. Short-term load forecasting using SVR (support vector
regression)-based radial basis function neural network with dual extended
Kalman filter. Energy 49, 413–422. http://dx.doi.org/10.1016/j.energy.2012.
11.015.
Li, Y., Che, J., Yang, Y., 2018. Subsampled support vector regression ensemble
for short term electric load forecasting. Energy (Oxford) 164, 160–170.
http://dx.doi.org/10.1016/j.energy.2018.08.169.
Liang, Y., Niu, D., Hong, W., 2019. Short term load forecasting based on feature
extraction and improved general regression neural network model. Energy
(Oxford) 166, 653–663. http://dx.doi.org/10.1016/j.energy.2018.10.119.
Liu, Z., Liu, A., Li, Y., 2021. Medium term load forecasting model based
on attention RESNET LSTM network. Chem. Autom. Instrum. 48 (6),
575–580, 1000-3932(2021)06-0575-07. http://qikan.cqvip.com/Qikan/Article/
Detail?id=7106067568.
Liu, X., Teng, H., Gong, Y., Teng, D., 2019. Short-term load forecasting based on
the improved Kalman filter algorithm. Electr. Meas. Instrum. 56 (3), 42–46.
http://dx.doi.org/10.19753/j.issn1001-1390.2019.03.007.
Liu, M., Zhang, Q., 2022. Prediction of strip crown based on support vector
machine and neural network. CAAI Trans. Intell. Syst. http://dx.doi.org/10.
11992/tis.202101002.
Lu, H., Yang, S., 2022. Three-dimensional object detection algorithm based on
deep neural networks for automatic driving. J. BEIJING Univ. Technol. http:
//dx.doi.org/10.11936/bjutxb2021100027.
Luo, S., Ma, M., Jiang, L., Jin, B., Lin, Y., Diao, X., Li, C., Yang, B., 2020. Medium
and long-term load forecasting method considering multi-time scale data.
Proc. CSEE 40, 11–19. http://dx.doi.org/10.13334/j.0258-8013.pcsee.190550.
Malekizadeh, M., Karami, H., Karimi, M., Moshari, A., Sanjari, M.J., 2020. Shortterm load forecast using ensemble neuro-fuzzy model. Energy (Oxford) 196,
117–127. http://dx.doi.org/10.1016/j.energy.2020.117127, 2020.
Mao, L., Jiang, Y., Long, R., Li, N., Huang, H., Huang, S., 2008. Medium- and
long-term load forecasting based on partial least squares regression analysis.
Power Syst. Technol. 32 (19), 71–77, CNKI:SUN:DWJS.0.2008-19-020.
Massaoudi, M., Refaat, S.S., Chihi, I., Trabelsi, M., Oueslati, F.S., Abu-Rub, H.,
2021. A novel stacked generalization ensemble-based hybrid LGBM-XGB-MLP
model for short-term load forecasting. Energy 214, 118874. http://dx.doi.org/
10.1016/j.energy.2020.118874.
Sadaei, H.J., de Lima, P.C., Silva, E., Guimarães, F.G., Lee, M.H., 2019. Shortterm load forecasting by using a combined method of convolutional neural
networks and fuzzy time series. Energy (Oxford) 175, 365–377. http://dx.doi.
org/10.1016/j.energy.2019.03.081.
Wen Xiang (1989-), female, intermediate engineer,
received the M.S. degrees in Agricultural Electrification
and Automation from Northeast Agricultural University, China, in 2015, Now studying for a doctorate in
the College of Electrical and Information, Northeast
Agricultural University mainly researching power grid
planning. Meanwhile, she works at the Economic and
Technological Research Institute of State Grid Heilongjiang Electric Power Co. LTD, mainly responsible for
investment and evaluation work.
Peng Xu (1996-), male, Postgraduate, graduated from
Northeast Agricultural University of China in 2018,
majoring in electrical engineering and automation. Now
he is studying for a master’s degree in the school of
electrical information of Northeast Agricultural University, mainly studying machine learning and power grid
planning.
8521
W. Xiang, P. Xu, J. Fang et al.
Energy Reports 8 (2022) 8511–8522
Junlong Fang (1971-), male, professor, doctor of
engineering, doctoral supervisor, Dean of School of
electrical and information, Northeast Agricultural University, and reserve leader of agricultural electrification and automation, a provincial key discipline of
Northeast Agricultural University. Executive director
of Heilongjiang electrical engineering society, director
of Heilongjiang automation society and member of
agricultural power special committee of Heilongjiang
Agricultural Engineering Society. His research direction
is power system automation, information processing
and intelligent measurement and control.
Zhenggang Gu (1979-), male, bachelor’s degree, master’s degree, graduated from Harbin University of
Science and Technology, he works at the State Grid
Heilongjiang Electric Power Co. LTD, mainly researching are investment, evaluation and power grid data
analysis.
Qinghe Zhao (1995-), male, he received the bachelor
of engineering degree in electrical engineering from
Northeast Agricultural University, China. Currently, he
is working toward the PhD degree in agricultural electrification from NEAU, China.
His major research focuses on the advanced algorithms application in Power System and Load Forecasting. His research interests lie in the areas of machine
learning of GBDT and deep learning.
Zhang Qirui (1980-), male, master degree, graduated
from Harbin Institute of Technology, majoring in electrical Engineering and automation, research direction:
power grid investment management, power system
and automation. He has participated in Hei Longjiang
province’s power grid investment interface research,
power grid development diagnosis and analysis, power
grid project post-evaluation, power grid investment
ability research, and is now in charge of power grid.
8522
Download