This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Data-driven Game-based Pricing for Sharing Rooftop Photovoltaic Generation and Energy Storage in the Residential Building Cluster under Uncertainties Xu Xu, Member, IEEE, Yan Xu, Senior Member, IEEE, Ming-Hao Wang, Member, IEEE, Jiayong Li, Member, IEEE, Zhao Xu, Senior Member, IEEE, Songjian Chai, Yufei He, Student Member, IEEE Abstract—In this paper, a novel machine learning based data-driven pricing method is proposed for sharing rooftop photovoltaic (PV) generation and energy storage (ES) in an electrically interconnected residential building cluster (RBC). In the studied problem, the energy sharing process is modeled by the leader-followers Stackelberg game where the owner of the rooftop PV system is responsible for pricing self-generated PV energy and operating ES devices. Meanwhile, local electricity consumers in the RBC choose their energy consumption with the given internal electricity prices. To track the stochastic rooftop PV panel outputs, the long short-term memory (LSTM) network based rolling-horizon prediction function is developed to dynamically predict future trends of PV generation. With system information, the predicted information is fed into a Q-learning based decision-making process to find near-optimal pricing strategies. The simulation results verify the effectiveness of the proposed approach in solving energy sharing problems with partial or uncertain information. Index Terms—Pricing method, photovoltaic generation, energy storage, residential building cluster, energy sharing, Stackelberg game, long short-term memory network, Q-learning algorithm I. INTRODUCTION I N recent years, rooftop photovoltaic (PV) systems have been widely deployed in residential buildings [1], which can provide clean energy supply during the daytime. However, for a residential building cluster (RBC) comprising of electrically This work is partially supported by the National Natural Science Foundation of China (Grant No. 71971183). The work of J. Li is supported by the National Natural Science Foundation of China (Grant No. 51907056). Y. Xu’s work is supported by Nanyang Assistant Professorship from Nanyang Technological University, Singapore. (Corresponding authors: Zhao Xu and Jiayong Li). X. Xu is with the Department of Electrical Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Special Administrative Region, China. (email: benxx.xu@connect.polyu.hk). Y. Xu is with the School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore. (email: xuyan@ntu.edu.sg). M.-H. Wang is with the Department of Electrical Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Special Administrative Region, China. (e-mail: minghao.wang@polyu.edu.hk). Z. Xu is with both Shenzhen Research Institute and Department of Electrical Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Special Administrative Region, China. (email: eezhaoxu@polyu.edu.hk). J. Li is with the College of Electrical and Information Engineering, Hunan University, Changsha, China. (email: j-y.li@connect.polyu.hk). S. J. Chai and Y. F. He are both with the Department of Electrical Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong Special Administrative Region, China. (e-mails: chaisongjian@gmail.com; daniel.v.he@connect.polyu.hk). interconnected buildings (see Fig. 1), the PV energy sharing management is a critical concern. The concept of energy sharing has been widely used in power systems, which has been well studied in existing papers, such as Refs [2-5]. Besides, many research efforts have been made to study the energy sharing management among end-users in the literature. Conventional energy sharing methods are based on optimization algorithms. In Ref. [6], based on Lyapunov optimization, an online energy sharing framework is presented to enhance the self-sufficiency and PV consumption for nano-grid clusters. Ref. [7] proposes a two-stage robust energy sharing approach for a prosumer microgrid with renewable energy integration, storage units and load shifting. Ref. [8] employs the heuristic algorithm to establish a day-ahead energy management method integrated with home appliance scheduling and energy sharing among smart houses. In Ref. [9], a peer-to-peer energy sharing strategy with the distributed transaction is developed for an energy building cluster including different types of energy buildings. Ref. [10] develops a game theory based energy sharing management method for the microgrid as well as a billing mechanism according to PV energy and load consumption. In Ref. [11], an online optimization based algorithm is proposed for cost-aware energy sharing among electricity consumers in a cooperative community. Ref. [12] presents a novel hybrid energy sharing management framework to facilitate heat and PV energy sharing among smart buildings. However, there are several deficiencies in these existing works: (i) Uncertain renewable generation is not well considered during the energy sharing process in these existing works; (ii) Multiple electricity consumers live in the RBC with different living behaviors, which may bring a difficulty to achieve an agreement on PV energy allocation; (iii) Conflicts of interest between the rooftop PV system owner and local electricity consumers need to be addressed properly. The existing optimization methods can be classified as model-based methods that rely on an accurate mathematical formulation to describe the energy sharing process. However, the energy sharing problem is usually involved with unknown or uncertain information in practice, so iterative solution algorithms are generally adopted. This may pose two potential challenges: (i) To ensure the convergence of some iterative algorithms, certain assumptions and simplifications are required; (ii) The iterative algorithm may be impractical to be used in the real world due to possible non-convergence issues. By comparison, as a model-free, adaptive and concise machine 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < learning technique [13], reinforcement learning exhibits excellent performance on the decision-making process. Reinforcement learning algorithms have been widely employed to model power system operation problems, such as multi-microgrid energy management [14], voltage control [15], electrical vehicle charging [16], dynamic economic dispatch [17], etc. However, applying reinforcement learning in energy sharing management is still in the early stage. In this regard, this paper proposes a fully data-driven method based on the deep neural network and the reinforcement learning algorithm for making game-theoretic dynamic pricing strategies to optimally share the rooftop PV energy with electricity consumers in a RBC. The main contributions of this paper can be summarized as follows, 1) The proposed dynamic data-driven game-based pricing decision-making process is described as the Markov Decision Process (MDP), which can be well addressed by the Q-learning algorithm. Compared with conventional optimization methods, our proposed method can be flexibly and easily applied by off-line training and on-line implementation with no requirement for initial knowledge. Besides, the computation efficiency can be substantially improved. 2) The long short-term memory (LSTM) network is duly integrated into the proposed pricing framework to capture the future trends of rooftop PV generation with time-window rolling. This predicted information is fed into the reinforcement learning based decision-making process to help the Q-learning agent to find the near-optimal pricing strategies. 3) To express the preferences of local consumers on environmental awareness, the concept of willingness-to-pay (WTP) is introduced in this paper. In this regard, the original complex game-based energy pricing optimization model is innovatively transformed into an efficient discriminatory auction, where the near-optimal pricing strategies can be quickly determined by the rooftop PV system owner by using the proposed pricing method. The rest of this paper is organized as follows. In Section II, we model the energy sharing in the RBC. Section III describes the decision-making process of pricing strategy, including the LSTM network, MDP formulation and Q-learning process. Numerical results are given in Section V. Finally, we conclude this paper in Section VI. II. PROBLEM MODELING Fig. 1 depicts the structure of the energy sharing in a RBC. As shown in this figure, the rooftop PV system is comprised of two kinds of devices, i.e., PV panels and energy storage (ES) devices. The rooftop PV system owner is the energy sharing executor who is responsible for the interoperability among various components in Fig. 1. The rooftop PV system owner is in charge of providing self-generated PV energy to all electricity consumers in the RBC and operating local energy storage devices. Moreover, this owner has the responsibility of guaranteeing the maximum utilization of local PV generation within the RBC. Besides, it is assumed that the smart meters are installed in the RBC to gather the system data and receive instructions or information from the rooftop PV system owner. 2 Fig. 1. Structure of the energy sharing in a RBC. A. Profit Model of Rooftop PV System Owner In this paper, we assume that the rooftop PV system owner is an external company, which can be defined as a financial objective function only, i.e. maximization of revenues from sharing self-generated PV energy and operating local energy storage devices. It is assumed that the ES is charged by rooftop PV generation only to properly track the dispatch of local energy. Note that the investment and operation costs of the rooftop PV system are omitted during the energy sharing process. Usually, the maximum power point tracking (MPPT) control [18] is applied to PV panel operation to maximize the PV generation since the PV power output is time-varying with solar intensity and environment temperature [19]. The actual ๐๐ values of the PV panel output are [๐ฬ โ๐๐ , ๐ฬ โ+1 , … , ๐ฬ ๐ป๐๐ ], where ๐ป โ {โ, โ + 1, … , ๐ป} denotes the time slot set. At each hour, the rooftop PV system owner acts as the leader which sets the uniform price for local PV generation, so the hourly profit ๐ ๐๐ฃโ๐ of the owner can be defined as follows, RevhO = ๏ฅ๏ฌ U h ( PihPVuser + PihESuser ) + ๏ฌ FiT ( Ph PVgrid + Ph ES grid i๏N C ) (1) − ๏ฌhTOU [๏ฅ ( PihPVuser + PihESuser ) − PhPV ]+ i In Eq. (1), the first term represents the profit of selling PV ๐๐ ๐ธ๐ energy ๐๐โ ๐ข๐ ๐๐ and electricity in ES ๐๐โ ๐ข๐ ๐๐ to the local electricity consumers with a uniform price ๐๐โ and the second ๐๐๐๐๐๐ term denotes the profit of selling PV energy ๐โ electricity ๐ธ๐๐๐๐๐ ๐๐โ in ES to the utility grid with feed-in traffic rate ๐๐น๐๐ . ๐ ๐ถ is the set of electricity consumers in the RBC. The third term in (1) describes the compensation cost regarding the mismatch between the energy sold to the electricity consumers and the actual PV generation. Note that this mismatch cost is caused by prediction errors since the rooftop PV system owner makes the pricing strategy based on the predicted information, which cannot be accurate. [โ]+ represents the projection 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < operator onto ๐๐๐ฅโก(๐ฅ, 0). the non-negative orthant, i.e., [๐ฅ]+ = B. Utility Cost of Electricity Consumers The electricity consumers in the RBC are followers who decide to purchase the electricity from the rooftop PV system owner or the utility grid according to the given price signals. ๐ถ The utility cost ๐๐โ of electricity consumer ๐ ∈ ๐ ๐ถ can be given as follows, UihC = ๏ฌhU ( PihPVuser + PihESuser ) + ๏ฌhTOU PihG + wE ๏ฌ E PihG (2) where the first term and second term denote the electricity cost of purchasing electricity from the rooftop PV system owner ๐๐ ๐ธ๐ ๐บ ๐๐โ ๐ข๐ ๐๐ ,โก๐๐โ ๐ข๐ ๐๐ , and the utility grid ๐๐โ , respectively. The third term describes the greenhouse gas emission cost with the coefficient ๐๐ธ . Specifically, the weight factor ๐ค๐๐ธ ∈ [0,1] is introduced to reflect the environmental awareness of electricity consumer ๐. In practice, ๐ค๐๐ธ can be adjusted depending on the preferences of electricity consumers on a case by case basis. ๐ท Note that the demand ๐๐โ of electricity consumer ๐ can be ๐๐๐ข๐ ๐๐ ๐ธ๐ ๐บ ๐ท satisfied by ๐๐โ , โก๐๐โ ๐ข๐ ๐๐ and ๐๐โ , i.e., ๐๐โ = ๐๐๐ข๐ ๐๐ ๐ธ๐๐ข๐ ๐๐ ๐บ ๐๐๐ ๐๐โ + ๐๐โ + ๐๐โ , so the WTP ๐๐โ of electricity ๐ท consumer ๐ for the local energy can be derived by using ๐๐โ − ๐๐๐ข๐ ๐๐ ๐ธ๐๐ข๐ ๐๐ ๐บ ๐๐โ − โก ๐๐โ to substitute for ๐๐โ in (2), given as follows, ๏ฌihWTP = ๏ฌhTOU + wiE ๏ฌ E ๏ฌ(Owner Building Users) ๏ผ ๏ฏ U ๏ฏ ES grid ESin } ๏ฏ{๏ฌh },{Ph },{Ph ๏ฏ G = ๏ญ PV ๏ฝ ESuser G user { P },{ P },{ P } ๏ฏ ih ๏ฏ h ih ๏ฏ{Rev O },{U C } ๏ฏ h ih ๏ฎ ๏พ where (๐๐ค๐๐๐ ∪ ๐ ๐ถ ) denotes the player sets, the rooftop PV system owner acts as the game leader and the building consumers take the roles of game followers in response to the ๐ธ๐๐๐๐๐ ๐ธ๐ ๐ธ๐ ๐ข๐ ๐๐ strategy of the leader; {๐๐ }, {๐ท๐โ ๐๐ }, and {๐ท๐โ โ }, {๐ท๐โ } are the strategy sets of the game leader; {๐ท๐๐โ } and {๐ท๐บ๐โ } are strategy sets of game followers; {๐น๐๐๐โ } and {๐ผ๐ถ๐โ } are the the profit (1) of the leader and the utility cost (2) of the followers, respectively. Thus, the bi-level energy sharing model is formulated as, ๏ฌ ๏ฅ ๏ฌhU ( PihPVuser + PihESuser ) ๏ผ ๏ฏi๏N C ๏ฏ ๏ฏ FiT PVgrid ๏ฏ ES grid O Rev = ๏ฅ ๏ญ+๏ฌ ( Ph Max + Ph ) ๏ฝ ES ES {๏ฌhU , Ph grid , Ph in , h๏H ๏ฏ PVuser ESuser TOU PV + ๏ฏ PV ES Pih user , Pih user , PihG } P P P − ๏ฌ [ ( + ) − ] ๏ฅi ih ih h ๏ฏ h ๏ฏ ๏ฎ ๏พ (4) s.t. PhPV = ๏ฅ PihPVuser + PhPVgrid + PhESin (5) i๏N C soc PhESsoc = PhES + ๏จ ESin PhESin − ๏จ ESout ( ๏ฅ PihESuser + Ph −1 ESgrid ) i๏N C (6) ES soc h = h1 P 0๏ฃ =P ESinit (7) ๏ฅP ESuser ih +P ES grid h ๏ฃP ES (8) i๏N C 0 ๏ฃ PhESin ๏ฃ P ES P (3) C. Stackelberg Game based PV Energy Sharing In this subsection, one-leader and N-follower Stackelberg game theory [20] is employed to formulate the PV energy sharing model. The basic idea of this game is that the leader chooses the first action and then the followers observe the action taken by the leader and make their own decisions accordingly. Specifically, as the leader in this game, the owner of the rooftop PV system (including rooftop PV panels and ES devices) set the internal price ๐๐โ for the local PV energy to sell them to the local building users. Besides, the local PV energy can also be sold to the utility grid with the FiT rate ๐๐น๐๐ . The goal of leader is to maximize the daily revenue by pricing and selling local PV energy. Meanwhile, the building users act as the followers in this game, so they choose to buy the local PV energy with the internal price ๐๐โ or/and the electricity from the utility grid with the TOU price ๐๐๐๐ โ . The goal of followers is to minimize the daily electricity bills by choosing energy consumption, i.e., from the local provider or/and utility grid. In this regard, the Stackelberg game ๐บ for this problem can be described as follows, 3 PVgrid h P ESsoc h , Pih ES grid h ,P MaESx PVuser { Pih ๏ฃP ESsoc user , PihG } PVuser ih P PVuser ih ESuser ih ,P ๏ฃP ๏R (10) + (11) ๏ป − UiC = − ๏ฅ ๏ฌhU ( P s.t. P = P D ih (9) ESsoc h๏H +P ESuser ih PVuser ih +P ESuser ih +P : ๏ญ G ih D ih , P ๏ R + : ๏ญihPVuser , ๏ญihESuser , ๏ญihG G ih ) + ๏ฌihWTP PihG ๏ฝ (12) (13) (14) where the upper-level model (4)-(11) is to maximize the profit of the PV system owner in the RBC. The objective function (4) is to maximize the daily revenue of the PV system owner. (5) denotes the dispatch of the PV energy, which can be sold to ๐๐ ๐๐๐ local electricity consumers ๐๐โ ๐ , fed into the utility grid ๐โ , ๐ธ๐ ๐โ ๐๐ . or stored in ES (6)-(10) gives the operating limits of ES devices. (11) ensures that the upper-level variables are non-negative. The lower-level model (12)-(14) is to minimize the electricity cost of local electricity consumers. (13) balances the supply and demand of electricity consumers with dual ๐ท variable ๐๐โ . (14) imposes the non-negative variables in the ๐๐ ๐ธ๐ ๐บ lower-level model with dual variables ๐๐โ ๐ข๐ ๐๐ , ๐๐โ ๐ข๐ ๐๐ , ๐๐โ . The difficulty of solving the proposed bi-level energy sharing problem (4)-(14) is that it is a nonlinear and nonconvex problem involved with nonlinear terms. Conventionally, Karush–Kuhn–Tucker (KKT) conditions can be employed to transform the original model should be transformed into a Mathematical Program with Equilibrium Constraints (MPEC) model (see Appendix A, Appendix B and Appendix C), which 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < can be directly solved by some commercial solvers, e.g., CPLEX [21] and Gurobi [22]. However, a large number of mixed-integer variables involve in KKT conditions, resulting in a huge computation burden. Besides, conventional optimization methods are not feasible in practice since these methods are based on the assumption of perfect prediction of PV panel output. Besides, the optimization based pricing strategy is also not reasonable enough since the rooftop PV system owner only focuses on the current profit and overlooks the future reward. In this regard, in the following section, we will propose a novel pricing method based on a dynamic uncertainty prediction model as well as a model-free reinforcement learning method, which can be easily employed to find the near-optimal pricing strategies. III. PROPOSED DATA-DRIVEN PRICING STRATEGY A. Mapping Energy Sharing Model to Discrimination Auction As studied in Ref. [10], the Stackelberg Equilibrium (SE) in a Stackelberg game is reached as long as all participants obtain the optimal solutions. Thus, our proposed bi-level energy sharing framework can reach the SE once the PV system owner (leader) finds the optimal pricing strategy for selling the self-generated PV energy and meanwhile all local consumers (followers) determine their electricity consumption, i.e., from the rooftop PV system and the utility grid. It is assumed that the load information of electricity consumers in the RBC can be utilized by the PV system owner since advanced non-intrusive load monitoring devices [23] can be installed in the residential buildings for long-term observation. To maximize the profits of the PV system owner, the self-generated PV energy will be dispatched in the descending order of the WTP values. Therefore, the optimal uniform price of PV energy equals the WTP offered by the consumers. In other words, the uniform price (WTP) that brings about the highest revenue to the PV system owner will be returned. Therefore, the original complex bi-level PV energy sharing problem is formulated as an efficient discriminatory auction for local energy (i.e., rooftop PV generation and electricity stored in ES). According to the Eq. (3), both electricity price ๐๐๐๐ and the โ coefficient of greenhouse gas emission cost ๐๐ธ are known, thus the value of WTP is mainly determined by the weight factor ๐ค๐๐ธ . In this regard, the original complex bi-level PV energy sharing problem is formulated as an efficient discriminatory auction for PV energy, where the weight factor ๐ค๐๐ธ needs to be selected for making the pricing strategy. B. LSTM Network for Dynamic PV Generation Prediction Considering the uncertain PV generation, a prediction function should be added in the rooftop PV system to facilitate the decision-making process of pricing. In this subsection, the LSTM-based sequence to sequence model is formulated to predict future rooftop PV panel output. This model includes three parts, encoder, encoder vector and decoder, aiming to map a fixed-length input with a fixed-length output where the length of the input and output may differ. The LSTM network is a variant of the standard recurrent neural network (RNN) [24]. 4 By substituting LSTM units for the basic hidden neurons in RRN, LSTM network can deal with the issues caused by gradient vanishing and explosion of long-term dependencies [25]. As shown in Fig. 3, the LSTM unit includes three kinds of gate controllers, i.e., input gate, forget gate and output gate, which are mainly used to determine what information should be remembered. These three gates can be calculated by the following equations, it = ๏ณ (Wix xt + Wih ht −1 + bi ) ft = ๏ณ (Wfx xt + Wfh ht −1 + bf ) ot = ๏ณ (Wox xt + Woh ht −1 + bo ) (15) (16) (17) where ๐ represents the sigmoid function, whose output is in the range of [0,1], describing how much information should be let through. ๐๐๐ฅ , ๐๐โ , ๐๐๐ฅ , ๐๐โ , ๐๐๐ฅ and ๐๐โ denote matrices of weights for the input gate, forget gate and output gate. ๐๐ , ๐๐ and ๐๐ represent the vectors of biases for these gates. It should be noted that temporal memory is implemented in the LSTM network by switching different gates to prevent the gradient vanishing. Therefore, the external inputs of the LSTM unit are the previous cell state ๐๐ก−1 , the previous hidden state โ๐ก−1 and the current input vector ๐ฅ๐ก . Then, an intermediate state ๐ถ๐ก is generated, given as, Ct = tanh(Wcx xt + Wch ht −1 + bc ) (18) Accordingly, the memory cell and hidden state of this LSTM are updated as, Ct = ft ๏ Ct + it ๏ Ct ht = Ot ๏ tanh(Ct ) (19) (20) where tanh is the nonlinear activation function and the operator โจ denotes the pointwise multiplication operation for two vectors. In this work, historical data of PV generations are collected and put into the proposed encoder-decoder sequence to sequence model, where the LSTM network is used as the training algorithm. As the output of this prediction model, ๐ฆ๐ก , ๐ฆ๐ก+1 , … , ๐ฆ๐ก+12 denotes the predicted future 12-hour PV generations. This predicted information will be fed into the Q-learning process to make pricing strategies in a rolling-window manner. C. MDP Formulation As described in Section III-A, the original complex bi-level PV energy sharing problem is formulated as an efficient discriminatory auction for rooftop PV energy, where the weight factor ๐ค๐๐ธ needs to be selected for making pricing strategies. This pricing problem can be formulated as a finite MDP [26], where the outcomes are partly controlled by the decision-maker (rooftop PV system owner) and partly random. Under the Q-learning framework [27], the MDP is formulated as follows, 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 Fig. 2. Schematic of our proposed LSTM network and Q-learning based data-driven pricing method. 1) State Set ๐บ๐ : The state ๐ โ ∈ ๐บ at hour โ includes three kinds of information, i.e., current TOU electricity price ๐๐๐๐ โ , ๐ธ๐๐ ๐๐ ๐น๐๐ feed-in tariff rate ๐ , current of ES ๐โ and predicted ๐๐ future trends of rooftop PV panel output [๐โ๐๐ , ๐โ+1 , … , ๐๐ป๐๐ ]. 2) Action Set ๐จ: As described in Section III-A, the action ๐โ ∈ ๐จ for the current state ๐ โ represents the weight factor ๐ค๐๐ธ . 3) Reward ๐โ : In this paper, the reward ๐โ is the cumulative profit of rooftop PV system owner by participating in the energy sharing from โ to ๐ป, as described by Eq. (4), 4) Action-value Function ๐๐ (๐ , ๐): The cumulative reward is used as the action-value function to evaluate the quality of action-state pairs, described as follows, ๏ฉK ๏น Q๏ฐ ( s, a) = ๏ ๏ฐ ๏ช๏ฅ ๏ง k ๏ rh +1 | sh = s, ah = a ๏บ ๏ซ k =0 ๏ป (21) where ๐ โ {0,1, … , ๐พ} denotes the time step and ๐ represents the policy which maps from a state to an action. Note that ๐พ ∈ [0,1] is the discount rate indicating the relative importance of future rewards for the current reward. The primary goal of our proposed pricing problem is to maximize the action-value function by finding the optimal policy ๐ ∗ , i.e., a sequence of optimal actions (weight factors ๐ค๐๐ธ ), given as follows, Q* ( s, a ) = max Q๏ฐ ( s, a ) ๏ฐ (22) The Q-learning algorithm is employed to iteratively update action-value function value via the Bellman equation [28]. Q๏ฐ* (sh , ah ) = r(sh , ah ) + ๏ง ๏ช max Q(sh+1 , ah+1 ) (23) Besides, the Q-value can be updated by the following equation, Q(sh , ah ) ๏ฌ (1 − ๏ฑ )Q(sh , ah ) + ๏ฑ Q๏ฐ* (sh , ah ) (24) where ๐ ∈ [0,1] denotes the learning rate indicating to what extend the new Q-value can overturn the old one. TABLE I STATE SET, ACTION SET AND REWARD FUNCTION FOR EACH HOUR State set ๐บ๐ Action setโก๐จ Reward function ๐โ ๐ธ๐๐ ๐๐ ๐๐ {๐๐น๐๐ , ๐๐๐๐ , [๐โ๐๐ , ๐โ+1 , … , ๐๐ป๐๐ ], ๐โ โ } {๐ค1๐ธ , ๐ค2๐ธ , … , ๐ค๐๐ธ๐ } Eq. (4) D. Q-learning Algorithm based Solution Method Algorithm 1 Proposed Dynamic Pricing Method 1. Repeat for each hour โ PV panel output prediction 2. Collect the data of PV panel output 3. Feed the collected data into trained LSTM network to predict future trends of PV panel out Q-learning algorithm based decision-making process 4. Input action set ๐จ๐ 5. Initialize state set ๐บ๐ 6. Initialize Q-value ๐(๐ โ , ๐โ ) arbitrarily 7. Repeat for each episode 8. Repeat for each state ๐ โ 9. Update state set ๐ธ๐ ๐๐ ๐บ๐ ← {๐๐น๐๐ , ๐๐๐๐ , [๐โ๐๐ , ๐โ+1 , … , ๐๐ป๐๐ ], ๐โ ๐ ๐๐ } โ 10. Choose an action ๐โ from the current action set ๐จ๐ 11. Calculate the current reward ๐โ (๐ โ , ๐โ ) 12. Update the Q value ๐(๐ โ , ๐โ ) 13. Until ๐ โ+1 = ๐ ๐ป 14. Until maximum episode 15. Output the optimal policyโก๐ ∗, ∗ ∗} {๐โ∗ , ๐โ+1 , … , ๐๐ป = ๐๐๐๐๐๐ฅ๐ 16. Execute the optimal action ๐โ∗ for the current hour โ 17. Until โ = ๐ป Algorithm 1 describes the implementation process of the proposed Q-learning algorithm-based solution method for solving our formulated MDP based pricing problem. As shown in Algorithm 1, in each hour, the proposed LSTM network-based PV generation prediction function runs to output future PV generation. Then, these predicted values are fed into the Q-learning process for making the optimal pricing strategy. Specifically, in each episode, an action is selected for the current state in terms of the ๐-greedy policy (๐โกฯตโก[0,1]) [29], where the agent in Q-learning algorithm can either execute a random action form the set of available actions with probability ๐ or select an action whose current Q-value is maximum, with probability 1 − ๐. After choosing an action, the current reward 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < can be calculated via Eq. (18) and then the Q-value can be updated via Eq. (22). At the end of each episode, the termination criterion is checked. If this termination criterion is not satisfied, the agent will move to the next episode and repeat the above process. Finally, each agent will gain optimal actions for each coming hour. Note that only the optimal action for the current hour is taken since the optimal pricing strategy will be updated for each hour. The above procedure will be repeated until the end hour, i.e., โ = ๐ป. Moreover, Fig. 3 is plotted to depict the flowchart of the proposed Q-learning algorithm based decision-making process. 6 Fig. 4). The daily individual load data published from the National Renewable Energy Laboratory (NREL) [30] is used in the case study. It should be noted that, for simplification, we randomly select 360 fractions in the range [0,1] to represent the weight factors for describing the environmental awareness of all electricity consumers in the RBC. The numeric value zero means weak environmental awareness while the numeric value one means strong environmental awareness. However, in real-world scenarios, these weight factors can be obtained in some ways, such as a questionnaire survey or non-instructive long-term observation of individual load consumption. For the Q-learning based decision-making process, the discount rate ๐พ is set to 0.9 so the obtained pricing strategy is foresighted to avoid future risks. All simulations are implemented on the platform MATLAB with an Intel Core i7 of 2.4 GHz and 8GB memory. B. Performance of LSTM Network based Prediction Function TABLE II SUMMARY OF TRAINING SETTINGS OF LSTM NETWORK Network Encoder Decoder Fig. 3. The proposed Q-learning algorithm based decision-making process. IV. NUMERICAL RESULTS Others Hyperparameter Encoder length Layers Hidden states Kernel_regularizer Activation function Decoder length Layers Hidden states Activation function Kernel_regularizer MLP layers MLP activation function Epochs Batch size Loss function Optimizer Value/Function 36 1 200 0.001 Relu [32] 12 1 200 Relu 0.001 1 Tanh [33] 100 64 Mean squared error [34] Nadam [35] A. Test Case Setup Fig. 4. TOU prices in the summer of 2019. (Source: Alectra Utilities) In this paper, we consider a test case in which six apartment buildings. Each apartment building has 60 households. It is also assumed that each apartment has a 100 m 2 roof area and the installation size of the rooftop PV panel is 16.6 kWp. The capacity of ES devices within the rooftop PV system is 10KVA. TOU price data can be collected from the Alectra Utilities (see The PV dataset for network training is collected from the Global Energy Forecasting Competition 2014 [31], which can be publicly accessed online. The dataset covers 12 numerical weather prediction (NWP) variables and the hourly PV power output measured from 1st Apr 2012 to 1st Jul 2014 at three neighbored PV plants in Australia. In this case, we only use the PV power output observed in site 1 for model construction, integration of NWP information and the neighbored measurements is beyond the scope of this work, since historic samples are enough to establish the forecasting model on a rolling basis. Before learning, the measurements in the night (7:00 pm – 7:00 am) are removed, the data from 1st Apr 2012 to 1st Apr 2014 is used for model training, and the rest is for prediction. The settings of the adopted encoder-decoder LSTM network are listed in Table II. The predictive skill of the well-trained LSTM network for different look-ahead horizons is shown in Fig. 5. Based on this predicted information, a sequence of optimal actions can be selected by the Q-learning agent with the consideration of the 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < trade-off between current reward and future reward by setting the discount factor. However, only the action for the current hour is executed. Therefore, a relatively large perdition error will have a minor effect on our results. It should be noted that taking into account the NWP information and making a dynamic intraday adjustment on the day-ahead forecast can further improve the forecasting accuracy, which would benefit the reinforcement learning based decision-making process on finding the near-optimal solutions, this will be investigated in our future work. 7 C. Performance of Q-learning Decision-making Process Fig. 6. Q-learning process during 5*104 episodes (a) Fig. 7. Optimal internal uniform price for each hour (b) Fig. 8. The number of building users involved in the energy sharing process in each hour. (c) Fig. 9. Dispatch of ES in each hour. (d) Fig. 5. Prediction performance with different time steps. The Q-learning process during 50,000 episodes is shown in Fig. 6. We can observe from this figure that during the first 10,000 episodes, the Q value increase rapidly the Q-learning agent can learn trials and errors after each episode at the initial learning stage. Then, the increment of Q value becomes small and finally Q value is stable after enough training. Therefore, a near-optimal pricing strategy can be obtained after approximately 30,000 episodes. 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < The optimal action can be selected by using the Q-learning algorithm, as seen in Fig. 7, which is plotted to depict the near-optimal internal uniform prices as well as TOU prices during the daytime. As seen from this figure, during the off-peak time slots, e.g. 7:00-10:00 and 19:00-20:00, the obtained internal uniform prices are higher than TOU prices. The reason is that the rooftop PV generation is low in these time slots so the PV system owner takes a premium pricing strategy to maximize the profit. On the contrary, during the on-peak time slots, i.e. 11:00-18:00, the rooftop PV generation is high due to strong solar irradiance. Therefore, the PV system owner would like to charge lower prices so the self-generated PV electricity can be sold to more building users, leading to a near-maximum revenue for rooftop PV system owner. To clearly show the number of building users who succeed in the bidding of local PV generation in each hour, Fig. 8 is given in this subsection. It can be observed from this figure that few building users can be supplied by PV energy during the off-peak hours while more building users can utilize local PV electricity during the on-peak hours. Fig. 9 illustrates the dispatch of ES in each hour. As shown in this figure, the PV system owner tends to sell the self-generated PV energy to building users or the ES devices, rather than the utility grid, aiming to maximize its economic benefits, so the PV energy is stored in ES in the daytime and dispatched to the local consumers at night. Hence, our proposed energy sharing model as well as pricing strategy can facilitate the utilization of local PV generation, reducing negative effects caused by intermittent PV energy integration. In this subsection, a comparative case study is conducted to demonstrate the effectiveness of the pricing strategy obtained from the proposed energy sharing model. Three different pricing strategies are included in this case study, described as follows, (i) Strategy 1 (proposed internal uniform price): This pricing strategy can be obtained by solving our formulated leader-follower energy sharing model (4)-(11). (ii) Strategy 2 (TOU price): The price of rooftop PV electricity applied to local consumers equals to TOU price. In this regard, from the perspective of consumers, the choice of energy consumption (from the rooftop PV system owner or the utility grid) results in the same electricity bill. (iii) Strategy 3 (Market clearing price): The price of rooftop PV electricity applied to local consumers equals to market clearing price. Under this pricing strategy, the rooftop PV system owner can acquire the same income by selling self-generated PV energy to the local consumers and/or the utility grid. TABLE III DAILY PROFIT WITH DIFFERENT PRICING STRATEGIES Pricing strategy Strategy 1: Internal uniform price Strategy 2: TOU price Strategy 3: Market clearing price Daily profit ($) 41.99 39.71 24.47 8 Fig. 10. Comparison of hourly revenue with and without reinforcement learning based on the same prediction information. Fig. 10 is plotted to depict the hourly profit under these three pricing strategies based on the same prediction information provided by our proposed LSTM model. As seen in this figure, the profit under Strategy 1 (internal uniform price) is always higher than that with the other two strategies. Accordingly, this pricing strategy leads to the highest daily profit (see Table III). The reason is that under Strategy 2, only a few consumers with strong environment awareness (high WTP) will purchase PV energy form the rooftop PV system owner since PV energy output is time-varying so it is not as stable as the electricity from the utility grid. Besides, with Strategy 3, though local consumers are more likely to buy the rooftop PV generation due to the relatively low price, limited PV panel output cannot bring the rooftop PV system owner high profit. Therefore, our proposed pricing model can duly address the interest conflict between the rooftop PV system owner and the local consumers by subtlety using utilizing the environment awareness of consumers. D. Comparison with Conventional Optimization Method Fig. 11. Convergence performances by the MILP based optimization method and the proposed Q-learning algorithm based RL method. TABLE IV PREFERENCES OF COMPUTATION EFFICIENCY BY MILP BASED OPTIMIZATION METHOD AND Q-LEARNING ALGORITHM BASED RL METHOD Solution method Conventional optimization method Q-learning algorithm Profit ($) 43.075 41.994 Computation time (s) 3400.42 15.339 Fig. 11 compares revenues obtained by MILP based optimization method (solved by the CPLEX [21]) and our proposed Q-learning algorithm based RL method. As seen in this figure, the proposed solution method shows a poor performance at the initial training stage since it is undergoing trials and errors. However, after experiencing more episodes, 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < the agent adapts to the learning environment and adjusts its policy via exploration and exploitation mechanism. Finally, it can find a near-optimal pricing strategy. Table IV lists the performances of computation efficiency by these two solution methods. It can be observed that our proposed method can significantly reduce the computation time, which can benefit the PV energy sharing process. In this regard, considering the adaptivity of model-free RL to the external environment, it is suggested to accept our proposed well-performing pricing method for energy sharing management in the RBC. V. CONCLUSION This paper proposes a novel dynamic prediction and reinforcement learning based game-theoretic pricing model for sharing the rooftop PV energy in the RBC. Specifically, Stackelberg game theory is used to model the energy sharing between the rooftop PV system owner in RBC and local electricity consumers. With the introduction of the WTP of each consumer, the original complex uniform auction for local PV energy can be transformed into an efficient discriminatory auction, which can be formulated to the MDP. Then, we develop a Q-learning algorithm based solution method to find a near-optimal pricing strategy. Besides, the LSTM network based PV generation prediction model is built to dynamically update action-state space by providing hourly predicted information about future trends of rooftop PV panel outputs. The numerical results verify the effectiveness of our proposed method on dealing with issues of PV energy sharing management in the RBC comprising of electrically interconnected apartment buildings. For the implementation of the proposed dynamic pricing method, some major limits should be noted: 1) The WTP value of each building consumer needs to be duly considered and it is suggested to do user survey to obtain the reasonable WTP values; 2) The privacy of building consumers may be violated since they send their daily load requirement information to the rooftop PV system owner in each hour. However, this issue can be addressed by using non-intrusive load monitoring devices for the long-term observation of individual load changes. 3) Precise smart meters need to be placed in the apartment to measure the energy consumption, resulting in a high installation cost which may be accepted by induvial users. APPENDIX A KKT CONDITIONS OF NONLINEAR MODEL (4)-(14) The general formulation of proposed bi-level energy sharing model (4)-(14) can be described as follows, min f1 ( x, y, ๏ฌ, ๏ญ ) {x, y ,๏ฌ , ๏ญ} (25) s.t. h1 ( x, y, ๏ฌ , ๏ญ ) = 0 (26) g 2 ( x, y , ๏ฌ , ๏ญ ) ๏ณ 0 min f 2 ( x, y) (27) { y ,๏ฌ , ๏ญ} s.t. h2 ( x, y ) = 0 : ๏ฌ g 2 ( x, y ) ๏ณ 0 : ๏ญ (28) 9 The Karush–Kuhn–Tucker (KKT) conditions of the lower-level optimization problem (28)-(30) can be integrated into the upper-level optimization problem (25)-(27), given as follows, min f1 ( x, y, ๏ฌ, ๏ญ ) {x, y ,๏ฌ , ๏ญ} (31) s.t. h1 ( x, y, ๏ฌ , ๏ญ ) = 0 (32) g 2 ( x, y , ๏ฌ , ๏ญ ) ๏ณ 0 (33) ๏y f2 ( x, y) + ๏ฌ๏ y h2 ( x, y) + ๏ญ๏ y g2 ( x, y) = 0 (34) h2 ( x, y ) = 0 (35) g 2 ( x, y ) ๏ณ 0 ⊥ ๏ญ ๏ณ 0 (36) Then, the Lagrangian is introduced as follows, L = −๏ฌhU ( PihPVuser + PihESuser ) − ๏ฌhTOU PihG − wE ๏ฌ E PihG − ๏ญihD ( PihPVuser + PihESuser + PihG − PihD ) (37) − ๏ญihPVuser PihPVuser − ๏ญihESuser PihESuser − ๏ญihG PihG Therefore, the lower-level problem can be replaced by KKT conditions, given as follows, ๏ถL = PihPVuser + PihESuser + PihG − PihD = 0 ๏ถ๏ญihD (38) ๏ถL = −๏ฌhU − ๏ญihD − ๏ญihPVuser = 0 PVuser ๏ถPih (39) ๏ถL = −๏ฌhU − ๏ญihD − ๏ญihESuser = 0 ๏ถPihESuser (40) ๏ถL = −๏ฌhTOU − wE ๏ฌ E − ๏ญihD − ๏ญihG = 0 G ๏ถPih (41) PihPVuser ๏ณ 0 ⊥ ๏ญihPVuser ๏ณ 0 (42) ๏ณ0⊥๏ญ (43) ESuser ih P PVc itw ๏ณ0 P ๏ณ0⊥๏ญ ๏ณ0 G ih G ih (44) APPENDIX B LINEARIZATION OF NONLINEAR MODEL (4)-(14) There are two nonlinearities in our proposed bi-level optimization model (4)-(14), 1) the nonlinear term ๐๐ ๐ธ๐ ๐๐โ (๐๐โ ๐ข๐ ๐๐ + ๐๐โ ๐ข๐ ๐๐ ) in the objective function (4); 2) the complementarity constraints (42)-(44). As stated in the strong duality theorem, if a problem is convex, the objective functions of the primal and dual problems have the same value at the optimum [36]. To linearize ๐๐ ๐ธ๐ ๐๐โ (๐๐โ ๐ข๐ ๐๐ + ๐๐โ ๐ข๐ ๐๐ ) , the strong duality condition is introduced here. In this regard, the primary objective function (12) of the lower-level problem is equal to its dual objective ๐ท ๐ท function ๐๐โ ๐๐โ , as follows, ๏ฌhU ( PihPV user + PihESuser ) + ๏ฌhTOU PihG + wE ๏ฌ E PihG = ๏ญiDh PihD (45) (29) (30) 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < ๐๐๐ข๐ ๐๐ Accordingly, the linear expression for ๐๐โ (๐๐โ can be written as follows, ๏ฌhU ( PihPV user + PihESuser ) = ๏ญihD PihD − ๏ฌhTOU PihG − wE ๏ฌ E PihG = ๏ญihD PihD − ๏ฌiWTP PihG h ๐ธ๐๐ข๐ ๐๐ + ๐๐โ ) (46) As for the complementarity constraints (42)-(44), Ref. [37] provides the linear expressions by introducing the large ๐๐ ๐ธ๐ ๐บ positive constant ๐ and binary variables ๐ข๐โ ๐ข๐ ๐๐ , ๐ข๐โ ๐ข๐ ๐๐ , ๐ข๐โ , described as follows, PihPVuser , PihESuser , PihG ๏ณ 0 (47) ๏ญ (48) PVuser ih ,๏ญ ,๏ญ ๏ณ 0 ESuser ih G ih PVuser ih ๏ฃ (1 − u )M (49) ESuser ih ๏ฃ (1 − u )M (50) P P PVuser ih ESuser ih P ๏ฃ (1 − u )M (51) ๏ญ ๏ฃu M (52) ๏ญ ๏ฃu M (53) G ih PVuser ih ESuser ih G ih PVuser ih ESuser ih ๏ญihG ๏ฃ uihG M PVuser ih u ESuser ih ,u (54) , u ๏{0,1} G ih (55) APPENDIX C FINAL LINEARIZED BI-LEVEL ENERGY SHARING MODEL By using the KKT conditions (see Appendix A) and linearization methods (see Appendix B), the final linearized bi-level energy sharing model is formulated as follows, ๏ฌ ๏ฅ ๏ญihD PihD − ๏ฌihWTP PihG ๏ผ ๏ฏi๏N C ๏ฏ ๏ฏ FiT PVgrid ๏ฏ ES grid O MaxES Rev = ๏ฅ ๏ญ+๏ฌ ( Ph + Ph ) ๏ฝ ES grid U {๏ฌh , Ph , Ph in , h๏H ๏ฏ PVuser ESuser TOU PV + ๏ฏ PVuser ESuser G Pih , Pih , Pih , + Pih ) − Ph ] ๏ฏ ๏ฏ−๏ฌh [๏ฅ ( Pih ๏ญihD , ๏ญihPVuser , ๏ญihESuser , ๏ญihG } i ๏ฎ ๏พ (56) s.t. (5)-(11), (38)-(41), (47)-(55) REFERENCES [1] E. O'Shaughnessy, D. Cutler, K. Ardani, and R. Margolis, "Solar plus: Optimization of distributed solar PV through battery storage and dispatchable load in residential buildings," Applied Energy, vol. 213, pp. 11-21, 2018. [2] W. Tushar, T. K. Saha, C. Yuen, D. Smith, and H. V. Poor, "Peer-to-peer trading in electricity networks: an overview," IEEE Transactions on Smart Grid, 2020. [3] W. Tushar et al., "Three-party energy management with distributed energy resources in smart grid," IEEE Transactions on Industrial Electronics, vol. 62, no. 4, pp. 2487-2498, 2014. [4] W. Tushar et al., "Energy storage sharing in smart grid: A modified auction-based approach," IEEE Transactions on Smart Grid, vol. 7, no. 3, pp. 1462-1475, 2016. [5] X. Xu, J. Li, Y. Xu, Z. Xu, and C. S. Lai, "A Two-stage Game-theoretic Method for Residential PV Panels Planning Considering Energy Sharing Mechanism," IEEE Transactions on Power Systems, 2020. [6] N. Liu et al., "Online energy sharing for nanogrid clusters: A lyapunov optimization approach," IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4624-4636, 2017. 10 [7] S. Cui, Y.-W. Wang, J.-W. Xiao, and N. Liu, "A two-stage robust energy sharing management for prosumer microgrid," IEEE Transactions on Industrial Informatics, vol. 15, no. 5, pp. 2741-2752, 2018. [8] B. S. K. Patnam and N. M. Pindoriya, "Centralized stochastic energy management framework of an aggregator in active distribution network," IEEE Transactions on Industrial Informatics, vol. 15, no. 3, pp. 1350-1360, 2018. [9] S. Cui, Y.-W. Wang, and J.-W. J. Xiao, "Peer-to-Peer Energy Sharing among Smart Energy Buildings by Distributed Transaction," IEEE Transactions on Smart Grid, 2019. [10] N. Liu, X. Yu, C. Wang, and J. J. Wang, "Energy sharing management for microgrids with PV prosumers: A Stackelberg game approach," IEEE Transactions on Industrial Informatics, vol. 13, no. 3, pp. 1088-1098, 2017. [11] G. Ye, G. Li, D. Wu, X. Chen, and Y. Zhou, "Towards cost minimization with renewable energy sharing in cooperative residential communities," IEEE Access, vol. 5, pp. 11688-11699, 2017. [12] N. Liu, J. Wang, X. Yu, and L. Ma, "Hybrid energy sharing for smart building cluster with CHP system and PV prosumers: A coalitional game approach," IEEE Access, vol. 6, pp. 34098-34108, 2018. [13] Z. Wan, H. Li, H. He, and D. Prokhorov, "Model-Free Real-Time EV Charging Scheduling Based on Deep Reinforcement Learning," IEEE Transactions on Smart Grid, 2018. [14] Y. Du and F. Li, "Intelligent Multi-microgrid Energy Management based on Deep Neural Network and Model-free Reinforcement Learning," IEEE Transactions on Smart Grid, 2019. [15] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, "Real-time Voltage Control Using Deep Reinforcement Learning," arXiv preprint arXiv:.09374, 2019. [16] N. Sadeghianpourhamami, J. Deleu, and C. Develder, "Definition and evaluation of model-free coordination of electrical vehicle charging with reinforcement learning," IEEE Transactions on Smart Grid, 2019. [17] P. Dai, W. Yu, G. Wen, and S. Baldi, "Distributed Reinforcement Learning Algorithm for Dynamic Economic Dispatch with Unknown Generation Cost Functions," IEEE Transactions on Industrial Informatics, 2019. [18] J. Ahmed and Z. Salam, "An improved method to predict the position of maximum power point during partial shading for PV arrays," IEEE Transactions on Industrial Informatics, vol. 11, no. 6, pp. 1378-1387, 2015. [19] A. W. Azhari, K. Sopian, A. Zaharim, and M. Al Ghoul, "A new approach for predicting solar radiation in tropical environment using satellite images-case study of Malaysia," Transactions on Environment Development, vol. 4, no. 4, pp. 373-378, 2008. [20] R. B. Myerson, Game theory. Harvard university press, 2013. [21] I. I. J. Cplex, "V12. 1: User’s Manual for CPLEX," International Business Machines Corporation, vol. 46, no. 53, p. 157, 2009. [22] G. OPTIMIZATION, "INC. Gurobi optimizer reference manual, 2015," 2014. [23] S. R. Shaw, S. B. Leeb, L. K. Norford, and R. W. Cox, "Nonintrusive load monitoring and diagnostics in power systems," IEEE Transactions on Instrumentation and Measurement, vol. 57, no. 7, pp. 1445-1454, 2008. [24] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, "Convolutional LSTM network: A machine learning approach for precipitation nowcasting," in Advances in neural information processing systems, 2015, pp. 802-810. [25] H. Sak, A. Senior, and F. Beaufays, "Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition," arXiv preprint arXiv:. 2014. [26] T. M. Hansen, E. K. Chong, S. Suryanarayanan, A. A. Maciejewski, and H. J. Siegel, "A partially observable markov decision process approach to residential home energy management," IEEE Transactions on Smart Grid, vol. 9, no. 2, pp. 1271-1281, 2016. [27] C. J. Watkins and P. Dayan, "Q-learning," Machine learning, vol. 8, no. 3-4, pp. 279-292, 1992. [28] H. J. Kappen, "Optimal control theory and the linear bellman equation," 2011. [29] M. Tokic, "Adaptive ε-greedy exploration in reinforcement learning based on value differences," in Annual Conference on Artificial Intelligence, 2010, pp. 203-210: Springer. [30] N. Blair et al., "System advisor model, sam 2014.1. 14: General description," National Renewable Energy Lab.(NREL), Golden, CO (United States)2014. [31] G. I. Nagy, G. Barta, S. Kazi, G. Borbély, and G. J. I. J. o. F. Simon, "GEFCom2014: Probabilistic solar and wind power forecasting using a 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TII.2020.3016336, IEEE Transactions on Industrial Informatics > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < [32] [33] [34] [35] [36] [37] generalized additive tree ensemble approach," vol. 32, no. 3, pp. 1087-1093, 2016. Y. Li and Y. Yuan, "Convergence analysis of two-layer neural networks with relu activation," in Advances in Neural Information Processing Systems, 2017, pp. 597-607. E. Fan, "Extended tanh-function method and its applications to nonlinear equations," Physics Letters A, vol. 277, no. 4-5, pp. 212-218, 2000. Z. Wang and A. C. Bovik, "Mean squared error: Love it or leave it? A new look at signal fidelity measures," IEEE signal processing magazine, vol. 26, no. 1, pp. 98-117, 2009. A. Gulli and S. Pal, Deep Learning with Keras. Packt Publishing Ltd, 2017. G. Dantzig, Linear programming and extensions. Princeton university press, 2016. J. Fortuny-Amat and B. J. McCarl, "A representation and economic interpretation of a two-level programming problem," Journal of the operational Research Society, vol. 32, no. 9, pp. 783-792, 1981. Xu Xu (S’18 M’19) received the M.E and Ph.D. degrees from The Hong Kong Polytechnic University, Hong Kong SAR in 2016 and 2019, respectively. Dr Xu is with the Department of Electrical Engineering, The Hong Kong Polytechnic University, Hung Hom, Hong Kong SAR, China. His current research interests include power system planning and operation, renewable power integration, energy management, and artificial intelligence application in power engineering. Yan Xu (S’10-M’13-SM’19) received the B.E. and M.E degrees from South China University of Technology, Guangzhou, China in 2008 and 2011, respectively, and the Ph.D. degree from The University of Newcastle, Australia, in 2013. He is now the Nanyang Assistant Professor at School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), and a Cluster Director at Energy Research Institute @ NTU (ERI@N), Singapore. Previously, he held The University of Sydney Postdoctoral Fellowship in Australia. His research interests include power system stability and control, microgrid, and data-analytics for smart grid applications. Dr Xu is an Editor for IEEE Transactions on Smart Grid, IEEE Transactions on Power Systems, IEEE Power Engineering Letters, CSEE Journal of Power and Energy Systems, and an Associate Editor for IET Generation, Transmission & Distribution. Ming-Hao Wang (S’15-M’2018) received the B.Eng.(Hons.) degree in electrical and electronic engineering from the Huazhong University of Science and Technology, Wuhan, China, and the University of Birmingham, Birmingham, U.K. in 2012, and the M.Sc. and the Ph.D. degree, both in electrical and electronic engineering, from The University of Hong Kong, Hong Kong, in 2013 and 2017, respectively. Since 2018, he has been with the Department of Electrical Engineering, Hong Kong Polytechnic University, Hong Kong. Currently, he is a Research Assistant Professor in the Department of Electrical Engineering, the Hong Kong Polytechnic University. Hissearch interests include power systems and power electronics. 11 Zhao Xu (M’2016-SM’2012) received B.Eng, M.Eng and Ph.D degree from Zhejiang University, National University of Singapore, and The University of Queensland in 1996, 2002 and 2006, respectively. From 2006 to 2009, he was an Assistant and later Associate Professor with the Centre for Electric Technology, Technical University of Denmark, Lyngby, Denmark. Since 2010, he has been with The Hong Kong Polytechnic University, where he is currently a Professor in the Department of Electrical Engineering and Leader of Smart Grid Research Area. He is also a foreign Associate Staff of Centre for Electric Technology, Technical University of Denmark. His research interests include demand side, grid integration of wind and solar power, electricity market planning and management, and AI applications. He is an Editor of the Electric Power Components and Systems, the IEEE PES Power Engineering Letter, and the IEEE Transactions on Smart Grid. He is currently the Chairman of IEEE PES/IES/PELS/IAS Joint Chapter in Hong Kong Section. Jiayong Li (S’16–M’19) received the B.Eng. degree from Zhejiang University, Hangzhou, China, in 2014, and the Ph.D. degree from The Hong Kong Polytechnic University, Hong Kong, in 2018. He is currently an Assistant Professor with the College of Electrical and Information Engineering, Hunan University, Changsha, China. He was a Postdoctoral Research Fellow with The Hong Kong Polytechnic University and a Visiting Scholar with Argonne National Laboratory, Argonne, IL, USA. His research interests include power economics, energy management, distribution system planning and operation, renewable energy integration, and demand-side energy management. Songjian Chai received the Ph.D. degree from The Hong Kong Polytechnic University, Hong Kong SAR, in 2018. He is currently a Postdoctoral Research Fellow with The Hong Kong Polytechnic University. His research interests include variable renewable generation forecasting, electricity price forecasting, power system uncertainty analysis, and artificial intelligence application in power engineering. Yufei He (S’17) received the B.Eng. degree from Zhejiang University, China, in 2016. He is currently working toward the Ph.D. degree in the Department of Electrical Engineering, The Hong Kong Polytechnic University, Hong Kong. His research interests include power electronic control for grid-integration of renewables. 1551-3203 (c) 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. Authorized licensed use limited to: Hong Kong Polytechnic University. Downloaded on December 28,2020 at 05:24:48 UTC from IEEE Xplore. Restrictions apply.