Uploaded by Umuralp Kaytaz

Distributed Deep Reinforcement Learning with Wideband Sensing for Dynamic Spectrum Access

advertisement
Distributed Deep Reinforcement Learning with
Wideband Sensing for Dynamic Spectrum Access
Umuralp Kaytaz∗ , Seyhan Ucar‡ , Baris Akgun† and Sinem Coleri∗
Department of Electrical and Electronics Engineering, Koc University Istanbul Turkey∗
Department of Computer Engineering, Koc University Istanbul Turkey†
InfoTech Labs, Toyota Motor North America R&D, Mountain View, CA ‡
{ukaytaz@ku.edu.tr, seyhan.ucar@toyota.com, baakgun@ku.edu.tr, scoleri@ku.edu.tr}
Abstract—Dynamic Spectrum Access (DSA) improves spectrum utilization by allowing secondary users (SUs) to opportunistically access temporary idle periods in the primary
user (PU) channels. Previous studies on utility maximizing
spectrum access strategies mostly require complete network
state information, therefore, may not be practical. Model-free
reinforcement learning (RL) based methods, such as Q-learning,
on the other hand, are promising adaptive solutions that do not
require complete network information. In this paper, we tackle
this research dilemma and propose deep Q-learning originated
spectrum access (DQLS) based decentralized and centralized
channel selection methods for network utility maximization,
namely DEcentralized Spectrum Allocation (DESA) and Centralized Spectrum Allocation (CSA), respectively. Actions that
are generated through centralized deep Q-network (DQN) are
utilized in CSA whereas the DESA adopts a non-cooperative
approach in spectrum decisions. We use extensive simulations
to investigate spectrum utilization of our proposed methods for
varying primary and secondary network sizes. Our findings
demonstrate that proposed methods significantly outperform
traditional methods, including slotted-Aloha and random assignment, while %88 of optimal channel access is achieved.
Index Terms—Cognitive radio, dynamic spectrum access,
deep reinforcement learning, medium access control (MAC).
I. I NTRODUCTION
Cognitive Radio (CR) is a promising technology for
emerging wireless communication systems due to its efficient
utilization of the frequency bands. CR enhances temporal and
spatial efficiency by exploiting temporary idle periods, a.k.a
spectrum holes in the frequency usage. Dynamic Spectrum
Access (DSA) plays a central role in CR networks by
allowing secondary users (SUs) to opportunistically access
spectrum holes in the PU channels. Network utility maximization during DSA requires efficient frequency selection
and minimal interference. Previous studies on DSA related
channel selection strategies mostly require complete network
state information, so, may not be practical [1]–[3]. Modelfree reinforcement learning (RL) based methods, such as Qlearning, on the other hand, are promising solutions that do
not require complete network information [4].
Deep Reinforcement Learning (DRL) adopts deep neural
network (DNN) architectures for approximating the objective
values during the learning [5]. High-dimensional spectrum
assignment problem, which includes large number of actions
and states space, can be solved via DRL without any prior
knowledge about the network state and/or environment.
Up to now, several works with RL and/or deep RL (DRL)
have been proposed with the aim of interference avoidance
and/or utility maximization in CR networks. DRL based
frequency allocation has been investigated for maximizing
the number of successful transmissions in [6]. A similar
approach has been proposed in [7] to maximize the network
utility during spectrum access via Deep Q-network (DQN).
Deep Q-learning (DQL) and narrowband sensing policy
has been examined for heterogeneous networks that operate
different MAC protocols [8]. However, narrowband sensing
adopted in these studies limits the usage of multiple spectral
opportunities. [9] examines multi-agent RL with wideband
sensing for learning sensing policies in a distributed manner.
Different from the proposed architecture, we focus on spectrum assignment and primary system frequency allocations.
Wideband sensing with DQN based access policy has been
investigated for utility maximization of a single SU [10].
However, existence of multiple cognitive agents in different
primary network settings and effect of centralized policy
learning is omitted in the presented work.
In this paper, we propose DQL and wideband sensing
based multiple SU utility maximization for the first time
in the literature. The original contribution of this study
is threefold. First, we develop a Markov Decision Process
(MDP) formulation for utility maximization during secondary
system decision making based on wideband sensing. Second,
we propose novel decentralized and centralized spectrum selection methods based on wideband sensing, namely DEcentralized Spectrum Allocation (DESA) and Centralized Spectrum Allocation (CSA), respectively. Third, we investigate
the spectrum utilization of the proposed spectrum allocation
methods for varying primary and secondary network sizes.
The remainder of the paper is organized as follows. Section II describes the system model. Section III presents the
formulation and RL based learning methods. DQN architecture and DQL-based spectrum allocation algorithms are
provided in Section IV. Section V details simulation setup
and analyzes experimental outcomes. Finally, we present our
concluding remarks in Section VI.
II. S YSTEM M ODEL
Fig. 1 represents IEEE 802.22 standard and Wireless Regional Area Network (WRAN) adopted network architecture.
PUs and SUs are assumed to be WRAN Base Station (BS)
IV. A PPROACH
q π (s, a) = Eπ
X
∞
γ k rt+k+1 st = s, at = a
k=0
(2)
There exists an optimal deterministic stationary policy for
each MDP model [11]. This policy maximizes the expected
reward returned from a state in the MDP model with unknown transition probabilities. Optimal state-value function
v∗ and optimal action-value function q∗ for all s ∈ S and
a ∈ A under policy π can be denoted as follows;
v∗ (s) = max v π (s)
(3)
q∗ (s, a) = max q π (s, a)
(4)
π
π
C. Q-Learning
Q-learning is a widely-used control algorithm for policy
independent approximation of optimal action-value function
(q∗) based on Bellman optimality equations. In the absence
of state transition probabilities P, optimal policy calculation
requires approximation of optimal value functions, q∗ and v∗,
for utility maximization. Bellman optimality equations allow
policy independent representations and recursive calculation
for optimal value functions [4]. These equations are;
Next, we describe our DQN architecture and algorithms for
DQL based spectrum allocation. First, we explain our DQN
architecture detailing hyperparameters selected for the learning procedure. Later on, we present wideband sensing based
DQL algorithm for single SU frequency selections. Finally,
we elaborate on our proposed centralized and decentralized
DRL architectures for multi-user utility maximization during
secondary system spectrum decisions.
A. Deep Q-Network
Q-learning calculation requires memorization and representation of Q-values during the expected reward calculation.
DQN is a deep reinforcement learning method that uses deep
neural network architecture for value storage and expected
utility approximation during Q-learning procedure. Using
DQN allows complex mapping of high-dimensional statespace representations to action-values [6], [7]. Our DQN
training procedure uses experience replay technique. With
this approach, we ensure convergent and stable training of
DQN model by sampling past experience at each batch and
training on decorrelated examples [12].
Table II: Hyperparameters for DQN
v∗ (s) = max E[ rt+1 + γv∗ (st+1 ) | st = s, at = a ]
a
(5)
Hyperparameter
Number of episodes
Number of time slots
Number of layers
Batch size B
Memory size
Learning rate α
Discount factor γ
Exploration rate ǫ
Epsilon decay
Loss function
Optimizer
q∗ (s, a) = E[ rt+1 + γ max q∗ (st+1 , at+1 ) | st = s, at = a ]
at+1
(6)
Learning algorithm uses multiple experiences and one-step
look-ahead approach for recursively approximating q∗ values,
in which, experience represents a sample taken from the MDP
model. An Experience at time slot t can be denoted using
et = hst , at , rt , st+1 i. Q-learning uses the following update
rule for solving the MDP formulation and approximating q∗
values;
q(s, a) ←
− q(s, a) + ∆q(s, a)
(7)
using learning rate α, ∆q(s, a) is defined as;
h
i
∆q(s, a) = α rt + γ max q(st+1 , at ) − q(st , at )
at
Table I: Reinforcement Learning Notation
Notation
γ
π
rt
st
at
v π (s)
q π (s, a)
v∗ (s)
q∗ (s, a)
et = hst , at , rt , st+1 i
Description
Discount factor
Policy
Reward observed at time t
State at time t
Action taken at time t
State-value function under policy π
Action-value function under policy π
Optimal state-value function
Optimal action-value function
Experience at time t
(8)
Value
200
10000
5
10
2000
0.001
0.95
1.0 −
→ 0.01
0.997
Huber
Adam
Feed-forward network consists of 5 fully connected layers,
where input is the state si . Input layer is of size 1 x N and
output layer is a vector of action set A. Hidden layers are
of size 30, 20, 10 and use ReLU activation function that
computes f (x) = max(x, 0). Last layer of the
P network
uses softmax activation function σ(xj ) = exj / i exi for
predicting Q-value of each action. We chose Adam gradient
descent optimization algorithm for updating weights during
DQN training [13]. Huber loss has been chosen as the loss
function during back propagation due to its robustness against
outliers [14]. At each iteration i, Adam algorithm is used to
update weights θi of the DQN for minimizing loss function
Li (θi ) =
(
1
2 (yi
2
− Qi )
c|yi − Qi | − 21 c2
f or |yi − Qi | ≤ c
otherwise
(9)
where yi = E[ ri+1 + γ maxa′ q∗ (si+1 , a′ )] is calculated
with the DQN using weights θi−1 from previous iteration,
Qi = q(si , a|θi ) and c = 1.0 as presented in Table II.
Algorithm 1: DQL originated Spectrum Access (DQLS)
Input: S ←
− primary system channel occupation
Output: DQN target
Data: M em : Memory, b : minibatch
train
1 Initialize DQN
, DQN target ←
− DQN (S, A)
2 for each episode Et do
3
for each state st ∈ Et do
4
Compute at = DQN train .get action(st )
5
at ←
− exploration with prob. ǫ
6
Compute rt = DQN train .get reward(st , at )
7
Get next state st+1
8
Store M em.store(et = hst , at , rt , st+1 i)
9
Update st ←
− st+1
10
if |M em| > B then
11
Compute b = M em.sample(B)
12
for each ei in b do
13
Compute
t = E[ ri+1 + γ maxai+1 q∗ (si+1 , ai+1 )]
14
Compute DQN train .train(si , t)
15
weights
Update DQN target ←−−−−− DQN train
B. DQL originated Spectrum Access Algorithm (DQLS)
Algorithm 1 is executed for learning spectrum decision
policies based on detected PU channel allocations. Learning is triggered upon receiving primary system channel
information from the environment using wideband sensing.
First, two DQN implementations (DQN train , DQN target )
are initialized for the experience replay procedure using
state space S and action space A (Line 1). Training lasts
for a predetermined amount of episodes (Lines 2-16). At
each episode randomly generated PU spectrum decisions are
determined using wideband sensing (Lines 3-15). Action at
corresponding to the highest expected utility is returned by
the DQN train given the current PU channel occupations
st (Line 4). Training procedure uses ǫ-greedy policy for
choosing random actions over actions that return highest
expected utility with probability ǫ (Line 5). Exploration
is necessary in order to find better action selections that
result in higher reward signals. We linearly decrease the
0.997
exploration rate ǫ (1.0 −−−→ 0.01) by epsilon decay for a balanced exploration-exploitation trade-off. Immediate reward
obtained by the selected action is recorded for the rest of the
training procedure (Line 6).
After getting the channel occupation in the subsequent time
slot current state is updated and an experience et is stored
in the memory M em using action at , states st , st+1 and
immediate reward rt (Lines 7-9). As the experience count
surpasses the predetermined batch size B, experiences are
randomly sampled from the memory to form a minibatch
and train on decorrelated experiences (Lines 10-11). Based
on the information contained in each experience, Q-value
Q(s, a) of that state-action pair is calculated and DQN model
DQN train is trained (Lines 12-14). After each episode
learned weights by the training model DQN train is used
for updating the target model DQN target (Line 15).
Algorithm 2: Centralized Spectrum Allocation (CSA)
Input: K ←
−number of SUs, N ←
− number of PUs
Output: Setaction : Secondary system actions
Data: si ←
− state during i-th slot
1 Compute |A| ←
− (N + 1)K
action
2 Initialize Set
= {}, assignedch = 0
target
3 Train DQN
, DQN train ←
− DQLS
target
4 Compute a = DQN
.get action(si )
ch
5 while assigned
6= K do
ch
6
Update |A| ←
− (N + 1)K−assigned −1
7
Compute P U index = ⌊(a/|A|)⌋
8
Update Setaction .append(P U index)
9
Update a = a mod |A|
10
Update assignedch += 1
C. Centralized Spectrum Allocation Algorithm (CSA)
Centralized DQL architecture uses Centralized Spectrum
Allocation Algorithm (CSA) for learning spectrum decision
policies. After obtaining primary system channel occupation
patterns using distributed sensing [9], Algorithm 2 is run for
determining secondary system spectrum decisions. Initially
the number of possible actions, action space size, is calculated considering each of K SUs has (N+1) actions for
N detected PU channels (Line 1). An empty set of actions
Setaction is created with a variable assignedch for counting
the number of assigned SU actions (Line 2). Centralized
DQN is trained and an action number for encoding all
individual actions of K SUs is computed using single DQN
based DQLS algorithm (Lines 3-4). Action code a, generated
by the central DQN for a given state si , is decoded into
separate action numbers until all SUs have an assigned action
(Lines 5-10). For each individual SU action this procedure
keeps the action of remaining SUs as a subset and divides
the action code to possible number of remaining actions
ch
(N + 1)K−assigned −1 (Lines 6-7). During the rest of the
procedure action code representing the cumulative actions of
remaining SUs and number assigned SU actions are updated
(Lines 9-10). After the calculation of the current channel
assigned to the SU, immediate reward signal is obtained from
the environment and total utility is updated (Lines 12-13).
D. Decentralized Spectrum Allocation Algorithm (DESA)
Decentralized policy learning architecture consists of noncooperative secondary BSs that perform independent wideband sensing and policy learning during DSA. Initially, each
secondary BS SU performs wideband sensing individually
and determines primary system channel allocations. Based
on spectrum occupancy information, SUs are trained using
the proposed DQLS algorithm (Lines 1-2). Upon completion
of the training procedure, secondary BSs choose the S
action and primary channel index, which provide the highest
expected utility given state sj at the current time slot j (Lines
3-4).
500
3
4
for each SU ∈ secondary system do
target
Compute aj = DQNSU
.get action(sj )
V. S IMULATIONS
A. Simulation Setup
Simulations of CSA and DESA algorithms have been
implemented for centralized and decentralized spectrum decision scenarios using open-source neural-network library
Keras [15] and numerical computation library TensorFlow
[16]. CSA uses a single DQN for generating secondary
system actions during DSA. Central DQN in the CSA is
aware of the primary system channels, which are detected by
the SUs using wideband sensing approach. During the DESA,
each SU has been modeled as a learning cognitive agent
that is capable of performing sensing and policy learning
operations independently. We have modeled PU channels as
independent 2-state Markov chains that can either be in 1
(occupied) or 0 (vacant) state. Similar to previous work in
[10], namely DQN-based access policy (DQNP), we have
implemented DSA for N = 20 channels each assigned to a
different PU with randomly generated transition probability
matrices {Pi }20
i=1 .
Network utility performance under CSA and DESA have
been compared with classical slotted-Aloha protocol and
optimal policy performance. Expected channel throughput
under slotted-Aloha protocol at a given time-slot i is given
by ni pi = (1 − pi )ni −1 , where pi represents transmission
probability and ni is number of SUs in the CR environment [7]. For the optimal policy performance comparison
we have implemented fixed-pattern channel switching based
optimal policy (OP) algorithm proposed in [6]. Each of the
simulations were carried out for T = 10, 000 time slots and
E = 200 episodes. We have used average SU utility per
episode as the performance evaluation metric for representing
average cumulative reward gained by SUs at each episode.
B. Simulation Results
We first present simulation results for spectrum decisions
of a single SU coexisting with multiple PUs in the CR
environment. Fig. 2 shows average SU utility per episode
obtained by a single SU spectrum decisions under DQLS.
Note that network scenario with 1 SU and 20 PUs corresponds to the architecture proposed for DQNP. We have
two observations from Fig. 2. First, average SU throughput
400
350
300
250
200
DQLS 1 SU 15 PU
DQLS 1 SU 20 PU (DQNP)
DQLS 1 SU 25 PU
OP 1 SU 25 PU
OP 1 SU 20 PU
OP 1SU 15 PU
150
100
50
0
0
50
100
150
200
Episodes
Figure 2: DQLS versus optimal channel access policy
increases as the number of detected PU channels decreases.
Low cardinality of the action space enables SU to learn a
better channel selection policy. Therefore, a higher value of
cumulative utility is obtained at the end of each episode as
the number of PU per SU decreases. Second, convergence
to DQN-based access policy takes longer with increasing
number of detected PU channels. Increasing complexity in
the CR network results in higher dimensionality of action
and state space representations, hence policy learning from
channel occupancy observations slows down.
Fig 3 shows average spectrum selection performance of
SUs under CSA. During simulations, the number of PU
channels is set to 20 while the number of base stations in
the secondary system vary as 2, 3 and 4. Similar to the
results presented in Fig 2, SUs perform better with DQN
based channel selection strategy as the number of available
PU channels per SU decreases. Different from slotted-Aloha
based medium access, average utility under CSA increases
with increasing number of SU. Furthermore, CSA-based
spectrum policy performance surpasses random access and
slotted-Aloha based spectrum decisions as the number of SU
increases to 4. On the other hand, 52% of optimal policy
performance is achieved.
Fig 4 depicts the performance of non-cooperative SUs
under DESA. Similar to the results depicted in Fig 2 and
Average SU Utility per Episode
Algorithm 3: Decentralized Spectrum Allocation
(DESA)
Input: S ←
− primary system channel occupation
Output: aj : PU channel at j-th slot
Data: sj ←
− state at j-th slot
1 for each SU ∈ secondary system do
target
train
, DQNSU
←
− DQLS
2
Train DQNSU
Average SU Utility per Episode
450
550
500
450
400
350
300
250
200
150
100
50
0
−50
−100
OP 20 PU 4 SU
OP 20 PU 2 SU
CSA 20 PU 4 SU
CSA 20 PU 2 SU
Slotted Aloha 20 PU 4 SU
Slotted Aloha 20 PU 2 SU
Random Access Policy
0
50
100
150
200
Episodes
Figure 3: Spectrum policy performance under CSA
Average SU Utility per Episode
R EFERENCES
550
500
450
400
350
300
250
200
150
100
50
0
−50
−100
OP 20 PU 4 SU
OP 20 PU 2 SU
DESA 20 PU 4 SU
DESA 20 PU 2 SU
Slotted Aloha 20 PU 4 SU
Slotted Aloha 20 PU 2 SU
Random Access Policy
0
50
100
150
200
Episodes
Figure 4: Spectrum policy performance under DESA
Fig 3, DESA-based channel access results show that increasing number of SUs increases the average performance
of DQN-based spectrum policy. It’s evident that DESAbased spectrum assignment outperforms other approaches.
Furthermore, this performance gap drastically increases as
the number of SU reaches to 4. Compared to results obtained
by centralized spectrum assignment under CSA, we observe
that independent policy learning in the decentralized scenario
results in higher average network utility values. Furthermore,
88% of optimal policy performance is achieved as 20 PUs
and 4 SUs are available in the CR environment.
VI. C ONCLUSION AND F UTURE W ORK
In this paper, we present multi-agent deep reinforcement
learning based spectrum selection with wideband sensing
capability for multi-user utility maximization during DSA.
Initially, DQN-based spectrum decisions of a single SU coexisting with multiple PUs have been derived. Later on, we propose novel algorithms for DQL and wideband sensing based
centralized and decentralized spectrum selection, namely
DESA and CSA. Through simulations, we demonstrate that
the performance of DQN-based frequency selection increases
as the number of available PU channels per SU decreases.
Moreover, we observe that independent policy learning in
non-cooperative manner under DESA results in more effective spectrum decisions than centralized spectrum assignment
under CSA. Overall, our proposed methods improve the spectrum utilization compared to traditional spectrum assignment
methods in which 88% and 52% of optimal channel access
are achieved by DESA and CSA, respectively.
Going forward, our future work will focus on extending
both the system model and proposed algorithms. For the
system model, we will concentrate on efficient utilization of
OFDMA resource blocks and analyze aggregated interference
at primary receivers during secondary system transmissions.
Considering the great potential of DRL-based technologies
for DSA, we also plan to work on creating an open source
dataset and implementing other DRL approaches such as
policy gradient methods.
[1] K. Wang and L. Chen, “On optimality of myopic policy for restless
multi-armed bandit problem: An axiomatic approach,” IEEE Transactions on Signal Processing, vol. 60, no. 1, pp. 300–309, Jan 2012.
[2] S. H. A. Ahmad, M. Liu et al., “Optimality of myopic sensing in
multichannel opportunistic access,” IEEE Transactions on Information
Theory, vol. 55, no. 9, pp. 4040–4050, Sep. 2009.
[3] Q. Zhao, L. Tong et al., “Decentralized cognitive mac for opportunistic
spectrum access in ad hoc networks: A pomdp framework,” Selected
Areas in Communications, IEEE Journal on, vol. 25, pp. 589 – 600,
05 2007.
[4] R. S. Sutton and A. G. Barto, Reinforcement Learning: An
Introduction, 2nd ed. The MIT Press, 2018. [Online]. Available:
http://incompleteideas.net/book/the-book-2nd.html
[5] V. Mnih, K. Kavukcuoglu et al., “Human-level control through deep
reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
Feb. 2015. [Online]. Available: http://dx.doi.org/10.1038/nature14236
[6] S. Wang, H. Liu et al., “Deep reinforcement learning for dynamic
multichannel access in wireless networks,” IEEE Transactions on
Cognitive Communications and Networking, vol. 4, no. 2, pp. 257–
265, June 2018.
[7] O. Naparstek and K. Cohen, “Deep multi-user reinforcement learning for distributed dynamic spectrum access,” IEEE Transactions on
Wireless Communications, vol. 18, no. 1, pp. 310–323, Jan 2019.
[8] Y. Yu, T. Wang, and S. C. Liew, “Deep-reinforcement learning multiple
access for heterogeneous wireless networks,” in IEEE International
Conference on Communications (ICC), May 2018, pp. 1–7.
[9] J. Lunden, S. R. Kulkarni et al., “Multiagent reinforcement learning
based spectrum sensing policies for cognitive radio networks,” IEEE
Journal of Selected Topics in Signal Processing, vol. 7, no. 5, pp.
858–868, Oct 2013.
[10] H. Q. Nguyen, B. T. Nguyen et al., “Deep q-learning with multiband
sensing for dynamic spectrum access,” in 2018 IEEE International
Symposium on Dynamic Spectrum Access Networks (DySPAN), Oct
2018, pp. 1–5.
[11] R. Bellman, “The theory of dynamic programming,” Bull. Amer.
Math. Soc., vol. 60, no. 6, pp. 503–515, 11 1954. [Online]. Available:
https://projecteuclid.org:443/euclid.bams/1183519147
[12] M. Andrychowicz, F. Wolski et al., “Hindsight experience
replay,” in Advances in Neural Information Processing Systems
30, I. Guyon, U. V. Luxburg et al., Eds.
Curran
Associates, Inc., 2017, pp. 5048–5058. [Online]. Available:
http://papers.nips.cc/paper/7090-hindsight-experience-replay.pdf
[13] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
CoRR, vol. abs/1412.6980, 2015.
[14] P. J. Huber, “Robust estimation of a location parameter,” Ann. Math.
Statist., vol. 35, no. 1, pp. 73–101, 03 1964. [Online]. Available:
https://doi.org/10.1214/aoms/1177703732
[15] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
[16] M. Abadi, P. Barham et al., “Tensorflow: A system for large-scale
machine learning,” in 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 16), 2016, pp. 265–283.
[Online]. Available: https://www.usenix.org/system/files/conference/
osdi16/osdi16-abadi.pdf
Download