Uploaded by Neshat Varjavand

Neshat Varjavand-Deep-learning project

advertisement
1.
Introduction
As wireless networks evolve, with the increasing number of mobile devices, there is a growing demand
for spectrum resources in the 4G and 5G bands, however, the radio frequency spectrum is a limited
resource. The demand for wireless communication is leading to the need for the development of efficient
dynamic spectrum access (DSA) schemes for emerging wireless network technologies.
Some parts of the radio frequency spectrum are heavily used while others remain underutilized.
Unlicensed bands, in particular, suffer from overutilization and cross-technology interference.
Monitoring and understanding spectrum resource usage are crucial for improving and regulating radio
spectrum utilization, especially in complex wireless systems like 5G. On the other hand, this monitoring
requires distributed sensing over a wide frequency range, which results in a large amount of radio
spectrum data. Extracting meaningful information from this massive and complex dataset necessitates
advanced algorithms.
To address these challenges, new innovative spectrum access schemes and identification mechanisms
are being developed to provide awareness about the radio environment. Identification of technology,
modulation type recognition, and interference source detection are essential for effective interference
mitigation and the coexistence of heterogeneous wireless networks.
Cognitive radio networks (CRN) play a significant role in dynamically and efficiently utilizing frequency
bands. Cognitive radio has been proposed as a solution to spectrum scarcity by allowing secondary users
(SUs) to utilize underutilized spectrum bands without causing interference to primary users (PUs).
Spectrum sensing, which detects the presence of PUs, is a key technology in CR.
A lot of spectrum sensing algorithms have been proposed for different scenarios, including traditional
detectors and deep learning-based detectors. Traditional detectors make assumptions about the signalnoise model, facing challenges due to computational complexity while deep learning detectors learn
from the sensing data itself, and have been explored as a promising alternative to traditional resources.
Deep neural networks (DNNs) can approximate optimal resource allocation strategies with reduced
computational complexity.
2.
Literature Review
Various deep learning approaches, including supervised and unsupervised learning, have been
investigated for CR systems. Paper [1], introduces the concept of end-to-end learning from spectrum
data, which involves utilizing deep neural networks for advanced wireless signal identification in
spectrum monitoring applications. End-to-end learning for spectrum monitoring, enables automatic
feature learning from wireless signal representations and trains wireless signal classifiers in a single step.
Two case studies, modulation recognition and wireless technology interference detection, are conducted
using different signal representations (temporal IQ data, amplitude/phase, and frequency domain). The
paper demonstrates that the choice of data representation significantly affects the accuracy, with
variations of up to 29%.
The proposed approach involves cognitive radio users sensing their environment and reporting the results
to a base station (BS), which forwards the information to a data center (DC). The DC combines the
sensing data from multiple CR users to create a spectrum map and determine the presence of primary
users (PUs). The DC then shares the spectrum availability information with the cognitive users. To
achieve this, the DC uses a convolutional neural network (CNN) model trained offline to distinguish
between occupied frequency channels and spectrum holes. Also, in the context of cognitive IoT, devices
are equipped with cognitive capabilities to search for interference-free spectrum bands and adjust their
transmission parameters accordingly. CR-IoT devices send spectrum sensing reports to a CNN-based
DC, which learns and estimates the presence of other emitters. This information is used to detect
interference sources and identify interference-free channels. The proposed architecture for |CR and
CR-IoT networks is shown in Fig. 1.
Fig. 1: Proposed architecture of CR SUs and CR-IOTs [1]
Table 1. shows the structure of the proposed CNN. Hyper parameters must be set. For both case studies,
67% of randomly selected examples are used for training in batches of 1024, and 33% are used for testing
and validation. Both sets of examples are uniformly distributed in Signal-to-Noise Ratio (SNR) from
-20dB to 20dB.
Table 1. CNN structure
The detection efficiency is dependent on the complexity of the CNN network structure used during
prediction, which is determined by the time required to calculate convolutions and activations in all
neurons. For the optimization of the model parameters, this paper has used Adaptive Moment Estimation
(Adam) optimizer, with a learning rate of 0.001 to ensure convergence. To accelerate the model’s
learning and convergence procedure, the input data has normalized and the Rectified Linear Unit (ReLU)
activation function was used. The CNNs were trained for 70 epochs, and the model with the lowest
validation loss was selected for evaluation.
To evaluate the prediction accuracy of the end-to-end wireless signal classification models for tasks such
as modulation type recognition or interference identification, it is necessary to compare their predictions
with the true response values of the observed spectrum data. The performance of these classification
methods can be measured using the prediction accuracy on a test data sample. The overall classification
test error over π‘šπ‘‘π‘’π‘ π‘‘ testing snapshots can be calculated using the formula:
(1)
𝐸𝑑𝑒𝑠𝑑 = π‘š
1
𝑑𝑒𝑠𝑑
𝑑𝑒𝑠𝑑
∑π‘š
Μ‚,𝑖 𝑦𝑖 )
𝑖=1 𝑙(𝑦
The classification accuracy can then be obtained as
(2)
accuracy = 1 − 𝐸𝑑𝑒𝑠𝑑
Here, 𝑙(𝑦̂,𝑖 𝑦𝑖 ) represents the loss function that quantifies the difference between the estimated value and
the true value for each instance. The number of true positives (TP), false positives (FP), and false
negatives (FN) were determined as follows:
ο‚·
If a signal was detected and annotated as belonging to a particular class in the labelled test data,
it was counted as TP.
ο‚·
If a signal was predicted as belonging to a particular class but did not actually belong to that class
according to the labelled test data, it was counted as FP.
ο‚·
If a signal was present in the labelled test data for a particular instance but was not detected in
that instance, it was counted as FN.
We can derive performance metrics precision (P), recall (R) and F1 score too:
𝑇𝑃
(3)
𝑃 = 𝑇𝑃+𝐹𝑁
,
(4)
𝐹1π‘ π‘π‘œπ‘Ÿπ‘’ = 2 × π‘ƒ+𝑅
𝑇𝑃
𝑅 = 𝑇𝑃+𝐹𝑃
𝑃×𝑅
For overall performance measurement, multiple per-class performance metrics are combined through a
prevalence-weighted macro-average across the class metrics. The term "macro-average" refers to the
calculation of precision, recall, and F1 score separately for each class and then averaging them across
all the classes. This is in contrast to "micro-average," which calculates precision, recall, and F1 score
by aggregating all true positives, false positives, and false negatives across all classes before
calculating the metrics.
The "prevalence-weighted" aspect of the metric refers to the fact that the average is weighted by the
prevalence of each class in the dataset. This means that the performance of the model on more
prevalent classes is given more weight in the calculation than less prevalent classes. The performance
results are shown in fig. 2 and fig.3.
Fig. 2: Performance results for modulation recognition classifiers vs. SNR [1]
Fig. 3: Performance results for interference identification classifiers vs. SNR [1]
The confusion matrix (Fig. 4) is used to provide a detailed overview of the per-class performance.
Fig. 4: Confusion matrices for the modulation recognition data for
SNR 6dB, frequency representation [1]
Signals within a dataset that exhibit similar characteristics in one data representation are more difficult
to discriminate. The performance of the classifier can be improved by increasing the quality of the
wireless signal dataset, by adding more training examples, more variation among the examples (e.g.
varying channel conditions), and tuning the model hyper-parameters.
As SUs face challenges such as fading and communication issues, Cooperative Spectrum Sensing (CSS)
has been introduced in [2] to improve the sensing quality by utilizing the results of cooperating secondary
users (CSUs). However, the optimal cooperation strategy in CSS is affected by channel conditions. This
paper proposes a Deep Reinforcement Learning (DRL)-based CSS approach, employing Reinforcement
Learning (RL) (Fig. 5 shows RL principle) to introduce the required SU to update its measurements.
Hence, the proposed DRL can increase the CSS efficiently from a resource and time viewpoint.
Fig. 5: The basic principle of reinforcement learning [11]
The SU broadcasts the request of cooperative sensing, and all the one-hop neighbors will respond to its
appeal by their local sensing results, intermittently. The initiating SU, which requests for cooperative
sensing, is called the agent will combine the local sensing results and take the final decision.
The paper has formulated the CSS problem as a Markov Decision Problem (MDP) and introduced state
and action spaces and strategy selection policy. Moreover, the reward function is introduced based on
the correlation of cooperating users. The reward calculation considers the correlation of the newly
selected SU, which depends on the location of each SU in the cognitive network. The correlation
coefficient is used to determine if the sense of the PU is correlated or uncorrelated between SUs. The
cost is calculated by summing the correlation coefficients from the first step up to the current step. The
policy of RL algorithm to find the next action is based on Bellman equation, and Deep Q-network (DQN)
is employed to estimate the optimal action-value function (Q). DQN structure is shown in Fig. 6.
Fig. 6: The flowchart of DQN [11]
In the agent, one of the available SUs is selected to represent its local sensing result, which is transmitted
to the agent. The sensing matrix Y is gathered and processed through three convolutional layers and two
subsequent FC layers to obtain the result of global sensing (to calculate the action-value function for the
DRL to determine the absence or presence of the PU) through CNN. The action-value and rewards are
updated based on the cost function and the local sensing result. Finally, the agent decides on the presence
or absence of the PU based on the local results.
To compare with other works, the comparison metrics are Region of Convergence (ROC), number of
active SUs, and the sensing error, which is the average of probability of false alarm (Pfa ) and the
probability of the miss detection (Pm). Fig. 7 shows ROC comparison
Fig. 7: The ROC of the proposed scheme for P = 30 dBm [2]
The combination of convolutional neural networks (CNNs) and long short-term memory (LSTM)
networks has been proposed to learn the energy-correlation features and PU activity pattern from the
sensing data, improving the spectrum sensing performance [3]. In the proposed CNN-LSTM-based
spectrum sensing detector, the CNN structure extracts energy-correlation features from the current and
historical sensing data, treated as sample covariance matrices. In fact, directly processing the sensing
samples in their original form would result in high computation complexity and redundancy. To address
this, the sensing samples are preprocessed by calculating the sample covariance matrix. The sample
covariance matrix captures the essential energy-correlation information. The extracted features
corresponding to different sensing periods are then input into an LSTM network to learn the PU activity
pattern. The structure of CNN-LSTM is shown in Fig. 8.
Fig. 8: CNN-LMST structure [3]
The proposed algorithm is:
The CNN-LSTM detector is free from signal-noise model assumptions and has shown superior
performance compared to benchmark detectors in scenarios with and without noise uncertainty. In
contrast, the traditional APASS algorithm learns both energy-correlation and temporal features using the
CNN. However, as the CNN is not specialized in processing temporal features, the CNN-LSTM
algorithm outperforms the APASS algorithm.
Benchmark algorithms, including MED, SSE, AGM, and APASS, are compared. ROC curves are shown
in Fig. 9. In the figure, legend, ‘CL’ represents CNN-LSTM algorithm. The CNN-LSTM algorithm
outperforms traditional detectors in terms of detection performance.
Fig. 9: ROC curves for comparison [3]
In the field of Intelligent Transportation Systems (ITS), in the 5G/6G era, Vehicular Ad-Hoc Networks
(VANETs) play a crucial role in enabling collision avoidance, traffic management, and safety warnings.
However, limited spectrum availability and congestion pose challenges for VANET-enabled services.
To overcome this, unlicensed white space devices and Cognitive Radio Networks (CRNs) are employed.
Integration of CRNs with VANETs (CR-VANETs) enables efficient spectrum utilization and sustainable
communication. As mentioned for CRNs, spectrum sensing and dynamic channel scheduling are
essential for opportunistic spectrum access and reliable data transmission in CR-VANETs, considering
vehicle mobility and avoiding interference with primary users.
[4] has proposed intelligent hybrid learning spectrum agents that perform sensing using a deep learning
model for the CR-VANETs framework. It effectively forecasts the spectrum occupancy in the primary
spectrum and allocates the idle channels to the cognitive vehicles using a Support Vector Machine
(SVM) classifier.
[5] has proposed Reinforcement learning to model PU activity patterns for predicting free channels for
dynamic spectrum access (DSA). The RL model is implemented on a road-side unit (RSU) and sends
predicted vacant channels to vehicles. Adaptive spectrum sensing using energy and feature detectors
improves spectrum detection performance. The proposed approach shows improvement in vehicular
communication through simulation results. The RL technique outperforms history-based schemes, as
demonstrated in simulation results. Flowchart of the proposed algorithm’s sensing architecture is given
in Fig. 10.
Fig. 10: flowchart of sensing architecture [5]
The proposed dynamic spectrum access algorithm in [6], based on RL, aims to optimize channel capacity
in CR-VANETs. The algorithm utilizes the Q-learning method to find the optimal access strategy for
secondary vehicles (SVs). SVs are considered agents that interact with the CR-VANETs communication
environment. The sensing results of SVs represent the state, and the access conditions of SVs are defined
as actions. The reward is determined based on the SV’s capacity.
To improve the Q-learning algorithm, a new reward function is introduced to avoid collisions between
SVs and increase the access success rate. The Q-values are updated based on the reward, and the learning
rate determines the weight given to the current reward versus future cumulative rewards. The strategy
for selecting the optimal access strategy is updated using the πœ€-greedy algorithm.
To address the issue of the exponential growth in the size of the Q-table, the IDQN (Improved Deep QNetwork) method is employed. It involves training a neural network to approximate the Q-values. The
network takes the sensing results as input and outputs the Q-values for each action. The training process
involves minimizing the loss function and updating the network weights. The target network is used to
stabilize the training process.
The algorithm depicts the interaction between SVs, the environment, and the neural network. The reward
is determined based on the SV’s sensing results and action. The sensing results are obtained from the
dynamic occupancy channel of primary vehicles, modeled as a Markov chain.
DRL methods are effective for optimizing decision-making in complex and high-dimensional problems.
Specifically in managing spectral regions among wireless connected devices, which involves a large
state space and partial observation, integrating DRL techniques proves beneficial.
Most research focuses on single-user OSA using single-agent DRL due to the exponential increase in
state-action space in multiuser settings. However, joint optimization problems require precise CSI and
can be computationally expensive. RL optimizes behavior based on rewards, but predicting CSI in
practice is challenging due to PU traffic and sensing errors. Partially Observable Markov Decision
Processes (POMDPs) offer a potential solution for decision-making in CR-VANETs by integrating
observations over time. [7] proposes solutions for channel allocation, future channel state estimation,
selective sensing, and spectrum allocation using DRL and POMDP. The system model includes primary
and secondary networks, with a CBS controlling CSS decisions and channel allocation. Simulation
results and analysis are presented, and the article concludes with a system model description.
This paper addresses three key issues and their respective solutions one after another in a single
framework for CR-VANETs as
• Reliable CSS performance in CR-VANETs through DRL algorithm.
• Channel indexing for selective sensing using LSTM-based time series analysis.
• Channel/Spectrum allocation to the CR-VANTEs using value iteration algorithm in POMDP
framework.
Fig. 11, noted that a single time frame (T) is divided into τ and (T−τ) distinct slots for SS and OSA,
respectively. Fig. 12 illustrates the sensing process of M SUs detecting m distinct PU channels. Each SU
senses distinct channel at a specific time, providing spatial and temporal diversity. SUs forward their
local decisions and GPS data to CBS through CCCH. CBS collects these local results in the matrix for
the global CSS decision.
Fig. 11: Time-frame model for the CR-VANETS [7]
Fig. 12: Proposed spectrum sensing technique in CR-VANETs [7]
To address DNN initialization influence in resource allocation, [8] has proposed a novel resource
allocation strategy using an ensemble of multiple DNNs, aims to maximize the sum spectral efficiency
(SE) of SUs while respecting interference constraints on PUs. The approach is based on unsupervised
learning. Performance evaluation demonstrates that the proposed scheme achieves near-optimal
performance, and the ensemble of multiple DNNs further enhances the DNN-based approach.
Paper [9] introduces SL-MAC, a MAC protocol based on intelligent Spectrum Learning, that integrates
deep learning (DL) techniques to improve channel access efficiency in IEEE 802.11-based WLANs. It
uses a convolutional neural network (CNN) model to identify stations involved in collisions and
dynamically schedule their data transmissions. The proposed protocol achieves better channel
coordination and spectrum sharing by leveraging DL inference results.
This paper describes a typical WLAN scenario where multiple stations contend for channel access using
the IEEE 802.11 scheme. Collisions occur when multiple stations transmit RTS ((Request to Send))
packets simultaneously, leading to low channel utilization. The AP (Access Point) utilizes the pre-trained
CNN model to infer and detects the number of users in collisions and their identities, and then schedules
their data transmissions using a CTS (Clear to Send) packet, informing them when they are allowed to
transmit their data packets without further collisions. Other users adjust their NAV (Network Allocation
Vector, a timer that is set by AP) accordingly and remain silent during this period. After the NAV expires,
all users can contend for channel access again.
The proposed protocol leverages the capability of the pre-trained CNN model to identify the number and
identities of stations involved in collisions. The CNN model is trained offline using RF (Radio
Frequency) traces collected from different transmission scenarios. By solving a multi-class classification
problem, the CNN model achieves high accuracy in identifying the stations involving in collision. The
scalability of the protocol depends on the average number of stations experiencing collisions.
The SL-MAC protocol design ensures a high overall throughput, however, the achieved throughput tends
to be degraded due to the inference errors introduced by the trained CNN model, especially when the
number of stations involving collision is large, though there exists a trade-off between performance gain
brought by deep learning and the inference accuracy. Normalized throughput versus number of devices
is shown in Fig. 13.
Fig. 13: Normalized throughput versus number of devices [9]
Inference error rate introduced by trained CNN model is defined as the ratio of the number of correct
inference to the total number of inference. two typical cases are considered: 1) over-estimation, and 2)
under-estimation. Specifically, if over-estimation happens, the inference results include not only the
stations involved collisions but also the other stations not being collided. If under-estimation occurs, the
inference results may miss one or more users involved collisions. As a result, just a portion of the users
involving collisions can be scheduled to transmit, which, however, degrades the fairness of each device.
It has been shown that over-estimation negatively impacts the overall performance of the protocol. On
the other hand, under-estimation of the number of colliding stations does not have a decisive impact on
system throughput, as it may either decrease or remain the same.
Fig. 14: CNN framework in the proposed MAC protocol [9]
Fig.15: Flowchart of the CNN framework in the proposed MAC protocol [9]
The proposed CNN architecture is designed as a master-CNN model and slave-CNN models. CNN
framework and its flowchart in the proposed MAC protocol are shown in Fig.14 and Fig.15. The
collected RF traces (using the USRP2 testbed), are processed in a convolutional layer, reshaped into a
suitable tensor, and fed into the CNN models. The CNN structure includes convolutional layers followed
by Rectified Linear Units (ReLU) layers. Feature extraction is performed using these layers, and fullyconnected (FC) layers are used for multi-class classification with Softmax activation. The master-CNN
model predicts the number of station involved in collisions, while the slave-CNN models identify the ID
of the stations.
The CNN models are optimized using the Adam optimizer and trained offline until they can learn the
features from the RF traces and make accurate inferences. The training involves back-propagation, and
Cross-entropy loss is employed as the loss function for classification. Once the CNN models are trained,
they can be used for online inference with an I/Q dataset. This two-step training approach improves
accuracy by reducing the number of classes compared to conventional CNN training. Extensive
simulations demonstrate the benefits of SL-MAC, as in Table 2.
Table 2) Inference accuracy
In [10], it has been discussed about the issue of spectrum interference among different technologies using
the industrial, scientific, and medical (ISM) radio bands. It proposes a Deep Neural Network (DNN)
approach for predicting the spectrum occupation of unknown neighboring networks in the near future.
This prediction helps existing network schedulers avoid collisions and optimize overall channel capacity.
The DNN is trained online, using supervised learning. The paper demonstrates a reduction in collisions
and an increase in overall throughput in a network with unknown neighboring networks.
[11] presents a novel approach to address the challenge of dynamic spectrum access in cognitive radio
networks. By combining deep reinforcement learning with evolutionary game theory, the proposed
method utilizes a Deep Q-network (DQN) framework, enabling individual users to independently select
channels and effectively improve the utilization of spectrum resources. Additionally, the replicator
dynamic model from evolutionary game theory is incorporated to maintain a balance of collaboration
among users. Simulation results demonstrate the effectiveness of the algorithm in reducing collision
rates and increasing system capacity. The algorithm follows a distributed approach, involving steps such
as initializing Q-values, selecting actions, calculating transmission utility and channel capacity, updating
reward values and Q-values, and recording system metrics. The ultimate goal is to achieve equilibrium
among distributed secondary users.
Paper [12] proposes a deep reinforcement learning (DRL) algorithm based on a deep Q-network (DQN)
to optimize spectrum access in a dynamic spectrum access (DSA) scenario. The paper considers a
dynamic spectrum access (DSA) scenario (Fig.16) with N primary users and M secondary users
communicating through N channels, examining both frequency division multiple access (FDMA) and
non-orthogonal multiple access (NOMA) schemes. SUs employ spectrum sensing to make access
decisions based on channel state information (CSI) obtained through sensing. DRL is applied to learn
the optimal access strategy, with neural networks used to predict Q-values.
Fig. 16: The DSA model for CR [12]
In the DQN-based FDMA scheme, one channel is divided into two sub-channels, allowing up to two
users to transmit simultaneously. SUs perform spectrum sensing to detect the channel states, and based
on the sensed CSI, they make spectrum access decisions using a Neural Network. Rewards are provided
based on different cases: no access, accessing an idle channel, accessing a busy channel with or without
interference.
In the DQN-based NOMA scheme, users employ Non-Orthogonal Multiple Access (NOMA) to utilize
the complete channel bandwidth for improved throughput. Power domain-based NOMA superimposes
the signals of multiple users at the transmitter, and at the receiver, Signal Interference Cancellation (SIC)
is used to separate the users’ signals. The reward function considers throughput, throughput loss, and
interference from Primary Users (PUs) or previous SUs.
These schemes aim to optimize spectrum access decisions for SUs in CR systems, considering channel
conditions and interference, to achieve maximum system throughput. Simulation results demonstrate
improved system throughput (Fig. 17) and reduced interference.
The training process of DQN-based DSA is shown in Algorithm 3. The algorithm takes various
parameters for configuring the DQN, including memory capacity, number of time slots, number of
samples, training threshold, batch size, discounted factor, copy frequency of the network, learning rate,
and exploration rate.
3
Fig.17: Average throughput of SUs vs. time slots [12]
[13] focuses on distributed spectrum access in cognitive radio networks. It introduces Deep Recurrent
Q-Networks (DRQN) with different recurrent neural network architectures for distributed spectrum
access. Performance comparison with other techniques like Q-learning (QL), Deep Q-Network (DQN),
and RC-based DRQN is done in the paper. The proposed DRQN models offer complexity-performance
trade-offs based on factors such as partial observation, number of channels, and training time.
This work utilizes Long Short-Term Memory (LSTM) for the hidden layer in the proposed DRQNbased spectrum access scheme. LSTM’s ability to remember longer sequences enhances the successful
channel access rate, especially in scenarios with a large number of PUs and SUs. It considers a multiagent scenario with varying numbers of SUs and presents evaluation of DRQN-based spectrum access
performance in both complete and partial observation scenarios. Unlike previous works, the research
takes into account sensing errors in the multi-agent scenario, providing a more realistic assessment of
the spectrum access method.
In the first LSTM-based DRQN architecture, (DRQN-LSTM 1), LSTM cells are used in the recurrent
layer, followed by a single dense layer with linear activation for action selection. The output layer
contains M+1 neurons representing M channels and the option of no transmission. (Fig. 18)
Fig.18: DRQN-LSTM1 [13]
The state space (S), action space (A), and reward function (Rnk) are defined as follows:
1. State space: The observed state π‘œπ‘›π‘˜ consists of M binary values indicating the presence or absence
of PU activity on each channel. In the case of a large number of channels, the SU may choose to sense
only a subset to avoid delays and resource constraints. Unobserved channels are represented as -1.
2. Action space: At each time step, the SU agent selects an action, which represents choosing a channel
based on the observed state π‘œπ‘›π‘˜ and policy πœ‹π‘˜ . The action space has M+1 elements, including the
option not to transmit.
3. Reward function: The reward is designed to prioritize desired outcomes. In the absence of PU
transmission, the reward is based on the signal-to-interference-noise ratio (SINR) at the SU receiver.
Mutual interference is considered when multiple SUs transmit simultaneously.
The training algorithm involves running independent DRQN at each SU node. The algorithm initializes
the networks and replay buffer and proceeds with multiple episodes. In each episode, the SU observes
the state, selects an action using πœ€-greedy exploration, applies the action to the environment, and
collects the reward and next state. The experiences are stored in the replay buffer. The main network’s
weights are periodically updated with the target network’s weights. After accumulating a sufficient
number of transitions in the replay buffer, sequences of transitions are used to update the network
weights based on the target values. This process is repeated for each episode until convergence.
The second DRQN architecture (DRQN-LSTM 2) includes LSTM cells in the recurrent layer and an
additional fully connected layer with ReLU activation. The output layer remains the same as in
DRQN-LSTM 1 (Fig. 19).
Fig. 19: DQRN-LSTM2 [13]
The plot of average Throughput vs. number of secondary users in Fig. 20, demonstrates that the
performance of all compared techniques decreases as the number of SUs increases, indicating increased
congestion with a fixed number of channels. However, the proposed scheme outperforms all benchmark
schemes for all values of K. The average throughput obtained by DRQN-LSTM 1 and DRQN-LSTM 2
does not significantly differ.
Fig.20: average Throughput vs. number of secondary users [13]
Implementation
The implementation is based on paper [14], but it can be used as the base environment and DQN in
other proposed designs.
[14] has proposed a communication method based on Aloha-type narrowband transmission, which
allows users to share a limited bandwidth channel without strict coordination. Users select a channel
and transmit their data with a certain attempt probability. Collisions, where multiple users transmit
simultaneously, are handled using collision detection and back-off mechanisms to improve channel
efficiency.
To address the challenge of users having partial knowledge of the network state, a Long Short Term
Memory (LSTM) layer is used to aggregate past observations and estimate the true state. This helps
users make more informed decisions based on historical information.
Instead of using experience replay, the algorithm collects episodes and creates target values for
multi-user learning. This approach incorporates interactions and dependencies among users more
directly and comprehensively, improving learning and decision-making in the dynamic spectrum
access (DSA) environment.
During real-time operation, users update their Deep Q-Network (DQN) weights by communicating
with a central unit. By mapping their local observations to spectrum access actions based on the
trained DQN, users aim to increase channel throughput by reducing idle time slots and collisions.
The objective of each user is to find a policy that maximizes its expected accumulated discounted
reward, considering both individual and social rewards. The action value, or Q-value, represents the
expected cumulative rewards for each action in a given state.
A Dueling DQN is used instead of simple DQN (Fig. 21). It separates the estimation of the state
value and the advantage value, and the advantage layer is responsible for estimating the advantages.
By decoupling the estimation of the advantages from the estimation of the state value, Dueling DQN
can provide more accurate and targeted learning, focusing on states where the advantages of
different actions differ significantly. This separation helps mitigate the overestimation bias and
improves the overall performance of the algorithm.
Fig. 21: proposed Dueling DQN [14]
Double Q-learning is employed to decouple action selection from Q-value evaluation. The online
Q-network is used to select actions, while the target Q-network is used to evaluate the selected
actions periodically. This helps mitigate overestimation bias and stabilize the learning process.
The DQN is trained offline at a central unit, and the users update their DQN weights in real-time.
The goal is to effectively learn how to access the channel to increase throughput by reducing idle
time slots and collisions, even without complete knowledge of the number of users.
By implementing this approach, the proposed algorithm achieves approximately twice the channel
throughput compared to slotted-Aloha with optimal attempt probability, without requiring complete
knowledge of the number of users.
 multi_user_network_env.py
This code defines a class “env_network”, which represents an environment for a network channel
allocation problem.
- The “env_network” class has the following attributes:
- NUM_USERS: The number of users in the network.
- NUM_CHANNELS: The number of channels available for allocation.
- ATTEMPT_PROB: The probability of a user attempting to access a channel.
- REWARD: The reward value for successful channel access.
- action_space: An array representing the possible channel allocation actions, including the
option of not accessing any channel.
- users_action: An array to store the action chosen by each user.
- users_observation: An array to store the observation for each user.
- The ‘reset’ method is currently empty and can be used to reset the environment if needed.
- The ‘sample’ method randomly selects an action for each user from the ‘action_space’ array and
returns the resulting actions as an array.
- The ‘step’ method takes an array of actions as input, updates the environment based on those
actions, and returns the observations and rewards for each user.
οƒΌ The method starts by asserting that the size of the ‘action’ array should be equal to the
number of users (‘NUM_USERS’) in the environment. This ensures that the actions
provided are valid for the given number of users.
οƒΌ Next, a ‘channel_alloc_frequency’ array is created to keep track of the number of users
accessing each channel. The array has a size of ‘NUM_CHANNELS + 1’, where index 0
represents not accessing any channel.
οƒΌ An empty list ‘obs’ is initialized to store the observations for each user, and an array ‘reward’
is created to store the rewards for each user.
οƒΌ The loop iterates over each action. For each action, a random probability ‘prob’ is generated
between 0 and 1.
οƒΌ If the generated probability is less than or equal to the ‘ATTEMPT_PROB’ (the probability
of a user attempting to access a channel), the user’s action is stored in the ‘users_action’
array at index ‘j’, and the corresponding channel in the ‘channel_alloc_frequency’ array is
incremented by 1.
-
‘ATTEMPT_PROB’ controls the likelihood of users making a channel access
attempt. A higher value of ‘ATTEMPT_PROB’ will result in a higher probability of
users attempting to access a channel, while a lower value will reduce the frequency
of channel access attempts.
οƒΌ After the loop, a check is performed to ensure that no channel has more than one user
allocated. For channels with more than one user, the count is reset to 0, effectively allowing
only one user per channel.
οƒΌ Another loop iterates over the actions, and for each user, their observation is set based on
the value in the ‘channel_alloc_frequency’ array for their allocated channel. If the user’s
action is 0 (not accessing any channel), their observation is set to 0. Additionally, if a user’s
observation is 1, indicating successful channel access, their reward is set to 1.
οƒΌ The user’s observation and reward values are then appended as a tuple ‘(observation,
reward)’ to the ‘obs’ list.
οƒΌ The remaining channel capacity is calculated by subtracting the ‘channel_alloc_frequency’
from 1 for all channels except the no-access channel (index 0). This represents the capacity
that is still available for each channel.
οƒΌ The ‘residual_channel_capacity’ array is appended to the ‘obs’ list.
οƒΌ The ‘obs’ list, containing the observations for each user, the corresponding rewards, and the
residual channel capacity, is returned as
- It first initializes a ‘channel_alloc_frequency’ array to keep track of the number of users accessing
each channel. The array has a size of ‘NUM_CHANNELS + 1’, where index 0 represents not
accessing any channel.
- It then iterates over the given actions and randomly assigns each user to attempt accessing a
channel based on the ‘ATTEMPT_PROB’ probability. If a user attempts to access a channel, their
action is stored in ‘users_action’, and the corresponding channel in ‘channel_alloc_frequency’ is
incremented.
- Afterward, the ‘channel_alloc_frequency’ array is checked, and if more than one user is allocated
to a channel, the count for that channel is reset to 0. This ensures that at most one user can access a
channel.
- Next, the ‘users_observation’ array is updated based on the allocated channels. If a user’s action
is 0 (not accessing any channel), their observation is set to 0. Otherwise, their observation is set to
the value stored in ‘channel_alloc_frequency’ for their allocated channel.
- The rewards are calculated based on the observations. If a user’s observation is 1, indicating
successful channel access, their reward is set to 1.
- The observations and rewards for each user are appended to the ‘obs’ list.
- The ‘residual_channel_capacity’ is calculated by subtracting the ‘channel_alloc_frequency’ from
1 for all channels except the no-access channel (index 0). This represents the remaining capacity of
each channel. The ‘residual_channel_capacity’ is also appended to the ‘obs’ list.
- The ‘obs’ list, containing the observations and rewards for each user, as well as the residual channel
capacity, is returned.
Now, we can use environment as following:
# import the env_network from multi_user_network_env.py
from multi_user_network_env import env_network
import numpy as np
import matplotlib.pyplot as plt
for example:
UM_USERS = 3
NUM_CHANNELS = 2
ATTEMPT_PROB = 1
#initializing the environment
env = env_network(NUM_USERS,NUM_CHANNELS,ATTEMPT_PROB)
# To sample random actions from action_space
action = env.sample()
print action
When we pass actions to the evironment , it takes these actions and returns the immediate reward as
well as acknowledgement of the channel.Finally it also returns the residual capacity of the
channel(remaining capacity).
 drqn.py
The code defines a QNetwork class and a Memory class.

class QNetwork: This class represents a Q-network
- ‘__init__’ method: This is the constructor method that initializes the Q-network object. It
takes several parameters:
- ‘learning_rate’: The learning rate used for training the network (default is 0.01).
- ‘state_size’: The size of the input state (default is 4).
- ‘action_size’: The number of possible actions (default is 2).
- ‘hidden_size’: The number of hidden units in the network (default is 10).
- ‘step_size’: The number of time steps in the input sequence (default is 1).
- ‘name’: The name of the network (default is ‘QNetwork’).
The method sets up the TensorFlow graph for the Q-network. Here are the key steps:
- ‘with tf.variable_scope(name)’: This creates a variable scope to encapsulate the network’s
variables.
- ‘self.inputs_’: This defines a placeholder for the input state sequence, with shape ‘[None,
step_size, state_size]’.
- ‘self.actions_’: This defines a placeholder for the actions taken, with shape ‘[None]’.
- ‘one_hot_actions’: This converts the actions to one-hot encoded vectors.
- ‘self.targetQs_’: This defines a placeholder for the target Q-values, with shape ‘[None]’.
The network architecture includes an LSTM layer and two fully connected layers:
- ‘self.lstm’: This creates a basic LSTM cell with ‘hidden_size’ units.
- ‘self.lstm_out’, ‘self.state’: This applies the LSTM cell to the input state sequence, producing the
LSTM output and hidden state.
- ‘self.reduced_out’: This extracts the last step of the LSTM output and reshapes it to match the
hidden size.
- ‘self.w2’, ‘self.b2’: These define the weights and biases for the second fully connected layer.
- ‘self.h2’: This applies the second fully connected layer to the reduced output.
- ‘self.w3’, ‘self.b3’: These define the weights and biases for the output layer.
- ‘self.output’: This applies the third fully connected layer to the second layer’s output.
The Q-values are calculated as the dot product between the output layer and the one-hot encoded
actions. The loss is computed as the mean squared error between the predicted Q-values and the
target Q-values. Finally, an Adam optimizer is used to minimize the loss.

‘class Memory()’: This is a class that represents a memory buffer for storing and sampling
experiences.
- ‘__init__’ method: This is the constructor method that initializes the Memory object. It takes
one parameter:
- ‘max_size’: The maximum size of the memory buffer (default is 1000).
The method sets up the memory buffer as a d-eque (double-ended queue) with a maximum
length of ‘max_size’.
- ‘add’ method: This method adds an experience to the memory buffer.
- ‘sample’ method: This method samples a batch of experiences from the memory buffer. It
takes two parameters:
- ‘batch_size’: The number of experiences to sample.
- ‘step_size’: The number of time steps in each experience.
The method randomly selects ‘batch_size’ indices from the buffer and retrieves the corresponding
experiences. Each experience consists of ‘step_size’ consecutive entries from the buffer, forming a
sequence of states, actions, and rewards. The sampled batch of experiences is returned as a list of
sequences.
Refrences:
1) M. Kulin, T. Kazaz, I. Moerman and E. De Poorter, "End-to-End Learning From Spectrum Data: A Deep
Learning Approach for Wireless Signal Identification in Spectrum Monitoring Applications," in IEEE Access,
vol. 6, pp. 18484-18501, 2018, doi: 10.1109/ACCESS.2018.2818794.
2) R. Sarikhani and F. Keynia, "Cooperative Spectrum Sensing Meets Machine Learning: Deep Reinforcement
Learning Approach," in IEEE Communications Letters, vol. 24, no. 7, pp. 1459-1462, July 2020, doi:
10.1109/LCOMM.2020.2984430.
3) J. Xie, J. Fang, C. Liu and X. Li, "Deep Learning-Based Spectrum Sensing in Cognitive Radio: A CNNLSTM Approach," in IEEE Communications Letters, vol. 24, no. 10, pp. 2196-2200, Oct. 2020, doi:
10.1109/LCOMM.2020.3002073.
4) R. Ahmed, Y. Chen, B. Hassan, L. Du, T. Hassan and J. Dias, "Hybrid Machine-Learning-Based Spectrum
Sensing and Allocation With Adaptive Congestion-Aware Modeling in CR-Assisted IoV Networks," in IEEE
Internet of Things Journal, vol. 9, no. 24, pp. 25100-25116, 15 Dec.15, 2022, doi:
10.1109/JIOT.2022.3195425.
5) Chembe, C., Kunda, D., Ahmedy, I., Md Noor, R., Md Sabri, A. Q., & Ngadi, M. A., Infrastructure based
spectrum sensing scheme in VANET using reinforcement learning. Vehicular Communications, 18, 100161.,
2019, doi.org/10.1016/j.vehcom.2019.100161
6) Chen, L., Fu, K., Zhao, Q., & Zhao, X. A multi-channel and multi-user dynamic spectrum access algorithm
based on deep reinforcement learning in Cognitive Vehicular Networks with sensing error. Physical
Communication, 55, 101926., 2022, doi.org/10.1016/j.phycom.2022.101926
7) Paul, A., & Choi, K. Deep learning-based selective spectrum sensing and allocation in cognitive vehicular
radio networks. Vehicular Communications, 41, 100606., 2023, doi.org/10.1016/j.vehcom.2023.100606
8) Lee, W., & Chung, B. C., Ensemble deep learning based resource allocation for multi-channel underlay
cognitive radio system. ICT Express., 2022 doi.org/10.1016/j.icte.2022.08.009
9) B. Yang, X. Cao, O. Omotere, X. Li, Z. Han and L. Qian, "Improving Medium Access Efficiency With
Intelligent Spectrum Learning," in IEEE Access, vol. 8, pp. 94484-94498, 2020, doi:
10.1109/ACCESS.2020.2995398.
10) R. Mennes, F. A. P. De Figueiredo and S. Latré, "Multi-Agent Deep Learning for Multi-Channel Access in
Slotted Wireless Networks," in IEEE Access, vol. 8, pp. 95032-95045, 2020, doi:
10.1109/ACCESS.2020.2995456.
11) P. Yang et al., "Dynamic Spectrum Access in Cognitive Radio Networks Using Deep Reinforcement Learning
and Evolutionary Game," 2018 IEEE/CIC International Conference on Communications in China (ICCC),
Beijing, China, 2018, pp. 405-409, doi: 10.1109/ICCChina.2018.8641242.
12) Li, Z., Liu, X., & Ning, Z., Dynamic spectrum access based on deep reinforcement learning for multiple
access in cognitive radio. Physical Communication, 54, 101845., 2022,
https://doi.org/10.1016/j.phycom.2022.101845
13) Giri, M. K., & Majumder, S., Distributed dynamic spectrum access through multi-agent deep recurrent
Q-learning in cognitive radio network. Physical Communication, 58, 102054., 2023,
doi.org/10.1016/j.phycom.2023.102054
14) O. Naparstek and K. Cohen, "Deep Multi-User Reinforcement Learning for Dynamic Spectrum Access in
Multichannel Wireless Networks," GLOBECOM 2017 - 2017 IEEE Global Communications Conference,
Singapore, 2017, pp. 1-7, doi: 10.1109/GLOCOM.2017.8254101.
Download