1. Introduction As wireless networks evolve, with the increasing number of mobile devices, there is a growing demand for spectrum resources in the 4G and 5G bands, however, the radio frequency spectrum is a limited resource. The demand for wireless communication is leading to the need for the development of efficient dynamic spectrum access (DSA) schemes for emerging wireless network technologies. Some parts of the radio frequency spectrum are heavily used while others remain underutilized. Unlicensed bands, in particular, suffer from overutilization and cross-technology interference. Monitoring and understanding spectrum resource usage are crucial for improving and regulating radio spectrum utilization, especially in complex wireless systems like 5G. On the other hand, this monitoring requires distributed sensing over a wide frequency range, which results in a large amount of radio spectrum data. Extracting meaningful information from this massive and complex dataset necessitates advanced algorithms. To address these challenges, new innovative spectrum access schemes and identification mechanisms are being developed to provide awareness about the radio environment. Identification of technology, modulation type recognition, and interference source detection are essential for effective interference mitigation and the coexistence of heterogeneous wireless networks. Cognitive radio networks (CRN) play a significant role in dynamically and efficiently utilizing frequency bands. Cognitive radio has been proposed as a solution to spectrum scarcity by allowing secondary users (SUs) to utilize underutilized spectrum bands without causing interference to primary users (PUs). Spectrum sensing, which detects the presence of PUs, is a key technology in CR. A lot of spectrum sensing algorithms have been proposed for different scenarios, including traditional detectors and deep learning-based detectors. Traditional detectors make assumptions about the signalnoise model, facing challenges due to computational complexity while deep learning detectors learn from the sensing data itself, and have been explored as a promising alternative to traditional resources. Deep neural networks (DNNs) can approximate optimal resource allocation strategies with reduced computational complexity. 2. Literature Review Various deep learning approaches, including supervised and unsupervised learning, have been investigated for CR systems. Paper [1], introduces the concept of end-to-end learning from spectrum data, which involves utilizing deep neural networks for advanced wireless signal identification in spectrum monitoring applications. End-to-end learning for spectrum monitoring, enables automatic feature learning from wireless signal representations and trains wireless signal classifiers in a single step. Two case studies, modulation recognition and wireless technology interference detection, are conducted using different signal representations (temporal IQ data, amplitude/phase, and frequency domain). The paper demonstrates that the choice of data representation significantly affects the accuracy, with variations of up to 29%. The proposed approach involves cognitive radio users sensing their environment and reporting the results to a base station (BS), which forwards the information to a data center (DC). The DC combines the sensing data from multiple CR users to create a spectrum map and determine the presence of primary users (PUs). The DC then shares the spectrum availability information with the cognitive users. To achieve this, the DC uses a convolutional neural network (CNN) model trained offline to distinguish between occupied frequency channels and spectrum holes. Also, in the context of cognitive IoT, devices are equipped with cognitive capabilities to search for interference-free spectrum bands and adjust their transmission parameters accordingly. CR-IoT devices send spectrum sensing reports to a CNN-based DC, which learns and estimates the presence of other emitters. This information is used to detect interference sources and identify interference-free channels. The proposed architecture for |CR and CR-IoT networks is shown in Fig. 1. Fig. 1: Proposed architecture of CR SUs and CR-IOTs [1] Table 1. shows the structure of the proposed CNN. Hyper parameters must be set. For both case studies, 67% of randomly selected examples are used for training in batches of 1024, and 33% are used for testing and validation. Both sets of examples are uniformly distributed in Signal-to-Noise Ratio (SNR) from -20dB to 20dB. Table 1. CNN structure The detection efficiency is dependent on the complexity of the CNN network structure used during prediction, which is determined by the time required to calculate convolutions and activations in all neurons. For the optimization of the model parameters, this paper has used Adaptive Moment Estimation (Adam) optimizer, with a learning rate of 0.001 to ensure convergence. To accelerate the model’s learning and convergence procedure, the input data has normalized and the Rectified Linear Unit (ReLU) activation function was used. The CNNs were trained for 70 epochs, and the model with the lowest validation loss was selected for evaluation. To evaluate the prediction accuracy of the end-to-end wireless signal classification models for tasks such as modulation type recognition or interference identification, it is necessary to compare their predictions with the true response values of the observed spectrum data. The performance of these classification methods can be measured using the prediction accuracy on a test data sample. The overall classification test error over ππ‘ππ π‘ testing snapshots can be calculated using the formula: (1) πΈπ‘ππ π‘ = π 1 π‘ππ π‘ π‘ππ π‘ ∑π Μ,π π¦π ) π=1 π(π¦ The classification accuracy can then be obtained as (2) accuracy = 1 − πΈπ‘ππ π‘ Here, π(π¦Μ,π π¦π ) represents the loss function that quantifies the difference between the estimated value and the true value for each instance. The number of true positives (TP), false positives (FP), and false negatives (FN) were determined as follows: ο· If a signal was detected and annotated as belonging to a particular class in the labelled test data, it was counted as TP. ο· If a signal was predicted as belonging to a particular class but did not actually belong to that class according to the labelled test data, it was counted as FP. ο· If a signal was present in the labelled test data for a particular instance but was not detected in that instance, it was counted as FN. We can derive performance metrics precision (P), recall (R) and F1 score too: ππ (3) π = ππ+πΉπ , (4) πΉ1π ππππ = 2 × π+π ππ π = ππ+πΉπ π×π For overall performance measurement, multiple per-class performance metrics are combined through a prevalence-weighted macro-average across the class metrics. The term "macro-average" refers to the calculation of precision, recall, and F1 score separately for each class and then averaging them across all the classes. This is in contrast to "micro-average," which calculates precision, recall, and F1 score by aggregating all true positives, false positives, and false negatives across all classes before calculating the metrics. The "prevalence-weighted" aspect of the metric refers to the fact that the average is weighted by the prevalence of each class in the dataset. This means that the performance of the model on more prevalent classes is given more weight in the calculation than less prevalent classes. The performance results are shown in fig. 2 and fig.3. Fig. 2: Performance results for modulation recognition classifiers vs. SNR [1] Fig. 3: Performance results for interference identification classifiers vs. SNR [1] The confusion matrix (Fig. 4) is used to provide a detailed overview of the per-class performance. Fig. 4: Confusion matrices for the modulation recognition data for SNR 6dB, frequency representation [1] Signals within a dataset that exhibit similar characteristics in one data representation are more difficult to discriminate. The performance of the classifier can be improved by increasing the quality of the wireless signal dataset, by adding more training examples, more variation among the examples (e.g. varying channel conditions), and tuning the model hyper-parameters. As SUs face challenges such as fading and communication issues, Cooperative Spectrum Sensing (CSS) has been introduced in [2] to improve the sensing quality by utilizing the results of cooperating secondary users (CSUs). However, the optimal cooperation strategy in CSS is affected by channel conditions. This paper proposes a Deep Reinforcement Learning (DRL)-based CSS approach, employing Reinforcement Learning (RL) (Fig. 5 shows RL principle) to introduce the required SU to update its measurements. Hence, the proposed DRL can increase the CSS efficiently from a resource and time viewpoint. Fig. 5: The basic principle of reinforcement learning [11] The SU broadcasts the request of cooperative sensing, and all the one-hop neighbors will respond to its appeal by their local sensing results, intermittently. The initiating SU, which requests for cooperative sensing, is called the agent will combine the local sensing results and take the final decision. The paper has formulated the CSS problem as a Markov Decision Problem (MDP) and introduced state and action spaces and strategy selection policy. Moreover, the reward function is introduced based on the correlation of cooperating users. The reward calculation considers the correlation of the newly selected SU, which depends on the location of each SU in the cognitive network. The correlation coefficient is used to determine if the sense of the PU is correlated or uncorrelated between SUs. The cost is calculated by summing the correlation coefficients from the first step up to the current step. The policy of RL algorithm to find the next action is based on Bellman equation, and Deep Q-network (DQN) is employed to estimate the optimal action-value function (Q). DQN structure is shown in Fig. 6. Fig. 6: The flowchart of DQN [11] In the agent, one of the available SUs is selected to represent its local sensing result, which is transmitted to the agent. The sensing matrix Y is gathered and processed through three convolutional layers and two subsequent FC layers to obtain the result of global sensing (to calculate the action-value function for the DRL to determine the absence or presence of the PU) through CNN. The action-value and rewards are updated based on the cost function and the local sensing result. Finally, the agent decides on the presence or absence of the PU based on the local results. To compare with other works, the comparison metrics are Region of Convergence (ROC), number of active SUs, and the sensing error, which is the average of probability of false alarm (Pfa ) and the probability of the miss detection (Pm). Fig. 7 shows ROC comparison Fig. 7: The ROC of the proposed scheme for P = 30 dBm [2] The combination of convolutional neural networks (CNNs) and long short-term memory (LSTM) networks has been proposed to learn the energy-correlation features and PU activity pattern from the sensing data, improving the spectrum sensing performance [3]. In the proposed CNN-LSTM-based spectrum sensing detector, the CNN structure extracts energy-correlation features from the current and historical sensing data, treated as sample covariance matrices. In fact, directly processing the sensing samples in their original form would result in high computation complexity and redundancy. To address this, the sensing samples are preprocessed by calculating the sample covariance matrix. The sample covariance matrix captures the essential energy-correlation information. The extracted features corresponding to different sensing periods are then input into an LSTM network to learn the PU activity pattern. The structure of CNN-LSTM is shown in Fig. 8. Fig. 8: CNN-LMST structure [3] The proposed algorithm is: The CNN-LSTM detector is free from signal-noise model assumptions and has shown superior performance compared to benchmark detectors in scenarios with and without noise uncertainty. In contrast, the traditional APASS algorithm learns both energy-correlation and temporal features using the CNN. However, as the CNN is not specialized in processing temporal features, the CNN-LSTM algorithm outperforms the APASS algorithm. Benchmark algorithms, including MED, SSE, AGM, and APASS, are compared. ROC curves are shown in Fig. 9. In the figure, legend, ‘CL’ represents CNN-LSTM algorithm. The CNN-LSTM algorithm outperforms traditional detectors in terms of detection performance. Fig. 9: ROC curves for comparison [3] In the field of Intelligent Transportation Systems (ITS), in the 5G/6G era, Vehicular Ad-Hoc Networks (VANETs) play a crucial role in enabling collision avoidance, traffic management, and safety warnings. However, limited spectrum availability and congestion pose challenges for VANET-enabled services. To overcome this, unlicensed white space devices and Cognitive Radio Networks (CRNs) are employed. Integration of CRNs with VANETs (CR-VANETs) enables efficient spectrum utilization and sustainable communication. As mentioned for CRNs, spectrum sensing and dynamic channel scheduling are essential for opportunistic spectrum access and reliable data transmission in CR-VANETs, considering vehicle mobility and avoiding interference with primary users. [4] has proposed intelligent hybrid learning spectrum agents that perform sensing using a deep learning model for the CR-VANETs framework. It effectively forecasts the spectrum occupancy in the primary spectrum and allocates the idle channels to the cognitive vehicles using a Support Vector Machine (SVM) classifier. [5] has proposed Reinforcement learning to model PU activity patterns for predicting free channels for dynamic spectrum access (DSA). The RL model is implemented on a road-side unit (RSU) and sends predicted vacant channels to vehicles. Adaptive spectrum sensing using energy and feature detectors improves spectrum detection performance. The proposed approach shows improvement in vehicular communication through simulation results. The RL technique outperforms history-based schemes, as demonstrated in simulation results. Flowchart of the proposed algorithm’s sensing architecture is given in Fig. 10. Fig. 10: flowchart of sensing architecture [5] The proposed dynamic spectrum access algorithm in [6], based on RL, aims to optimize channel capacity in CR-VANETs. The algorithm utilizes the Q-learning method to find the optimal access strategy for secondary vehicles (SVs). SVs are considered agents that interact with the CR-VANETs communication environment. The sensing results of SVs represent the state, and the access conditions of SVs are defined as actions. The reward is determined based on the SV’s capacity. To improve the Q-learning algorithm, a new reward function is introduced to avoid collisions between SVs and increase the access success rate. The Q-values are updated based on the reward, and the learning rate determines the weight given to the current reward versus future cumulative rewards. The strategy for selecting the optimal access strategy is updated using the π-greedy algorithm. To address the issue of the exponential growth in the size of the Q-table, the IDQN (Improved Deep QNetwork) method is employed. It involves training a neural network to approximate the Q-values. The network takes the sensing results as input and outputs the Q-values for each action. The training process involves minimizing the loss function and updating the network weights. The target network is used to stabilize the training process. The algorithm depicts the interaction between SVs, the environment, and the neural network. The reward is determined based on the SV’s sensing results and action. The sensing results are obtained from the dynamic occupancy channel of primary vehicles, modeled as a Markov chain. DRL methods are effective for optimizing decision-making in complex and high-dimensional problems. Specifically in managing spectral regions among wireless connected devices, which involves a large state space and partial observation, integrating DRL techniques proves beneficial. Most research focuses on single-user OSA using single-agent DRL due to the exponential increase in state-action space in multiuser settings. However, joint optimization problems require precise CSI and can be computationally expensive. RL optimizes behavior based on rewards, but predicting CSI in practice is challenging due to PU traffic and sensing errors. Partially Observable Markov Decision Processes (POMDPs) offer a potential solution for decision-making in CR-VANETs by integrating observations over time. [7] proposes solutions for channel allocation, future channel state estimation, selective sensing, and spectrum allocation using DRL and POMDP. The system model includes primary and secondary networks, with a CBS controlling CSS decisions and channel allocation. Simulation results and analysis are presented, and the article concludes with a system model description. This paper addresses three key issues and their respective solutions one after another in a single framework for CR-VANETs as • Reliable CSS performance in CR-VANETs through DRL algorithm. • Channel indexing for selective sensing using LSTM-based time series analysis. • Channel/Spectrum allocation to the CR-VANTEs using value iteration algorithm in POMDP framework. Fig. 11, noted that a single time frame (T) is divided into τ and (T−τ) distinct slots for SS and OSA, respectively. Fig. 12 illustrates the sensing process of M SUs detecting m distinct PU channels. Each SU senses distinct channel at a specific time, providing spatial and temporal diversity. SUs forward their local decisions and GPS data to CBS through CCCH. CBS collects these local results in the matrix for the global CSS decision. Fig. 11: Time-frame model for the CR-VANETS [7] Fig. 12: Proposed spectrum sensing technique in CR-VANETs [7] To address DNN initialization influence in resource allocation, [8] has proposed a novel resource allocation strategy using an ensemble of multiple DNNs, aims to maximize the sum spectral efficiency (SE) of SUs while respecting interference constraints on PUs. The approach is based on unsupervised learning. Performance evaluation demonstrates that the proposed scheme achieves near-optimal performance, and the ensemble of multiple DNNs further enhances the DNN-based approach. Paper [9] introduces SL-MAC, a MAC protocol based on intelligent Spectrum Learning, that integrates deep learning (DL) techniques to improve channel access efficiency in IEEE 802.11-based WLANs. It uses a convolutional neural network (CNN) model to identify stations involved in collisions and dynamically schedule their data transmissions. The proposed protocol achieves better channel coordination and spectrum sharing by leveraging DL inference results. This paper describes a typical WLAN scenario where multiple stations contend for channel access using the IEEE 802.11 scheme. Collisions occur when multiple stations transmit RTS ((Request to Send)) packets simultaneously, leading to low channel utilization. The AP (Access Point) utilizes the pre-trained CNN model to infer and detects the number of users in collisions and their identities, and then schedules their data transmissions using a CTS (Clear to Send) packet, informing them when they are allowed to transmit their data packets without further collisions. Other users adjust their NAV (Network Allocation Vector, a timer that is set by AP) accordingly and remain silent during this period. After the NAV expires, all users can contend for channel access again. The proposed protocol leverages the capability of the pre-trained CNN model to identify the number and identities of stations involved in collisions. The CNN model is trained offline using RF (Radio Frequency) traces collected from different transmission scenarios. By solving a multi-class classification problem, the CNN model achieves high accuracy in identifying the stations involving in collision. The scalability of the protocol depends on the average number of stations experiencing collisions. The SL-MAC protocol design ensures a high overall throughput, however, the achieved throughput tends to be degraded due to the inference errors introduced by the trained CNN model, especially when the number of stations involving collision is large, though there exists a trade-off between performance gain brought by deep learning and the inference accuracy. Normalized throughput versus number of devices is shown in Fig. 13. Fig. 13: Normalized throughput versus number of devices [9] Inference error rate introduced by trained CNN model is defined as the ratio of the number of correct inference to the total number of inference. two typical cases are considered: 1) over-estimation, and 2) under-estimation. Specifically, if over-estimation happens, the inference results include not only the stations involved collisions but also the other stations not being collided. If under-estimation occurs, the inference results may miss one or more users involved collisions. As a result, just a portion of the users involving collisions can be scheduled to transmit, which, however, degrades the fairness of each device. It has been shown that over-estimation negatively impacts the overall performance of the protocol. On the other hand, under-estimation of the number of colliding stations does not have a decisive impact on system throughput, as it may either decrease or remain the same. Fig. 14: CNN framework in the proposed MAC protocol [9] Fig.15: Flowchart of the CNN framework in the proposed MAC protocol [9] The proposed CNN architecture is designed as a master-CNN model and slave-CNN models. CNN framework and its flowchart in the proposed MAC protocol are shown in Fig.14 and Fig.15. The collected RF traces (using the USRP2 testbed), are processed in a convolutional layer, reshaped into a suitable tensor, and fed into the CNN models. The CNN structure includes convolutional layers followed by Rectified Linear Units (ReLU) layers. Feature extraction is performed using these layers, and fullyconnected (FC) layers are used for multi-class classification with Softmax activation. The master-CNN model predicts the number of station involved in collisions, while the slave-CNN models identify the ID of the stations. The CNN models are optimized using the Adam optimizer and trained offline until they can learn the features from the RF traces and make accurate inferences. The training involves back-propagation, and Cross-entropy loss is employed as the loss function for classification. Once the CNN models are trained, they can be used for online inference with an I/Q dataset. This two-step training approach improves accuracy by reducing the number of classes compared to conventional CNN training. Extensive simulations demonstrate the benefits of SL-MAC, as in Table 2. Table 2) Inference accuracy In [10], it has been discussed about the issue of spectrum interference among different technologies using the industrial, scientific, and medical (ISM) radio bands. It proposes a Deep Neural Network (DNN) approach for predicting the spectrum occupation of unknown neighboring networks in the near future. This prediction helps existing network schedulers avoid collisions and optimize overall channel capacity. The DNN is trained online, using supervised learning. The paper demonstrates a reduction in collisions and an increase in overall throughput in a network with unknown neighboring networks. [11] presents a novel approach to address the challenge of dynamic spectrum access in cognitive radio networks. By combining deep reinforcement learning with evolutionary game theory, the proposed method utilizes a Deep Q-network (DQN) framework, enabling individual users to independently select channels and effectively improve the utilization of spectrum resources. Additionally, the replicator dynamic model from evolutionary game theory is incorporated to maintain a balance of collaboration among users. Simulation results demonstrate the effectiveness of the algorithm in reducing collision rates and increasing system capacity. The algorithm follows a distributed approach, involving steps such as initializing Q-values, selecting actions, calculating transmission utility and channel capacity, updating reward values and Q-values, and recording system metrics. The ultimate goal is to achieve equilibrium among distributed secondary users. Paper [12] proposes a deep reinforcement learning (DRL) algorithm based on a deep Q-network (DQN) to optimize spectrum access in a dynamic spectrum access (DSA) scenario. The paper considers a dynamic spectrum access (DSA) scenario (Fig.16) with N primary users and M secondary users communicating through N channels, examining both frequency division multiple access (FDMA) and non-orthogonal multiple access (NOMA) schemes. SUs employ spectrum sensing to make access decisions based on channel state information (CSI) obtained through sensing. DRL is applied to learn the optimal access strategy, with neural networks used to predict Q-values. Fig. 16: The DSA model for CR [12] In the DQN-based FDMA scheme, one channel is divided into two sub-channels, allowing up to two users to transmit simultaneously. SUs perform spectrum sensing to detect the channel states, and based on the sensed CSI, they make spectrum access decisions using a Neural Network. Rewards are provided based on different cases: no access, accessing an idle channel, accessing a busy channel with or without interference. In the DQN-based NOMA scheme, users employ Non-Orthogonal Multiple Access (NOMA) to utilize the complete channel bandwidth for improved throughput. Power domain-based NOMA superimposes the signals of multiple users at the transmitter, and at the receiver, Signal Interference Cancellation (SIC) is used to separate the users’ signals. The reward function considers throughput, throughput loss, and interference from Primary Users (PUs) or previous SUs. These schemes aim to optimize spectrum access decisions for SUs in CR systems, considering channel conditions and interference, to achieve maximum system throughput. Simulation results demonstrate improved system throughput (Fig. 17) and reduced interference. The training process of DQN-based DSA is shown in Algorithm 3. The algorithm takes various parameters for configuring the DQN, including memory capacity, number of time slots, number of samples, training threshold, batch size, discounted factor, copy frequency of the network, learning rate, and exploration rate. 3 Fig.17: Average throughput of SUs vs. time slots [12] [13] focuses on distributed spectrum access in cognitive radio networks. It introduces Deep Recurrent Q-Networks (DRQN) with different recurrent neural network architectures for distributed spectrum access. Performance comparison with other techniques like Q-learning (QL), Deep Q-Network (DQN), and RC-based DRQN is done in the paper. The proposed DRQN models offer complexity-performance trade-offs based on factors such as partial observation, number of channels, and training time. This work utilizes Long Short-Term Memory (LSTM) for the hidden layer in the proposed DRQNbased spectrum access scheme. LSTM’s ability to remember longer sequences enhances the successful channel access rate, especially in scenarios with a large number of PUs and SUs. It considers a multiagent scenario with varying numbers of SUs and presents evaluation of DRQN-based spectrum access performance in both complete and partial observation scenarios. Unlike previous works, the research takes into account sensing errors in the multi-agent scenario, providing a more realistic assessment of the spectrum access method. In the first LSTM-based DRQN architecture, (DRQN-LSTM 1), LSTM cells are used in the recurrent layer, followed by a single dense layer with linear activation for action selection. The output layer contains M+1 neurons representing M channels and the option of no transmission. (Fig. 18) Fig.18: DRQN-LSTM1 [13] The state space (S), action space (A), and reward function (Rnk) are defined as follows: 1. State space: The observed state πππ consists of M binary values indicating the presence or absence of PU activity on each channel. In the case of a large number of channels, the SU may choose to sense only a subset to avoid delays and resource constraints. Unobserved channels are represented as -1. 2. Action space: At each time step, the SU agent selects an action, which represents choosing a channel based on the observed state πππ and policy ππ . The action space has M+1 elements, including the option not to transmit. 3. Reward function: The reward is designed to prioritize desired outcomes. In the absence of PU transmission, the reward is based on the signal-to-interference-noise ratio (SINR) at the SU receiver. Mutual interference is considered when multiple SUs transmit simultaneously. The training algorithm involves running independent DRQN at each SU node. The algorithm initializes the networks and replay buffer and proceeds with multiple episodes. In each episode, the SU observes the state, selects an action using π-greedy exploration, applies the action to the environment, and collects the reward and next state. The experiences are stored in the replay buffer. The main network’s weights are periodically updated with the target network’s weights. After accumulating a sufficient number of transitions in the replay buffer, sequences of transitions are used to update the network weights based on the target values. This process is repeated for each episode until convergence. The second DRQN architecture (DRQN-LSTM 2) includes LSTM cells in the recurrent layer and an additional fully connected layer with ReLU activation. The output layer remains the same as in DRQN-LSTM 1 (Fig. 19). Fig. 19: DQRN-LSTM2 [13] The plot of average Throughput vs. number of secondary users in Fig. 20, demonstrates that the performance of all compared techniques decreases as the number of SUs increases, indicating increased congestion with a fixed number of channels. However, the proposed scheme outperforms all benchmark schemes for all values of K. The average throughput obtained by DRQN-LSTM 1 and DRQN-LSTM 2 does not significantly differ. Fig.20: average Throughput vs. number of secondary users [13] Implementation The implementation is based on paper [14], but it can be used as the base environment and DQN in other proposed designs. [14] has proposed a communication method based on Aloha-type narrowband transmission, which allows users to share a limited bandwidth channel without strict coordination. Users select a channel and transmit their data with a certain attempt probability. Collisions, where multiple users transmit simultaneously, are handled using collision detection and back-off mechanisms to improve channel efficiency. To address the challenge of users having partial knowledge of the network state, a Long Short Term Memory (LSTM) layer is used to aggregate past observations and estimate the true state. This helps users make more informed decisions based on historical information. Instead of using experience replay, the algorithm collects episodes and creates target values for multi-user learning. This approach incorporates interactions and dependencies among users more directly and comprehensively, improving learning and decision-making in the dynamic spectrum access (DSA) environment. During real-time operation, users update their Deep Q-Network (DQN) weights by communicating with a central unit. By mapping their local observations to spectrum access actions based on the trained DQN, users aim to increase channel throughput by reducing idle time slots and collisions. The objective of each user is to find a policy that maximizes its expected accumulated discounted reward, considering both individual and social rewards. The action value, or Q-value, represents the expected cumulative rewards for each action in a given state. A Dueling DQN is used instead of simple DQN (Fig. 21). It separates the estimation of the state value and the advantage value, and the advantage layer is responsible for estimating the advantages. By decoupling the estimation of the advantages from the estimation of the state value, Dueling DQN can provide more accurate and targeted learning, focusing on states where the advantages of different actions differ significantly. This separation helps mitigate the overestimation bias and improves the overall performance of the algorithm. Fig. 21: proposed Dueling DQN [14] Double Q-learning is employed to decouple action selection from Q-value evaluation. The online Q-network is used to select actions, while the target Q-network is used to evaluate the selected actions periodically. This helps mitigate overestimation bias and stabilize the learning process. The DQN is trained offline at a central unit, and the users update their DQN weights in real-time. The goal is to effectively learn how to access the channel to increase throughput by reducing idle time slots and collisions, even without complete knowledge of the number of users. By implementing this approach, the proposed algorithm achieves approximately twice the channel throughput compared to slotted-Aloha with optimal attempt probability, without requiring complete knowledge of the number of users. ο multi_user_network_env.py This code defines a class “env_network”, which represents an environment for a network channel allocation problem. - The “env_network” class has the following attributes: - NUM_USERS: The number of users in the network. - NUM_CHANNELS: The number of channels available for allocation. - ATTEMPT_PROB: The probability of a user attempting to access a channel. - REWARD: The reward value for successful channel access. - action_space: An array representing the possible channel allocation actions, including the option of not accessing any channel. - users_action: An array to store the action chosen by each user. - users_observation: An array to store the observation for each user. - The ‘reset’ method is currently empty and can be used to reset the environment if needed. - The ‘sample’ method randomly selects an action for each user from the ‘action_space’ array and returns the resulting actions as an array. - The ‘step’ method takes an array of actions as input, updates the environment based on those actions, and returns the observations and rewards for each user. οΌ The method starts by asserting that the size of the ‘action’ array should be equal to the number of users (‘NUM_USERS’) in the environment. This ensures that the actions provided are valid for the given number of users. οΌ Next, a ‘channel_alloc_frequency’ array is created to keep track of the number of users accessing each channel. The array has a size of ‘NUM_CHANNELS + 1’, where index 0 represents not accessing any channel. οΌ An empty list ‘obs’ is initialized to store the observations for each user, and an array ‘reward’ is created to store the rewards for each user. οΌ The loop iterates over each action. For each action, a random probability ‘prob’ is generated between 0 and 1. οΌ If the generated probability is less than or equal to the ‘ATTEMPT_PROB’ (the probability of a user attempting to access a channel), the user’s action is stored in the ‘users_action’ array at index ‘j’, and the corresponding channel in the ‘channel_alloc_frequency’ array is incremented by 1. - ‘ATTEMPT_PROB’ controls the likelihood of users making a channel access attempt. A higher value of ‘ATTEMPT_PROB’ will result in a higher probability of users attempting to access a channel, while a lower value will reduce the frequency of channel access attempts. οΌ After the loop, a check is performed to ensure that no channel has more than one user allocated. For channels with more than one user, the count is reset to 0, effectively allowing only one user per channel. οΌ Another loop iterates over the actions, and for each user, their observation is set based on the value in the ‘channel_alloc_frequency’ array for their allocated channel. If the user’s action is 0 (not accessing any channel), their observation is set to 0. Additionally, if a user’s observation is 1, indicating successful channel access, their reward is set to 1. οΌ The user’s observation and reward values are then appended as a tuple ‘(observation, reward)’ to the ‘obs’ list. οΌ The remaining channel capacity is calculated by subtracting the ‘channel_alloc_frequency’ from 1 for all channels except the no-access channel (index 0). This represents the capacity that is still available for each channel. οΌ The ‘residual_channel_capacity’ array is appended to the ‘obs’ list. οΌ The ‘obs’ list, containing the observations for each user, the corresponding rewards, and the residual channel capacity, is returned as - It first initializes a ‘channel_alloc_frequency’ array to keep track of the number of users accessing each channel. The array has a size of ‘NUM_CHANNELS + 1’, where index 0 represents not accessing any channel. - It then iterates over the given actions and randomly assigns each user to attempt accessing a channel based on the ‘ATTEMPT_PROB’ probability. If a user attempts to access a channel, their action is stored in ‘users_action’, and the corresponding channel in ‘channel_alloc_frequency’ is incremented. - Afterward, the ‘channel_alloc_frequency’ array is checked, and if more than one user is allocated to a channel, the count for that channel is reset to 0. This ensures that at most one user can access a channel. - Next, the ‘users_observation’ array is updated based on the allocated channels. If a user’s action is 0 (not accessing any channel), their observation is set to 0. Otherwise, their observation is set to the value stored in ‘channel_alloc_frequency’ for their allocated channel. - The rewards are calculated based on the observations. If a user’s observation is 1, indicating successful channel access, their reward is set to 1. - The observations and rewards for each user are appended to the ‘obs’ list. - The ‘residual_channel_capacity’ is calculated by subtracting the ‘channel_alloc_frequency’ from 1 for all channels except the no-access channel (index 0). This represents the remaining capacity of each channel. The ‘residual_channel_capacity’ is also appended to the ‘obs’ list. - The ‘obs’ list, containing the observations and rewards for each user, as well as the residual channel capacity, is returned. Now, we can use environment as following: # import the env_network from multi_user_network_env.py from multi_user_network_env import env_network import numpy as np import matplotlib.pyplot as plt for example: UM_USERS = 3 NUM_CHANNELS = 2 ATTEMPT_PROB = 1 #initializing the environment env = env_network(NUM_USERS,NUM_CHANNELS,ATTEMPT_PROB) # To sample random actions from action_space action = env.sample() print action When we pass actions to the evironment , it takes these actions and returns the immediate reward as well as acknowledgement of the channel.Finally it also returns the residual capacity of the channel(remaining capacity). ο drqn.py The code defines a QNetwork class and a Memory class. ο§ class QNetwork: This class represents a Q-network - ‘__init__’ method: This is the constructor method that initializes the Q-network object. It takes several parameters: - ‘learning_rate’: The learning rate used for training the network (default is 0.01). - ‘state_size’: The size of the input state (default is 4). - ‘action_size’: The number of possible actions (default is 2). - ‘hidden_size’: The number of hidden units in the network (default is 10). - ‘step_size’: The number of time steps in the input sequence (default is 1). - ‘name’: The name of the network (default is ‘QNetwork’). The method sets up the TensorFlow graph for the Q-network. Here are the key steps: - ‘with tf.variable_scope(name)’: This creates a variable scope to encapsulate the network’s variables. - ‘self.inputs_’: This defines a placeholder for the input state sequence, with shape ‘[None, step_size, state_size]’. - ‘self.actions_’: This defines a placeholder for the actions taken, with shape ‘[None]’. - ‘one_hot_actions’: This converts the actions to one-hot encoded vectors. - ‘self.targetQs_’: This defines a placeholder for the target Q-values, with shape ‘[None]’. The network architecture includes an LSTM layer and two fully connected layers: - ‘self.lstm’: This creates a basic LSTM cell with ‘hidden_size’ units. - ‘self.lstm_out’, ‘self.state’: This applies the LSTM cell to the input state sequence, producing the LSTM output and hidden state. - ‘self.reduced_out’: This extracts the last step of the LSTM output and reshapes it to match the hidden size. - ‘self.w2’, ‘self.b2’: These define the weights and biases for the second fully connected layer. - ‘self.h2’: This applies the second fully connected layer to the reduced output. - ‘self.w3’, ‘self.b3’: These define the weights and biases for the output layer. - ‘self.output’: This applies the third fully connected layer to the second layer’s output. The Q-values are calculated as the dot product between the output layer and the one-hot encoded actions. The loss is computed as the mean squared error between the predicted Q-values and the target Q-values. Finally, an Adam optimizer is used to minimize the loss. ο§ ‘class Memory()’: This is a class that represents a memory buffer for storing and sampling experiences. - ‘__init__’ method: This is the constructor method that initializes the Memory object. It takes one parameter: - ‘max_size’: The maximum size of the memory buffer (default is 1000). The method sets up the memory buffer as a d-eque (double-ended queue) with a maximum length of ‘max_size’. - ‘add’ method: This method adds an experience to the memory buffer. - ‘sample’ method: This method samples a batch of experiences from the memory buffer. It takes two parameters: - ‘batch_size’: The number of experiences to sample. - ‘step_size’: The number of time steps in each experience. The method randomly selects ‘batch_size’ indices from the buffer and retrieves the corresponding experiences. Each experience consists of ‘step_size’ consecutive entries from the buffer, forming a sequence of states, actions, and rewards. The sampled batch of experiences is returned as a list of sequences. Refrences: 1) M. Kulin, T. Kazaz, I. Moerman and E. De Poorter, "End-to-End Learning From Spectrum Data: A Deep Learning Approach for Wireless Signal Identification in Spectrum Monitoring Applications," in IEEE Access, vol. 6, pp. 18484-18501, 2018, doi: 10.1109/ACCESS.2018.2818794. 2) R. Sarikhani and F. Keynia, "Cooperative Spectrum Sensing Meets Machine Learning: Deep Reinforcement Learning Approach," in IEEE Communications Letters, vol. 24, no. 7, pp. 1459-1462, July 2020, doi: 10.1109/LCOMM.2020.2984430. 3) J. Xie, J. Fang, C. Liu and X. Li, "Deep Learning-Based Spectrum Sensing in Cognitive Radio: A CNNLSTM Approach," in IEEE Communications Letters, vol. 24, no. 10, pp. 2196-2200, Oct. 2020, doi: 10.1109/LCOMM.2020.3002073. 4) R. Ahmed, Y. Chen, B. Hassan, L. Du, T. Hassan and J. Dias, "Hybrid Machine-Learning-Based Spectrum Sensing and Allocation With Adaptive Congestion-Aware Modeling in CR-Assisted IoV Networks," in IEEE Internet of Things Journal, vol. 9, no. 24, pp. 25100-25116, 15 Dec.15, 2022, doi: 10.1109/JIOT.2022.3195425. 5) Chembe, C., Kunda, D., Ahmedy, I., Md Noor, R., Md Sabri, A. Q., & Ngadi, M. A., Infrastructure based spectrum sensing scheme in VANET using reinforcement learning. Vehicular Communications, 18, 100161., 2019, doi.org/10.1016/j.vehcom.2019.100161 6) Chen, L., Fu, K., Zhao, Q., & Zhao, X. A multi-channel and multi-user dynamic spectrum access algorithm based on deep reinforcement learning in Cognitive Vehicular Networks with sensing error. Physical Communication, 55, 101926., 2022, doi.org/10.1016/j.phycom.2022.101926 7) Paul, A., & Choi, K. Deep learning-based selective spectrum sensing and allocation in cognitive vehicular radio networks. Vehicular Communications, 41, 100606., 2023, doi.org/10.1016/j.vehcom.2023.100606 8) Lee, W., & Chung, B. C., Ensemble deep learning based resource allocation for multi-channel underlay cognitive radio system. ICT Express., 2022 doi.org/10.1016/j.icte.2022.08.009 9) B. Yang, X. Cao, O. Omotere, X. Li, Z. Han and L. Qian, "Improving Medium Access Efficiency With Intelligent Spectrum Learning," in IEEE Access, vol. 8, pp. 94484-94498, 2020, doi: 10.1109/ACCESS.2020.2995398. 10) R. Mennes, F. A. P. De Figueiredo and S. Latré, "Multi-Agent Deep Learning for Multi-Channel Access in Slotted Wireless Networks," in IEEE Access, vol. 8, pp. 95032-95045, 2020, doi: 10.1109/ACCESS.2020.2995456. 11) P. Yang et al., "Dynamic Spectrum Access in Cognitive Radio Networks Using Deep Reinforcement Learning and Evolutionary Game," 2018 IEEE/CIC International Conference on Communications in China (ICCC), Beijing, China, 2018, pp. 405-409, doi: 10.1109/ICCChina.2018.8641242. 12) Li, Z., Liu, X., & Ning, Z., Dynamic spectrum access based on deep reinforcement learning for multiple access in cognitive radio. Physical Communication, 54, 101845., 2022, https://doi.org/10.1016/j.phycom.2022.101845 13) Giri, M. K., & Majumder, S., Distributed dynamic spectrum access through multi-agent deep recurrent Q-learning in cognitive radio network. Physical Communication, 58, 102054., 2023, doi.org/10.1016/j.phycom.2023.102054 14) O. Naparstek and K. Cohen, "Deep Multi-User Reinforcement Learning for Dynamic Spectrum Access in Multichannel Wireless Networks," GLOBECOM 2017 - 2017 IEEE Global Communications Conference, Singapore, 2017, pp. 1-7, doi: 10.1109/GLOCOM.2017.8254101.