Received 26 February 2023, accepted 20 March 2023, date of publication 29 March 2023, date of current version 3 April 2023. Digital Object Identifier 10.1109/ACCESS.2023.3263117 OBTAIN: Observational Therapy-Assistance Neural Network for Training State Recognition MINXIAO WANG1 AND NING YANG 2, 1 School (Member, IEEE) of Electrical, Computer and Biomedical Engineering, Southern Illinois University, Carbondale, IL 62901, USA 2 School of Computing, Southern Illinois University, Carbondale, IL 62901, USA Corresponding author: Ning Yang (nyang@siu.edu) This work was supported in part by the U.S. NSF under Grant CC-2018919. ABSTRACT Children with autism spectrum disorder (ASD) often require long-term and high-quality intervention. An important factor affecting the quality of the intervention is therapists’ observation skill, by which therapists can opportunely adjust their intervention strategies based on children’s states. However, there is a shortage of experienced therapists and observational skill development is time-consuming for junior therapists to acquire. This motivates us to use data-driven deep learning method to build an OBservational Therapy-AssIstance Neural network (OBTAIN), which is a weakly-supervised learning framework for the therapy training states recognition. OBTAIN first represents children’s skeleton-sequence data as a large graph. Then a graph representation learning module is used to extract training state features. To learn spatialtemporal behavior features more effectively and efficiently, a novel structure-aware GCN block is designed in OBTAIN’s learning module. Finally, a MILnet and corresponding joint MIL loss are used to learn state score prediction from extracted features. Experimental results (0.824 AUC score) on DREAM dataset demonstrate our OBTAIN can effectively recognize the training states in autism intervention. INDEX TERMS Autism, therapy assistance, weakly-supervised learning, GCN, multiple instance learning. I. INTRODUCTION Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder. Children with ASD can benefit from long-term intervention. Some therapeutic treatments such as Applied Behavior Analysis (ABA), Relationship Development Intervention (RDI) model, Cognitive-Behavior Therapy (CBT) have established standard protocols that can be given in education, clinic, community, or home settings. In the typical intervention session, therapists set a training goal and the task training cycle could be broken down into two basic steps. First, the therapist gives instructions, and the child responds. Then, based on observing the child’s response, the therapist identifies the child’s mood, then adjusts for the next instruction. Since children with ASD usually have their own strengths, difficulties, and other special needs, the same instructions may work for some children but are ignored by others. Children may not be able to express their feelings by language, but their eye contact and body movements will release a signal for whether they The associate editor coordinating the review of this manuscript and approving it for publication was Sotirios Goudos VOLUME 11, 2023 . understand instruction and would like to follow it. Usually, it can take several cycles for the therapist to find the best way to communicate with the child to get one task well done. In this child–therapist interaction, the key component is observation. The intervention training could effectively move on if the child’s response is recognized properly and promptly by the therapist. Currently, this observational process is handled by therapists only. For senior therapists, their observational experience can help them better understand children’s special behavior and provide dynamic adaptation. For junior therapists who lack observational experience, they might learn the typical behaviors of ASD children in professional education and accumulate the experience during the intervention practice. However, most observation skills are learned through individual cases, making the skill hard to be taught directly and effectively. Nowadays, most states face a shortage of experienced providers for ASD intervention. This motivates us to provide some observational assistance, which will help quickly fill the gap of observation ability between experienced and junior therapists, thus improving the effectiveness of ASD intervention. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ 31951 M. Wang, N. Yang: OBTAIN for Training State Recognition FIGURE 1. Traditional therapy intervention. FIGURE 2. OBTAIN assisted therapy intervention. To achieve our long-term goal of enhancing autism intervention quality, the first step is building an observation system to recognize children’s training states during the intervention. Considering that experienced therapists improve their observational skills through practice in actual interventions, we believe that recently advanced data-driven deep learning techniques have great potential for solving this task. Fortunately, many therapists recorded their intervention sessions for further analysis. Deep learning models can learn general knowledge and patterns from the recorded data more efficiently than humans. In recent years, machine learning (ML), especially deep learning (DL) techniques have been applied to autism research. For example, many ASD diagnosis studies take the advantage of deep learning for detecting different data types including magnetic resonance imaging (MRI) [6], facial expression [7], eye gaze [8], movement pattern [9], [10], and multimodal data [11]. Besides ASD diagnosis, Robotic Assisted Therapy (RAT) also gains attention. Social robots in RAT can improve the quality of therapy, acting as a mediator between therapists and children with ASD, especially improving the effects from therapists to children with ASD, as the blue arrow on the right side in Fig.1 demonstrates. Different from previous works, we focus on another direction as Fig.2 shows, a deep learning-based observational assistance method to help therapists identify the children’s states reflected by their behavior. With our study, it is challenging to provide learningbased real-time therapy training state recognition due to the following difficulties: 1) Limited data. Data collection and annotation are very costly, only large-scale clinical experiments can provide the needed dataset. In addition, for privacy concerns, only high-level perception data from sensors, like skeleton data, are available for training the DL model. 2) Blurred definition of training state. Good or bad training state is relatively defined based on historical 31952 and ongoing individual performance. As an abstract concept, the training state may not be directly relevant to particular behaviors. There is no existing dataset had been annotated about the training state. 3) Efficiency requirement for real-world scenarios. During real-world therapy intervention, the observational system is expected to process input in real-time and generate prediction efficiently. In this work, we propose an Observational TherapyAssistance Neural Network (OBTAIN) as shown in Fig.3, which aims at providing observational assistance efficiently during the interactions between therapists and children. Our OBTAIN model contains two key components: (1) a GCN-based neural network to extract spatial and temporal features from children’s 3D skeleton behavior data, which is beneficial to privacy and data protection; (2) a multiple instance learning (MIL) networks, which predicts the score during training time for reflecting children’s personal training state in a therapy session. We use weakly-supervised learning to train OBTAIN model with a joint MIL loss to overcome the rare labeled data. To verify our method, we evaluate our OBTAIN model on DREAM dataset [3] provided by DREAM project [4]. Although the DREAM dataset does not include training state annotations, we take advantage of large-scale comparison between provided RET and SHT data to simulate the intervention scenarios offered by different level therapists. Our recognition results on the DREAM dataset achieve 0.824 AUC score, which shows OBTAIN can successfully detect where a bad training state happened and timely locate it. Furthermore, we compare our proposed structureaware GCN block with two classical GCN blocks: GAT and GraphSAGE. The proposed structure-aware GCN block can achieve better performance (0.824 AUC score) than GAT and GraphSAGE (0.812 and 0.809 AUC score). Additionally, jointing an attention-based MIL pooling bag loss with MIL ranking loss also increases the performance of our model by approximately 0.017 AUC score. We summarize the contributions of this paper as follows: 1) A weakly-supervised framework for recognizing children’s training states during ASD intervention. The weakly-supervised learning relieves the requirement for temporal annotations of training states. 2) A structure-aware GCN block, which unifies spatial and temporal convolution as the message passing operation on a large skeleton-sequence graph to extract movement features. 3) A joint MIL loss, which includes a bag loss for attentionbased MIL pooling and a MIL ranking loss for instances of state score regression. The self-attention mechanism perfectly fits the prediction of relative training states during a therapy session. 4) Experiments are conducted on a large-scale skeleton ASD therapy dataset, DREAM, our weakly-supervised learning framework achieves 0.824 AUC score. Compared with classical GraphSAGE and GAT blocks, the proposed structure-aware GCN block can help OBTAIN achieve better performance. VOLUME 11, 2023 M. Wang, N. Yang: OBTAIN for Training State Recognition In the remainder of this paper, related work is discussed in Section II. Section III presents the design details of our OBTAIN model. Section IV introduces the DREAM dataset and data pre-processing for training and evaluating. The training and evaluation performance with analysis on DREAM dataset is shown in Section V. Meanwhile, we also present the comparison of our structure-aware GCN block against classical GCN blocks and the ablation study of our OBTAIN model. Finally, Section VI concludes the paper. II. RELATED WORK In this section, we introduce related works about skeletonbased action recognition, graph representation learning and multiple instance learning. Graph representation learning is adopted in many different applications. For example, in knowledge repositories, node sequences can be regarded as sentences and fed into the skipgram model to learn low-dimension node representations. Another important application is predicting the relationship in social networks. Recently, there has been a surge of approaches that seek to learn representations that encode structural information about the spatial graph, such as biological protein-protein networks [27] and 3D point cloud [28]. GCNs have made a lot of progress in the past decade and achieved great success in those tasks. Considering the recent success of graph representation learning on the spatial graph, our strategy is to apply it to skeleton sequence data for the task of training state recognition during ASD therapy. A. SKELETON-BASED ACTION RECOGNITION Numerous deep learning-based methods, such as RNNbased, CNN-based, and GCN-based, are proposed for 3D skeleton-based action recognition. However, both RNNs and CNNs are suitable for processing a vector sequence or 2D grids instead of the skeleton data that are naturally embedded in the form of graphs. Recently, Graph Convolutional Networks (GCNs) have attracted much attention as human 3D-skeleton data is naturally a topological graph instead of a sequence vector or pseudo-image which the RNNbased or CNN-based methods deal with [13]. The spatial temporal graph convolution networks (ST-GCN) model is presented in [32], which constructed a spatial-temporal graph with joints as graph vertices and natural connectivity in both human body structures and time as graph edges. ASGCN [14] and 2s-AGCN [15] have proposed improvements to ST-GCN to get better performance by defining and adding more complexity spatial graph connections so that more action features can be learned. Most existing skeleton-based GCN action recognition approaches share a similar core component: ST-GCN block, which is a graph convolutional layer followed by a 1DCNN layer. Obviously, ST-GCN is a special one compared with other well-known graph convolutional structures. The two-step structure aims to learn movement features from spatial-temporal graphs, but it’s different as most graph representation learning methods aggregate information from neighbors. Although most previous works achieved remarkable performance on action recognition tasks, the two-step ST-GCN is problematic due to the increased computational complexity in both network parameters and pre-defined spatial graph structure. Meanwhile, the ability of our model to recognize the states during intervention across different actions is crucial in our state recognition task. Therefore, in this work, we make an attempt to unify spatial and temporal convolution as the message passing operation in a single layer but still maintain the ability to extract hidden patterns from spatial-temporal graphs. B. GRAPH REPRESENTATION LEARNING Graph representation [26] aims at learning low-dimension vector representations of nodes to facilitate a better understanding for semantic relationships among nodes in graphs. VOLUME 11, 2023 C. MULTIPLE INSTANCE LEARNING Multiple Instance Learning (MIL) is a branch of weakly supervised learning [17]. MIL allows weak supervision of the training data and considers groups of observations, called bags, where ground-truth labels are only available at the bag level. The labels of the observations in a bag, called instances, are unknown. In a binary classification problem, MIL assumes that a bag is identified as negative if all the instances it contained are negative. However, if it contains at least one positive instance, the bag will be marked as positive. Deep MIL has been successfully applied to various domains, such as anomaly detection [18], human action recognition [19], and medical image analysis [20]. However, MIL ranking loss was formulated as a nonconvex problem due to the non-smooth and non-convex hinge loss [29]. In MIL approaches, the bag labels will be given. Therefore, MIL pooling can be adopted for training bag level prediction. Most common MIL pooling approaches [35] utilize either the mean pooling or the max pooling, which is non-trainable on cross-entropy loss instead of hinge loss. A fully trainable MIL pooling is proposed in [30], which provides insight into the contribution of each instance to the bag label. A loss-based attention mechanism is proposed in [31], which simultaneously learns instance weights and predictions, and bag predictions for deep multiple instance learning. Inspired by MIL pooling approaches, we introduce a bag level loss to our joint MIL loss to overcome the non-convex problem in our model’s optimization. III. OBTAIN ARCHITECTURE In this section, we introduce the design details of our OBTAIN model. The whole framework consists of two phases: behavior feature extracting phase and weaklysupervised multiple instances learning phase, which are demonstrated in Fig.3. Two core parts corresponding to those two phases are ‘‘structure-aware GCN Network’’ and ‘‘Multiple Instance Learning(MIL) Network’’. In the first phase, we design and configure our neural network structures focusing on both efficacy and efficiency. In the second phase, we use the MIL method to convert training state 31953 M. Wang, N. Yang: OBTAIN for Training State Recognition FIGURE 3. OBTAIN model consists of two phases: behavior feature extracting phase and weakly-supervised multiple instances learning phase. Two data streams are token by OBTAIN, positive bags contain all good training state instances and negative bags contain bad training state instances. FIGURE 4. Illustration of graph structure for 3 frames skeleton sequences. Joints of the human body are denoted as blue circles, skeleton connections in a single frame are denoted as solid lines, and dashed lines present the timely continuity of the same joint from adjacent frames. In each frame, joint nodes are set with fixed sequences index. The node indexes between adjacent frames are consecutive. recognition into a training state score regression problem. Further, we improve the flexibility of MIL pooling with selfattention, which can better predict the relative training state score. Additionally, we use joint MIL loss, which includes a ‘‘bag loss’’ and a ‘‘MIL ranking loss’’ to train our model. In the rest of this section, we first generally overview the behavior feature extracting pipeline (phase 1). Then we specifically explain the design of a structure-aware GCN block. Finally, we describe the details of multiple instances network and joint MIL loss (phase 2). A. BEHAVIOR FEATURE EXTRACTING The behavior feature extracting pipeline is designed to leverage deep learning to extract feature representations from original human body skeleton sequences data for downstream training states detection. As shown in Fig.3, our extractor pipeline has four components: (1) input layer; (2) GCN (Graph Convolutional Networks) module; (3) TCN (temporal convolutional neural networks) module; (4) adaptive pooling layer. Besides neural network modules, a data format module is required to process raw data. 31954 Input format (Graph). Skeleton sequence data contains 3D coordinates of each human joint in each frame. Previous work usually formatted the node set in two dimensions. For example, constructing a skeleton sequence with N joints and T frames as a spatial temporal graph, the node set will be presented as V = {v(t,i) |t = 1, . . . , T ; i = 1, . . . , N }. In this work, we also employ a graph structure on skeleton sequence data, but we format the node set in one dimension. A graph construction example is presented in Fig.4, three frames skeleton data were presented as a graph G = (V , E), where E is the edge set. In Fig.4, joints of the human body are denoted as blue circles, skeleton connections in the single frame are denoted as solid lines, and dashed lines present the timely continuity of the same joint from adjacent frames. In DREAM dataset, 12 joints of the upper body’s 3D coordinates are recorded. We organize the single frame’s joint sequence in the order of up to down and left to right, then concatenate the next frame at the tail. So the node set is denoted as V = {vi |i = 1, . . . , N × T }. The one-dimension node set will benefit us in unifying spatial and temporal graph convolution together. The edge set E contains two subsets, the spatial edge, and the temporal edge. The spatial edge is defined by the natural topology of human joints, denoted as Es = {(vp , vq )|(p, q) ∈ Adj}, where Adj is the adjacency matrix of joints. Temporal edge set can be easily presented as Et = {(vi , vi+N )}, where N is the number of joints in each frame. Hence, the whole edge set is E = {Es , Et }. Input layer. As the first layer of the extractor pipeline, the input layer receives predefined graph data whose nodes’ features are raw 3D coordinates of joints. Therefore, a data batch normalization layer is added at the beginning to normalize raw data. Then, we use a GCN layer to encode the node feature from 3D coordinates to a feature with 64 channels. GCN module. As stated in the previous part, children’s state is an indistinct concept. The same kind of state can cross different actions. Considering this difference, we complete embedding generation with a shallow network which will be more sensitive to tiny movement feature that reflects mental and spiritual state. GCN operation aggregates feature information from a node’s local neighborhood, with an increasing number of layers, the respective range will reach distant neighbors. Shallow GCN can restrict the respective VOLUME 11, 2023 M. Wang, N. Yang: OBTAIN for Training State Recognition range within a local neighborhood, especially focusing on short-time dependence. Besides the local characteristic, shallow network structure also means more lightweight and efficient. In our graph convolutional network module, a novel graph structure-aware GCN layer is adopted as a basic block. We define the first order neighborhood as GCN block’s aggregate range on each node. Since we have already formatted the node set in one dimension and combined spatial edge and temporal edge together, our graph structure-aware GCN layer can aggregate spatial and temporal features in one layer. That means pre-defined spatial configuration partitioning, which is usually needed for processing spatial GCN intra-frame, is no longer necessary. In addition, residual links are added over stacked blocks, as shown in Fig.4, the block residual link connects the features before and after each block. TCN module. As noted that action recognition will predict a single action class for the whole skeleton sequence, while state recognition needs to perceive the training state that varies with time. It should be able to timely locate children’s anomaly state during therapy intervention. After the GCN module extracts movement representation features for all nodes, we deploy a TCN module to learn features in temporal domain [40]. Before entering the TCN module, we need to participate in the whole sequence in multiple segments according to the temporal resolution for the anomaly state locating a task. In Fig.3, we break the original sequence, the bag, into segments, or instances. Then TCN module will extract the feature of each instance independently. We employ similar dilated convolutions as in [33], because the TCN architecture is not only more accurate than canonical recurrent networks such as LSTMs and GRUs, but also simpler and clearer. Dilated convolution operation can be formulated as: F(s) = k X f (i) · xs−d·i (1) i=1 where s presents the index of an element of input sequence, k presents filter’s kernel size. Adaptive pooling layer. Different from [33], we modify the structure by keeping the dilation factors of each layer to be one and adding a max pooling layer between two adjacent layers since only one prediction for each instance (segment) is needed. Therefore, our TCN blocks still have the ability to increase the receptive field quickly with stacking layers without increasing dilation. Meanwhile, the added pooling layer also reduces feature size and makes our TCN module more efficient. The residual block contains a branch leading out to a TCN layer, whose outputs are added to the input of the block. The residual blocks allow layers to learn modifications to the identity mapping, which has repeatedly been shown to benefit very deep networks. B. STRUCTURE-AWARE GCN BLOCK Structure-aware GCN block is the key component of our feature extractor pipeline. It enables us to unify spatial and temporal graph convolution together, meanwhile, making VOLUME 11, 2023 FIGURE 5. Structure-aware GCN block. The blue circle denotes a center node, red circles denote as neighborhood nodes of the center node. The neighborhood is divided into four parts, ‘‘up and left part’’, ‘‘down and right part’’, ‘‘previous frame part’’ and ‘‘next frame part’’. For each part, features are processed independently and concatenated together. sure spatial features and temporal features can be learned. Recall that in the data format part, we store the node set in one dimension and format skeleton sequences data as a large graph without distinguishing spatial or temporal edges. This enables us to efficiently aggregate features from all neighbor nodes in a single layer. Another notable contribution is that we design the structure-aware GCN block to assure the features aggregated along spatial and temporal edges are different. Fig.5 shows how structure-aware GCN block aggregates information from one node’s local neighbors and generates embedding representation feature for the node. As the upper part of Fig.5 shown, the blue circle in the middle frame denotes the current node where the structure-aware GCN filter is working at. Red circles denote neighbor nodes in the aggregate range. It should be noticed that the upper-body skeleton data has a quite simple spatial structure as shown in Fig.4. Most nodes only have two neighbor nodes, since each neighborhood part needs to be followed with independent neural network layers. Considering the model efficiency, we combine the ‘‘up’’ and ‘‘left’’ spatial relationship together, same for ‘‘down’’ and ‘‘right’’. Therefore, we participate the aggregate range of each node into four parts, which are the ‘‘up and left part’’, ‘‘down and right part’’, ‘‘previous frame part’’, and ‘‘next frame part’’, as shown with different color regions in Fig.5. Formally, we define N (vi ) as the neighborhood of node vi , where i is the index of the node in the whole graph. Then the neighbor nodes set for four aggregate range parts are four subsets of N (vi ). They would be defined as: Nul (vi ) = {vj |∀(vi , vj ) ∈ E, N > j − i > 0} Ndr (vi ) = {vj |∀(vi , vj ) ∈ E, 0 > j − i > −N } Npf (vi ) = {vj |∀(vi , vj ) ∈ E, j − i = −N } Nnf (vi ) = {vj |∀(vi , vj ) ∈ E, j − i = N } (2) 31955 M. Wang, N. Yang: OBTAIN for Training State Recognition where N is the number of joints in one frame, and E is the edge set. Neighbor nodes’ features, hl−1 Np (v) will be aggregated together if they belong to the same range part. The mean operator is applied to taking the element-wise mean of the feature vectors, then producing a feature embedding for each aggregate range part. Four corresponding fully connected layers take those four regions’ features in and reduce the number of channels to 1/4 of it used to be. l−1 hl−1 Np (v) = AGGmean ({hu , ∀u ∈ Np (v)}), p ∈ {ul, dr, pf , nf } (3) where AGGmean () is an aggregator with element-wise mean operator, l is the layer number. After that new shorter features from the four parts are concatenated together and sent to a fusion layer. The fused feature represents the neighbor information of the root node. In this way, features from different spatial and temporal parts are separately saved in different channels. The root node feature is fed through another linear layer to transform the representation. The final feature representations output of one structure-aware GCN block is the summation of fused neighbor feature hlN (v) and new root feature hlv in equation (4) and (5). l−1 l−1 l−1 hlN (v) = Wf · CAT(hl−1 Nul (v) , hNdr (v) , hNpf (v) , hNnf (v) , ) (4) hlv = Wr · hl−1 v (5) where CAT () denotes the concatenation operation, Wf is the weight of fusion layer, Wr is the weight of root layer. fixed number of segments (e.g., 32 segments) during training. These segments of an intervention session clip are instances in a bag. MIL ranking loss. MILnet predicts a training state score for each segment of an intervention session clip. For the purpose of distinguishing good or bad training state, MILnet should predict a higher score for the bad training state frame than the good one, like fMILnet (pi ) > fMILnet nj . But we only have intervention session clips level (bags) annotations, it’s impossible to compare each pair of pi and nj . Hence, we use multiple instances ranking loss to achieve that goal, the objective function can be formulated as: max fMILnet (pi ) > max fMILnet nj (6) i∈Bp j∈Bn Instead of enforcing ranking on every instance of the bag, MIL ranking loss can enforce ranking only on the two instances having the highest training state score respectively in the positive and negative bags. The segment corresponding to the highest training state score in the positive bag is most likely to be the true positive instance (bad training state segment). The segment corresponding to the highest training state score in the negative bag is the one that looks most similar to a bad training state segment, but it actually is a good one. We implement a hinge-based ranking loss [18] for training, in order to keep a large margin between the positive and negative instances: l Bp , Bn = max(0, 1 − max fMILnet (pi ) i∈Bp + max fMILnet nj ) (7) j∈Bn C. MULTIPLE INSTANCE LEARNING FOR REGRESSING TRAINING STATE SCORE Since the large scale precise temporal annotation of training state in therapy sessions record is almost impossible to obtain, we cannot simply learn the patterns the same way as in a standard classification problem. Instead, we can treat it as a Multiple Instance Learning (MIL) problems. In this work, we mainly use MIL ranking loss for optimization. To overcome the optimization difficulty caused by nonsmoothness and nonconvexity of MIL ranking loss [34], we introduce an attention-based MIL pooling-based bag loss. The joint MIL loss is more suitable for instances state score regression than MIL ranking loss only. Problem Statement. In our scenario, we only have session clips level annotations. A session clip containing bad training state is labeled as positive, otherwise, it is labeled as negative. Following real-world anomaly detection method [18], we represent a positive video as a positive bag Bp , and different temporal segments are individual instances in the bag (p1 , p2 , . . . , pm ), where m is the number of instances in the bag. We assume that at least one of these instances contains bad training state frames. Similarly, the negative video is denoted by a negative bag Bn , where temporal segments in this bag are negative instances (n1 , n2 , . . . , nm ). In the negative bag, all of the instances are good training state frames. In this work, we divide each session clip into a 31956 Further, since the intervention session clip is a sequence of segments, the training state score should vary smoothly between segments. Hence, we minimize the difference of scores for the adjacent segment to enforce the prediction of MILnet to be temporal smooth. The additional loss item lsmooth is shown as: lsmooth = m−1 X (fMILnet (pi ) − fMILnet (pi+1 ))2 (8) i=1 Attention-based MIL pooling. Attention-based MIL pooling is more flexible than mean pooling and max pooling since the attention mechanism is a trainable and adaptive method. The attention mechanism is widely used in deep learning for image captioning or text analysis [30]. Different with existing methods that generate attention weights by neural network [30], [35], our method is inspired by selfattention [21], which uses scaled dot-product attention. As shown in Fig.3, in the multiple instance learning module, extracted features are transformed by a multiple layer perceptron (MLP) header and transformed features embedding represent instances. We compute the dot products of each pair of instance features and obtain the self-attention map, which is a N × N matrix. Then we sum each line together and apply a softmax function to obtain the weights on the values. The output is the instance attention weight. VOLUME 11, 2023 M. Wang, N. Yang: OBTAIN for Training State Recognition With the attention weight, we apply a weighted summation on instance score to get the bag score prediction. Since we have the bag labels, a binary cross-entropy loss will be easily calculated. IV. DREAM DATASET PREPARATION In this section, we introduce the DREAM dataset and data preparation. In order to train and evaluate our OBTAIN model, data need to include both pure good state sessions and labeled sessions with state decline. However, for privacy concerns, no existing dataset consists of training state annotations. Fortunately, the published DREAM dataset provides a large-scale comparison between RET (Robot Enhanced Therapy) and SHT (Pure Human Therapy). We use RET and SHT session data to simulate the intervention scenarios with different training states. The corresponding evaluation results with the DREAM dataset will be presented in SectionV. A. DREAM DATASET DESCRIPTION The DREAM dataset [3] collected behavioral data during a large-scale evaluation of Robot Enhanced Therapy (RET). They separate 61 children who were diagnosed with ASD into two groups. The Applied Behavior Analysis (ABA) therapists provided training directly with half of the children (SHT) and the social robot, which is supervised by a therapist (RET) interacted with other children. The therapy sessions for both groups are recorded by three RGB cameras and two RGBD (Kinect) cameras, and all the detailed information about children’s behavior during the therapy is included. The dataset recorded 3000 therapy sessions, providing a total of more than 300 hours of therapy. Most therapy sessions targeted three social skills, namely imitation (IM), joint-attention (JA), and turn-taking in collaborative play (TT), while a part of session data contains no task label. As we mentioned, the blurred definition of the training state is a challenge. The training state is not directly relevant to particular behaviors. Therefore we evaluate our method on both mixed task data and independent single task. Meanwhile, the naturally imbalanced sessions number among IM, JA and TT can help us to test our method with both sufficient and insufficient data. B. DEFINITION OF TRAINING STATE IN DATASET The core function of our observational assistance method is recognizing unsatisfied training states to help the therapists identify children’s responses to a training task and adjust their strategy to help children achieve the training goal. Considering every child with ASD has special strengths and difficulties in various areas, there is no standard definition for satisfied and unsatisfied states, which also means, labeling data would be a challenge. Not to mention temporal annotations are required for long time intervention records. For the purpose of bridging the gap of observation ability between experienced therapists and unprepared therapists, we transform this challenge by treating children’s behavior through experienced therapists’ intervention as a good VOLUME 11, 2023 training state and through unprepared therapists’ intervention as bad training state. Therefore, the temporal annotations requirement can be ignored and training data is easier to obtain. Specifically, because robots can attract more attention from children with ASD and, as [24] shown, they might be more motivated to participate in training. We consider the behavior during intervention sessions with RET as a good training state under experienced therapists. Correspondingly, the behavior during SHT intervention will be treated as a bad training state under unprepared therapists. Although, we wouldn’t say that children are always in a good training state during interventions offered by experienced therapists, most of the time, they perform better. Hence, it’s reasonable to simplify our observational assistance goal to distinguish the training states related to the behavior when interventions are provided by different level providers: experienced therapists and unprepared therapists (in the DREAM dataset, RET and SHT). Actually, in the real intervention scenario, the training states vary at different levels over time. That’s one reason that motivates us to provide observational assistance. For the purpose of identifying the states of the children during the intervention and informing the therapist according to the observation, we train our model to detect the bad training states mixed in good training states. However, the DREAM dataset provides the original therapy records for RET/SHT separately. In order to better simulate the practical training cases and reflect the real training state fluctuation over time, we pre-process data by splicing the RET and SHT therapy records clips together. C. DATA PRE-PROCESSING Before extracting features, we initialized input data clips with a data format, which has fixed length N of skeletonframes (N=1024) and each skeleton-frame includes J joints (J=12). The dimension of the initial feature of each joint (3D position of each joint) is F. For each session clip, we take the length N RET therapy record as basement, and mix a random length of SHT data sequence into a random position of the basement to simulate bad training state happens. Then, we divide each data clip into 32 non-overlapping segments and consider each data segment as an instance of the bag. The number of segments (32) is empirically set [18]. For each segment, we extract features to represent it. V. EVALUATION This section shows the evaluation results of our proposed OBTAIN framework. First, we evaluate the performance of our OBTAIN framework on the DREAM dataset. Second, we compare the performance of our novel structure-aware GCN block with the two most popular classical approaches, GraphSAGE and GAT. Third, we use an ablation study to analyze the effectiveness of each module. A. EVALUATION METRICS We evaluate our OBTAIN model by frame based receiver operating characteristic (ROC) curve and corresponding area 31957 M. Wang, N. Yang: OBTAIN for Training State Recognition under the curve (AUC), which are normally common evaluation metrics used by anomaly detection methods. The ROC curve is a graph showing the performance of a classification model at all classification thresholds. This curve plots two parameters: True Positive Rate (TPR) and False Positive Rate (FPR). On different classification thresholds, the ROC curve is drawn on TPR vs. FPR space. The scale-invariant and classification-threshold-invariant properties make AUC suitable for our application. FIGURE 6. Testing loss for different training subsets among ALL, IM, TT, JA, and no label. B. PERFORMANCE ON DREAM DATASET To evaluate the effectiveness of OBTAIN, we train OBTAIN model to recognize bad training state segments which are mixed with good training state session clips. The challenge is that characteristics of the abnormal state vary greatly for divergent behaviors for each special training topic. Hence, we use different task subsets to prove OBTAIN is able to generally recognize the states related to divergent behaviors. Experiment Setup: We use AdamW [41] with a learning rate of 0.00001 and reduce it on epoch 200 when the training loss stops decreasing. The training batch size is set to 1024. We train and test our OBTAIN with five subsets with different intervention tasks (ALL, IM, TT, JA, no label), where ‘‘ALL’’ means the whole training dataset includes all intervention tasks, and ‘‘no label’’ denotes the data without a task label. Each subset is split into two parts with fixed proportions, namely the training set (80%) and the testing set (20%). Performance Analysis: Fig.6 shows the test loss curves when OBTAIN model is trained with different subsets as a function of iterations. Fig.7 shows the convergence curves for each subset, with converging of loss, testing AUC scores for different subsets correspondingly increase and converge. Both figures show that our model has a stable performance (more than 0.8 AUC score) on each independent subset. That means our model has the ability to recognize the training states related to divergent behavior styles presented during interventions targeting various social skills. It should be noticed that the JA subset has an apparent slower increase than other subsets. The reason is JA is the smallest subset which only contains around 5% amount of the whole dataset. In the practical case, during a training session, several training topics are combined together. It’s hard to train an independent state recognition model limited to the specific social skill intervention. That’s the reason we not only do the training separately on each subset but also evaluate the performance of the model trained by the whole training dataset. Fig.8 shows the ROC curves tested on ALL, IM, TT, JA and no label test subsets. From the results, we can see that the areas under the ROC curves of ALL and IM are very similar, around 0.824. ROC of TT and JA test sets are slightly lower than others, TT is around 0.823 and JA is around 0.814. Especially, ROC on the label test set is higher than ALL. That is caused by the imbalance of intervention tasks in the whole train dataset. Among the total 9352 intervention clip samples, there are 1152 IM samples, 6840 TT samples, only 504 JA samples, and 2008 samples without task labels. Overall, children with ASD have great differences regarding to their adaptability during the intervention, which is 31958 FIGURE 7. Testing AUC curves show the model’s convergence progress on different train subsets. FIGURE 8. ROC curve comparison among different test subsets. The model is trained on the whole train set. shown by their behavior and states that need to be addressed. Our model has the ability to identify the training states successfully. In Fig.9, we present qualitative results of our state recognition method on eight intervention clips. OBTAIN provides successful and timely detection of those bad training states by generating high state scores for the frames which recorded during SHT interventions and low (close to 0) state scores for good states (RET intervention samples), as shown in the first six clips. The last two clips illustrate that failure cases could generate false alarms and predict a low score for bad training state, and predict some false alarms on a normal intervention clip. C. COMPARISON WITH CLASSICAL GCN BLOCKS To validate that the proposed structure-aware GCN block is more suitable for learning information from skeleton sequences data than classical graph representation methods, VOLUME 11, 2023 M. Wang, N. Yang: OBTAIN for Training State Recognition FIGURE 9. State scores prediction results in the test set. Window blocks with red color represent the ground truth of the anomalous region. Curves represent the predicted training state score. FIGURE 10. Testing AUC curve of proposed structure-aware GCN compares with GAT and GraphSAGE. FIGURE 12. ROC of proposed structure-aware GCN compares with GAT and GraphSAGE. Besides a higher AUC score, we also find differences in ROC curve compared with GAT and GraphSAGE. Fig.12 shows the ROC curve of the best performance among structure-aware GCN, GAT, and GraphSAGE. As we know, the ROC curve is plotted with TPR (True Positive Rate) against the FPR (False Positive Rate) where TPR is on the y-axis and FPR is on the x-axis. TPR can be called recall or sensitivity and FPR can be denoted as 1-specificity. Sensitivity and specificity are inversely proportional to each other, as Fig.12 shows, our structure-aware GCN is more balanced between sensitivity and specificity. That means the extractor pipeline with structure-aware GCN blocks can learn better feature distribution for both positive and negative samples. While the curves of GAT and GraphSAGE show that the models with GAT or GraphSAGE blocks do predict high state scores for abnormal states, they also predict many high scores for normal states. In this case, the predictor can not really find out valuable information. D. ABLATIOIN STUDY FIGURE 11. Testing loss curve of proposed structure-aware GCN compares with GAT and GraphSAGE. we evaluated the performance of our structure-aware GCN against the two most popular approaches, GraphSAGE [36] and GAT [37]. Comparison Setup: We only replace the structure-aware GCN blocks in the extractor pipeline with GraphSAGE or GAT blocks, then repeat the experiments on ALL subsets of DREAM with exactly the same experimental setup. We use the published realization GraphSAGE and GAT provided by the PyG (PyTorch Geometric) v2.04 library [25]. Analysis: Fig.10 and Fig.11 show the testing performance of AUC score and loss during 300 epochs training. From Fig.10, we can see that the model with our structure-aware GCN block can achieve a higher AUC score than GAT and GraphSAGE. VOLUME 11, 2023 In our OBTAIN model, the extractor pipeline consists of the input layer, a structure-aware GCN module, and a TCN module, joint MIL loss function includes bag loss and MIL ranking loss. For analyzing the effectiveness of each module, we further present an ablation study to compare variant combinations of modules with different loss functions. To ensure the integrity of the model, we adopt a linear layer to replace the GCN-based input layer or global adaptive pooling to replace TCN modules. Four variants can be described: 1): Linear + GCN blocks + global pooling 2): Linear + GCN blocks + TCN 3): GCN + GCN blocks + global pooling 4): GCN + GCN blocks + TCN In addition, we separately train those four variants with the proposed joint MIL loss and MIL ranking loss. Analysis: The results are shown in Table 1. We consider ranking loss only and set variant 1 as a baseline. Firstly, we will see employing the TCN module instead of simply adaptive global pooling can bring a 0.04 increase in the AUC 31959 M. Wang, N. Yang: OBTAIN for Training State Recognition TABLE 1. AUC scores on different ablation settings trained with different MIL loss functions. FIGURE 13. Testing AUC score on different ablation settings in Table 1 have a decreasing trend, which is caused by nonsmoothness and nonconvexity of MIL ranking loss. score. Then, choosing the GCN block as the input layer can bring around 0.021 increase in AUC score to using a linear layer. And variant 4 has 0.046 higher AUC score than the baseline. The previous experiment has already shown the effectiveness of our structure-aware GCN block, therefore, we conclude that the GCN-based input layer is beneficial for subsequent spatial-temporal feature extraction. TCN has a more important impact on the temporal abnormal state prediction than the GCN-based input layer. The MIL joint loss results in Table 1 show a similar trend that the input layer module and TCN module can increase OBTAIN model’s performance with training on different loss functions. Meanwhile, comparing with two AUC score columns, training with joint MIL loss can achieve a better performance than MIL ranking loss only on all four variants. That means the proposed joint MIL loss can directly increase the performance of our model without any model structure variation. Furthermore, we found there is an AUC decrease trend during the training procession on MIL ranking loss only. Fig.13 shows four variants’ testing AUC scores trained with MIL ranking loss. All four curves achieve their best performance before 50 epochs, then all four curves start to decrease their AUC performance. This decrease is caused by nonsmoothness and nonconvexity of MIL ranking loss, the optimization stuck in a local optimum. And as a weakly-supervised approach, reduction of loss does not guarantee a mounting of performance. The results shown in Fig.7 and Fig.10, which are trained with proposed joint MIL loss do not have a decreasing trend and achieve better performance. Therefore, those results prove our proposed joint MIL loss can overcome the nonsmoothness and nonconvexity problem of MIL ranking loss. VI. CONCLUSION In this paper, we propose OBTAIN, a weakly-supervised children’s training state recognition for providing observational 31960 assistance to therapists during ASD intervention. OBTAIN includes an extractor pipeline for encoding feature representation for children’s behavior. We design a structure-aware GCN block that can unify spatial and temporal convolutional operations in the same layer to extract features. A MILNet is proposed for learning how to identify children’s special behavior by analyzing different training states. We introduce a joint MIL loss to combine MIL ranking based optimization with MIL pooling based optimization. The experiments on DREAM dataset achieve 0.824 AUC score, which shows our model can successfully distinguish children’s different training states. Moreover, the proposed structure-aware GCN block can achieve a better performance than GAT and GraphSAGE (0.812 and 0.809 AUC score). Additionally, our joint MIL loss also increases the performance of our model than MIL ranking loss by around 0.017 AUC score. The limitation of the work is that we only evaluate the proposed OBTAIN on the DREAM dataset. DREAM dataset is the only available published dataset on behavioral data of ASD therapy sessions, which are labeled as one of two conditions, Robot Enhanced Therapy or Standard Human Treatment. We used it in [38], trying to employ the machine learning concepts to analyze the behavioral data. Then we design a new method in this paper and get a much better performance. Our future goal is to utilize OBTAIN to aid teletherapy for children with ASD, as it can make therapy more accessible to those living in remote or underserved areas by reducing the need for in-person sessions. Nonetheless, therapists may face limitations on observing their patients without in-person interventions. Therefore, we plan to use OBTAIN to enhance the quality of teletherapy for ASD patients. In conclusion, the proposed OBTAIN is a successful attempt for building an observation model to recognize children’s training states during ASD intervention. We believe that OBTAIN is the first milestone for enhancing long-term high-quality intervention with deep learning. REFERENCES [1] F. Chiarotti and A. Venerosi, ‘‘Epidemiology of autism spectrum disorders: A review of worldwide prevalence estimates since 2014,’’ Brain Sci., vol. 10, no. 5, pp. 274–295, 2020. [2] G. Bertamini, A. Bentenuto, S. Perzolli, and E. Paolizzi, ‘‘Quantifying the child–therapist interaction in ASD intervention: An observational coding system,’’ Brain Sci., vol. 11, no. 3, pp. 366–389, 2021. [3] E. Billing, T. Belpaeme, H. Cai, H.-L. Cao, A. Ciocan, C. Costescu, D. David, R. Homewood, D. Hernandez Garcia, P. Gómez Esteban, H. Liu, V. Nair, S. Matu, A. Mazel, M. Selescu, E. Senft, S. Thill, B. Vanderborght, D. Vernon, and T. Ziemke, ‘‘The dream dataset: Supporting a data-driven study of autism spectrum disorder and robot enhanced therapy,’’ PLoS ONE, vol. 15, no. 8, pp. 1–15, Aug. 2020. [4] H.-L. Cao, ‘‘Robot-enhanced therapy: Development and validation of supervised autonomous robotic system for autism spectrum disorders therapy,’’ IEEE Robot. Autom. Mag., vol. 26, no. 2, pp. 49–58, Jun. 2019. [5] R. A. J. de Belen, T. Bednarz, A. Sowmya, and D. Del Favero, ‘‘Computer vision in autism spectrum disorder research: A systematic review of published studies from 2009 to 2019,’’ Translational Psychiatry, vol. 10, no. 1, pp. 1–20, Sep. 2020. [6] H. Li, N. A. Parikh, and L. He, ‘‘A novel transfer learning approach to enhance deep neural network classification of brain functional connectomes,’’ Frontiers Neurosci., vol. 12, pp. 1–15, Jul. 2018. [7] M. Leo, M. D. Coco, P. Carcagni, C. Distante, M. Bernava, and G. Pioggia, ‘‘Automatic emotion recognition in robot-children interaction for ASD treatment,’’ in Proc. ICCV Workshops, 2015, pp. 145–153. VOLUME 11, 2023 M. Wang, N. Yang: OBTAIN for Training State Recognition [8] M. Jiang and Q. Zhao, ‘‘Learning visual attention to identify people with autism spectrum disorder,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 3267–3276. [9] S. Piana, C. Malagoli, M. C. Usai, and A. Camurri, ‘‘Effects of computerized emotional training on children with high functioning autism,’’ IEEE Trans. Affect. Comput., vol. 12, no. 4, pp. 1045–1054, Oct. 2021. [10] A. Zunino, P. Morerio, A. Cavallo, C. Ansuini, J. Podda, F. Battaglia, E. Veneselli, C. Becchio, and V. Murino, ‘‘Video gesture analysis for autism spectrum disorder detection,’’ in Proc. 24th Int. Conf. Pattern Recognit. (ICPR), Aug. 2018, pp. 3421–3426. [11] O. Rudovic, J. Lee, L. Mascarell-Maricic, B. W. Schuller, and R. W. Picard, ‘‘Measuring engagement in robot-assisted autism therapy: A cross-cultural study,’’ Frontiers Robot. AI, vol. 4, pp. 1–17, Jul. 2017. [12] H. Javed and C. Park, ‘‘Behavior-based risk detection of autism spectrum disorder through child-robot interaction,’’ in Proc. HRI, 2020, pp. 275–277. [13] B. Ren, M. Liu, R. Ding, and H. Liu, ‘‘A survey on 3D skeleton-based action recognition using learning method,’’ 2020, arXiv:2002.05907. [14] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, ‘‘Actionalstructural graph convolutional networks for skeleton-based action recognition,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 3595–3603. [15] L. Shi, Y. Zhang, J. Cheng, and H. Lu, ‘‘Two-stream adaptive graph convolutional networks for skeleton-based action recognition,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019, pp. 12026–12035. [16] F. Shi, C. Lee, L. Qiu, Y. Zhao, T. Shen, S. Muralidhar, T. Han, S.-C. Zhu, and V. Narayanan, ‘‘STAR: Sparse transformer-based action recognition,’’ 2021, arXiv:2107.07089. [17] Z.-H. Zhou, ‘‘A brief introduction to weakly supervised learning,’’ Nat. Sci. Rev., vol. 5, no. 1, pp. 44–53, Jan. 2017. [18] W. Sultani, C. Chen, and M. Shah, ‘‘Real-world anomaly detection in surveillance videos,’’ in Proc. CVPR, 2018, pp. 6479–6488. [19] S. Ali and M. Shah, ‘‘Human action recognition in videos using kinematic features and multiple instance learning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 2, pp. 288–303, Feb. 2010. [20] K. Das, S. Conjeti, J. Chatterjee, and D. Sheet, ‘‘Detection of breast cancer from whole slide histopathological images using deep multiple instance CNN,’’ IEEE Access, vol. 8, pp. 213502–213511, 2020. [21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11. [22] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, ‘‘Transformers are RNNs: Fast autoregressive transformers with linear attention,’’ in Proc. ICML, 2020, pp. 5156–5165. [23] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang, ‘‘SimGNN: A neural network approach to fast graph similarity computation,’’ in Proc. 12th ACM Int. Conf. Web Search Data Mining, Jan. 2019, pp. 384–392. [24] M. Coeckelbergh, C. Pop, R. Simut, A. Peca, S. Pintea, D. David, and B. Vanderborght, ‘‘A survey of expectations about the role of robots in robot-assisted therapy for children with ASD: Ethical acceptability, trust, sociability, appearance, and attachment,’’ Sci. Eng. Ethics, vol. 22, no. 1, pp. 47–65, Feb. 2016. [25] M. Fey and J. E. Lenssen, ‘‘Fast graph representation learning with PyTorch geometric,’’ 2019, arXiv:1903.02428. [26] W. L. Hamilton, R. Ying, and J. Leskovec, ‘‘Representation learning on graphs: Methods and applications,’’ 2017, arXiv:1709.05584. [27] Y. Hou, H. Chen, C. Li, J. Cheng, and M.-C. Yang, ‘‘A representation learning framework for property graphs,’’ in Proc. 25th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, Jul. 2019, pp. 65–73. [28] S. Kim, J. Park, and B. Han, ‘‘Rotation-invariant local-to-global representation learning for 3D point cloud,’’ in Proc. NeurIPS, vol. 33, 2020, pp. 8174–8185. [29] Y. Hu, M. Li, and N. Yu, ‘‘Multiple-instance ranking: Learning to rank images for image retrieval,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2008, pp. 1–8. [30] M. Ilse, J. Tomczak, and M. Welling, ‘‘Attention-based deep multiple instance learning,’’ in Proc. Mach. Learn. Res., vol. 80, Jul. 2018, pp. 2127–2136. [31] X. Shi, F. Xing, Y. Xie, Z. Zhang, L. Cui, and L. Yang, ‘‘Loss-based attention for deep multiple instance learning,’’ in Proc. AAAI Conf. Artif. Intell., vol. 34, no. 4, pp. 5742–5749, Apr. 2020. VOLUME 11, 2023 [32] S. Yan, Y. Xiong, and D. Lin, ‘‘Spatial temporal graph convolutional networks for skeleton-based action recognition,’’ in Proc. AAAI, 2018, pp. 1–9. [33] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,’’ 2018, arXiv:1803.01271. [34] C. Bergeron, G. Moore, J. Zaretzki, C. M. Breneman, and K. P. Bennett, ‘‘Fast bundle algorithm for multiple-instance learning,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 6, pp. 1068–1079, Jun. 2012. [35] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, ‘‘Deep multi-instance networks with sparse label assignment for whole mammogram classification,’’ in Proc. MICCAI, 2017, pp. 603–611. [36] W. Hamilton, Z. Ying, and J. Leskovec, ‘‘Inductive representation learning on large graphs,’’ in Advances in Neural Information Processing Systems, vol. 30. Red Hook, NY, USA: Curran Associates, 2017. [37] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, ‘‘Graph attention networks,’’ in Proc. ICLR, 2017, pp. 20–32. [38] M. Wang and N. Yang, ‘‘OTA-NN: Observational therapy-assistance neural network for enhancing autism intervention quality,’’ in Proc. IEEE 19th Annu. Consum. Commun. Netw. Conf. (CCNC), Jan. 2022, pp. 1–7. [39] G. Li, M. Müller, A. Thabet, and B. Ghanem, ‘‘DeepGCNs: Can GCNs go as deep as CNNs?’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 9267–9276. [40] A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano, and A. C. Murillo, ‘‘One-shot action recognition in challenging therapy scenarios,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW), Jun. 2021, pp. 2777–2785. [41] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’ 2017, arXiv:1711.05101. MINXIAO WANG received the B.S. degree in electronic and communication engineering and the M.S. degree in electronic and information engineering from the Civil Aviation University of China, in 2014 and 2018, respectively. He is currently pursuing the Ph.D. degree with the School of Electrical, Computer, and Biomedical Engineering, Southern Illinois University, Carbondale, IL, USA. His research interests include developing machine learning and deep learning models for diverse applications related to action recognition, object detection, and network intrusion detection. NING YANG (Member, IEEE) received the M.S. degree in computer engineering from the University of Massachusetts, Amherst, MA, USA, in 2006, and the Ph.D. degree in computer engineering from Southern Illinois University, Carbondale, IL, USA, in 2020, where she is currently an Assistant Professor with the School of Computing, Information Technology Program. Her research interests include machine learning, network security, and network intrusion detection. 31961