Uploaded by Usman Zafar

OBTAIN Observational Therapy-Assistance Neural Network for Training State Recognition

advertisement
Received 26 February 2023, accepted 20 March 2023, date of publication 29 March 2023, date of current version 3 April 2023.
Digital Object Identifier 10.1109/ACCESS.2023.3263117
OBTAIN: Observational Therapy-Assistance
Neural Network for Training State Recognition
MINXIAO WANG1 AND NING YANG
2,
1 School
(Member, IEEE)
of Electrical, Computer and Biomedical Engineering, Southern Illinois University, Carbondale, IL 62901, USA
2 School of Computing, Southern Illinois University, Carbondale, IL 62901, USA
Corresponding author: Ning Yang (nyang@siu.edu)
This work was supported in part by the U.S. NSF under Grant CC-2018919.
ABSTRACT Children with autism spectrum disorder (ASD) often require long-term and high-quality
intervention. An important factor affecting the quality of the intervention is therapists’ observation skill,
by which therapists can opportunely adjust their intervention strategies based on children’s states. However,
there is a shortage of experienced therapists and observational skill development is time-consuming for junior
therapists to acquire. This motivates us to use data-driven deep learning method to build an OBservational
Therapy-AssIstance Neural network (OBTAIN), which is a weakly-supervised learning framework for the
therapy training states recognition. OBTAIN first represents children’s skeleton-sequence data as a large
graph. Then a graph representation learning module is used to extract training state features. To learn spatialtemporal behavior features more effectively and efficiently, a novel structure-aware GCN block is designed in
OBTAIN’s learning module. Finally, a MILnet and corresponding joint MIL loss are used to learn state score
prediction from extracted features. Experimental results (0.824 AUC score) on DREAM dataset demonstrate
our OBTAIN can effectively recognize the training states in autism intervention.
INDEX TERMS Autism, therapy assistance, weakly-supervised learning, GCN, multiple instance learning.
I. INTRODUCTION
Autism Spectrum Disorder (ASD) is a neurodevelopmental
disorder. Children with ASD can benefit from long-term
intervention. Some therapeutic treatments such as Applied
Behavior Analysis (ABA), Relationship Development Intervention (RDI) model, Cognitive-Behavior Therapy (CBT)
have established standard protocols that can be given in
education, clinic, community, or home settings.
In the typical intervention session, therapists set a training
goal and the task training cycle could be broken down into
two basic steps. First, the therapist gives instructions, and
the child responds. Then, based on observing the child’s
response, the therapist identifies the child’s mood, then
adjusts for the next instruction. Since children with ASD
usually have their own strengths, difficulties, and other
special needs, the same instructions may work for some
children but are ignored by others. Children may not be able
to express their feelings by language, but their eye contact
and body movements will release a signal for whether they
The associate editor coordinating the review of this manuscript and
approving it for publication was Sotirios Goudos
VOLUME 11, 2023
.
understand instruction and would like to follow it. Usually,
it can take several cycles for the therapist to find the best
way to communicate with the child to get one task well
done. In this child–therapist interaction, the key component is
observation. The intervention training could effectively move
on if the child’s response is recognized properly and promptly
by the therapist.
Currently, this observational process is handled by
therapists only. For senior therapists, their observational
experience can help them better understand children’s
special behavior and provide dynamic adaptation. For junior
therapists who lack observational experience, they might
learn the typical behaviors of ASD children in professional education and accumulate the experience during the
intervention practice. However, most observation skills are
learned through individual cases, making the skill hard to be
taught directly and effectively. Nowadays, most states face
a shortage of experienced providers for ASD intervention.
This motivates us to provide some observational assistance,
which will help quickly fill the gap of observation ability
between experienced and junior therapists, thus improving
the effectiveness of ASD intervention.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
31951
M. Wang, N. Yang: OBTAIN for Training State Recognition
FIGURE 1. Traditional therapy intervention.
FIGURE 2. OBTAIN assisted therapy intervention.
To achieve our long-term goal of enhancing autism
intervention quality, the first step is building an observation system to recognize children’s training states during
the intervention. Considering that experienced therapists
improve their observational skills through practice in actual
interventions, we believe that recently advanced data-driven
deep learning techniques have great potential for solving this
task. Fortunately, many therapists recorded their intervention
sessions for further analysis. Deep learning models can learn
general knowledge and patterns from the recorded data more
efficiently than humans.
In recent years, machine learning (ML), especially deep
learning (DL) techniques have been applied to autism
research. For example, many ASD diagnosis studies take
the advantage of deep learning for detecting different data
types including magnetic resonance imaging (MRI) [6],
facial expression [7], eye gaze [8], movement pattern [9],
[10], and multimodal data [11]. Besides ASD diagnosis,
Robotic Assisted Therapy (RAT) also gains attention. Social
robots in RAT can improve the quality of therapy, acting
as a mediator between therapists and children with ASD,
especially improving the effects from therapists to children
with ASD, as the blue arrow on the right side in Fig.1
demonstrates. Different from previous works, we focus on
another direction as Fig.2 shows, a deep learning-based
observational assistance method to help therapists identify the
children’s states reflected by their behavior.
With our study, it is challenging to provide learningbased real-time therapy training state recognition due to the
following difficulties:
1) Limited data. Data collection and annotation are very
costly, only large-scale clinical experiments can provide
the needed dataset. In addition, for privacy concerns,
only high-level perception data from sensors, like
skeleton data, are available for training the DL model.
2) Blurred definition of training state. Good or bad
training state is relatively defined based on historical
31952
and ongoing individual performance. As an abstract
concept, the training state may not be directly relevant
to particular behaviors. There is no existing dataset had
been annotated about the training state.
3) Efficiency requirement for real-world scenarios. During real-world therapy intervention, the observational
system is expected to process input in real-time and
generate prediction efficiently.
In this work, we propose an Observational TherapyAssistance Neural Network (OBTAIN) as shown in Fig.3,
which aims at providing observational assistance efficiently
during the interactions between therapists and children.
Our OBTAIN model contains two key components: (1) a
GCN-based neural network to extract spatial and temporal
features from children’s 3D skeleton behavior data, which
is beneficial to privacy and data protection; (2) a multiple
instance learning (MIL) networks, which predicts the score
during training time for reflecting children’s personal training
state in a therapy session. We use weakly-supervised learning
to train OBTAIN model with a joint MIL loss to overcome the
rare labeled data.
To verify our method, we evaluate our OBTAIN model
on DREAM dataset [3] provided by DREAM project [4].
Although the DREAM dataset does not include training state
annotations, we take advantage of large-scale comparison
between provided RET and SHT data to simulate the
intervention scenarios offered by different level therapists.
Our recognition results on the DREAM dataset achieve
0.824 AUC score, which shows OBTAIN can successfully
detect where a bad training state happened and timely
locate it. Furthermore, we compare our proposed structureaware GCN block with two classical GCN blocks: GAT and
GraphSAGE. The proposed structure-aware GCN block can
achieve better performance (0.824 AUC score) than GAT
and GraphSAGE (0.812 and 0.809 AUC score). Additionally,
jointing an attention-based MIL pooling bag loss with MIL
ranking loss also increases the performance of our model
by approximately 0.017 AUC score. We summarize the
contributions of this paper as follows:
1) A weakly-supervised framework for recognizing children’s training states during ASD intervention. The
weakly-supervised learning relieves the requirement for
temporal annotations of training states.
2) A structure-aware GCN block, which unifies spatial and
temporal convolution as the message passing operation
on a large skeleton-sequence graph to extract movement
features.
3) A joint MIL loss, which includes a bag loss for attentionbased MIL pooling and a MIL ranking loss for instances
of state score regression. The self-attention mechanism
perfectly fits the prediction of relative training states
during a therapy session.
4) Experiments are conducted on a large-scale skeleton
ASD therapy dataset, DREAM, our weakly-supervised
learning framework achieves 0.824 AUC score. Compared with classical GraphSAGE and GAT blocks, the
proposed structure-aware GCN block can help OBTAIN
achieve better performance.
VOLUME 11, 2023
M. Wang, N. Yang: OBTAIN for Training State Recognition
In the remainder of this paper, related work is discussed
in Section II. Section III presents the design details of
our OBTAIN model. Section IV introduces the DREAM
dataset and data pre-processing for training and evaluating.
The training and evaluation performance with analysis on
DREAM dataset is shown in Section V. Meanwhile, we also
present the comparison of our structure-aware GCN block
against classical GCN blocks and the ablation study of our
OBTAIN model. Finally, Section VI concludes the paper.
II. RELATED WORK
In this section, we introduce related works about skeletonbased action recognition, graph representation learning and
multiple instance learning.
Graph representation learning is adopted in many different
applications. For example, in knowledge repositories, node
sequences can be regarded as sentences and fed into the skipgram model to learn low-dimension node representations.
Another important application is predicting the relationship
in social networks.
Recently, there has been a surge of approaches that seek
to learn representations that encode structural information
about the spatial graph, such as biological protein-protein
networks [27] and 3D point cloud [28]. GCNs have made a
lot of progress in the past decade and achieved great success
in those tasks. Considering the recent success of graph
representation learning on the spatial graph, our strategy is
to apply it to skeleton sequence data for the task of training
state recognition during ASD therapy.
A. SKELETON-BASED ACTION RECOGNITION
Numerous deep learning-based methods, such as RNNbased, CNN-based, and GCN-based, are proposed for 3D
skeleton-based action recognition. However, both RNNs and
CNNs are suitable for processing a vector sequence or 2D
grids instead of the skeleton data that are naturally embedded
in the form of graphs. Recently, Graph Convolutional
Networks (GCNs) have attracted much attention as human
3D-skeleton data is naturally a topological graph instead
of a sequence vector or pseudo-image which the RNNbased or CNN-based methods deal with [13]. The spatial
temporal graph convolution networks (ST-GCN) model is
presented in [32], which constructed a spatial-temporal graph
with joints as graph vertices and natural connectivity in
both human body structures and time as graph edges. ASGCN [14] and 2s-AGCN [15] have proposed improvements
to ST-GCN to get better performance by defining and adding
more complexity spatial graph connections so that more
action features can be learned.
Most existing skeleton-based GCN action recognition
approaches share a similar core component: ST-GCN block,
which is a graph convolutional layer followed by a 1DCNN layer. Obviously, ST-GCN is a special one compared
with other well-known graph convolutional structures. The
two-step structure aims to learn movement features from
spatial-temporal graphs, but it’s different as most graph
representation learning methods aggregate information from
neighbors. Although most previous works achieved remarkable performance on action recognition tasks, the two-step
ST-GCN is problematic due to the increased computational
complexity in both network parameters and pre-defined
spatial graph structure. Meanwhile, the ability of our model
to recognize the states during intervention across different
actions is crucial in our state recognition task. Therefore,
in this work, we make an attempt to unify spatial and temporal
convolution as the message passing operation in a single layer
but still maintain the ability to extract hidden patterns from
spatial-temporal graphs.
B. GRAPH REPRESENTATION LEARNING
Graph representation [26] aims at learning low-dimension
vector representations of nodes to facilitate a better understanding for semantic relationships among nodes in graphs.
VOLUME 11, 2023
C. MULTIPLE INSTANCE LEARNING
Multiple Instance Learning (MIL) is a branch of weakly
supervised learning [17]. MIL allows weak supervision of
the training data and considers groups of observations, called
bags, where ground-truth labels are only available at the
bag level. The labels of the observations in a bag, called
instances, are unknown. In a binary classification problem,
MIL assumes that a bag is identified as negative if all the
instances it contained are negative. However, if it contains
at least one positive instance, the bag will be marked
as positive. Deep MIL has been successfully applied to
various domains, such as anomaly detection [18], human
action recognition [19], and medical image analysis [20].
However, MIL ranking loss was formulated as a nonconvex problem due to the non-smooth and non-convex hinge
loss [29].
In MIL approaches, the bag labels will be given. Therefore,
MIL pooling can be adopted for training bag level prediction.
Most common MIL pooling approaches [35] utilize either
the mean pooling or the max pooling, which is non-trainable
on cross-entropy loss instead of hinge loss. A fully trainable
MIL pooling is proposed in [30], which provides insight
into the contribution of each instance to the bag label.
A loss-based attention mechanism is proposed in [31], which
simultaneously learns instance weights and predictions, and
bag predictions for deep multiple instance learning. Inspired
by MIL pooling approaches, we introduce a bag level loss to
our joint MIL loss to overcome the non-convex problem in
our model’s optimization.
III. OBTAIN ARCHITECTURE
In this section, we introduce the design details of our
OBTAIN model. The whole framework consists of two
phases: behavior feature extracting phase and weaklysupervised multiple instances learning phase, which are
demonstrated in Fig.3. Two core parts corresponding to
those two phases are ‘‘structure-aware GCN Network’’ and
‘‘Multiple Instance Learning(MIL) Network’’. In the first
phase, we design and configure our neural network structures
focusing on both efficacy and efficiency. In the second
phase, we use the MIL method to convert training state
31953
M. Wang, N. Yang: OBTAIN for Training State Recognition
FIGURE 3. OBTAIN model consists of two phases: behavior feature extracting phase and weakly-supervised multiple instances learning phase. Two data
streams are token by OBTAIN, positive bags contain all good training state instances and negative bags contain bad training state instances.
FIGURE 4. Illustration of graph structure for 3 frames skeleton
sequences. Joints of the human body are denoted as blue circles,
skeleton connections in a single frame are denoted as solid lines, and
dashed lines present the timely continuity of the same joint from
adjacent frames. In each frame, joint nodes are set with fixed sequences
index. The node indexes between adjacent frames are consecutive.
recognition into a training state score regression problem.
Further, we improve the flexibility of MIL pooling with selfattention, which can better predict the relative training state
score. Additionally, we use joint MIL loss, which includes
a ‘‘bag loss’’ and a ‘‘MIL ranking loss’’ to train our model.
In the rest of this section, we first generally overview the
behavior feature extracting pipeline (phase 1). Then we
specifically explain the design of a structure-aware GCN
block. Finally, we describe the details of multiple instances
network and joint MIL loss (phase 2).
A. BEHAVIOR FEATURE EXTRACTING
The behavior feature extracting pipeline is designed to
leverage deep learning to extract feature representations from
original human body skeleton sequences data for downstream
training states detection. As shown in Fig.3, our extractor
pipeline has four components: (1) input layer; (2) GCN
(Graph Convolutional Networks) module; (3) TCN (temporal
convolutional neural networks) module; (4) adaptive pooling
layer. Besides neural network modules, a data format module
is required to process raw data.
31954
Input format (Graph). Skeleton sequence data contains 3D
coordinates of each human joint in each frame. Previous
work usually formatted the node set in two dimensions. For
example, constructing a skeleton sequence with N joints and
T frames as a spatial temporal graph, the node set will be
presented as V = {v(t,i) |t = 1, . . . , T ; i = 1, . . . , N }.
In this work, we also employ a graph structure on skeleton
sequence data, but we format the node set in one dimension.
A graph construction example is presented in Fig.4, three
frames skeleton data were presented as a graph G = (V , E),
where E is the edge set. In Fig.4, joints of the human
body are denoted as blue circles, skeleton connections in
the single frame are denoted as solid lines, and dashed lines
present the timely continuity of the same joint from adjacent
frames. In DREAM dataset, 12 joints of the upper body’s
3D coordinates are recorded. We organize the single frame’s
joint sequence in the order of up to down and left to right,
then concatenate the next frame at the tail. So the node set is
denoted as V = {vi |i = 1, . . . , N × T }. The one-dimension
node set will benefit us in unifying spatial and temporal graph
convolution together.
The edge set E contains two subsets, the spatial edge, and
the temporal edge. The spatial edge is defined by the natural
topology of human joints, denoted as Es = {(vp , vq )|(p, q) ∈
Adj}, where Adj is the adjacency matrix of joints. Temporal
edge set can be easily presented as Et = {(vi , vi+N )}, where N
is the number of joints in each frame. Hence, the whole edge
set is E = {Es , Et }.
Input layer. As the first layer of the extractor pipeline,
the input layer receives predefined graph data whose nodes’
features are raw 3D coordinates of joints. Therefore, a data
batch normalization layer is added at the beginning to
normalize raw data. Then, we use a GCN layer to encode
the node feature from 3D coordinates to a feature with
64 channels.
GCN module. As stated in the previous part, children’s
state is an indistinct concept. The same kind of state can cross
different actions. Considering this difference, we complete
embedding generation with a shallow network which will
be more sensitive to tiny movement feature that reflects
mental and spiritual state. GCN operation aggregates feature
information from a node’s local neighborhood, with an
increasing number of layers, the respective range will reach
distant neighbors. Shallow GCN can restrict the respective
VOLUME 11, 2023
M. Wang, N. Yang: OBTAIN for Training State Recognition
range within a local neighborhood, especially focusing
on short-time dependence. Besides the local characteristic,
shallow network structure also means more lightweight and
efficient.
In our graph convolutional network module, a novel graph
structure-aware GCN layer is adopted as a basic block.
We define the first order neighborhood as GCN block’s aggregate range on each node. Since we have already formatted
the node set in one dimension and combined spatial edge and
temporal edge together, our graph structure-aware GCN layer
can aggregate spatial and temporal features in one layer. That
means pre-defined spatial configuration partitioning, which is
usually needed for processing spatial GCN intra-frame, is no
longer necessary. In addition, residual links are added over
stacked blocks, as shown in Fig.4, the block residual link
connects the features before and after each block.
TCN module. As noted that action recognition will predict
a single action class for the whole skeleton sequence, while
state recognition needs to perceive the training state that
varies with time. It should be able to timely locate children’s
anomaly state during therapy intervention.
After the GCN module extracts movement representation
features for all nodes, we deploy a TCN module to learn
features in temporal domain [40]. Before entering the TCN
module, we need to participate in the whole sequence in
multiple segments according to the temporal resolution for
the anomaly state locating a task. In Fig.3, we break the
original sequence, the bag, into segments, or instances.
Then TCN module will extract the feature of each instance
independently. We employ similar dilated convolutions as
in [33], because the TCN architecture is not only more
accurate than canonical recurrent networks such as LSTMs
and GRUs, but also simpler and clearer. Dilated convolution
operation can be formulated as:
F(s) =
k
X
f (i) · xs−d·i
(1)
i=1
where s presents the index of an element of input sequence,
k presents filter’s kernel size.
Adaptive pooling layer. Different from [33], we modify
the structure by keeping the dilation factors of each layer to
be one and adding a max pooling layer between two adjacent
layers since only one prediction for each instance (segment)
is needed. Therefore, our TCN blocks still have the ability
to increase the receptive field quickly with stacking layers
without increasing dilation. Meanwhile, the added pooling
layer also reduces feature size and makes our TCN module
more efficient. The residual block contains a branch leading
out to a TCN layer, whose outputs are added to the input of the
block. The residual blocks allow layers to learn modifications
to the identity mapping, which has repeatedly been shown to
benefit very deep networks.
B. STRUCTURE-AWARE GCN BLOCK
Structure-aware GCN block is the key component of our
feature extractor pipeline. It enables us to unify spatial and
temporal graph convolution together, meanwhile, making
VOLUME 11, 2023
FIGURE 5. Structure-aware GCN block. The blue circle denotes a center
node, red circles denote as neighborhood nodes of the center node. The
neighborhood is divided into four parts, ‘‘up and left part’’, ‘‘down and
right part’’, ‘‘previous frame part’’ and ‘‘next frame part’’. For each part,
features are processed independently and concatenated together.
sure spatial features and temporal features can be learned.
Recall that in the data format part, we store the node set
in one dimension and format skeleton sequences data as a
large graph without distinguishing spatial or temporal edges.
This enables us to efficiently aggregate features from all
neighbor nodes in a single layer. Another notable contribution
is that we design the structure-aware GCN block to assure
the features aggregated along spatial and temporal edges are
different.
Fig.5 shows how structure-aware GCN block aggregates
information from one node’s local neighbors and generates
embedding representation feature for the node. As the upper
part of Fig.5 shown, the blue circle in the middle frame
denotes the current node where the structure-aware GCN
filter is working at. Red circles denote neighbor nodes in the
aggregate range.
It should be noticed that the upper-body skeleton data has
a quite simple spatial structure as shown in Fig.4. Most nodes
only have two neighbor nodes, since each neighborhood part
needs to be followed with independent neural network layers.
Considering the model efficiency, we combine the ‘‘up’’ and
‘‘left’’ spatial relationship together, same for ‘‘down’’ and
‘‘right’’.
Therefore, we participate the aggregate range of each node
into four parts, which are the ‘‘up and left part’’, ‘‘down and
right part’’, ‘‘previous frame part’’, and ‘‘next frame part’’,
as shown with different color regions in Fig.5. Formally,
we define N (vi ) as the neighborhood of node vi , where i is
the index of the node in the whole graph. Then the neighbor
nodes set for four aggregate range parts are four subsets of
N (vi ). They would be defined as:
Nul (vi ) = {vj |∀(vi , vj ) ∈ E, N > j − i > 0}
Ndr (vi ) = {vj |∀(vi , vj ) ∈ E, 0 > j − i > −N }
Npf (vi ) = {vj |∀(vi , vj ) ∈ E, j − i = −N }
Nnf (vi ) = {vj |∀(vi , vj ) ∈ E, j − i = N }
(2)
31955
M. Wang, N. Yang: OBTAIN for Training State Recognition
where N is the number of joints in one frame, and E is the
edge set.
Neighbor nodes’ features, hl−1
Np (v) will be aggregated
together if they belong to the same range part. The mean
operator is applied to taking the element-wise mean of the
feature vectors, then producing a feature embedding for each
aggregate range part. Four corresponding fully connected
layers take those four regions’ features in and reduce the
number of channels to 1/4 of it used to be.
l−1
hl−1
Np (v) = AGGmean ({hu , ∀u ∈ Np (v)}),
p ∈ {ul, dr, pf , nf }
(3)
where AGGmean () is an aggregator with element-wise mean
operator, l is the layer number.
After that new shorter features from the four parts are
concatenated together and sent to a fusion layer. The fused
feature represents the neighbor information of the root node.
In this way, features from different spatial and temporal parts
are separately saved in different channels. The root node
feature is fed through another linear layer to transform the
representation. The final feature representations output of
one structure-aware GCN block is the summation of fused
neighbor feature hlN (v) and new root feature hlv in equation
(4) and (5).
l−1
l−1
l−1
hlN (v) = Wf · CAT(hl−1
Nul (v) , hNdr (v) , hNpf (v) , hNnf (v) , ) (4)
hlv = Wr · hl−1
v
(5)
where CAT () denotes the concatenation operation, Wf is the
weight of fusion layer, Wr is the weight of root layer.
fixed number of segments (e.g., 32 segments) during training.
These segments of an intervention session clip are instances
in a bag.
MIL ranking loss. MILnet predicts a training state score
for each segment of an intervention session clip. For the
purpose of distinguishing good or bad training state, MILnet
should predict a higher score for the bad training state
frame
than the good one, like fMILnet (pi ) > fMILnet nj . But we
only have intervention session clips level (bags) annotations,
it’s impossible to compare each pair of pi and nj . Hence,
we use multiple instances ranking loss to achieve that goal,
the objective function can be formulated as:
max fMILnet (pi ) > max fMILnet nj
(6)
i∈Bp
j∈Bn
Instead of enforcing ranking on every instance of the bag,
MIL ranking loss can enforce ranking only on the two
instances having the highest training state score respectively
in the positive and negative bags. The segment corresponding
to the highest training state score in the positive bag is most
likely to be the true positive instance (bad training state
segment). The segment corresponding to the highest training
state score in the negative bag is the one that looks most
similar to a bad training state segment, but it actually is a good
one.
We implement a hinge-based ranking loss [18] for training,
in order to keep a large margin between the positive and
negative instances:
l Bp , Bn = max(0, 1 − max fMILnet (pi )
i∈Bp
+ max fMILnet nj )
(7)
j∈Bn
C. MULTIPLE INSTANCE LEARNING FOR REGRESSING
TRAINING STATE SCORE
Since the large scale precise temporal annotation of training
state in therapy sessions record is almost impossible to obtain,
we cannot simply learn the patterns the same way as in a
standard classification problem. Instead, we can treat it as
a Multiple Instance Learning (MIL) problems. In this work,
we mainly use MIL ranking loss for optimization. To overcome the optimization difficulty caused by nonsmoothness
and nonconvexity of MIL ranking loss [34], we introduce an
attention-based MIL pooling-based bag loss. The joint MIL
loss is more suitable for instances state score regression than
MIL ranking loss only.
Problem Statement. In our scenario, we only have session
clips level annotations. A session clip containing bad training
state is labeled as positive, otherwise, it is labeled as negative. Following real-world anomaly detection method [18],
we represent a positive video as a positive bag Bp , and
different temporal segments are individual instances in the
bag (p1 , p2 , . . . , pm ), where m is the number of instances
in the bag. We assume that at least one of these instances
contains bad training state frames. Similarly, the negative
video is denoted by a negative bag Bn , where temporal
segments in this bag are negative instances (n1 , n2 , . . . , nm ).
In the negative bag, all of the instances are good training
state frames. In this work, we divide each session clip into a
31956
Further, since the intervention session clip is a sequence
of segments, the training state score should vary smoothly
between segments. Hence, we minimize the difference of
scores for the adjacent segment to enforce the prediction
of MILnet to be temporal smooth. The additional loss item
lsmooth is shown as:
lsmooth =
m−1
X
(fMILnet (pi ) − fMILnet (pi+1 ))2
(8)
i=1
Attention-based MIL pooling. Attention-based MIL pooling is more flexible than mean pooling and max pooling
since the attention mechanism is a trainable and adaptive
method. The attention mechanism is widely used in deep
learning for image captioning or text analysis [30]. Different
with existing methods that generate attention weights by
neural network [30], [35], our method is inspired by selfattention [21], which uses scaled dot-product attention.
As shown in Fig.3, in the multiple instance learning
module, extracted features are transformed by a multiple
layer perceptron (MLP) header and transformed features
embedding represent instances. We compute the dot products
of each pair of instance features and obtain the self-attention
map, which is a N × N matrix. Then we sum each line
together and apply a softmax function to obtain the weights
on the values. The output is the instance attention weight.
VOLUME 11, 2023
M. Wang, N. Yang: OBTAIN for Training State Recognition
With the attention weight, we apply a weighted summation
on instance score to get the bag score prediction. Since we
have the bag labels, a binary cross-entropy loss will be easily
calculated.
IV. DREAM DATASET PREPARATION
In this section, we introduce the DREAM dataset and data
preparation. In order to train and evaluate our OBTAIN
model, data need to include both pure good state sessions
and labeled sessions with state decline. However, for privacy
concerns, no existing dataset consists of training state
annotations. Fortunately, the published DREAM dataset
provides a large-scale comparison between RET (Robot
Enhanced Therapy) and SHT (Pure Human Therapy). We use
RET and SHT session data to simulate the intervention
scenarios with different training states. The corresponding
evaluation results with the DREAM dataset will be presented
in SectionV.
A. DREAM DATASET DESCRIPTION
The DREAM dataset [3] collected behavioral data during a
large-scale evaluation of Robot Enhanced Therapy (RET).
They separate 61 children who were diagnosed with ASD
into two groups. The Applied Behavior Analysis (ABA)
therapists provided training directly with half of the children
(SHT) and the social robot, which is supervised by a therapist
(RET) interacted with other children. The therapy sessions
for both groups are recorded by three RGB cameras and two
RGBD (Kinect) cameras, and all the detailed information
about children’s behavior during the therapy is included. The
dataset recorded 3000 therapy sessions, providing a total of
more than 300 hours of therapy.
Most therapy sessions targeted three social skills, namely
imitation (IM), joint-attention (JA), and turn-taking in
collaborative play (TT), while a part of session data contains
no task label. As we mentioned, the blurred definition
of the training state is a challenge. The training state
is not directly relevant to particular behaviors. Therefore
we evaluate our method on both mixed task data and
independent single task. Meanwhile, the naturally imbalanced sessions number among IM, JA and TT can help
us to test our method with both sufficient and insufficient
data.
B. DEFINITION OF TRAINING STATE IN DATASET
The core function of our observational assistance method is
recognizing unsatisfied training states to help the therapists
identify children’s responses to a training task and adjust
their strategy to help children achieve the training goal.
Considering every child with ASD has special strengths and
difficulties in various areas, there is no standard definition
for satisfied and unsatisfied states, which also means,
labeling data would be a challenge. Not to mention temporal
annotations are required for long time intervention records.
For the purpose of bridging the gap of observation ability
between experienced therapists and unprepared therapists,
we transform this challenge by treating children’s behavior
through experienced therapists’ intervention as a good
VOLUME 11, 2023
training state and through unprepared therapists’ intervention
as bad training state. Therefore, the temporal annotations
requirement can be ignored and training data is easier to
obtain.
Specifically, because robots can attract more attention
from children with ASD and, as [24] shown, they might be
more motivated to participate in training. We consider the
behavior during intervention sessions with RET as a good
training state under experienced therapists. Correspondingly,
the behavior during SHT intervention will be treated as a
bad training state under unprepared therapists. Although,
we wouldn’t say that children are always in a good training
state during interventions offered by experienced therapists,
most of the time, they perform better. Hence, it’s reasonable to
simplify our observational assistance goal to distinguish the
training states related to the behavior when interventions are
provided by different level providers: experienced therapists
and unprepared therapists (in the DREAM dataset, RET and
SHT).
Actually, in the real intervention scenario, the training
states vary at different levels over time. That’s one reason
that motivates us to provide observational assistance. For
the purpose of identifying the states of the children during
the intervention and informing the therapist according to the
observation, we train our model to detect the bad training
states mixed in good training states. However, the DREAM
dataset provides the original therapy records for RET/SHT
separately. In order to better simulate the practical training
cases and reflect the real training state fluctuation over time,
we pre-process data by splicing the RET and SHT therapy
records clips together.
C. DATA PRE-PROCESSING
Before extracting features, we initialized input data clips
with a data format, which has fixed length N of skeletonframes (N=1024) and each skeleton-frame includes J joints
(J=12). The dimension of the initial feature of each joint (3D
position of each joint) is F. For each session clip, we take
the length N RET therapy record as basement, and mix a
random length of SHT data sequence into a random position
of the basement to simulate bad training state happens. Then,
we divide each data clip into 32 non-overlapping segments
and consider each data segment as an instance of the bag.
The number of segments (32) is empirically set [18]. For each
segment, we extract features to represent it.
V. EVALUATION
This section shows the evaluation results of our proposed
OBTAIN framework. First, we evaluate the performance of
our OBTAIN framework on the DREAM dataset. Second,
we compare the performance of our novel structure-aware
GCN block with the two most popular classical approaches,
GraphSAGE and GAT. Third, we use an ablation study to
analyze the effectiveness of each module.
A. EVALUATION METRICS
We evaluate our OBTAIN model by frame based receiver
operating characteristic (ROC) curve and corresponding area
31957
M. Wang, N. Yang: OBTAIN for Training State Recognition
under the curve (AUC), which are normally common evaluation metrics used by anomaly detection methods. The ROC
curve is a graph showing the performance of a classification
model at all classification thresholds. This curve plots two
parameters: True Positive Rate (TPR) and False Positive
Rate (FPR). On different classification thresholds, the ROC
curve is drawn on TPR vs. FPR space. The scale-invariant
and classification-threshold-invariant properties make AUC
suitable for our application.
FIGURE 6. Testing loss for different training subsets among ALL, IM, TT,
JA, and no label.
B. PERFORMANCE ON DREAM DATASET
To evaluate the effectiveness of OBTAIN, we train OBTAIN
model to recognize bad training state segments which are
mixed with good training state session clips. The challenge
is that characteristics of the abnormal state vary greatly for
divergent behaviors for each special training topic. Hence,
we use different task subsets to prove OBTAIN is able to
generally recognize the states related to divergent behaviors.
Experiment Setup: We use AdamW [41] with a learning
rate of 0.00001 and reduce it on epoch 200 when the training
loss stops decreasing. The training batch size is set to 1024.
We train and test our OBTAIN with five subsets with different
intervention tasks (ALL, IM, TT, JA, no label), where ‘‘ALL’’
means the whole training dataset includes all intervention
tasks, and ‘‘no label’’ denotes the data without a task label.
Each subset is split into two parts with fixed proportions,
namely the training set (80%) and the testing set (20%).
Performance Analysis: Fig.6 shows the test loss curves
when OBTAIN model is trained with different subsets as a
function of iterations. Fig.7 shows the convergence curves
for each subset, with converging of loss, testing AUC scores
for different subsets correspondingly increase and converge.
Both figures show that our model has a stable performance
(more than 0.8 AUC score) on each independent subset. That
means our model has the ability to recognize the training
states related to divergent behavior styles presented during
interventions targeting various social skills. It should be
noticed that the JA subset has an apparent slower increase
than other subsets. The reason is JA is the smallest subset
which only contains around 5% amount of the whole dataset.
In the practical case, during a training session, several
training topics are combined together. It’s hard to train an
independent state recognition model limited to the specific
social skill intervention. That’s the reason we not only do
the training separately on each subset but also evaluate the
performance of the model trained by the whole training
dataset. Fig.8 shows the ROC curves tested on ALL, IM, TT,
JA and no label test subsets. From the results, we can see
that the areas under the ROC curves of ALL and IM are very
similar, around 0.824. ROC of TT and JA test sets are slightly
lower than others, TT is around 0.823 and JA is around 0.814.
Especially, ROC on the label test set is higher than ALL. That
is caused by the imbalance of intervention tasks in the whole
train dataset. Among the total 9352 intervention clip samples,
there are 1152 IM samples, 6840 TT samples, only 504 JA
samples, and 2008 samples without task labels.
Overall, children with ASD have great differences regarding to their adaptability during the intervention, which is
31958
FIGURE 7. Testing AUC curves show the model’s convergence progress on
different train subsets.
FIGURE 8. ROC curve comparison among different test subsets. The
model is trained on the whole train set.
shown by their behavior and states that need to be addressed.
Our model has the ability to identify the training states
successfully. In Fig.9, we present qualitative results of our
state recognition method on eight intervention clips. OBTAIN
provides successful and timely detection of those bad training
states by generating high state scores for the frames which
recorded during SHT interventions and low (close to 0) state
scores for good states (RET intervention samples), as shown
in the first six clips. The last two clips illustrate that failure
cases could generate false alarms and predict a low score for
bad training state, and predict some false alarms on a normal
intervention clip.
C. COMPARISON WITH CLASSICAL GCN BLOCKS
To validate that the proposed structure-aware GCN block
is more suitable for learning information from skeleton
sequences data than classical graph representation methods,
VOLUME 11, 2023
M. Wang, N. Yang: OBTAIN for Training State Recognition
FIGURE 9. State scores prediction results in the test set. Window blocks
with red color represent the ground truth of the anomalous region.
Curves represent the predicted training state score.
FIGURE 10. Testing AUC curve of proposed structure-aware GCN
compares with GAT and GraphSAGE.
FIGURE 12. ROC of proposed structure-aware GCN compares with GAT
and GraphSAGE.
Besides a higher AUC score, we also find differences in
ROC curve compared with GAT and GraphSAGE. Fig.12
shows the ROC curve of the best performance among
structure-aware GCN, GAT, and GraphSAGE. As we know,
the ROC curve is plotted with TPR (True Positive Rate)
against the FPR (False Positive Rate) where TPR is on
the y-axis and FPR is on the x-axis. TPR can be called
recall or sensitivity and FPR can be denoted as 1-specificity.
Sensitivity and specificity are inversely proportional to each
other, as Fig.12 shows, our structure-aware GCN is more
balanced between sensitivity and specificity. That means
the extractor pipeline with structure-aware GCN blocks can
learn better feature distribution for both positive and negative
samples. While the curves of GAT and GraphSAGE show that
the models with GAT or GraphSAGE blocks do predict high
state scores for abnormal states, they also predict many high
scores for normal states. In this case, the predictor can not
really find out valuable information.
D. ABLATIOIN STUDY
FIGURE 11. Testing loss curve of proposed structure-aware GCN
compares with GAT and GraphSAGE.
we evaluated the performance of our structure-aware GCN
against the two most popular approaches, GraphSAGE [36]
and GAT [37].
Comparison Setup: We only replace the structure-aware
GCN blocks in the extractor pipeline with GraphSAGE or
GAT blocks, then repeat the experiments on ALL subsets of
DREAM with exactly the same experimental setup. We use
the published realization GraphSAGE and GAT provided by
the PyG (PyTorch Geometric) v2.04 library [25].
Analysis: Fig.10 and Fig.11 show the testing performance
of AUC score and loss during 300 epochs training. From
Fig.10, we can see that the model with our structure-aware
GCN block can achieve a higher AUC score than GAT and
GraphSAGE.
VOLUME 11, 2023
In our OBTAIN model, the extractor pipeline consists of
the input layer, a structure-aware GCN module, and a TCN
module, joint MIL loss function includes bag loss and MIL
ranking loss. For analyzing the effectiveness of each module,
we further present an ablation study to compare variant
combinations of modules with different loss functions.
To ensure the integrity of the model, we adopt a linear layer to
replace the GCN-based input layer or global adaptive pooling
to replace TCN modules. Four variants can be described:
1): Linear + GCN blocks + global pooling
2): Linear + GCN blocks + TCN
3): GCN + GCN blocks + global pooling
4): GCN + GCN blocks + TCN
In addition, we separately train those four variants with the
proposed joint MIL loss and MIL ranking loss.
Analysis: The results are shown in Table 1. We consider
ranking loss only and set variant 1 as a baseline. Firstly,
we will see employing the TCN module instead of simply
adaptive global pooling can bring a 0.04 increase in the AUC
31959
M. Wang, N. Yang: OBTAIN for Training State Recognition
TABLE 1. AUC scores on different ablation settings trained with different
MIL loss functions.
FIGURE 13. Testing AUC score on different ablation settings in Table 1
have a decreasing trend, which is caused by nonsmoothness and
nonconvexity of MIL ranking loss.
score. Then, choosing the GCN block as the input layer can
bring around 0.021 increase in AUC score to using a linear
layer. And variant 4 has 0.046 higher AUC score than the
baseline. The previous experiment has already shown the
effectiveness of our structure-aware GCN block, therefore,
we conclude that the GCN-based input layer is beneficial
for subsequent spatial-temporal feature extraction. TCN has
a more important impact on the temporal abnormal state
prediction than the GCN-based input layer.
The MIL joint loss results in Table 1 show a similar trend
that the input layer module and TCN module can increase
OBTAIN model’s performance with training on different
loss functions. Meanwhile, comparing with two AUC score
columns, training with joint MIL loss can achieve a better
performance than MIL ranking loss only on all four variants.
That means the proposed joint MIL loss can directly increase
the performance of our model without any model structure
variation.
Furthermore, we found there is an AUC decrease trend
during the training procession on MIL ranking loss only.
Fig.13 shows four variants’ testing AUC scores trained
with MIL ranking loss. All four curves achieve their best
performance before 50 epochs, then all four curves start to
decrease their AUC performance. This decrease is caused
by nonsmoothness and nonconvexity of MIL ranking loss,
the optimization stuck in a local optimum. And as a
weakly-supervised approach, reduction of loss does not
guarantee a mounting of performance. The results shown
in Fig.7 and Fig.10, which are trained with proposed joint
MIL loss do not have a decreasing trend and achieve better
performance. Therefore, those results prove our proposed
joint MIL loss can overcome the nonsmoothness and
nonconvexity problem of MIL ranking loss.
VI. CONCLUSION
In this paper, we propose OBTAIN, a weakly-supervised children’s training state recognition for providing observational
31960
assistance to therapists during ASD intervention. OBTAIN
includes an extractor pipeline for encoding feature representation for children’s behavior. We design a structure-aware
GCN block that can unify spatial and temporal convolutional
operations in the same layer to extract features. A MILNet
is proposed for learning how to identify children’s special
behavior by analyzing different training states. We introduce
a joint MIL loss to combine MIL ranking based optimization
with MIL pooling based optimization. The experiments on
DREAM dataset achieve 0.824 AUC score, which shows
our model can successfully distinguish children’s different
training states. Moreover, the proposed structure-aware GCN
block can achieve a better performance than GAT and
GraphSAGE (0.812 and 0.809 AUC score). Additionally, our
joint MIL loss also increases the performance of our model
than MIL ranking loss by around 0.017 AUC score. The
limitation of the work is that we only evaluate the proposed
OBTAIN on the DREAM dataset. DREAM dataset is the
only available published dataset on behavioral data of ASD
therapy sessions, which are labeled as one of two conditions,
Robot Enhanced Therapy or Standard Human Treatment.
We used it in [38], trying to employ the machine learning
concepts to analyze the behavioral data. Then we design a
new method in this paper and get a much better performance.
Our future goal is to utilize OBTAIN to aid teletherapy for
children with ASD, as it can make therapy more accessible
to those living in remote or underserved areas by reducing
the need for in-person sessions. Nonetheless, therapists may
face limitations on observing their patients without in-person
interventions. Therefore, we plan to use OBTAIN to enhance
the quality of teletherapy for ASD patients.
In conclusion, the proposed OBTAIN is a successful
attempt for building an observation model to recognize children’s training states during ASD intervention. We believe
that OBTAIN is the first milestone for enhancing long-term
high-quality intervention with deep learning.
REFERENCES
[1] F. Chiarotti and A. Venerosi, ‘‘Epidemiology of autism spectrum disorders:
A review of worldwide prevalence estimates since 2014,’’ Brain Sci.,
vol. 10, no. 5, pp. 274–295, 2020.
[2] G. Bertamini, A. Bentenuto, S. Perzolli, and E. Paolizzi, ‘‘Quantifying the
child–therapist interaction in ASD intervention: An observational coding
system,’’ Brain Sci., vol. 11, no. 3, pp. 366–389, 2021.
[3] E. Billing, T. Belpaeme, H. Cai, H.-L. Cao, A. Ciocan, C. Costescu,
D. David, R. Homewood, D. Hernandez Garcia, P. Gómez Esteban, H. Liu,
V. Nair, S. Matu, A. Mazel, M. Selescu, E. Senft, S. Thill, B. Vanderborght,
D. Vernon, and T. Ziemke, ‘‘The dream dataset: Supporting a data-driven
study of autism spectrum disorder and robot enhanced therapy,’’ PLoS
ONE, vol. 15, no. 8, pp. 1–15, Aug. 2020.
[4] H.-L. Cao, ‘‘Robot-enhanced therapy: Development and validation of
supervised autonomous robotic system for autism spectrum disorders
therapy,’’ IEEE Robot. Autom. Mag., vol. 26, no. 2, pp. 49–58, Jun. 2019.
[5] R. A. J. de Belen, T. Bednarz, A. Sowmya, and D. Del Favero, ‘‘Computer
vision in autism spectrum disorder research: A systematic review of
published studies from 2009 to 2019,’’ Translational Psychiatry, vol. 10,
no. 1, pp. 1–20, Sep. 2020.
[6] H. Li, N. A. Parikh, and L. He, ‘‘A novel transfer learning approach
to enhance deep neural network classification of brain functional
connectomes,’’ Frontiers Neurosci., vol. 12, pp. 1–15, Jul. 2018.
[7] M. Leo, M. D. Coco, P. Carcagni, C. Distante, M. Bernava, and G. Pioggia,
‘‘Automatic emotion recognition in robot-children interaction for ASD
treatment,’’ in Proc. ICCV Workshops, 2015, pp. 145–153.
VOLUME 11, 2023
M. Wang, N. Yang: OBTAIN for Training State Recognition
[8] M. Jiang and Q. Zhao, ‘‘Learning visual attention to identify people with
autism spectrum disorder,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV),
Oct. 2017, pp. 3267–3276.
[9] S. Piana, C. Malagoli, M. C. Usai, and A. Camurri, ‘‘Effects of
computerized emotional training on children with high functioning
autism,’’ IEEE Trans. Affect. Comput., vol. 12, no. 4, pp. 1045–1054,
Oct. 2021.
[10] A. Zunino, P. Morerio, A. Cavallo, C. Ansuini, J. Podda, F. Battaglia,
E. Veneselli, C. Becchio, and V. Murino, ‘‘Video gesture analysis for
autism spectrum disorder detection,’’ in Proc. 24th Int. Conf. Pattern
Recognit. (ICPR), Aug. 2018, pp. 3421–3426.
[11] O. Rudovic, J. Lee, L. Mascarell-Maricic, B. W. Schuller, and R. W. Picard,
‘‘Measuring engagement in robot-assisted autism therapy: A cross-cultural
study,’’ Frontiers Robot. AI, vol. 4, pp. 1–17, Jul. 2017.
[12] H. Javed and C. Park, ‘‘Behavior-based risk detection of autism
spectrum disorder through child-robot interaction,’’ in Proc. HRI, 2020,
pp. 275–277.
[13] B. Ren, M. Liu, R. Ding, and H. Liu, ‘‘A survey on 3D skeleton-based
action recognition using learning method,’’ 2020, arXiv:2002.05907.
[14] M. Li, S. Chen, X. Chen, Y. Zhang, Y. Wang, and Q. Tian, ‘‘Actionalstructural graph convolutional networks for skeleton-based action recognition,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
Jun. 2019, pp. 3595–3603.
[15] L. Shi, Y. Zhang, J. Cheng, and H. Lu, ‘‘Two-stream adaptive graph
convolutional networks for skeleton-based action recognition,’’ in Proc.
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
pp. 12026–12035.
[16] F. Shi, C. Lee, L. Qiu, Y. Zhao, T. Shen, S. Muralidhar, T. Han, S.-C. Zhu,
and V. Narayanan, ‘‘STAR: Sparse transformer-based action recognition,’’
2021, arXiv:2107.07089.
[17] Z.-H. Zhou, ‘‘A brief introduction to weakly supervised learning,’’ Nat. Sci.
Rev., vol. 5, no. 1, pp. 44–53, Jan. 2017.
[18] W. Sultani, C. Chen, and M. Shah, ‘‘Real-world anomaly detection in
surveillance videos,’’ in Proc. CVPR, 2018, pp. 6479–6488.
[19] S. Ali and M. Shah, ‘‘Human action recognition in videos using kinematic
features and multiple instance learning,’’ IEEE Trans. Pattern Anal. Mach.
Intell., vol. 32, no. 2, pp. 288–303, Feb. 2010.
[20] K. Das, S. Conjeti, J. Chatterjee, and D. Sheet, ‘‘Detection of breast cancer
from whole slide histopathological images using deep multiple instance
CNN,’’ IEEE Access, vol. 8, pp. 213502–213511, 2020.
[21] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv.
Neural Inf. Process. Syst., vol. 30, 2017, pp. 1–11.
[22] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, ‘‘Transformers are
RNNs: Fast autoregressive transformers with linear attention,’’ in Proc.
ICML, 2020, pp. 5156–5165.
[23] Y. Bai, H. Ding, S. Bian, T. Chen, Y. Sun, and W. Wang, ‘‘SimGNN: A
neural network approach to fast graph similarity computation,’’ in Proc.
12th ACM Int. Conf. Web Search Data Mining, Jan. 2019, pp. 384–392.
[24] M. Coeckelbergh, C. Pop, R. Simut, A. Peca, S. Pintea, D. David, and
B. Vanderborght, ‘‘A survey of expectations about the role of robots in
robot-assisted therapy for children with ASD: Ethical acceptability, trust,
sociability, appearance, and attachment,’’ Sci. Eng. Ethics, vol. 22, no. 1,
pp. 47–65, Feb. 2016.
[25] M. Fey and J. E. Lenssen, ‘‘Fast graph representation learning with
PyTorch geometric,’’ 2019, arXiv:1903.02428.
[26] W. L. Hamilton, R. Ying, and J. Leskovec, ‘‘Representation learning on
graphs: Methods and applications,’’ 2017, arXiv:1709.05584.
[27] Y. Hou, H. Chen, C. Li, J. Cheng, and M.-C. Yang, ‘‘A representation
learning framework for property graphs,’’ in Proc. 25th ACM SIGKDD Int.
Conf. Knowl. Discovery Data Mining, Jul. 2019, pp. 65–73.
[28] S. Kim, J. Park, and B. Han, ‘‘Rotation-invariant local-to-global representation learning for 3D point cloud,’’ in Proc. NeurIPS, vol. 33, 2020,
pp. 8174–8185.
[29] Y. Hu, M. Li, and N. Yu, ‘‘Multiple-instance ranking: Learning to rank
images for image retrieval,’’ in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit., Jun. 2008, pp. 1–8.
[30] M. Ilse, J. Tomczak, and M. Welling, ‘‘Attention-based deep multiple
instance learning,’’ in Proc. Mach. Learn. Res., vol. 80, Jul. 2018,
pp. 2127–2136.
[31] X. Shi, F. Xing, Y. Xie, Z. Zhang, L. Cui, and L. Yang, ‘‘Loss-based
attention for deep multiple instance learning,’’ in Proc. AAAI Conf. Artif.
Intell., vol. 34, no. 4, pp. 5742–5749, Apr. 2020.
VOLUME 11, 2023
[32] S. Yan, Y. Xiong, and D. Lin, ‘‘Spatial temporal graph convolutional
networks for skeleton-based action recognition,’’ in Proc. AAAI, 2018,
pp. 1–9.
[33] S. Bai, J. Z. Kolter, and V. Koltun, ‘‘An empirical evaluation of generic
convolutional and recurrent networks for sequence modeling,’’ 2018,
arXiv:1803.01271.
[34] C. Bergeron, G. Moore, J. Zaretzki, C. M. Breneman, and K. P. Bennett,
‘‘Fast bundle algorithm for multiple-instance learning,’’ IEEE Trans.
Pattern Anal. Mach. Intell., vol. 34, no. 6, pp. 1068–1079, Jun. 2012.
[35] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, ‘‘Deep multi-instance networks
with sparse label assignment for whole mammogram classification,’’ in
Proc. MICCAI, 2017, pp. 603–611.
[36] W. Hamilton, Z. Ying, and J. Leskovec, ‘‘Inductive representation learning
on large graphs,’’ in Advances in Neural Information Processing Systems,
vol. 30. Red Hook, NY, USA: Curran Associates, 2017.
[37] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio,
‘‘Graph attention networks,’’ in Proc. ICLR, 2017, pp. 20–32.
[38] M. Wang and N. Yang, ‘‘OTA-NN: Observational therapy-assistance neural
network for enhancing autism intervention quality,’’ in Proc. IEEE 19th
Annu. Consum. Commun. Netw. Conf. (CCNC), Jan. 2022, pp. 1–7.
[39] G. Li, M. Müller, A. Thabet, and B. Ghanem, ‘‘DeepGCNs: Can GCNs go
as deep as CNNs?’’ in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2019, pp. 9267–9276.
[40] A. Sabater, L. Santos, J. Santos-Victor, A. Bernardino, L. Montesano,
and A. C. Murillo, ‘‘One-shot action recognition in challenging therapy
scenarios,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
Workshops (CVPRW), Jun. 2021, pp. 2777–2785.
[41] I. Loshchilov and F. Hutter, ‘‘Decoupled weight decay regularization,’’
2017, arXiv:1711.05101.
MINXIAO WANG received the B.S. degree in
electronic and communication engineering and
the M.S. degree in electronic and information
engineering from the Civil Aviation University
of China, in 2014 and 2018, respectively. He is
currently pursuing the Ph.D. degree with the
School of Electrical, Computer, and Biomedical
Engineering, Southern Illinois University, Carbondale, IL, USA. His research interests include
developing machine learning and deep learning
models for diverse applications related to action recognition, object
detection, and network intrusion detection.
NING YANG (Member, IEEE) received the M.S.
degree in computer engineering from the University of Massachusetts, Amherst, MA, USA,
in 2006, and the Ph.D. degree in computer
engineering from Southern Illinois University,
Carbondale, IL, USA, in 2020, where she is
currently an Assistant Professor with the School
of Computing, Information Technology Program.
Her research interests include machine learning,
network security, and network intrusion detection.
31961
Download