A Semi-Supervised Learning Approach to Integrated Hui Li Xiaoyi Li

advertisement
A Semi-Supervised Learning Approach to Integrated
Salient Risk Features for Bone Diseases
∗
Hui Li
Department of Computer
Science and Engineering,
State University of New York
at Buffalo, USA
hli24@buffalo.edu
Xiaoyi Li
Murali Ramanathan
Department of Computer
Science and Engineering,
State University of New York
at Buffalo, USA
Department of Pharmaceutical
Sciences,
State University of New York
at Buffalo, USA
xiaoyili@buffalo.edu
Aidong Zhang
murali@buffalo.edu
Department of Computer
Science and Engineering,
State University of New York
at Buffalo, USA
azhang@buffalo.edu
ABSTRACT
The study of the risk factor analysis and prediction for diseases requires the understanding of the complicated and
highly correlated relationships behind numerous potential
risk factors (RFs). The existing models for this purpose usually fix a small number of RFs based on the expert knowledge. Although handcrafted RFs are usually statistically
significant, those abandoned RFs might still contain valuable information for explaining the comprehensiveness of a
disease. However, it is impossible to simply keep all of RFs.
So how to find the integrated risk features from numerous
potential RFs becomes a particular challenging task. Another major challenge for this task is the lack of sufficient
labeled data and missing values in the training data.
In this paper, we focus on the identification of the relationships between a bone disease and its potential risk factors by
learning a deep graphical model in an epidemiologic study
for the purpose of predicting osteoporosis and bone loss. An
effective risk factor analysis approach which delineates both
observed and hidden risk factors behind a disease encapsulates the salient features and also provides a framework for
two prediction tasks. Specifically, we first investigate an approach to show the salience of the integrated risk features
yielding more abstract and useful representations for the
prediction. Then we formulate the whole prediction problem as two separate tasks to evaluate our new representation
of integrated features. With the success of the osteoporosis
prediction, we further take advantage of the P ositive output
and predict the progression trend of osteoporosis severity.
∗corresponding author.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
BCB ’13, September 22 - 25, 2013, Washington, DC, USA
Copyright 2013 ACM 978-1-4503-2434-2/13/09 ...$15.00.
ACM-BCB 2013
We capture the characteristics of data itself and intrinsic
relatedness between two relevant prediction tasks by constructing a deep belief network followed with a two-stage
fine-tuning (FT). Moreover, our proposed method results in
stable and promising results without using any prior information. The superior performance on our evaluation metrics
confirms the effectiveness of the proposed approach for extraction of the integrated salient risk features for predicting
bone diseases.
Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—
Data Mining; J.3 [Life and Medical Sciences]: Health.
General Terms
Algorithms, Performance, Experimentation.
Keywords
Risk Factors Analysis (RFA), Integrated Features, Deep Belief Net (DBN), Restricted Boltzmann Machine (RBM), Osteoporosis, Bone Fracture.
1.
INTRODUCTION
Modeling the relationships between a disease and its potential risk factors (RFs) is a crucial task of epidemiology
and public health. Usually, numerous potential RFs need to
be considered simultaneously for assessing disease determinants and predicting the progression of the disease, for the
purpose of disease control or prevention. More importantly,
some common diseases may be clinically silent but can cause
significant mortality and morbidity after onset. Unless early
prevented or treated, these diseases will affect the quality of
life, and increase the burden of healthcare costs. With the
success of risk factor analysis and disease prediction based
on an intelligent computational model, unnecessary tests can
be avoided. The information can assist in evaluating the risk
of the occurrence of disease, monitor the disease progression,
and facilitate early prevention measures.
42
Osteoporosis
Prediction Bone Loss Rate
Prediction Demographics
Classifiers Vertebrate
fracture
Wrist fracture
Diet
Hip fracture
… Lifestyle
Diagnosis
… …… Shared intermediate representation
Input data
Figure 1: Risk factors for osteoporosis.
Figure 2: Structure of how a deep graphical model learns shared
features for two prediction tasks.
In this paper, we focus on finding the integrated salient
risk features for the study of osteoporosis and bone fracture prediction. Over the past few decades, osteoporosis
has been recognized as an established and well-defined disease that affects more than 75 million people in the United
States, Europe and Japan, and it causes more than 8.9 million fractures annually worldwide [32]. It’s reported that
20-25% of people with a hip fracture are unable to return
to independent living and 12-20% die within one year. In
2003, the World Health Organization (WHO) embarked on
a project to integrate information on RFs and bone mineral
density (BMD) to better predict the fracture risk in men
and women worldwide [31]. Osteoporosis in the vertebrae
can cause serious problems for women such as bone fracture. The diagnosis of osteoporosis is usually based on the
assessment of BMD. The most widely validated technique to
measure BMD is dual energy X-ray absorptiometry (DXA).
Different from osteoporosis measured by BMD, bone fracture risk is determined by the bone loss rate and various
factors such as demographic attributes, family history and
life style. Some studies have stratified their analysis of fracture risk into those who are fast or slow bone losers. With
a faster rate of bone loss, people have a higher future risk of
fracture [24].
As shown in Figure 1, osteoporosis and bone fracture
are complicated diseases which are associated with potential
RFs that include but are not limited to the information of
demographic attributes, patients’ clinical records regarding
disease diagnoses and treatments, family history, diet, and
lifestyle. Different representations about these information
might entangle the different explanatory reasons of variation behind various RFs and diseases. Some of the fundamental questions have been attracting researchers’ interest
in this area, for example, how to perform feature extraction
and select the integrated significant features? Also, what
are appropriate approaches for manifold feature extraction
to accurately maintain the real and intricate relationships
between a disease and its potential risk factors? A good
representation has an advantage for capturing underlying
ACM-BCB 2013
factors with shared statistical strength for predicting two relevant tasks, as illustrated in Figure 2. This representationlearning model discovers explanatory factors in the top layer
of the shared intermediate representation with the combination of knowledge from both input data and output prediction tasks. The rich interactions among numerous potential
RFs or between RFs and a disease can complicate our final
prediction tasks. How can we cope with these complex interactions? How can we disentangle the salient integrated
features from complex data? The proposed approach shows
a good solution to these questions.
The major challenges of analyzing disease RFs and making
any kind of disease prediction can be summarized from two
perspectives:
The complexity of the data representation and
processing.
With the advancement of computer technologies, various real-world medical data can be collected
and warehoused. However, due to the complexity of diseases, disease risk factor analysts require comprehensive and
complex data as shown in Figure 1. It’s a challenge to completely and precisely process those data using a feasible data
structure. On the other hand, even if there’s a longitude
study and careful planning of data collection over decades,
processing plenty of missing values in such complicated data
sets is a significant challenge due to the lack of users’ information.
The lack of a comprehensive model. Such a datarich environment creates new opportunities for investigating
the relationships between the disease and all potential RFs
simultaneously. However, there is a lack of effective tools for
analyzing these data. People sometimes may overlook a lot
of information for a disease, in which a wealth of valuable
hidden information may still exist. As a consequence, which
of the information can be more influential than others for a
particular disease becomes an endless argument.
Traditionally, the assessment of the relationship between
a disease and a potential risk factor is achieved by find-
43
ing statistically significant associations using the regression
model [5, 6, 25, 16] such as linear regression, logistic regression, Poisson regression, and Cox regression. Although
these regression models are theoretically acceptable for analyzing the risk dependence of several variables, it pays little
attention to the nature of the RFs and the disease. Sometimes, less than ten significant RFs are fed into those models,
which are not enough for intelligently predicting a complicated disease such as osteoporosis. Other data mining studies under this objective are focused on association rules [11],
decision tree [23] and Artificial Neural Network (ANN) [19].
For these methods, it’s ineffective to build a comprehensive
model that can guide medical decision making if there are
a number of potential RFs needed to be studied simultaneously. Usually limited RFs are selected based on the knowledge of physicians since handling with a lot of features is
computationally expensive. Or feature selection techniques
are used to select limited number of RFs before feeding a
classifier. However, the feature selection problem is known
to be NP-hard [4] and more importantly, those abandoned
features might still contain valuable information. Furthermore, the performance of ANN depends on a good setting of
meta parameters and so parameter-tuning is an inevitable
issue. Under these scenarios, most of these traditional data
mining approaches may not be effective.
Mining the causality relationship between RFs and a specific disease has attracted considerable research attention in
recent years. In [21, 10, 20], limited RFs are used to construct a Bayesian network and the RFs are assumed conditionally independent of one another. It is also worth mentioning that the random forest decision tree has been investigated for identifying osteoporosis cases [22]. The data in
this work is processed using FRAX [1]. Although this is a
popular fracture risk assessment tool developed by WHO, it
may not be appropriate to directly adopt the results from
this prediction tool for evaluating the validity of a algorithm since FRAX sometimes overestimates or underestimates the fracture risk [15]. The prediction results from
FRAX need to be further interpreted with caution and properly re-evaluated. Some hybrid data mining approaches might
also be used to combine classical classification methods with
feature selection techniques for the purpose of improving the
performance or minimizing the computational expense for a
large data set [30], but they are limited by the challenge of
explaining selected features.
Recent research has been devoted to learning algorithms
for deep architectures with impressive results on several areas. The ability of inference on exponential number of subspaces without using labels typically suitable for our assumption, that is, observed RFs are caused by various hidden reasons, and one hidden reason may directly or indirectly affect other RFs. With the help of an efficient inference technique, we build our learning procedure to formulate a set of risk factors which integrates both observed and
hidden risk factors. Another significance is that our integrated salient features are directly extracted from raw input
without discarding any RFs, which maximally utilizes the
provided information.
Our contribution in this paper can be summarized as follows:
• We investigate the problem of risk factor analysis on
bone diseases and propose a hypothesis that observed
RFs are caused by hidden reasons which should be
ACM-BCB 2013
modeled at the same time.
• We propose a learning framework to simultaneously
capture uniqueness of each risk factor and their potential relationships with hidden reasons.
• Our proposed framework utilizes an amount of unlabeled data and has been fine-tuned by two sets of hierarchical labeled information.
• We learn the model using original high-dimensional
data and extract the important features that are more
likely to interpret the hidden reasons since they are
highly correlated with a disease. In this way, our work
could potentially save medical researchers years of endeavor on explaining the reason of selecting a specific
risk factor for a disease under different demographic
characteristics.
2.
METHODOLOGY
In this section, we first briefly describe the evolution of the
energy models as the preliminaries to our proposed method,
for the purpose of better understanding the intuition behind
our task for risk factor analysis (RFA). Then we introduce a
single-layer and multi-layer learning approaches to yield the
integrated salient features for diseases, respectively. Finally,
we show the pipeline of our framework as an overall two-step
prediction system.
2.1
2.1.1
Preliminaries
Hopfield Net
A Hopfield network is a form of recurrent artificial neural network invented by John Hopfield [28]. It serves as
the content-addressable memory systems with the binary
threshold nodes where each unit (node in the graph simulating the artificial neuron) can be updated using the following
rule:
P
1
if
j Wi,j Sj > θi ,
(1)
Si =
-1
otherwise
where Wi,j is the strength of the connection weight from
unit j to unit i. Sj is the state of unit j. θi is the threshold
of unit i. Based on Eq.(1) the energy of Hopfield Net is
defined as,
X
1X
Wi,j Si Sj +
θi Si .
(2)
E=−
2 i,j
i
The difference in the global energy that results from a
single unit i being 0 (off) versus 1 (on), denoted as ∆Ei , is
given as follows:
X
∆Ei =
wij sj + θi .
(3)
j
Eq.(2) ensures that when units are randomly chosen to
update, the energy E will either lower in value or stay the
same. Furthermore, repeatedly updating the network will
eventually converge to a state which is a local minima in
the energy function (which is considered to be a Lyapunov
function [12]). Thus, if a state is a local minimum in the
energy function, it is a stable state for the network. Note
that this energy function belongs to a general class of models
in physics, under the name of Ising models. This in turn
44
… W E : {0, 1}D+F → R :
h E(v, h; θ) = −
RBM …… i=1 j=1
V (a) boltzmann
RBM machine including one
Figure 3: Shallow restricted
visible layer V and one hidden layer h.
is a special case of Markov networks, since the associated
probability measure, the Gibbs measure, has the Markov
property.
2.1.2
Boltzmann Machines
Boltzmann machines (BM) can be seen as the stochastic,
generative counterpart of Hopfield nets [3]. They are one
of the first examples of a neural network capable of learning
internal representations, and are able to represent and (given
sufficient time) solve difficult combinatoric problems. The
global energy in a Boltzmann machine is identical in form
to that of a Hopfield network, with the difference that the
partial derivative with respect to each unit (Eq.(3)) can be
expressed as the difference of energies of two states:
∆Ei = Ei=of f − Ei=on .
(4)
If we want to train the network so that it will converge
to a global state according to a data distribution that we
have over these states, we need to set the weights making
the global states with the highest probabilities which will get
the lowest energies. The units in the BM are divided into
“visible” units, V , and “hidden” units, h. The visible units
are those which receive information from the data. The
distribution over the data set is denoted as P + (V ). After
the distribution over global states converges and marginalizes over the hidden units, we get the estimated distribution
P − (V ) that is the distribution of our model. Then the difference can be measured using KL-divergence [17], and partial
gradient of this difference will be used to update the network. But the computation time grows exponentially with
the machine’s size, and with the magnitude of the connection strengths.
2.2
Single-Layer Learning for Integrated Features
Recently, one of the most state-of-the-art single-layer greedy
learning modules has attracted great interest, named Restricted Boltzmann Machine (RBM), which can be made
quite efficient on the variance of Boltzmann Machine. It is
a powerful model and has been successfully used in the area
of computer vision and natural language processing (NLP).
A RBM is a generative stochastic graphical model that can
learn a probability distribution over its set of inputs, with
the restriction that their visible units and hidden units must
form a fully connected bipartite graph. Specifically, it has
a single layer of hidden units that are not connected to
each other and have undirected, symmetrical connections
to a layer of visible units [13]. We show a shallow RBM in
Figure 3. The model defines the following energy function:
ACM-BCB 2013
D X
F
X
vi Wij hj −
D
X
i=1
bi vi −
F
X
aj hj ,
(5)
j=1
where θ = {a, b, W } are the model parameters. D and F are
the number of visible units and hidden units, respectively.
The joint distribution over the visible and hidden units is
defined by:
P (v, h; θ) =
1
exp(−E(v, h; θ)),
Z(θ)
(6)
where Z(θ) is the partition function that plays the role of a
normalizing constant for the energy function.
Exact maximum likelihood learning is intractable in RBM.
In practice, efficient learning is performed using Contrastive
Divergence (CD) [7]. To learn succinct representations, the
model needs to be constrained by sparsity [18]. In particular, each hidden unit activation is penalized in the form:
P
S
j=1 KL(ρ|vj ), where S is the total number of hidden units,
vj is the activation of unit j and ρ is a predefined sparsity
parameter, typically a small value close to zero (we use 0.05
in our model). So the overall cost of a sparse RBM used in
our model is:
P
PF
E(v, h; θ) = − D
vi Wij hj −
PD i=1 j=1
P
bi vi − F
(7)
i=1
j=1 aj hj +
P
β S
j=1 KL(ρ|vj ) + λ kW k ,
where kW k is the regularizer and both β and λ are hyperparameters 1 .
The advantage of RBM is that it investigates a expressive
representation of the input risk factors. Each hidden unit in
RBM is able to encode at least one high-order interaction
among the input variables. Given a specific number of latent
reasons in the input, RBM requires less hidden units to represent the problem complexity. Under this scenario, RFs can
be analyzed by a RBM model with an efficient learning algorithm CD. In this paper, we use RBM for an unsupervised
greedy layer-wise pre-training. Specifically, each sample describes a state of visible units in the model. The goal of
learning is to minimize the overall energy so that the data
distribution can be better captured using this single-layer
approach.
2.3
Multi-Layer Learning for Integrated Features
The new representations learned by a shallow RBM (one
layer RBM) can model some directed hidden causalities behind the RFs. But there are more abstractive reasons behind
them (i.e. the reasons of the reasons). To sufficiently model
reasons in different abstractive levels, we can stack more layers into the shallow RBM to form a deep graphical model,
namely a Deep Belief Net (DBN).
DBN is a probabilistic generative model that is composed
of multiple layers of stochastic, latent variables [13]. The
latent variables typically have binary values and are often
called hidden units or feature detectors. The top two layers
form a RBM which can be viewed as an associative memory. The lower layer forms a multi-layer perceptron (MLP)
1
We tried different settings for both β and λ and found our
model is not very sensitive to the input parameters. We
fixed β to 0.1 and λ to 0.0001 for all the experiments.
45
… W2 … h2 h1 RBM W1 Algorithm 1 DBN training algorithm for risk factors
MLP …… V (b) DBN Figure 4: Two-layer deep belief network including one visible layer
V and two hidden layers h1 and h2 in which the top two layers
form a RBM and the bottom layer forms a multi-layer perceptron.
[26] which receives top-down, directed connections from the
layers above. The states of the units in the lowest layer
represent a data vector.
There is an efficient, layer-by-layer procedure for learning
the top-down, generative weights that determine how the
variables in one layer depend on the variables in the layers
above.
The bottom-up inference from the observed variables V
and the hidden layers hk (k = 1, ..., l when l > 2) is following
a chain rule:
p(hl |hl−1 , ..., h1 , v) = p(hl |hl−1 )p(hl−1 |hl−2 )...p(h1 |v),
(8)
where if we denote bias for the layer k as bk and σ is a logistic
sigmoid function, for m units in layer k and n units in layer
k − 1,
P
k k−1
).
p(hk |hk−1 ) = σ(bkj + m
(9)
j=1 Wji hi
The top-down inference is a symmetric version of the bottomup inference, which can be written as
P
k−1 k
(10)
p(hk−1 |hk ) = σ(ak−1
+ n
hj ).
i
i=1 Wij
where we denote bias for the layer k − 1 as ak−1 .
We show a two-layer DBN in Figure 4, in which the pretraining follows a greedy layer-wise training procedure. Specifically, one layer is added on top of the network at each step,
and only that top layer is trained as an RBM using CD
strategy [7]. After each RBM has been trained, the weights
are clamped and a new layer is added and then repeat the
above procedure.
After pre-training, the values of the latent variables in
every layer can be inferred by a single, bottom-up pass that
starts with an observed data vector in the bottom layer and
uses the generative weights in the reverse direction. The top
layer of DBN forms a compressed manifold of input data,
in which each unit in this layer has distinct weighted nonlinear relationship with all of the input factors. This new
representation of RFs is later served as the input of several
traditional classifiers.
To incorporate labeled samples, we add a regression layer
on top of DBN to get classification results, which can be
used to update the overall model using back propagation.
The training procedure using two sources is shown in Algorithm 1, in which lines 2 to 4 reflect a layer-wised Contrastive
Divergence (CD) learning procedure where z is a predetermined hyper-parameter that controls how many Gibbs
ACM-BCB 2013
Input: All risk factors, learning rate , Gibbs round z, stopping patience d;
Output: Model parameters M (W, a, b);
Pre-training Stage:
1: Randomly initialize all W, a, b;
2: for t from layer V to hl−1
3:
clamp t and run CDz to update Mt and t+1
4: end
Fine-tuning Stage:
5: randomly dropout 30% hidden units for each layer
6: repeat
7:
for each predicted result (r)
8:
calculate cost (c) between r and ground truth g1
9:
calculate partial gradient of c with respect to M
10:
update M
11:
calculate cost (c0) on held-out set
12:
if c0 is larger than c0−1 for d round
13:
break
14:
end
15:
end
16: end
17: do the fine-tuning stage again with ground truth g2
rounds for each sampling are completed and t+1 is the state
of upper layer. In our experiments, we choose z to be 1.
The pre-training phase stops when all layers are exhausted.
Lines 5 to 15 show a standard gradient update procedure
(fine-tuning). Since we have ground truth g1 and g2 representing different measurements, we implement the second
fine-tuning procedure using g2 after the stage using g1 . Notice that we first randomly dropout 30% hidden units for
each layer for the purpose of alleviating the counter effect between the fine-tuning on two different types of labels g1 and
g2 . To prevent over-fitting we use early stopping. Specifically, the fine-tuning procedure halts when the validation
error stops decreasing or starts to increase within 5 minibatches. The different semantic meaning for both g1 and g2
will be explicitly stated in Section 2.4.
The main advantage of the DBN is that it tends to display more expressive and invariant results than the single
layer network and also reduce the size of the representation.
This approach obtains a filter-like representations if we treat
unit weights as filters [9]. We want to filter the insignificant
risk factors and thus find out robust and integrated features
which are the fusion of both observed risk factors and hidden
reasons for predicting bone diseases.
2.4
Model Pipeline
Our system pipeline including two main components can
be described by the flow chart in Figure 5. The first component illustrates the proposed risk factor analysis framework for the integrated salient risk features. In the proposed
framework, we first keep the original samples with 672 potential RFs and two types of labels. Then we feed all of
them into the risk factor analysis (RFA) module. On the
one hand, we train our model using a shallow RBM and a
two-layer DBN without using two types of labels (without
fine-tuning stage). Such an unsupervised training procedure
aims for reducing the freedom of data fitting when the ultimate goal is to predict bone diseases given risk factors.
Actually, this is an unsupervised pre-training phase which
46
Labels
Risk Factors
RF1
672
RFs
Raw RFs
RFA
Integrated
Risk
Features
11
RFs
Phase1
Phase2
RF2
…
RF671
RF672
L1
L2
Patient1
Patient2
Component 1: Risk factor analysis Component 2: Predic8on Patient3
Patient4
Figure 5: Pipeline of our proposed method.
Patient5
guides the learning towards basins of optima that supports
a better generalization. Moreover, most real world healthcare data are lack of ground truth and therefore achieving
a good performance during an unsupervised phase is more
influential for the disease prediction. On the other hand,
we train both shallow RBM and two-layer DBN with a twostage fine-tuning procedure. In the context of scarcity of
labeled data, both models have shown promising as well during a semi-supervised process. Samples in the original data
are therefore projected onto a new space with a predetermined dimensionality2 . These integrated low-dimensional
risk features can be viewed as a new representation and can
also be treated as the input flowing into the second component. In summary, Component 1 is acting as a knowledge
integration component to generate integrated risk features
which maintain the properties for both observed risk factors
and latent risk factors from the data.
In Figure 5, Component 2 evaluates our new integrated
features using a two-step prediction module composed of
Phase 1 and Phase 2, respectively. Specifically, Phase 1
aims to predict whether a person tends to get abnormal
bone (osteopenia or osteoporosis) after 10 years measured
by BMD value. We regard the abnormal bone as Positive
and the normal bone as Negative. In Algorithm 1, we use
g1 to represent this measure which includes two cases: (1)
people will have osteopenia or osteoporosis in 10 years later
and (2) people’s bones have a density that is healthy after
10 years. Then for those abnormal cases, Phase 2 is used to
predict the annual bone loss rate (high or low) measured by
a series of BMD data. For Phase 2, we treat the high bone
loss rate as Positive and the low bone loss rate as Negative.
In Algorithm 1, we use g2 to represent this measure which
includes two cases: (1) annual bone loss in a high speed for
next 10 years and (2) annual bone loss in a low speed for next
10 years. Obviously, the size of the dataset for the second
phase is shrunk because we have discarded those negative
cases after the first phase. In either of the two phases, the
Positive prediction usually attracts people’s attention.
Patient9704
..
.
Figure 6: Illustration of the missing data for the SOF data set
shown in empty shapes for both RFs space and Labels space.
and older. It contains 20 years of prospective data about osteoporosis, bone fractures, breast cancer, and so on. Potential risk factors (RFs) and confounders were classified into
20 categories such as demographics, family history, lifestyle,
and medical history [2]. As shown in Figure 6, there are
missing values for both risk factor space and label space,
denoted as empty shapes.
A number of potential RFs are grouped and organized at
the first and second visits which include 672 variables scattered into 20 categories as the input of our model. The
rest of the visits contain time-series dual-energy x-ray absorptiometry (DXA) scan results on bone mineral density
(BMD) variation, which will be extracted and processed as
the label for our data set. Based on WHO standard, Tscore of less than -13 indicates the osteopenia condition that
is the precursor to osteoporosis, which is used as the first
type of label. The second type of label is the annual rate
of BMD variation. We use at least two BMD values in the
data set to calculate the bone loss rate and define the high
bone loss rate with greater than 0.84% bone loss in each
year [27]. We have shown how to employ this multi-label
data and finish a hierarchical prediction task in Section 2.4
and Algorithm 1. Notice that this is a partially labeled data
set since some patients just come during the first and second
visit and never take a DXA scan in the following visits like
example P atient3 shown in Figure 6.
3.2
Evaluation Metric
The Study of Osteoporotic Fractures (SOF) is the largest
and most comprehensive study of risk factors (RFs) for bone
diseases which includes 9704 Caucasian women aged 65 years
The error rate on a test dataset is commonly used as the
evaluation method of the classification performance. Nevertheless, for most skewed medical data sets, the error rate
could be still low when misclassifying entire minority sample
to the class of majority. Thus, two alternative measurements
are used in this paper. First, Receiver Operator Characteristic (ROC) curves are plotted to generally capture how the
number of correctly classified abnormal cases varies with
the number of incorrectly classifying normal cases as abnormal cases. Since in most medical problems, we usually care
2
We use 11 dimensions which are consistent with the number
of the expert selected RFs.
3
T-score of -1 corresponds to BMD of 0.82, if the reference
BMD is 0.942 and the reference standard deviation is 0.122.
3.
3.1
EXPERIMENTS
Data Set
ACM-BCB 2013
47
Table 1: Confusion matrix.
Predicted Class
Positive
Negative
Actual Class
Positive Negative
TP
FP
FN
TN
about the fraction of examples classified as abnormal cases
that are truly abnormal, the measurements, Precision-Recall
(PR) curves, are also plotted to show this property. We
present the confusion matrix in Table 1 and several derivative quality measures in Table 2.
Table 2: Metrics definition.
3.3
True Positive Rate
=
TP
T P +F N
False Positive Rate
=
FP
F P +T N
Precision
=
TP
T P +F P
Recall
=
TP
T P +F N
Error Rate
=
F P +F N
T P +T N +F P +F N
Experiment Setup
Since no classifier is considered to perform the best classification, we choose two classical classifiers4 to validate our
risk factor analysis results compared with the expert opinion. Logistic Regression (LR) is widely used among experts
to assess clinical RFs and predict the fracture risk. Support
Vector Machine (SVM) has also been applied to various realworld problems.
RFs are extracted based on the expert opinion [1, 8, 29,
14] and summarized using the following variables in Table 3.
We apply two basic classifiers mentioned above and choose
the parameters by cross-validation for fairness. Notice that
this is a supervised learning process since all samples for this
expert knowledge based model are labeled. For fair comparison with the classification results using expert knowledge,
we fix the number of the output dimensions of RFA module
to be equal to the expert selected RFs. Specifically, we fix
the number of units in the output layer to be 11, where each
unit in this layer represents a new integrated feature describing complex relationships among all 672 input factors,
rather than a single independent RFs like a set of typical
risk factor selected by experts shown in Table 3.
For all the experiments involving RFA, the learning rate
used to update weights is fixed to the value of 0.055 and the
number of iterations is set to 10 for efficiency6 . We use minibatch gradient for updating the parameters and the batch
size is set to 20. After RFA is trained, we simply feed it
4
If we only train RFA by fine-tuning, then the model will
end up with a traditional neural network (NN). During the
experiment, the NN training are more likely to result in
over-fitting and the performance is no better than non-linear
SVM classifier with RBF kernel.
5
It is chosen from the validation set.
6
We observed that the model cost can reach into a relatively
stable state within 5 to 10 iterations.
ACM-BCB 2013
Table 3: Typical risk factors from the expert opinion.
Variables
Age
Weight
Height
BMI
Parent fall
Type
Numeric
Numeric
Numeric
Numeric
Boolean
Smoke
Excess alcohol
Boolean
Boolean
Rheumatoid
arthritis
Physical activity
Boolean
Physical exercise
BMD
Description
Between 65 - 84
BM I = weight/height2
Hip fracture in the patient’s
mother or father
3 or more units of alcohol
daily
Boolean
Use of arms to stand up from
chair
Boolean Take walk for exercises
Numeric Normal: T-score > -1; Abormal: T-score <= -1
with the whole data and get the new integrated RFs and
then run the same classification module to get results.
Since the sample size is large and highly imbalanced in
Phase1, we evaluate the performance using both ROC and
PR curves. However, the number of samples in Phase2 is
small and balanced, thus we only evaluate the performance
using classification error rate in this phase.
3.4
3.4.1
Results and Evaluation
Phase1: Osteoporosis Prediction
The overall results for the SOF data after Phase1 are
shown in Figure 7. In Section 2.4, we have introduced that
we first implement RBM and two-layer DBN for unsupervised learning, also known as a pre-training stage. Then
we add labeled information into both models for supervised
learning, also known as a fine-tuning (FT) stage. Figure
7 shows results of risk factor analysis using RBM without/with FT and DBN without/with FT between two classifiers, presented using ROC and PR curves as a group.
Furthermore, the area under curve (AUC) of ROC curve
for each classifier (denoted as “LR-ROC”, “SVM-ROC”) and
the AUC of PR curve (denoted as “LR-PR”, “SVM-PR”)
are shown in Table 4. AUC indicates the performance of a
classifier: the larger the better (an AUC of 1.0 indicates a
perfect performance). The classification results using expert
knowledge are also shown for the performance comparison
as the baseline.
From Figure 7(a), we observe that a shallow RBM without
FT get a sense of how the data is distributed which represents the basic characteristics of the data itself shown as LR
(RBM) and SVM (RBM) curves in the figure. Although the
performances are not always higher than the expert model
shown as LR (Expert) and SVM (Expert) curves in Figure 7(a), this is a completely unsupervised process without
borrowing knowledge from any types of labeled information.
Achieving such a comparable performance is not easy since
expert model is trained in a supervised way. But we find
from above experiments that the model is lack of focus to
a specific task and thus leads to poor performances. Further improvements may be possible by more thorough experiments with the two types of labeled data for finishing a
48
Table 4: AUC of ROC and PR curves of expert knowledge model and our RFA model using four different structures.
Risk Factors From:
Expert knowledge
Shallow RBM without FT
Shallow RBM with FT
DBN without FT
DBN with FT
LR-ROC
0.729
0.638
0.795
0.662
0.878
Table 5: Classification error rates given expert based model and
our model.
Expert
DBN with FT
LR-Error
0.3833
0.1066
SVM-Error
0.3259
0.0936
two-stage fine-tuning that is used to better satisfy our prediction tasks. Next we take advantage of the labeled information and transform from an unsupervised task to a semisupervised task because of the partial label data. Figure 7(a)
shows the classification results which boost the performance
of all classifiers because of the two-stage fine-tuning shown
as LR (RBM with FT) and SVM (RBM with FT) curves attached with AUC in Figure 7(a). Especially, the AUC of PR
of our model significantly outperforms the expert system.
Since the capacity for the RBM model with one hidden
layer is usually small, it indicates a need for a more expressive model over the complex data. To satisfy this need,
we add a new layer of non-linear perceptron at the bottom of RBM, which forms a DBN as shown in Figure 4.
This new added layer greatly enlarges the overall model expressiveness. More importantly, the deeper structure is able
to extract more abstractive reasons. As we expected, using a deeper structure without labeled information yields a
better performance than the shallow RBM model shown as
LR (DBN) and SVM (DBN) curves in Figure 7(b). And
the model further improves its behavior after the two-stage
fine-tuning that is also shown as LR (DBN with FT) and
SVM (DBN with FT) curves in Figure 7(b) attached with
AUC. All the numerical results about AUC rounded to 3
significant figures are listed in Table 4.
3.4.2
Phase2: Bone Loss Rate Prediction
In this section, we show the bone loss rate prediction using abnormal cases after Phase1. High bone loss rate is an
important predictor of higher fracture risk. Moreover, it’s
reported that RFs that account for high and low bone loss
rates are different [27]. Our integrated risk features are good
at detecting this property since they integrate the characteristics of data itself and nicely tuned under the help of
two kinds of labels. We compare the results between expert knowledge based model and our DBN with fine-tuning
model that yields the best performance for Phase1. The
classification error rate is defined in Table 2.
Since our result is also fine-tuned by the bone loss rate, we
can directly feed the 11 new integrated features into Phase2.
Table 5 shows that our model achieves high predictive power
when predicting bone loss rate. In this case, the expert
model fails because the limited features are not sufficient to
describe the bone loss rate which may interact with other
different RFs. This highlights the need for a more complex
ACM-BCB 2013
SVM-ROC
0.601
0.591
0.785
0.631
0.879
LR-PR
0.458
0.379
0.594
0.393
0.718
SVM-PR
0.343
0.358
0.581
0.386
0.720
model to extract the precise attributes from an amount of
potential RFs. Moreover, our RFA module takes into account the whole data set, not only keeping the 672 risk factor dimensions but also utilizing two types of labeled data,
that is normal/abnormal bone and low/high bone loss rate.
The fine-tuning effects can also be important on bone loss
prediction although the number of labeled data and samples
are less than the first prediction task.
4.
CONCLUSION AND FUTURE WORK
A challenge for the disease risk factor analysis is that the
complicated and highly correlated relationships behind these
risk factors are difficult to be mined. In addition, the lack
of complete ground truth for the medical data becomes an
inevitable obstacle for developing a state-of-the-art model
in the realm of health informatics. Existing approaches neither incorporate the information from the whole data set nor
better solve the partial labeled problem with limited medical
data resources. In this paper, an integrated approach is developed to identify the osteoporotic risk features associated
with the patients’ continuous medical records. By learning
a deep graphical model using both shallow Restricted Boltzmann Machine (RBM) and Deep belief net (DBN) structure
and involving two-stage fine-tuning (FT), the predictive performance and stability for predicting osteoporosis and bone
loss improves stepwise since the model is more expressive
and wisely tuned by the labeled information. The risk factor analysis (RFA) captures the data manifold and identifies
an accurate set of lower-dimensional risk features from the
original higher-dimensional data. We obtain an integrated
set of new features that preserves properties from both osteoporosis and bone loss rate perspectives, yielding a better
prediction performance than the expert opinion.
In the future work, we plan to extend the problem from
risk factor analysis and prediction on bone diseases to other
diseases for the single task learning. For the multi-task
learning, we aim for investigating the use of shared risk factors for totally different diseases. We will also try to enrich
the data from one source to two or more sources such as combining patients’ DXA scan image with their clinical records
and questionnaire data to construct a multi-view learning
framework.
5.
ACKNOWLEDGMENTS
The materials published in this paper are partially supported by the National Science Foundation under Grants
No. 1218393, No. 1016929, and No. 0101244.
6.
REFERENCES
[1] http://www.shef.ac.uk/FRAX/.
[2] http://www.sof.ucsf.edu/interface/.
49
1
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Precision
True Positive Rate
1
0.9
0.5
0.4
0.3
0.1
0
0
0.2
0.4
0.6
False Positive Rate
0.8
0.5
0.4
0.3
LR (Expert): 0.72946
SVM (Expert): 0.60094
LR (RBM): 0.63795
SVM (RBM): 0.59062
LR (RBM with FT): 0.79518
SVM (RBM with FT): 0.78539
0.2
LR (Expert): 0.45762
SVM (Expert): 0.34308
LR (RBM): 0.37876
SVM (RBM): 0.35772
LR (RBM with FT): 0.59392
SVM (RBM with FT): 0.58092
0.2
0.1
0
1
0
0.2
0.4
Recall
0.6
0.8
1
(a) ROC and PR curve for expert knowledge and shallow RBM without/with FT.
1
1
0.9
LR (Expert): 0.45762
SVM (Expert): 0.34308
LR (DBN): 0.39258
SVM (DBN): 0.38551
LR (DBN with FT): 0.7183
SVM (DBN with FT): 0.7204
0.9
0.8
0.7
0.7
0.6
Precision
True Positive Rate
0.8
0.5
0.4
0.5
0.3
LR (Expert): 0.72946
SVM (Expert): 0.60094
LR (DBN): 0.66159
SVM (DBN): 0.6307
LR (DBN with FT): 0.87779
SVM (DBN with FT): 0.87897
0.2
0.1
0
0.6
0
0.2
0.4
0.6
False Positive Rate
0.8
0.4
0.3
1
0.2
0
0.2
0.4
Recall
0.6
0.8
1
(b) ROC and PR curve for expert knowledge and DBN without/with FT.
Figure 7: Performance comparison.
[3] H. Ackley, E. Hinton, and J. Sejnowski. A learning
algorithm for boltzmann machines. Cognitive Science,
pages 147–169, 1985.
[4] E. Amaldi and V. Kann. On the approximability of
minimizing nonzero variables or unsatisfied relations
in linear systems. Theoretical Computer Science,
209(1):237–260, 1998.
[5] R. Bender. Introduction to the use of regression
models in epidemiology. Methods Mol Biol,
471:179–195, 2009.
[6] D. Black, M. Steinbuch, L. Palermo,
P. Dargent-Molina, R. Lindsay, M. Hoseyni, and
O. Johnell. An assessment tool for predicting fracture
risk in postmenopausal women. Osteoporosis
International, 12(7):519–528, 2001.
[7] M. A. Carreira-Perpinan and G. E. Hinton. On
contrastive divergence learning, 2005.
[8] Cummings, S.R., Nevitt, M.C., Browner, W.S., Stone,
K., Fox, K.M., Ensrud, K.E., Cauley, J., Black, D.,
and Vogt, T.M. Risk factors for hip fracture in white
women. Study of Osteoporotic fractures research group,
332:767–773, 1995.
[9] D. Erhan, A. Courville, and Y. Bengio. Understanding
representations learned in deep architectures.
Technical report, Technical Report 1355, Université de
Montréal/DIRO, 2010.
[10] L. Getoor, J. T. Rhee, D. Koller, and P. Small.
Understanding tuberculosis epidemiology using
structured statistical models. Artificial Intelligence in
Medicine, 30(3):233–256, 2004.
ACM-BCB 2013
[11] S. H. Ha. Medical domain knowledge and associative
classification rules in diagnosis. International Journal
of Knowledge Discovery in Bioinformatics (IJKDB),
2(1):60–73, 2011.
[12] S. F. Hafstein. An algorithm for constructing lyapunov
functions, 2007.
[13] G. E. Hinton and R. R. Salakhutdinov. Reducing the
dimensionality of data with neural networks. science,
2006.
[14] Hui, S.L., Slemenda, C.W. and Johnston, C.C. Age
and bone mass as predictors of fracture in a
prospective study. The Journal of Clinical
Investigation, 81:1804–1809, 1988.
[15] J. Kanis, A. Oden, H. Johansson, and E. McCloskey.
Pitfalls in the external validation of frax. Osteoporosis
International, pages 1–9, 2012.
[16] J. Kanis, A. Odén, O. Johnell, H. Johansson,
C. De Laet, J. Brown, P. Burckhardt, C. Cooper,
C. Christiansen, S. Cummings, et al. The use of
clinical risk factors enhances the performance of bmd
in the prediction of hip and osteoporotic fractures in
men and women. Osteoporosis international,
18(8):1033–1046, 2007.
[17] S. Kullback. On information and sufficiency, 1951.
[18] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep
belief net model for visual area V2. In Advances in
Neural Information Processing Systems 20, pages
873–880. Nips Foundation, 2008.
[19] G. Lemineur, R. Harba, N. Kilic, O. Ucan, O. Osman,
and L. Benhamou. Efficient estimation of osteoporosis
50
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
using artificial neural networks. In Industrial
Electronics Society, 2007. IECON 2007. 33rd Annual
Conference of the IEEE, pages 3039–3044. IEEE, 2007.
H. Li, C. Buyea, X. Li, M. Ramanathan, L. Bone, and
A. Zhang. 3d bone microarchitecture modeling and
fracture risk prediction. In Proceedings of the ACM
Conference on Bioinformatics, Computational Biology
and Biomedicine, pages 361–368. ACM, 2012.
J. Li, J. Shi, and D. Satz. Modeling and analysis of
disease and risk factors through learning bayesian
networks from observational data. Quality and
Reliability Engineering International, 24(3):291–302,
2008.
W. Moudani, A. Shahin, F. Chakik, and D. Rajab.
Intelligent decision support system for osteoporosis
prediction. International Journal of Intelligent
Information Technologies (IJIIT), 8(1):26–45, 2012.
C. Ordonez and K. Zhao. Evaluating association rules
and decision trees to predict multiple target attributes.
Intelligent Data Analysis, 15(2):173–192, 2011.
B. J. Riis. The role of bone loss. The American
journal of medicine, 98(2):29S–32S, 1995.
J. Robbins, A. Schott, P. Garnero, P. Delmas,
D. Hans, and P. Meunier. Risk factors for hip fracture
in women with high bmd: Epidos study. Osteoporosis
international, 16(2):149–154, 2005.
F. Rosenblatt. Principles of neurodynamics;
perceptrons and the theory of brain mechanisms.
Spartan Books, Washington, 1962.
J. Sirola, A.-K. Koistinen, K. Salovaara, T. Rikkonen,
M. Tuppurainen, J. S. Jurvelin, R. Honkanen,
E. Alhava, and H. Kröger. Bone loss rate may interact
with other risk factors for fractures among elderly
women: A 15-year population-based study. Journal of
osteoporosis, 2010, 2010.
A. J. Storkey and R. Valabregue. The basins of
attraction of a new hopfield learning rule. Neural
Networks, 12(6):869–876, 1999.
Taylor, B.C., Schreiner, P.J., Stone, K.L., Fink, H.A.,
Cummings, S.R., Nevitt, M.C., Bowman, P.J., and
Ensrud, K.E. Long-term prediction of incident hip
fracture risk in elderly white women: study of
osteoporotic fractures. J Am Geriatr Soc Int,
52:1479–1486, 2004.
W. Wang, G. Richards, and S. Rea. Hybrid data
mining ensemble for predicting osteoporosis risk. In
Engineering in Medicine and Biology Society, 2005.
IEEE-EMBS 2005. 27th Annual International
Conference of the, pages 886–889. IEEE, 2006.
WHO Scientific Group. Prevention and management
of osteoporosis. Who technical report series, world
health organization, Geneva, 2003.
World Health Organization. WHO scientific group on
the assessment of osteoporosis at primary health care
level. Summary meeting report, Brussels, Belgium,
May 5-7 2004.
ACM-BCB 2013
51
Download