A Semi-Supervised Learning Approach to Integrated Salient Risk Features for Bone Diseases ∗ Hui Li Department of Computer Science and Engineering, State University of New York at Buffalo, USA hli24@buffalo.edu Xiaoyi Li Murali Ramanathan Department of Computer Science and Engineering, State University of New York at Buffalo, USA Department of Pharmaceutical Sciences, State University of New York at Buffalo, USA xiaoyili@buffalo.edu Aidong Zhang murali@buffalo.edu Department of Computer Science and Engineering, State University of New York at Buffalo, USA azhang@buffalo.edu ABSTRACT The study of the risk factor analysis and prediction for diseases requires the understanding of the complicated and highly correlated relationships behind numerous potential risk factors (RFs). The existing models for this purpose usually fix a small number of RFs based on the expert knowledge. Although handcrafted RFs are usually statistically significant, those abandoned RFs might still contain valuable information for explaining the comprehensiveness of a disease. However, it is impossible to simply keep all of RFs. So how to find the integrated risk features from numerous potential RFs becomes a particular challenging task. Another major challenge for this task is the lack of sufficient labeled data and missing values in the training data. In this paper, we focus on the identification of the relationships between a bone disease and its potential risk factors by learning a deep graphical model in an epidemiologic study for the purpose of predicting osteoporosis and bone loss. An effective risk factor analysis approach which delineates both observed and hidden risk factors behind a disease encapsulates the salient features and also provides a framework for two prediction tasks. Specifically, we first investigate an approach to show the salience of the integrated risk features yielding more abstract and useful representations for the prediction. Then we formulate the whole prediction problem as two separate tasks to evaluate our new representation of integrated features. With the success of the osteoporosis prediction, we further take advantage of the P ositive output and predict the progression trend of osteoporosis severity. ∗corresponding author. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BCB ’13, September 22 - 25, 2013, Washington, DC, USA Copyright 2013 ACM 978-1-4503-2434-2/13/09 ...$15.00. ACM-BCB 2013 We capture the characteristics of data itself and intrinsic relatedness between two relevant prediction tasks by constructing a deep belief network followed with a two-stage fine-tuning (FT). Moreover, our proposed method results in stable and promising results without using any prior information. The superior performance on our evaluation metrics confirms the effectiveness of the proposed approach for extraction of the integrated salient risk features for predicting bone diseases. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining; J.3 [Life and Medical Sciences]: Health. General Terms Algorithms, Performance, Experimentation. Keywords Risk Factors Analysis (RFA), Integrated Features, Deep Belief Net (DBN), Restricted Boltzmann Machine (RBM), Osteoporosis, Bone Fracture. 1. INTRODUCTION Modeling the relationships between a disease and its potential risk factors (RFs) is a crucial task of epidemiology and public health. Usually, numerous potential RFs need to be considered simultaneously for assessing disease determinants and predicting the progression of the disease, for the purpose of disease control or prevention. More importantly, some common diseases may be clinically silent but can cause significant mortality and morbidity after onset. Unless early prevented or treated, these diseases will affect the quality of life, and increase the burden of healthcare costs. With the success of risk factor analysis and disease prediction based on an intelligent computational model, unnecessary tests can be avoided. The information can assist in evaluating the risk of the occurrence of disease, monitor the disease progression, and facilitate early prevention measures. 42 Osteoporosis Prediction Bone Loss Rate Prediction Demographics Classifiers Vertebrate fracture Wrist fracture Diet Hip fracture … Lifestyle Diagnosis … …… Shared intermediate representation Input data Figure 1: Risk factors for osteoporosis. Figure 2: Structure of how a deep graphical model learns shared features for two prediction tasks. In this paper, we focus on finding the integrated salient risk features for the study of osteoporosis and bone fracture prediction. Over the past few decades, osteoporosis has been recognized as an established and well-defined disease that affects more than 75 million people in the United States, Europe and Japan, and it causes more than 8.9 million fractures annually worldwide [32]. It’s reported that 20-25% of people with a hip fracture are unable to return to independent living and 12-20% die within one year. In 2003, the World Health Organization (WHO) embarked on a project to integrate information on RFs and bone mineral density (BMD) to better predict the fracture risk in men and women worldwide [31]. Osteoporosis in the vertebrae can cause serious problems for women such as bone fracture. The diagnosis of osteoporosis is usually based on the assessment of BMD. The most widely validated technique to measure BMD is dual energy X-ray absorptiometry (DXA). Different from osteoporosis measured by BMD, bone fracture risk is determined by the bone loss rate and various factors such as demographic attributes, family history and life style. Some studies have stratified their analysis of fracture risk into those who are fast or slow bone losers. With a faster rate of bone loss, people have a higher future risk of fracture [24]. As shown in Figure 1, osteoporosis and bone fracture are complicated diseases which are associated with potential RFs that include but are not limited to the information of demographic attributes, patients’ clinical records regarding disease diagnoses and treatments, family history, diet, and lifestyle. Different representations about these information might entangle the different explanatory reasons of variation behind various RFs and diseases. Some of the fundamental questions have been attracting researchers’ interest in this area, for example, how to perform feature extraction and select the integrated significant features? Also, what are appropriate approaches for manifold feature extraction to accurately maintain the real and intricate relationships between a disease and its potential risk factors? A good representation has an advantage for capturing underlying ACM-BCB 2013 factors with shared statistical strength for predicting two relevant tasks, as illustrated in Figure 2. This representationlearning model discovers explanatory factors in the top layer of the shared intermediate representation with the combination of knowledge from both input data and output prediction tasks. The rich interactions among numerous potential RFs or between RFs and a disease can complicate our final prediction tasks. How can we cope with these complex interactions? How can we disentangle the salient integrated features from complex data? The proposed approach shows a good solution to these questions. The major challenges of analyzing disease RFs and making any kind of disease prediction can be summarized from two perspectives: The complexity of the data representation and processing. With the advancement of computer technologies, various real-world medical data can be collected and warehoused. However, due to the complexity of diseases, disease risk factor analysts require comprehensive and complex data as shown in Figure 1. It’s a challenge to completely and precisely process those data using a feasible data structure. On the other hand, even if there’s a longitude study and careful planning of data collection over decades, processing plenty of missing values in such complicated data sets is a significant challenge due to the lack of users’ information. The lack of a comprehensive model. Such a datarich environment creates new opportunities for investigating the relationships between the disease and all potential RFs simultaneously. However, there is a lack of effective tools for analyzing these data. People sometimes may overlook a lot of information for a disease, in which a wealth of valuable hidden information may still exist. As a consequence, which of the information can be more influential than others for a particular disease becomes an endless argument. Traditionally, the assessment of the relationship between a disease and a potential risk factor is achieved by find- 43 ing statistically significant associations using the regression model [5, 6, 25, 16] such as linear regression, logistic regression, Poisson regression, and Cox regression. Although these regression models are theoretically acceptable for analyzing the risk dependence of several variables, it pays little attention to the nature of the RFs and the disease. Sometimes, less than ten significant RFs are fed into those models, which are not enough for intelligently predicting a complicated disease such as osteoporosis. Other data mining studies under this objective are focused on association rules [11], decision tree [23] and Artificial Neural Network (ANN) [19]. For these methods, it’s ineffective to build a comprehensive model that can guide medical decision making if there are a number of potential RFs needed to be studied simultaneously. Usually limited RFs are selected based on the knowledge of physicians since handling with a lot of features is computationally expensive. Or feature selection techniques are used to select limited number of RFs before feeding a classifier. However, the feature selection problem is known to be NP-hard [4] and more importantly, those abandoned features might still contain valuable information. Furthermore, the performance of ANN depends on a good setting of meta parameters and so parameter-tuning is an inevitable issue. Under these scenarios, most of these traditional data mining approaches may not be effective. Mining the causality relationship between RFs and a specific disease has attracted considerable research attention in recent years. In [21, 10, 20], limited RFs are used to construct a Bayesian network and the RFs are assumed conditionally independent of one another. It is also worth mentioning that the random forest decision tree has been investigated for identifying osteoporosis cases [22]. The data in this work is processed using FRAX [1]. Although this is a popular fracture risk assessment tool developed by WHO, it may not be appropriate to directly adopt the results from this prediction tool for evaluating the validity of a algorithm since FRAX sometimes overestimates or underestimates the fracture risk [15]. The prediction results from FRAX need to be further interpreted with caution and properly re-evaluated. Some hybrid data mining approaches might also be used to combine classical classification methods with feature selection techniques for the purpose of improving the performance or minimizing the computational expense for a large data set [30], but they are limited by the challenge of explaining selected features. Recent research has been devoted to learning algorithms for deep architectures with impressive results on several areas. The ability of inference on exponential number of subspaces without using labels typically suitable for our assumption, that is, observed RFs are caused by various hidden reasons, and one hidden reason may directly or indirectly affect other RFs. With the help of an efficient inference technique, we build our learning procedure to formulate a set of risk factors which integrates both observed and hidden risk factors. Another significance is that our integrated salient features are directly extracted from raw input without discarding any RFs, which maximally utilizes the provided information. Our contribution in this paper can be summarized as follows: • We investigate the problem of risk factor analysis on bone diseases and propose a hypothesis that observed RFs are caused by hidden reasons which should be ACM-BCB 2013 modeled at the same time. • We propose a learning framework to simultaneously capture uniqueness of each risk factor and their potential relationships with hidden reasons. • Our proposed framework utilizes an amount of unlabeled data and has been fine-tuned by two sets of hierarchical labeled information. • We learn the model using original high-dimensional data and extract the important features that are more likely to interpret the hidden reasons since they are highly correlated with a disease. In this way, our work could potentially save medical researchers years of endeavor on explaining the reason of selecting a specific risk factor for a disease under different demographic characteristics. 2. METHODOLOGY In this section, we first briefly describe the evolution of the energy models as the preliminaries to our proposed method, for the purpose of better understanding the intuition behind our task for risk factor analysis (RFA). Then we introduce a single-layer and multi-layer learning approaches to yield the integrated salient features for diseases, respectively. Finally, we show the pipeline of our framework as an overall two-step prediction system. 2.1 2.1.1 Preliminaries Hopfield Net A Hopfield network is a form of recurrent artificial neural network invented by John Hopfield [28]. It serves as the content-addressable memory systems with the binary threshold nodes where each unit (node in the graph simulating the artificial neuron) can be updated using the following rule: P 1 if j Wi,j Sj > θi , (1) Si = -1 otherwise where Wi,j is the strength of the connection weight from unit j to unit i. Sj is the state of unit j. θi is the threshold of unit i. Based on Eq.(1) the energy of Hopfield Net is defined as, X 1X Wi,j Si Sj + θi Si . (2) E=− 2 i,j i The difference in the global energy that results from a single unit i being 0 (off) versus 1 (on), denoted as ∆Ei , is given as follows: X ∆Ei = wij sj + θi . (3) j Eq.(2) ensures that when units are randomly chosen to update, the energy E will either lower in value or stay the same. Furthermore, repeatedly updating the network will eventually converge to a state which is a local minima in the energy function (which is considered to be a Lyapunov function [12]). Thus, if a state is a local minimum in the energy function, it is a stable state for the network. Note that this energy function belongs to a general class of models in physics, under the name of Ising models. This in turn 44 … W E : {0, 1}D+F → R : h E(v, h; θ) = − RBM …… i=1 j=1 V (a) boltzmann RBM machine including one Figure 3: Shallow restricted visible layer V and one hidden layer h. is a special case of Markov networks, since the associated probability measure, the Gibbs measure, has the Markov property. 2.1.2 Boltzmann Machines Boltzmann machines (BM) can be seen as the stochastic, generative counterpart of Hopfield nets [3]. They are one of the first examples of a neural network capable of learning internal representations, and are able to represent and (given sufficient time) solve difficult combinatoric problems. The global energy in a Boltzmann machine is identical in form to that of a Hopfield network, with the difference that the partial derivative with respect to each unit (Eq.(3)) can be expressed as the difference of energies of two states: ∆Ei = Ei=of f − Ei=on . (4) If we want to train the network so that it will converge to a global state according to a data distribution that we have over these states, we need to set the weights making the global states with the highest probabilities which will get the lowest energies. The units in the BM are divided into “visible” units, V , and “hidden” units, h. The visible units are those which receive information from the data. The distribution over the data set is denoted as P + (V ). After the distribution over global states converges and marginalizes over the hidden units, we get the estimated distribution P − (V ) that is the distribution of our model. Then the difference can be measured using KL-divergence [17], and partial gradient of this difference will be used to update the network. But the computation time grows exponentially with the machine’s size, and with the magnitude of the connection strengths. 2.2 Single-Layer Learning for Integrated Features Recently, one of the most state-of-the-art single-layer greedy learning modules has attracted great interest, named Restricted Boltzmann Machine (RBM), which can be made quite efficient on the variance of Boltzmann Machine. It is a powerful model and has been successfully used in the area of computer vision and natural language processing (NLP). A RBM is a generative stochastic graphical model that can learn a probability distribution over its set of inputs, with the restriction that their visible units and hidden units must form a fully connected bipartite graph. Specifically, it has a single layer of hidden units that are not connected to each other and have undirected, symmetrical connections to a layer of visible units [13]. We show a shallow RBM in Figure 3. The model defines the following energy function: ACM-BCB 2013 D X F X vi Wij hj − D X i=1 bi vi − F X aj hj , (5) j=1 where θ = {a, b, W } are the model parameters. D and F are the number of visible units and hidden units, respectively. The joint distribution over the visible and hidden units is defined by: P (v, h; θ) = 1 exp(−E(v, h; θ)), Z(θ) (6) where Z(θ) is the partition function that plays the role of a normalizing constant for the energy function. Exact maximum likelihood learning is intractable in RBM. In practice, efficient learning is performed using Contrastive Divergence (CD) [7]. To learn succinct representations, the model needs to be constrained by sparsity [18]. In particular, each hidden unit activation is penalized in the form: P S j=1 KL(ρ|vj ), where S is the total number of hidden units, vj is the activation of unit j and ρ is a predefined sparsity parameter, typically a small value close to zero (we use 0.05 in our model). So the overall cost of a sparse RBM used in our model is: P PF E(v, h; θ) = − D vi Wij hj − PD i=1 j=1 P bi vi − F (7) i=1 j=1 aj hj + P β S j=1 KL(ρ|vj ) + λ kW k , where kW k is the regularizer and both β and λ are hyperparameters 1 . The advantage of RBM is that it investigates a expressive representation of the input risk factors. Each hidden unit in RBM is able to encode at least one high-order interaction among the input variables. Given a specific number of latent reasons in the input, RBM requires less hidden units to represent the problem complexity. Under this scenario, RFs can be analyzed by a RBM model with an efficient learning algorithm CD. In this paper, we use RBM for an unsupervised greedy layer-wise pre-training. Specifically, each sample describes a state of visible units in the model. The goal of learning is to minimize the overall energy so that the data distribution can be better captured using this single-layer approach. 2.3 Multi-Layer Learning for Integrated Features The new representations learned by a shallow RBM (one layer RBM) can model some directed hidden causalities behind the RFs. But there are more abstractive reasons behind them (i.e. the reasons of the reasons). To sufficiently model reasons in different abstractive levels, we can stack more layers into the shallow RBM to form a deep graphical model, namely a Deep Belief Net (DBN). DBN is a probabilistic generative model that is composed of multiple layers of stochastic, latent variables [13]. The latent variables typically have binary values and are often called hidden units or feature detectors. The top two layers form a RBM which can be viewed as an associative memory. The lower layer forms a multi-layer perceptron (MLP) 1 We tried different settings for both β and λ and found our model is not very sensitive to the input parameters. We fixed β to 0.1 and λ to 0.0001 for all the experiments. 45 … W2 … h2 h1 RBM W1 Algorithm 1 DBN training algorithm for risk factors MLP …… V (b) DBN Figure 4: Two-layer deep belief network including one visible layer V and two hidden layers h1 and h2 in which the top two layers form a RBM and the bottom layer forms a multi-layer perceptron. [26] which receives top-down, directed connections from the layers above. The states of the units in the lowest layer represent a data vector. There is an efficient, layer-by-layer procedure for learning the top-down, generative weights that determine how the variables in one layer depend on the variables in the layers above. The bottom-up inference from the observed variables V and the hidden layers hk (k = 1, ..., l when l > 2) is following a chain rule: p(hl |hl−1 , ..., h1 , v) = p(hl |hl−1 )p(hl−1 |hl−2 )...p(h1 |v), (8) where if we denote bias for the layer k as bk and σ is a logistic sigmoid function, for m units in layer k and n units in layer k − 1, P k k−1 ). p(hk |hk−1 ) = σ(bkj + m (9) j=1 Wji hi The top-down inference is a symmetric version of the bottomup inference, which can be written as P k−1 k (10) p(hk−1 |hk ) = σ(ak−1 + n hj ). i i=1 Wij where we denote bias for the layer k − 1 as ak−1 . We show a two-layer DBN in Figure 4, in which the pretraining follows a greedy layer-wise training procedure. Specifically, one layer is added on top of the network at each step, and only that top layer is trained as an RBM using CD strategy [7]. After each RBM has been trained, the weights are clamped and a new layer is added and then repeat the above procedure. After pre-training, the values of the latent variables in every layer can be inferred by a single, bottom-up pass that starts with an observed data vector in the bottom layer and uses the generative weights in the reverse direction. The top layer of DBN forms a compressed manifold of input data, in which each unit in this layer has distinct weighted nonlinear relationship with all of the input factors. This new representation of RFs is later served as the input of several traditional classifiers. To incorporate labeled samples, we add a regression layer on top of DBN to get classification results, which can be used to update the overall model using back propagation. The training procedure using two sources is shown in Algorithm 1, in which lines 2 to 4 reflect a layer-wised Contrastive Divergence (CD) learning procedure where z is a predetermined hyper-parameter that controls how many Gibbs ACM-BCB 2013 Input: All risk factors, learning rate , Gibbs round z, stopping patience d; Output: Model parameters M (W, a, b); Pre-training Stage: 1: Randomly initialize all W, a, b; 2: for t from layer V to hl−1 3: clamp t and run CDz to update Mt and t+1 4: end Fine-tuning Stage: 5: randomly dropout 30% hidden units for each layer 6: repeat 7: for each predicted result (r) 8: calculate cost (c) between r and ground truth g1 9: calculate partial gradient of c with respect to M 10: update M 11: calculate cost (c0) on held-out set 12: if c0 is larger than c0−1 for d round 13: break 14: end 15: end 16: end 17: do the fine-tuning stage again with ground truth g2 rounds for each sampling are completed and t+1 is the state of upper layer. In our experiments, we choose z to be 1. The pre-training phase stops when all layers are exhausted. Lines 5 to 15 show a standard gradient update procedure (fine-tuning). Since we have ground truth g1 and g2 representing different measurements, we implement the second fine-tuning procedure using g2 after the stage using g1 . Notice that we first randomly dropout 30% hidden units for each layer for the purpose of alleviating the counter effect between the fine-tuning on two different types of labels g1 and g2 . To prevent over-fitting we use early stopping. Specifically, the fine-tuning procedure halts when the validation error stops decreasing or starts to increase within 5 minibatches. The different semantic meaning for both g1 and g2 will be explicitly stated in Section 2.4. The main advantage of the DBN is that it tends to display more expressive and invariant results than the single layer network and also reduce the size of the representation. This approach obtains a filter-like representations if we treat unit weights as filters [9]. We want to filter the insignificant risk factors and thus find out robust and integrated features which are the fusion of both observed risk factors and hidden reasons for predicting bone diseases. 2.4 Model Pipeline Our system pipeline including two main components can be described by the flow chart in Figure 5. The first component illustrates the proposed risk factor analysis framework for the integrated salient risk features. In the proposed framework, we first keep the original samples with 672 potential RFs and two types of labels. Then we feed all of them into the risk factor analysis (RFA) module. On the one hand, we train our model using a shallow RBM and a two-layer DBN without using two types of labels (without fine-tuning stage). Such an unsupervised training procedure aims for reducing the freedom of data fitting when the ultimate goal is to predict bone diseases given risk factors. Actually, this is an unsupervised pre-training phase which 46 Labels Risk Factors RF1 672 RFs Raw RFs RFA Integrated Risk Features 11 RFs Phase1 Phase2 RF2 … RF671 RF672 L1 L2 Patient1 Patient2 Component 1: Risk factor analysis Component 2: Predic8on Patient3 Patient4 Figure 5: Pipeline of our proposed method. Patient5 guides the learning towards basins of optima that supports a better generalization. Moreover, most real world healthcare data are lack of ground truth and therefore achieving a good performance during an unsupervised phase is more influential for the disease prediction. On the other hand, we train both shallow RBM and two-layer DBN with a twostage fine-tuning procedure. In the context of scarcity of labeled data, both models have shown promising as well during a semi-supervised process. Samples in the original data are therefore projected onto a new space with a predetermined dimensionality2 . These integrated low-dimensional risk features can be viewed as a new representation and can also be treated as the input flowing into the second component. In summary, Component 1 is acting as a knowledge integration component to generate integrated risk features which maintain the properties for both observed risk factors and latent risk factors from the data. In Figure 5, Component 2 evaluates our new integrated features using a two-step prediction module composed of Phase 1 and Phase 2, respectively. Specifically, Phase 1 aims to predict whether a person tends to get abnormal bone (osteopenia or osteoporosis) after 10 years measured by BMD value. We regard the abnormal bone as Positive and the normal bone as Negative. In Algorithm 1, we use g1 to represent this measure which includes two cases: (1) people will have osteopenia or osteoporosis in 10 years later and (2) people’s bones have a density that is healthy after 10 years. Then for those abnormal cases, Phase 2 is used to predict the annual bone loss rate (high or low) measured by a series of BMD data. For Phase 2, we treat the high bone loss rate as Positive and the low bone loss rate as Negative. In Algorithm 1, we use g2 to represent this measure which includes two cases: (1) annual bone loss in a high speed for next 10 years and (2) annual bone loss in a low speed for next 10 years. Obviously, the size of the dataset for the second phase is shrunk because we have discarded those negative cases after the first phase. In either of the two phases, the Positive prediction usually attracts people’s attention. Patient9704 .. . Figure 6: Illustration of the missing data for the SOF data set shown in empty shapes for both RFs space and Labels space. and older. It contains 20 years of prospective data about osteoporosis, bone fractures, breast cancer, and so on. Potential risk factors (RFs) and confounders were classified into 20 categories such as demographics, family history, lifestyle, and medical history [2]. As shown in Figure 6, there are missing values for both risk factor space and label space, denoted as empty shapes. A number of potential RFs are grouped and organized at the first and second visits which include 672 variables scattered into 20 categories as the input of our model. The rest of the visits contain time-series dual-energy x-ray absorptiometry (DXA) scan results on bone mineral density (BMD) variation, which will be extracted and processed as the label for our data set. Based on WHO standard, Tscore of less than -13 indicates the osteopenia condition that is the precursor to osteoporosis, which is used as the first type of label. The second type of label is the annual rate of BMD variation. We use at least two BMD values in the data set to calculate the bone loss rate and define the high bone loss rate with greater than 0.84% bone loss in each year [27]. We have shown how to employ this multi-label data and finish a hierarchical prediction task in Section 2.4 and Algorithm 1. Notice that this is a partially labeled data set since some patients just come during the first and second visit and never take a DXA scan in the following visits like example P atient3 shown in Figure 6. 3.2 Evaluation Metric The Study of Osteoporotic Fractures (SOF) is the largest and most comprehensive study of risk factors (RFs) for bone diseases which includes 9704 Caucasian women aged 65 years The error rate on a test dataset is commonly used as the evaluation method of the classification performance. Nevertheless, for most skewed medical data sets, the error rate could be still low when misclassifying entire minority sample to the class of majority. Thus, two alternative measurements are used in this paper. First, Receiver Operator Characteristic (ROC) curves are plotted to generally capture how the number of correctly classified abnormal cases varies with the number of incorrectly classifying normal cases as abnormal cases. Since in most medical problems, we usually care 2 We use 11 dimensions which are consistent with the number of the expert selected RFs. 3 T-score of -1 corresponds to BMD of 0.82, if the reference BMD is 0.942 and the reference standard deviation is 0.122. 3. 3.1 EXPERIMENTS Data Set ACM-BCB 2013 47 Table 1: Confusion matrix. Predicted Class Positive Negative Actual Class Positive Negative TP FP FN TN about the fraction of examples classified as abnormal cases that are truly abnormal, the measurements, Precision-Recall (PR) curves, are also plotted to show this property. We present the confusion matrix in Table 1 and several derivative quality measures in Table 2. Table 2: Metrics definition. 3.3 True Positive Rate = TP T P +F N False Positive Rate = FP F P +T N Precision = TP T P +F P Recall = TP T P +F N Error Rate = F P +F N T P +T N +F P +F N Experiment Setup Since no classifier is considered to perform the best classification, we choose two classical classifiers4 to validate our risk factor analysis results compared with the expert opinion. Logistic Regression (LR) is widely used among experts to assess clinical RFs and predict the fracture risk. Support Vector Machine (SVM) has also been applied to various realworld problems. RFs are extracted based on the expert opinion [1, 8, 29, 14] and summarized using the following variables in Table 3. We apply two basic classifiers mentioned above and choose the parameters by cross-validation for fairness. Notice that this is a supervised learning process since all samples for this expert knowledge based model are labeled. For fair comparison with the classification results using expert knowledge, we fix the number of the output dimensions of RFA module to be equal to the expert selected RFs. Specifically, we fix the number of units in the output layer to be 11, where each unit in this layer represents a new integrated feature describing complex relationships among all 672 input factors, rather than a single independent RFs like a set of typical risk factor selected by experts shown in Table 3. For all the experiments involving RFA, the learning rate used to update weights is fixed to the value of 0.055 and the number of iterations is set to 10 for efficiency6 . We use minibatch gradient for updating the parameters and the batch size is set to 20. After RFA is trained, we simply feed it 4 If we only train RFA by fine-tuning, then the model will end up with a traditional neural network (NN). During the experiment, the NN training are more likely to result in over-fitting and the performance is no better than non-linear SVM classifier with RBF kernel. 5 It is chosen from the validation set. 6 We observed that the model cost can reach into a relatively stable state within 5 to 10 iterations. ACM-BCB 2013 Table 3: Typical risk factors from the expert opinion. Variables Age Weight Height BMI Parent fall Type Numeric Numeric Numeric Numeric Boolean Smoke Excess alcohol Boolean Boolean Rheumatoid arthritis Physical activity Boolean Physical exercise BMD Description Between 65 - 84 BM I = weight/height2 Hip fracture in the patient’s mother or father 3 or more units of alcohol daily Boolean Use of arms to stand up from chair Boolean Take walk for exercises Numeric Normal: T-score > -1; Abormal: T-score <= -1 with the whole data and get the new integrated RFs and then run the same classification module to get results. Since the sample size is large and highly imbalanced in Phase1, we evaluate the performance using both ROC and PR curves. However, the number of samples in Phase2 is small and balanced, thus we only evaluate the performance using classification error rate in this phase. 3.4 3.4.1 Results and Evaluation Phase1: Osteoporosis Prediction The overall results for the SOF data after Phase1 are shown in Figure 7. In Section 2.4, we have introduced that we first implement RBM and two-layer DBN for unsupervised learning, also known as a pre-training stage. Then we add labeled information into both models for supervised learning, also known as a fine-tuning (FT) stage. Figure 7 shows results of risk factor analysis using RBM without/with FT and DBN without/with FT between two classifiers, presented using ROC and PR curves as a group. Furthermore, the area under curve (AUC) of ROC curve for each classifier (denoted as “LR-ROC”, “SVM-ROC”) and the AUC of PR curve (denoted as “LR-PR”, “SVM-PR”) are shown in Table 4. AUC indicates the performance of a classifier: the larger the better (an AUC of 1.0 indicates a perfect performance). The classification results using expert knowledge are also shown for the performance comparison as the baseline. From Figure 7(a), we observe that a shallow RBM without FT get a sense of how the data is distributed which represents the basic characteristics of the data itself shown as LR (RBM) and SVM (RBM) curves in the figure. Although the performances are not always higher than the expert model shown as LR (Expert) and SVM (Expert) curves in Figure 7(a), this is a completely unsupervised process without borrowing knowledge from any types of labeled information. Achieving such a comparable performance is not easy since expert model is trained in a supervised way. But we find from above experiments that the model is lack of focus to a specific task and thus leads to poor performances. Further improvements may be possible by more thorough experiments with the two types of labeled data for finishing a 48 Table 4: AUC of ROC and PR curves of expert knowledge model and our RFA model using four different structures. Risk Factors From: Expert knowledge Shallow RBM without FT Shallow RBM with FT DBN without FT DBN with FT LR-ROC 0.729 0.638 0.795 0.662 0.878 Table 5: Classification error rates given expert based model and our model. Expert DBN with FT LR-Error 0.3833 0.1066 SVM-Error 0.3259 0.0936 two-stage fine-tuning that is used to better satisfy our prediction tasks. Next we take advantage of the labeled information and transform from an unsupervised task to a semisupervised task because of the partial label data. Figure 7(a) shows the classification results which boost the performance of all classifiers because of the two-stage fine-tuning shown as LR (RBM with FT) and SVM (RBM with FT) curves attached with AUC in Figure 7(a). Especially, the AUC of PR of our model significantly outperforms the expert system. Since the capacity for the RBM model with one hidden layer is usually small, it indicates a need for a more expressive model over the complex data. To satisfy this need, we add a new layer of non-linear perceptron at the bottom of RBM, which forms a DBN as shown in Figure 4. This new added layer greatly enlarges the overall model expressiveness. More importantly, the deeper structure is able to extract more abstractive reasons. As we expected, using a deeper structure without labeled information yields a better performance than the shallow RBM model shown as LR (DBN) and SVM (DBN) curves in Figure 7(b). And the model further improves its behavior after the two-stage fine-tuning that is also shown as LR (DBN with FT) and SVM (DBN with FT) curves in Figure 7(b) attached with AUC. All the numerical results about AUC rounded to 3 significant figures are listed in Table 4. 3.4.2 Phase2: Bone Loss Rate Prediction In this section, we show the bone loss rate prediction using abnormal cases after Phase1. High bone loss rate is an important predictor of higher fracture risk. Moreover, it’s reported that RFs that account for high and low bone loss rates are different [27]. Our integrated risk features are good at detecting this property since they integrate the characteristics of data itself and nicely tuned under the help of two kinds of labels. We compare the results between expert knowledge based model and our DBN with fine-tuning model that yields the best performance for Phase1. The classification error rate is defined in Table 2. Since our result is also fine-tuned by the bone loss rate, we can directly feed the 11 new integrated features into Phase2. Table 5 shows that our model achieves high predictive power when predicting bone loss rate. In this case, the expert model fails because the limited features are not sufficient to describe the bone loss rate which may interact with other different RFs. This highlights the need for a more complex ACM-BCB 2013 SVM-ROC 0.601 0.591 0.785 0.631 0.879 LR-PR 0.458 0.379 0.594 0.393 0.718 SVM-PR 0.343 0.358 0.581 0.386 0.720 model to extract the precise attributes from an amount of potential RFs. Moreover, our RFA module takes into account the whole data set, not only keeping the 672 risk factor dimensions but also utilizing two types of labeled data, that is normal/abnormal bone and low/high bone loss rate. The fine-tuning effects can also be important on bone loss prediction although the number of labeled data and samples are less than the first prediction task. 4. CONCLUSION AND FUTURE WORK A challenge for the disease risk factor analysis is that the complicated and highly correlated relationships behind these risk factors are difficult to be mined. In addition, the lack of complete ground truth for the medical data becomes an inevitable obstacle for developing a state-of-the-art model in the realm of health informatics. Existing approaches neither incorporate the information from the whole data set nor better solve the partial labeled problem with limited medical data resources. In this paper, an integrated approach is developed to identify the osteoporotic risk features associated with the patients’ continuous medical records. By learning a deep graphical model using both shallow Restricted Boltzmann Machine (RBM) and Deep belief net (DBN) structure and involving two-stage fine-tuning (FT), the predictive performance and stability for predicting osteoporosis and bone loss improves stepwise since the model is more expressive and wisely tuned by the labeled information. The risk factor analysis (RFA) captures the data manifold and identifies an accurate set of lower-dimensional risk features from the original higher-dimensional data. We obtain an integrated set of new features that preserves properties from both osteoporosis and bone loss rate perspectives, yielding a better prediction performance than the expert opinion. In the future work, we plan to extend the problem from risk factor analysis and prediction on bone diseases to other diseases for the single task learning. For the multi-task learning, we aim for investigating the use of shared risk factors for totally different diseases. We will also try to enrich the data from one source to two or more sources such as combining patients’ DXA scan image with their clinical records and questionnaire data to construct a multi-view learning framework. 5. ACKNOWLEDGMENTS The materials published in this paper are partially supported by the National Science Foundation under Grants No. 1218393, No. 1016929, and No. 0101244. 6. REFERENCES [1] http://www.shef.ac.uk/FRAX/. [2] http://www.sof.ucsf.edu/interface/. 49 1 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Precision True Positive Rate 1 0.9 0.5 0.4 0.3 0.1 0 0 0.2 0.4 0.6 False Positive Rate 0.8 0.5 0.4 0.3 LR (Expert): 0.72946 SVM (Expert): 0.60094 LR (RBM): 0.63795 SVM (RBM): 0.59062 LR (RBM with FT): 0.79518 SVM (RBM with FT): 0.78539 0.2 LR (Expert): 0.45762 SVM (Expert): 0.34308 LR (RBM): 0.37876 SVM (RBM): 0.35772 LR (RBM with FT): 0.59392 SVM (RBM with FT): 0.58092 0.2 0.1 0 1 0 0.2 0.4 Recall 0.6 0.8 1 (a) ROC and PR curve for expert knowledge and shallow RBM without/with FT. 1 1 0.9 LR (Expert): 0.45762 SVM (Expert): 0.34308 LR (DBN): 0.39258 SVM (DBN): 0.38551 LR (DBN with FT): 0.7183 SVM (DBN with FT): 0.7204 0.9 0.8 0.7 0.7 0.6 Precision True Positive Rate 0.8 0.5 0.4 0.5 0.3 LR (Expert): 0.72946 SVM (Expert): 0.60094 LR (DBN): 0.66159 SVM (DBN): 0.6307 LR (DBN with FT): 0.87779 SVM (DBN with FT): 0.87897 0.2 0.1 0 0.6 0 0.2 0.4 0.6 False Positive Rate 0.8 0.4 0.3 1 0.2 0 0.2 0.4 Recall 0.6 0.8 1 (b) ROC and PR curve for expert knowledge and DBN without/with FT. Figure 7: Performance comparison. [3] H. Ackley, E. Hinton, and J. Sejnowski. A learning algorithm for boltzmann machines. Cognitive Science, pages 147–169, 1985. [4] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems. Theoretical Computer Science, 209(1):237–260, 1998. [5] R. Bender. Introduction to the use of regression models in epidemiology. Methods Mol Biol, 471:179–195, 2009. [6] D. Black, M. Steinbuch, L. Palermo, P. Dargent-Molina, R. Lindsay, M. Hoseyni, and O. Johnell. An assessment tool for predicting fracture risk in postmenopausal women. Osteoporosis International, 12(7):519–528, 2001. [7] M. A. Carreira-Perpinan and G. E. Hinton. On contrastive divergence learning, 2005. [8] Cummings, S.R., Nevitt, M.C., Browner, W.S., Stone, K., Fox, K.M., Ensrud, K.E., Cauley, J., Black, D., and Vogt, T.M. Risk factors for hip fracture in white women. Study of Osteoporotic fractures research group, 332:767–773, 1995. [9] D. Erhan, A. Courville, and Y. Bengio. Understanding representations learned in deep architectures. Technical report, Technical Report 1355, Université de Montréal/DIRO, 2010. [10] L. Getoor, J. T. Rhee, D. Koller, and P. Small. Understanding tuberculosis epidemiology using structured statistical models. Artificial Intelligence in Medicine, 30(3):233–256, 2004. ACM-BCB 2013 [11] S. H. Ha. Medical domain knowledge and associative classification rules in diagnosis. International Journal of Knowledge Discovery in Bioinformatics (IJKDB), 2(1):60–73, 2011. [12] S. F. Hafstein. An algorithm for constructing lyapunov functions, 2007. [13] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. science, 2006. [14] Hui, S.L., Slemenda, C.W. and Johnston, C.C. Age and bone mass as predictors of fracture in a prospective study. The Journal of Clinical Investigation, 81:1804–1809, 1988. [15] J. Kanis, A. Oden, H. Johansson, and E. McCloskey. Pitfalls in the external validation of frax. Osteoporosis International, pages 1–9, 2012. [16] J. Kanis, A. Odén, O. Johnell, H. Johansson, C. De Laet, J. Brown, P. Burckhardt, C. Cooper, C. Christiansen, S. Cummings, et al. The use of clinical risk factors enhances the performance of bmd in the prediction of hip and osteoporotic fractures in men and women. Osteoporosis international, 18(8):1033–1046, 2007. [17] S. Kullback. On information and sufficiency, 1951. [18] H. Lee, C. Ekanadham, and A. Y. Ng. Sparse deep belief net model for visual area V2. In Advances in Neural Information Processing Systems 20, pages 873–880. Nips Foundation, 2008. [19] G. Lemineur, R. Harba, N. Kilic, O. Ucan, O. Osman, and L. Benhamou. Efficient estimation of osteoporosis 50 [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] using artificial neural networks. In Industrial Electronics Society, 2007. IECON 2007. 33rd Annual Conference of the IEEE, pages 3039–3044. IEEE, 2007. H. Li, C. Buyea, X. Li, M. Ramanathan, L. Bone, and A. Zhang. 3d bone microarchitecture modeling and fracture risk prediction. In Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine, pages 361–368. ACM, 2012. J. Li, J. Shi, and D. Satz. Modeling and analysis of disease and risk factors through learning bayesian networks from observational data. Quality and Reliability Engineering International, 24(3):291–302, 2008. W. Moudani, A. Shahin, F. Chakik, and D. Rajab. Intelligent decision support system for osteoporosis prediction. International Journal of Intelligent Information Technologies (IJIIT), 8(1):26–45, 2012. C. Ordonez and K. Zhao. Evaluating association rules and decision trees to predict multiple target attributes. Intelligent Data Analysis, 15(2):173–192, 2011. B. J. Riis. The role of bone loss. The American journal of medicine, 98(2):29S–32S, 1995. J. Robbins, A. Schott, P. Garnero, P. Delmas, D. Hans, and P. Meunier. Risk factors for hip fracture in women with high bmd: Epidos study. Osteoporosis international, 16(2):149–154, 2005. F. Rosenblatt. Principles of neurodynamics; perceptrons and the theory of brain mechanisms. Spartan Books, Washington, 1962. J. Sirola, A.-K. Koistinen, K. Salovaara, T. Rikkonen, M. Tuppurainen, J. S. Jurvelin, R. Honkanen, E. Alhava, and H. Kröger. Bone loss rate may interact with other risk factors for fractures among elderly women: A 15-year population-based study. Journal of osteoporosis, 2010, 2010. A. J. Storkey and R. Valabregue. The basins of attraction of a new hopfield learning rule. Neural Networks, 12(6):869–876, 1999. Taylor, B.C., Schreiner, P.J., Stone, K.L., Fink, H.A., Cummings, S.R., Nevitt, M.C., Bowman, P.J., and Ensrud, K.E. Long-term prediction of incident hip fracture risk in elderly white women: study of osteoporotic fractures. J Am Geriatr Soc Int, 52:1479–1486, 2004. W. Wang, G. Richards, and S. Rea. Hybrid data mining ensemble for predicting osteoporosis risk. In Engineering in Medicine and Biology Society, 2005. IEEE-EMBS 2005. 27th Annual International Conference of the, pages 886–889. IEEE, 2006. WHO Scientific Group. Prevention and management of osteoporosis. Who technical report series, world health organization, Geneva, 2003. World Health Organization. WHO scientific group on the assessment of osteoporosis at primary health care level. Summary meeting report, Brussels, Belgium, May 5-7 2004. ACM-BCB 2013 51