2013 IEEE International Conference on Bioinformatics and Biomedicine A Generative Framework for Prediction and Informative Risk Factor Selection of Bone Diseases Hui Li∗ , Xiaoyi Li∗ , Yuan Zhang† , Murali Ramanathan‡ , Aidong Zhang∗ ∗ Department of Computer Science and Engineering, State University of New York at Buffalo, USA {hli24,xiaoyili,azhang}@buffalo.edu † College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing, China {zhangyuan}@emails.bjut.edu.cn ‡ Department of Pharmaceutical Sciences, State University of New York at Buffalo, USA {murali}@buffalo.edu Abstract—With the rapid development of healthcare industry, the overwhelming amounts of electronic health records (EHRs) have been well documented and shared by healthcare institutions and practitioners. It is important to take advantage of EHR data to develop an effective disease risk management model that not only predicts the progression of the disease, but also provides a candidate list of informative risk factors (RFs) in order to prevent the disease. Although EHRs are valuable sources due to the comprehensive patient information, it is difficult to pinpoint the underlying causes of the disease in order to assess the risk of a patient in developing a target disease. Because of the entangled EHR data, it is also challenging to discriminate between patients suffering from the disease and without the disease for the purpose of selecting RFs that cause the disease. To tackle these challenges, we propose a disease memory (DM) framework which can extract the integrated features by modeling the relationships among RFs and more importantly between RFs and the target disease by establishing a deep graphical model with two types of labels. The variants of DM can model characteristics for patients with disease and without disease respectively via training deep networks with different samples. Experiments on a real bone disease data set show that the proposed framework can successfully predict the bone disease and select the informative RFs that are beneficial and useful to aid clinical decision support. Most of the selected RFs are validated by medical literature and some new RFs will attract interests in medical research. The stable and promising performance on evaluation metrics confirms the effectiveness of our model. I. I NTRODUCTION The Electronic Health Record (EHR) is a longitudinal electronic record of patient health information including diverse information like demographics, medications, past medical history, laboratory data, and lifestyles. EHRs are valuable sources for exploratory analysis and statistics to assist clinical decisionmaking and further medical research. Researchers have been converting EHR data into risk factors (RFs) for the disease risk analysis which includes two crucial tasks: disease risk prediction and informative risk factor (RF) selection. With the success of both tasks, patient can avoid unnecessary tests, reduce the cost of public health care, and change their modifiable RFs for disease control or prevention. Usually, numerous potential RFs need to be considered simultaneously since observed and hidden reasons behind all RFs are worth learning for the exploration of the disease progression. However, it is an extremely challenging task to capture the disease characteristics and clinical nuances for predicting disease progression 978-1-4799-1310-7/13/$31.00 ©2013 IEEE and detecting the informative risk factors (RFs) due to the complexity and diversity of the EHR data. The difficulties are showing in many aspects. First, it is hard to find a good RF representation so that the salient integrated features can be disentangled from heterogeneous information. Second, it is difficult to discriminate the different roles of independent features for both healthy and diseased patients. Osteoporosis and bone fractures are common bone diseases associated with aging and may be clinically silent but can cause significant mortality and morbidity after onset. Over the past few decades, osteoporosis has been recognized as a common bone disease that affects more than 75 million people in the United States, Europe and Japan, and it causes more than 8.9 million fractures annually worldwide [1]. It’s reported that 20-25% of people with a hip fracture are unable to return to independent living and 12-20% die within one year. Although the diagnosis of osteoporosis is usually based on the assessment of bone mineral density (BMD) using dual energy X-ray absorptiometry (DXA), the World Health Organization (WHO) embarked on a project to integrate information on RFs to better predict the risk of bone disease in men and women worldwide [2]. In this paper, we propose a novel approach for the study of bone diseases in two aspects: bone disease prediction and disease RF selection according to the significance. Existing models usually fall into two categories: the expert knowledge based model or the handcrafted feature set based model. The first mentioned model mainly relies on a small number of well-known RFs which have been validated by an expert in this field like [3]. However, the information based on the expert knowledge is limited so that some important features might be discarded and thus affect the predictive performance. The second mentioned model tries to find the informative RFs by calculating their statistical significance and then measure the predictive power. The assessment method of the relationship between a disease and a handcrafted RF is based on the regression model [4], Artificial Neural Network (ANN) [5], association rules and decision tree [6]. Although these models are theoretically acceptable for analyzing the risk dependence of several variables, it pays little attention to the relationships among RFs and between RFs and the target disease. Furthermore, they usually select statistically significant features from an expert support candidate list, 11 RFs 672 RFs CDM Integrated Risk Features Original Dataset samples from Phase 1. The high bone loss rate, as the positive output, shows higher possibility to have bone fractures. Phase2 Phase1 Training Process Tr Disease Samples Validate NDM Candidate Informative RFs Non-Disease Samples Medical Knowledge BDM Fig. 1: Overview of our framework for bone health. which means there still exists the loss of useful information if the list is not comprehensive. Recently, mining the causality relationship between RFs and a specific disease has attracted considerable research attentions. In [7], limited RFs are used to construct a Bayesian network and the RFs are assumed conditionally independent of one another. However, learning the Bayesian networks becomes tough and even impossible as the number of RFs increases. II. P ROBLEM D EFINITION In this section, we define our problem by showing a pipeline for the whole framework. Generally speaking, our proposed system contains a two-task framework, as shown in Fig. 1. The upper component of Fig. 1 shows the roadmap for the first task: the bone disease prediction based on integrated RFs. The bottom component of Fig. 1 shows the roadmap for the second task: informative RF selection. Given patients’ information, our system can not only predict the risk of osteoporosis and bone fractures, but also rank the informative RFs and explain the semantics of each RF. The description of each component is given as follows. Task 1 – The Bone Disease Prediction Component. In this component, we feed the original data set to the comprehensive disease memory (CDM). The training procedure of CDM includes two procedures: pre-training and fine-tuning. In the pre-training step, we train CDM in an unsupervised fashion. This pre-training procedure aims at capturing the characteristics among all RFs with the ultimate goal of guiding the learning towards basins of optima that supports a better generalization. In the fine-tuning step, we take advantage of two types of labeled information (osteoporosis and bone loss rate) for the purpose of focusing on these two prediction tasks. We use a greedy layer-wised learning algorithm to train a twolayer Deep Belief Network (DBN) which is the structure of CDM. Besides, all RFs in the original data are projected onto a new space with the lower dimensionality by restricting the number of units in the output layer of DBN. Therefore, the integrated risk features are extracted by CDM module from the original date set. These lower-dimensional integrated risk features are new representation of original higher-dimensional RFs which will be examined by a two-phase prediction module. Two classifiers, Logistic Regression and Support Vector Machine (SVM), are composed of the prediction module. In Phase 1, we predict the risk of osteoporosis for all test samples. We regard the osteoporotic bone as the positive output and the normal bone as the negative output. Because the osteoporotic patients tend to have more severe bone fractures. In Phase 2, we further predict the risk of bone loss rate for all positive Task 2 – The Informative RF selection Component. Since we are not able to explain the semantics of the integrated RFs extracted by the first component, we are required to select the meaningful and significant RFs from all candidates in the second component. Instead of using all samples into training procedure, we first split the original data set into two parts: diseased samples and non-diseased samples. In the procedure of training, we separately train bone disease memory (BDM) using diseased samples and non-disease memory (NDM) using non-diseased sample, shown a dashed arrows at the bottom component of Fig. 1. Once the training session is complete, both memories are used to reconstruct data respectively based on the contrast group of samples. A two-layer DBN, as the structure of NDM and BDM, has properties to reconstruct samples. But it yields large reconstruction errors if we use BDM to reconstruct non-diseased samples because of the mismatch between the input data and the memory module. The contrasts are valuable information to explain why a nondiseased person will get disease. Similarly, the differences are obvious when reconstructing diseased samples using NDM. All RFs cumulatively lead to the reconstruction errors. Our ultimate goal is to find the top-N individual RFs which contribute greatly to the reconstruction errors. The top-N selected RFs form a candidate informative RF list that will be validated using the medical knowledge such as medical reports from WHO and National Osteoporosis Foundation (NOF), as well as biomedical literatures from PubMed. III. M ETHODOLOGY In this section, we first introduce both single-layer and multi-layer learning approaches which are preliminaries to our proposed method. Then we propose our model focusing on the prediction and informative RF selection for bone diseases. A. Single-Layer Learning for the Latent Reasons To have a good RF representation of latent reasons for the data, we propose to use Restricted Boltzmann Machine (RBM) [8]. A RBM is a generative stochastic graphical model that can learn a probability distribution over its set of inputs, with the restriction that their visible units and hidden units must form a fully connected bipartite graph. Specifically, it has a single layer of hidden units that are not connected to each other and have undirected, symmetrical connections to a layer of visible units. We show a shallow RBM in Fig. 2(a). The model defines the following energy function: E : {0, 1}D+F → R : E(v, h; θ) = − D F i=1 j=1 vi Wij hj − D i=1 b i vi − F a j hj , (1) j=1 where θ = {a, b, W } are the model parameters. D and F are the number of visible units and hidden units. The joint distribution over the visible and hidden units is defined by: P (v, h; θ) = 1 exp(−E(v, h; θ)), Z(θ) (2) where Z(θ) is the partition function that plays the role of a normalizing constant for the energy function. Exact maximum likelihood learning is intractable in RBM. In practice, efficient learning is performed using Contrastive Divergence (CD) [9]. In particular, each hidden unit activation F is penalized in the form: j=1 KL(ρ|hj ), where F is the total ) . +, ) . - Osteoporosis Prediction Classifiers )) +, Fig. 2: (a) Shallow Restricted Boltzmann Machine, which contains a layer of visible units v that represent the data and a layer of hidden units h that learn to represent features that capture higherorder correlations in the data. The two layers are connected by a matrix of symmetrically weighted connections, W , and there are no connections within a layer. (b) A 2-Layer DBN in which the top two layers form a RBM and the bottom layer forms a multi-layer perceptron. It contains a layer of visible units v and two layers of hidden units h1 and h2. number of hidden units, hj is the activation of unit j and ρ is a predefined sparsity parameter, typically a small value close to zero (we use 0.05 in our model). So the overall cost of a sparse RBM used in our model is: E(v, h; θ) = Bone Loss Rate Prediction - )) ) F − D vi Wij hj − D i=1 bi vi − F i=1 j=1 F a h + β KL(ρ|h j ) + λ W , j=1 j j j=1 & & ) ) )) ) &) )) & CDM Whole ( Data Set ( ( Fig. 3: Bone disease prediction using a two-layer DBN model. procedure. After pre-training, the values of the latent variables in every layer can be inferred by a single, bottom-up pass that starts with an observed data vector in the bottom layer and uses the generative weights in the reverse direction. The top layer of DBN forms a compressed manifold of input data, in which each unit in this layer has distinct weighted non-linear relationship with all of the input factors. C. Bone Disease Prediction Using CDM (3) where W is the regularizer, β and λ are hyper-parameters1 The advantage of RBM is that it investigates an expressive representation of the input risk factors. Each hidden unit in RBM is able to encode at least one high-order interaction among the input variables. Given a specific number of latent reasons in the input, RBM requires less hidden units to represent the problem complexity. Under this scenario, RFs can be analyzed by a RBM model with an efficient CD learning algorithm. In this paper, we use RBM for an unsupervised greedy layer-wise pre-training. Specifically, each sample describes a state of visible units in the model. The goal of learning is to minimize the overall energy so that the data distribution can be better captured by the single-layer model. B. Multi-Layer Learning for Mining Abstractive Reasons The new representations learned by a shallow RBM (one layer RBM) can model some directed hidden causalities behind the RFs. But there are more abstractive reasons behind them (i.e. the reasons of the reasons). To sufficiently model reasons in different abstractive levels, we can stack more layers into the shallow RBM to form a deep graphical model, namely, a DBN [10]. DBN is a probabilistic generative model that is composed of multiple layers of stochastic, latent variables. The latent variables typically have binary values and are often called hidden units. The top two layers form a RBM which can be viewed as an associative memory. The lower layer forms a multi-layer perceptron (MLP) [11] which receives top-down, directed connections from the layers above. The states of the units in the lowest layer represent data vector. We show a two-layer DBN in Fig. 2(b), in which the pre-training follows a greedy layer-wise training procedure. Specifically, one layer is added on top of the network at each step, and only that top layer is trained as an RBM using CD strategy [9]. After each RBM has been trained, the weights are clamped and a new layer is added and then repeats the above 1 We tried different settings for both β and λ and found our model is not very sensitive to the input parameters. We fixed β to 0.1 and λ to 0.0001. Our goal now is to disentangle the salient integrated features from the complex EHR data for the bone disease prediction. We define an integrated RF learning model based on the given data set for two types of bone disease prediction, osteoporosis and bone loss rate and DBN that is introduced in the last section. Our general idea is shown in Fig. 3, where a good RF representation for predicting osteoporosis and bone loss rate is achieved by learning a set of intermediate representation using a DBN structure at bottom appending a regression layer (classifiers) on it. This multi-learning model can capture the characteristics from both observed input (bottom-up learning) and labeled information (top-down learning). The internal model, which memorizes the trained parameters using the whole training data and preserves the information for both normal and abnormal patients, is term as the comprehensive disease memory (CDM). That is, the learned representation model CDM discovers good intermediate representations that can be shared across two prediction tasks with the combination of knowledge from both input layer with the original training data and output layer with two types of class labels. The training procedure for CDM will focus on two specific prediction tasks (osteoporosis and bone loss rate) with all risk factors as the input and model parameters as the output. It includes a pre-training stage and a fine-tuning stage. In the first stage, the unsupervised pretraining stage, we apply the layer-wised CD learning procedure for putting the parameter values in the appropriate range for further supervised training. It guides the learning towards basins of attraction of minima that support better risk factor generalization from the training data set. So the result of the pre-training procedure establishes an initialization point of the fine-tuning procedure inside a region of parameter space in which the parameters are henceforth restricted. In the second stage, the fine-tuning (FT) stage, we take advantage of the two labeled information to train our model in a supervised fashion. In this way, the prediction errors for both prediction tasks will be minimized. Specifically, we use parameters from the pretraining stage to calculate the prediction results for each sample and then back propagate the errors between the predicted result Bone disease memory (BDM) ) ) Non-disease memory (NDM) ) ) )) )) &) )) )&) Normal data set Data reconstruction (a) (k) )) )) &) ))& )&)& )& Abnormal data set Data reconstruction (b) Fig. 4: Informative RF selection based on (a) BDM and (b) NDM. and the ground truth about osteoporosis from top to bottom to update model parameters to a better state. Since we have another type of labeled information, we then repeat the finetuning stage by calculating errors between the predicted result and another ground truth about bone loss rate. After the twostage training procedure, our CDM is well trained and can be used to predict bone diseases. D. Informative Risk Factor Selection Using BDM and NDM In the previous section, we use CDM to model both diseased patients and healthy patients together and establish a comprehensive disease memory which captures the salience of all RFs by a limited number of integrated RFs for predicting osteoporosis and bone loss rate. In this section, we model the diseased patients and healthy patients separately based on their unique characteristics and identify the RFs that cause the disease (or osteoporosis). We first define a pair of disease memory models with a contrast pattern (diseased patients vs. non-diseased patients). We term the bone disease memory (BDM) model as a type of DM model which is trained by the diseased samples so it only memorizes the characteristics of those patients who suffer from the osteoporosis disease or having high bone loss rate. BDM is different from CDM in that it is a disease-targeted model that implies possible latent reasons to those abnormal patients. Given an abnormal sample, our goal is to represent the latent reasons leading to his/her disease. The top block of Fig. 4(a) shows a hierarchical latent structure underlying the observed RFs, which is well trained using the abnormal samples. To find informative RFs, we will apply this model with the normal samples as the input data and its reconstruction as output, as illustrated in Fig. 4(a). Note that there are obvious contrasts between the input and output since data reconstructed by BDM reflects abnormal cases which is contrary to the input. Under this scenario, the differences between both sides help us in finding the informative RFs. Similarly, we term the non-disease memory (NDM) model as a model which is trained by the non-diseased samples who have normal bone and low bone loss rate and memorizes their attributes. The structure of NDM is similar with BDM as shown in Fig. 4(b), but NDM is a non-disease targeted model that keeps information about normal patients. Contrary to BDM, the top block of NDM memorizes the characteristics of normal patients since it is totally trained by the normal samples. It has the same function with BDM to find the informative RFs. Also, it can be severed as a cross-validation for analyzing the informative RFs provided by BDM. Distance Metrics. For the convenience of finding the informative RFs which cause a normal case becomes abnormal, we look inside to track the distance for each column pair (each column is a risk factor) between the original data and the reconstructed data. Note that if information provided is not reliable it also yields a large distance. To remove the unreliable information and purify the informative RF list, we first examine the validity of BDM. We will calculate the distance ddB between the original disease samples and the data generated by BDM. We denote the distance for the kth RF between the original non-disease data and data generated (k) by BDM as dnB . The cumulative distance for the kth RF is: (k) (k) (k) dcB = |dnB - ddB |. We use Root Mean Square Error (RMSE) (k) (k) to calculate the distance for both dnB and ddB and absolute (k) distance for dcB . The sum of distances for all RFs is large since BDM and the input data follow diverse distributions. NDM has the similar function as BDM. The only difference is that NDM is used to generate samples with the disease samples as input and the distances between the reconstructed (k) data and the original data is ddN . And the validation for NDM is the distances between the original non-disease samples and (k) data generated by NDM dnN . The cumulative distance can be (k) (k) (k) calculated using dcN = |ddN - dnN |. We then rank the distances (k) (k) for dcB and dcN with a descending order respectively and then find the top-N informative RF. Ideally, the candidate informative RFs produced by either BDM or NDM are consistent and close from one another because only the informative RFs cause a larger distance if we successfully remove the unreliable data. A. Data Set IV. E XPERIMENTS The Study of Osteoporotic Fractures (SOF) is the largest and most comprehensive study of risk factors (RFs) for bone diseases which includes 9704 Caucasian women aged 65 years and older. It contains 20 years of prospective data about osteoporosis, bone fractures, breast cancer, and so on. Potential risk factors (RFs) and confounders were classified into 20 categories such as demographics, family history, lifestyle, and medical history [12]. A number of potential RFs are grouped and organized at the first and second visits which include 672 variables scattered into 20 categories as the input of our model. The rest of the visits contain time-series dual-energy x-ray absorptiometry (DXA) scan results on bone mineral density (BMD) measure, which will be extracted and processed as the label for our data set. Based on the WHO standard, T-score of less than -12 indicates the osteopenia condition that is the precursor to osteoporosis, which is used as the first type of label. The second type of label is the annual rate of BMD variation. We use at least two BMD values in the data set to calculate the bone loss rate and define the high bone loss rate with greater than 0.84% bone loss in each year [13]. B. Evaluation Metric The error rate on a test data set is commonly used as the evaluation method of the classification performance. Nevertheless, for most skewed medical data sets, the error rate could be still low when misclassifying entire minority samples to the class of majority. Thus, two alternative measurements are used in this paper. First, Receiver Operator Characteristic (ROC) curves are plotted to generally capture how the number of correctly classified abnormal cases varies with the number of incorrectly classifying normal cases as abnormal cases. Since in most medical problems, researchers usually attach great importance to the fraction of examples classified as abnormal cases that are truly abnormal, the measurements, PrecisionRecall (PR) curves, are also plotted to show this property. 2 T-score of -1 corresponds to BMD of 0.82, if the reference BMD is 0.942 and the reference standard deviation is 0.122. 0.8 0.7 0.7 0.6 0.6 0.5 0.4 0.3 LR (Expert): 0.729 SVM (Expert): 0.601 LR (RBM): 0.638 SVM (RBM): 0.591 LR (RBM with FT): 0.795 SVM (RBM with FT): 0.785 0.2 0.1 0 0 0.2 0.4 0.6 False Positive Rate 0.8 1 0.8 0.4 0.7 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0.2 0.4 0.6 0.8 1 0.7 0.6 0.3 0 LR (Expert): 0.458 SVM (Expert): 0.343 LR (DBN): 0.393 SVM (DBN): 0.386 LR (DBN with FT): 0.718 SVM (DBN with FT): 0.72 0.9 0.8 0.5 0 1 0.9 Precision 0.8 1 LR (Expert): 0.458 SVM (Expert): 0.343 LR (RBM): 0.379 SVM (RBM): 0.358 LR (RBM with FT): 0.594 SVM (RBM with FT): 0.581 True Positive Rate 1 0.9 Precision True Positive Rate 1 0.9 0.6 0.5 LR (Expert): 0.729 SVM (Expert): 0.601 LR (DBN): 0.662 SVM (DBN): 0.631 LR (DBN with FT): 0.878 SVM (DBN with FT): 0.879 0 0.2 0.4 0.6 0.8 0.4 0.3 1 0.2 0 0.2 False Positive Rate Recall (a) ROC and PR curve for single-layer learning. 0.4 0.6 0.8 1 Recall (b) ROC and PR curve for multi-layer learning. Fig. 5: Performance Comparison C. Experiments and Results for Task 1 RFs are extracted based on the expert opinion [3], [14] and summarized using the following variables: age, weight, height, BMI, parent fall, smoke, excess alcohol, rheumatoid arthritis, physical activity. We apply two basic classifiers LR and SVM and choose the parameters by cross-validation for fairness. Note that this is a supervised learning process since all samples for this expert knowledge based model are labeled. For fair comparison with the classification results using the expert knowledge, we fix the number of the output dimensions to be equal to the expert selected RFs. Specifically, we fix the number of units in the output layer to be 11, where each unit in this layer represents a new integrated feature describing complex relationships among all 672 input factors, rather than a set of typical RFs selected by experts. Since the sample size is large and highly imbalanced in Phase1, we evaluate the performance using area under curve (AUC) of ROC and PR curves. AUC indicates the performance of a classifier: the larger the better (an AUC of 1.0 indicates a perfect performance). The number of samples in Phase2 is small and balanced, thus we only evaluate the performance using the classification error rate. We will present and discuss the experiment results for both phases. 1) Phase 1– Osteoporosis Prediction: From Figure 5(a), we observe that a shallow RBM without FT “LR (RBM)” and “SVM (RBM)” get a sense of how data is distributed which represents the basic characteristics of the data itself. Although the performances are not always higher than the expert model “LR (Expert)” and “SVM (Expert)”, this is a completely unsupervised process without borrowing knowledge from any types of labeled information. Achieving such a comparable performance is not easy since the expert model is trained in a supervised way. But we find that the model is lack of focus to a specific task and thus leads to poor performances. Further improvements may be possible by more thorough experiments with a two-stage fine-tuning. So we take advantage of the labeled information and transform from an unsupervised task to a semi-supervised task because of the partial label data. Figure 5(a) shows the classification results after using the twostage fine-tuning to boost the performance of all classifiers “LR (RBM with FT)” and “SVM (RBM with FT)”. Especially, the AUC of PR of our model outperforms the expert system. Since the capacity for the RBM model with one hidden layer is usually small, it indicates a need for a more expressive model over the complex data. To satisfy this need, we add a new layer of non-linear perceptron at the bottom of RBM, which forms a DBN as shown in Fig. 2(b). This new added layer greatly enlarges the overall model expressiveness. More importantly, the deeper structure is able to extract more abstractive reasons. As we expected, using a deeper structure without labeled information, both LR (DBN) and SVM (DBN) yield better performance than the shallow RBM model, as illustrated in Fig. 5(b). And the model “LR (DBN with FT)” and “SVM (DBN with FT)” further improve its behavior because of the two-stage fine-tuning. The performance using DBN model improves at 32% average rate using ROC measure and 80% average rate using PR measure. 2) Phase 2 – Bone Loss Rate Prediction: In this section, we show the bone loss rate prediction using the abnormal cases after Phase1. High bone loss rate is an important predictor of higher fracture risk. Our integrated risk features are good at detecting this property since they integrate the characteristics of data itself and nicely tuned under the help of two kinds of labels. We compare the results between expert knowledge based model and our DBN with fine-tuning model that yields the best performance for Phase1. Since our result is also finetuned by the bone loss rate, we can directly feed the 11 new integrated features into Phase2. Table I shows that our model achieves high predictive power when predicting bone loss rate. In this case, the expert model fails because the limited features are not sufficient to describe the bone loss rate which may interact with other different RFs. This highlights the need for a more complex model to extract the precise attributes from an amount of potential RFs. Moreover, our CDM module takes into account the whole data set, not only keeping the 672 risk factor dimensions but also utilizing two types of labeled data. TABLE I: Classification error rate comparison Expert DBN with FT LR-Error 0.383 0.107 SVM-Error 0.326 0.094 D. Experiments and Results for Task 2 In this section, we will show experiments and results on informative RF selection. Based on the proposed method shown in Figure 1, we show a case study which lists the top 20 informative RFs selected using BDM and NDM in Table II. Variable descriptions are shown from the data provider [12]. In this study osteoporosis appears to be associated with several known risk factors that are well described in the literature. Based on the universal rule used by FRAX [3] that is a popular fracture risk assessment tool developed by WHO, some of the selected RFs have already been used to evaluate fracture risk of patients such as age, fracture history, family history, BMD and extra alcohol intake. Some researchers find that not only well-known RFs are associated TABLE II: Informative risk factors generated by BDM and NDM Variable AGE IFX14 INTX Fracture history FACEF ANYF History MHIP80 DSTBMC Exam PRXBMD TURNUM STEADY Physical STEPUP performance STDARM GAID Exercise 50TMWT Life style DR30 Breast cancer BRSTCA LISYS Blood pressure DIZZY Vision CSHAVG 0.9 0.8 0.85 0.7 0.8 0.6 0.75 0.5 0.7 0.65 0.6 The materials published in this paper are partially supported by the National Science Foundation under Grants No. 1218393, No. 1016929, and No. 0101244. 0.4 R EFERENCES 0.3 0.2 Informative RF Integrated RF Expert RF 0.55 0.5 0 the observed and latent reasons behind risk factors (RFs) using a deep graphical model pre-trained by CD algorithm. We found an effective way of modeling the comprehensiveness and uniqueness from different samples. First, we combined two types of bone disease labeled information to train our model for the prediction task. Second, we formulated a reconstruction pattern comparison framework to select the informative RFs for bone diseases. Besides, a group of “disease memories” (DMs) including comprehensive disease memory (CDM), bone disease memory (BDM) and non-disease memory (NDM) were well defined and applied to our experiments. Our extensive experiment results showed that the proposed method improves the prediction performance and has great potential to select the informative RFs for bone diseases. VI. ACKNOWLEDGMENTS Description The patient’s age at this visit Vertebral fractures Intertrochanteric fractures Face fracture Follow-up time to 1st any fracture since current visit Mom hip fracture after age 80 Distal radius bone mass content(gm/cm) Proximal radius bone mass density(gm/cm2 ) Number of steps in turn Steadiness of turn Ability to step up one step Does participant use arms to stand up? Aid used for pace tests(i.e.crutch,cane,walker) Total number of times of activity/year at age 50 How often did you have 5 or more drinks one day Breast cancer status Systolic blood pressure lying down (mmHg) Dizziness upon standing up Average contrast sensitivity AUC of PR AUC of ROC Category Demographics 10 20 30 40 Number of informative RFs (a) AUC of ROC curve 50 Informative RF Integrated RF Expert RF 0.1 0 0 10 20 30 40 Number of informative RFs 50 (b) AUC of Precision-Recall curve Fig. 6: Osteoporosis prediction based on informative RFs. AUC of ROC curve(on left) and AUC of Precision-Recall curve(on right) with osteoporosis and more falls, but more lifestyle-related behavioral and environmental risk factors are also important causes of falls in older women. In Table II, some selected RFs have been well studied like DIZZY, GAID, STDARM and 50TMWT [15], [16]. The rest of the RFs would attract medical researchers’ interests and call researchers’ attention on monitoring bone disease progression. In general, it is probably not practical to acquire many features from all participants. So what are the most important questions the physician need to know? How many features they need to achieve a good predictive performance. Using the proposed approach we selected top 50 informative RFs, instead of using all of them, and fed them directly to the Logistic Regression classifier for the osteoporosis prediction. Fig. 6 shows that we only need top 20 informative RFs so as to improve both ROC and PR curves. The area under the ROC curve and the precision-recall curve (AUC) for our selected RFs (denoted as Informative RF) is even better than the RFs selected using expert knowledge (denoted as Expert RF) when the number of selected RFs is fixed to 20. The proposed informative RF selection method exhibits great power of predicting osteoporosis in that the selected RFs are rather significant than the rest RFs. But the performance of the prediction result of top 50 RFs selected by BDM and NDM is always inferior to that of integrated RFs extracted by CDM (denoted as Integrated RF) in that some information are discarded and those information might still make contribution to enhancing the predictive behaviors. V. C ONCLUSIONS We proposed to tackle the problem of bone disease prediction and informative risk factor (RF) selection by modeling [1] W. H. Organization et al., “Who scientific group on the assessment of osteoporosis at primary health care level,” 2004. [2] W. H. O. S. group on the prevention and management of osteoporosis. Report, Prevention and management of osteoporosis: report of a WHO scientific group. WHO, 2003. [3] Http://www.shef.ac.uk/FRAX/. [4] R. Bender, “Introduction to the use of regression models in epidemiology,” Methods Mol Biol, vol. 471, pp. 179–195, 2009. [5] G. Lemineur, R. Harba, N. Kilic, O. Ucan, O. Osman, and L. Benhamou, “Efficient estimation of osteoporosis using artificial neural networks,” in Industrial Electronics Society. IEEE, 2007, pp. 3039–3044. [6] C. Ordonez and K. Zhao, “Evaluating association rules and decision trees to predict multiple target attributes,” Intelligent Data Analysis, vol. 15, no. 2, pp. 173–192, 2011. [7] H. Li, C. Buyea, X. Li, M. Ramanathan, L. Bone, and A. Zhang, “3d bone microarchitecture modeling and fracture risk prediction,” in Proceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine. ACM, 2012, pp. 361–368. [8] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” science, 2006. [9] M. A. Carreira-Perpinan and G. E. Hinton, “On contrastive divergence learning,” 2005. [10] G. E. Hinton, “Deep belief networks,” Scholarpedia, vol. 4, no. 5, p. 5947, 2009. [11] F. Rosenblatt, Principles of neurodynamics; perceptrons and the theory of brain mechanisms. Washington: Spartan Books, 1962. [12] Http://www.sof.ucsf.edu/interface/. [13] J. Sirola, A.-K. Koistinen, K. Salovaara, T. Rikkonen, M. Tuppurainen, J. S. Jurvelin, R. Honkanen, E. Alhava, and H. Kröger, “Bone loss rate may interact with other risk factors for fractures among elderly women: A 15-year population-based study,” Journal of osteoporosis, vol. 2010, 2010. [14] Cummings, S.R., Nevitt, M.C., Browner, W.S., Stone, K., Fox, K.M., Ensrud, K.E., Cauley, J., Black, D., and Vogt, T.M., “Risk factors for hip fracture in white women.” Study of Osteoporotic fractures research group, vol. 332, pp. 767–773, 1995. [15] K. A. Faulkner, J. A. Cauley, S. A. Studenski, D. Landsittel, S. Cummings, K. E. Ensrud, M. Donaldson, and M. Nevitt, “Lifestyle predicts falls independent of physical risk factors,” Osteoporosis international, vol. 20, no. 12, pp. 2025–2034, 2009. [16] R. Bensen, J. D. Adachi, A. Papaioannou, G. Ioannidis, W. P. Olszynski, R. J. Sebaldt, T. M. Murray, R. G. Josse, J. P. Brown, D. A. Hanley et al., “Evaluation of easily measured risk factors in the prediction of osteoporotic fractures,” BMC musculoskeletal disorders, vol. 6, no. 1, p. 47, 2005.