An Ensemble Model for Mobile Device based Arrhythmia Detection Kang Li, Suxin Guo, Jing Gao and Aidong Zhang Computer Science and Engineering Department State University of New York at Buffalo Buffalo, 14260, USA {kli22, suxinguo, jing, azhang}@buffalo.edu ABSTRACT Recent advances in smart mobile device technology have resulted in global availability of portable computing devices capable of performing many complex functions. With the ultimate intent of promoting human’s well-being, mobile device based arrhythmia detection (MAD) has attracted lots of attention recently. Without any guidance or supervision from experts, the performance of arrhythmia detection is usually unsatisfactory. Supervised learning can learn from labeled cardiac cycles to detect arrhythmias for each mobile device user if enough training data is provided. However, it is time-consuming, costly and sometimes impossible to let experts annotate enough training data for each user. To tackle this problem, we take advantage of publicly available and well annotated data to infer knowledge which can be treated as experts for MAD. To reduce the space usage of the framework, we extract from each source of labeled data an expert model, which consists of a task-independent individual characteristic vector and a task-related preference vector. Multiple experts are then integrated into an ensemble model for arrhythmia detection. Both space and time complexities of this proposed approach are theoretically analyzed and experimentally examined. To evaluate the performance of the method, we implement it on the MIT-BIH Arrhythmia Dataset and compare it with seven state-of-the-art methods in the area. Extensive experimental results show that the proposed algorithm outperforms all the baseline methods, which validates the effectiveness of the proposed algorithm in MAD. Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology—Classifier design and evaluation; J.3 [Computer Applications]: Life and Medical Science—health General Terms Algorithms, Theory, Experimentation, Performance Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BCB ’13, September 22 - 25, 2013, Washington, DC, USA Copyright 2013 ACM 978-1-4503-2434-2/13/09 ...$15.00. ACM-BCB 2013 Figure 1: A Mobile ECG Monitoring System. Keywords ECG, Arrhythmia Detection, Ensemble Model 1. INTRODUCTION Arrhythmias, which are irregular rates or rhythms of heartbeats, reveals abnormal heart activities. In serious cases, during arrhythmias, hearts cannot pump enough blood to the body, which can cause damages to brains and hearts, and can even cause sudden cardiac death. Therefore, monitoring heart activities and detecting arrhythmias are of great importance to people’s well-being. Arrhythmia detection can be used to alarm heart disease onsets, accelerate first aids and save people’s lives. Electrocardiogram (ECG) plays a major role in monitoring heart activities and detecting arrhythmias, which interprets cardiac electrical activities over time as periodic signals. Each period represents the dynamic patterns of heart activity in a cardiac cycle (heart beat). In the task of arrhythmia detection, cardiac cycles are divided into two classes: Normal and abnormal. The former stands for normal states of the heart while the latter represents abnormal cardiac cycles which likely lead to heart damages or even sudden death. Our goal is to detect abnormal cardiac cycles timely from ECG data. Recent advances in smart mobile devices enable people to integrate many complex functions to portable computing devices including smart phones. The advances in smart health sensors further enable miniaturizing ECG assessing equipments. Due to the importance of arrythmia detection and the popular usage of smart phones, integrating ECG monitoring and arrhythmia detecting functions to mobile devices have aroused great interest. Fig. 1 [13] presents such a cyber-physical system. In the system, ECG signals collected through ECG electro-nodes are first sent to smart 266 phones. After processing the signals, data are then sent to a remote data warehouse for long term data managements. One simple way to infer abnormal cardiac cycles is to detect anomalous patterns from ECG data. This approach is referred to as unsupervised learning as no annotated data are included and no experts are consulted during this process. Such unsupervised arrhythmia detection methods usually suffer from relatively poor performance due to the lack of supervision. Therefore, supervised learning approaches should be preferred which learn from expert annotated ECG data to predict abnormality of unlabeled ECG data. However, most of the existing supervised approaches fail to handle the following difficulties in MAD: • Lack of labeled data: Existing supervised arrhythmia detection methods usually require sufficient labeled data for both normal and abnormal cardiac cycles in order to make accurate predictions on unlabeled data. However, the labeling process needs professional training and consumes lots of time. Furthermore, the number of normal cardiac cycles is usually much larger than that of abnormal ones, and thus collecting enough labeled data for abnormal heart beats is especially difficult. Therefore, it is nearly impossible to obtain sufficient labeled data for each mobile device user. • Limited space on mobile devices: Typically, in supervised learning, the more labeled instances are provided in the training set, the more likely that the supervised method will perform better. In MAD, we need to store a big number of labeled cardiac cycles if we want to guarantee the effectiveness of the supervised learning approaches. However, due to limited storage space of mobile devices, it is very difficult to store enough labeled data, which will in turn degrade the performance of supervised learning approaches. • Requirement of high efficiency: To issue timely alerts, arrhythmia detection has to be made in real time on the streaming data collected from mobile devices. This requires a highly efficient algorithm. To tackle these challenges, we propose an effective and efficient framework to combine expert knowledge extracted from multiple sources of well-annotated data and apply the knowledge to mobile device users for the task of arrhythmia detection. Specifically, to handle the lack of labeled data, we take advantage of existing publicly available and well annotated ECG sets to guide the classification of cardiac cycles of mobile device users. To reduce the space usage, we extract an expert model from each ECG data set in the form of two vectors: A task-independent individual characteristic vector and a task-related preference vector. The former represents the type of cardiac cycles that the expert is professional with, and the latter encodes how the expert decides if a cardiac cycle is normal or abnormal (arrhythmia). An ensemble model is then built upon the extracted expert characteristics and preferences with higher weights assigned to more helpful experts. Intuitively, if the ECG data exhibit similar patterns to the target user, we trust the expert constructed from this data set better because it is more likely to make correct decisions on the cardiac cycles of the target user. Cardiac cycles of the target user are classified according to the weighted integration of the experts’ preferences. Note that in this framework, the expert extraction and expert integration are derived off-line and then applied to predict on mobile device users’ ECG data online. ACM-BCB 2013 The model can handle the collected ECG signals of mobile device users in a very fast speed. Although there are many existing methods on detecting arrhythmias using ECG signals, the proposed work in this paper significantly differs from the existing work in both the method and the aimed task. Here we summarize the difference from the following two major categories of related work. One relevant topic is distribution matching [13, 2], which has been used to tackle the arrhythmia detection problem on ECG signals. Although it also utilizes publicly available and well-labeled data sources to help detect arrythmia in target users, these approaches adopt quite different assumptions and strategies. They are developed based on the idea of matching the distributions of source signals and target signals by reweighing the features with the hope that matching distributions can be found between source and target users. In contrast, in the proposed approach, we do not make any assumptions about the distributions of the source or target signals. Our model is designed based on the well-grounded principle that an expert should be weighed higher if the source data it learns from has similar characteristics compared to the target signals. Another relevant study is multiple source transfer learning (MSTL) [15, 4, 6], which can transfer knowledge from multiple labeled sources (sets of labeled ECG signals in this task) to unlabeled data in a target domain (the target user’s ECG signals in this task). Note that these approaches are not particularly designed for arrythmia detection, and thus they fail to address some unique challenges in the proposed problem. For example, they are unable to handle the class imbalance problem (normal cardiac cycles are way more than abnormal cardiac cycles) and space shortage problem. Different from these approaches, our proposed method is developed to address these challenges and thus it is more suitable for the MAD problem. In summary, the contributions of this paper include: • We propose a novel model which maps each source labeled ECG set to two vectors while preserving the characteristics and discriminating ability of the expert. By the expert extraction model, space usage can be significantly reduced to meet the requirement of the mobile device based implementation. • We propose an effective and efficient ensemble model, which integrates decisions from all the experts for the task of MAD. The contribution of each expert to the model is derived based on how similar the source and target signals are, so the experts which are closer to the target can get higher weights in the combination. We derive an efficient solution to infer the model, and we theoretically analyze the time and space usage of the framework. • We evaluate the proposed ensemble model on real and publicly available data for the task of arrhythmia detection. We compare the proposed approach with seven state-of-the-arts algorithms. Experiments show that the proposed approach outperforms the compared algorithms and it is highly efficient in both time and space. 2. 2.1 METHOD Notation and Problem Definition 267 Table 1: Notation t {ni }ti=1 nT r c i {Sn }t i ×r i=1 {Yni ×1 }ti=1 i TnT ×r YnT ×1 T number of well labeled ECG sets number of cardiac cycles in the t labeled ECG sets number of cardiac cycles in the target ECG set number of features of each cardiac cycle number of classes in label matrices t labeled ECG sets t label matrices for the source ECG sets target unlabeled ECG set target label matrix In the paper, scalers and constants are denoted by lowercase letters in {a, b, α, β, ...}. Matrices and vectors are denoted by upper-case letters in {A, B, Γ, Λ, ...}. We use As×t to indicate that matrix A contains s rows and t columns. Ai,: is the i-th row, A:,j is the j-th column and Aij is the entry of the i-th row and j-th column. The formulation of our model involves several mathematical concepts, which we explicitly introduce here. The P trace of a square matrix Us×s is defined as T r(Us×s ) = si=1 Uii . The squared Frobenius norm of a matrix A is defined as kAk2F = T r(A> A) = T r(AA> ). A◦B denotes the Hadamard product of two equal size matrices A and B, by which we have (A ◦ B)ij = Aij · Bij . Similarly, the Hadamard division is defined as (A B)ij = Aij /Bij . N (A|B, σ 2 ) represents that A follows a Gaussian distribution with mean B and variance σ 2 . Is×t is an all ones matrix with Iij = 1 for i ∈ [1, s] and j ∈ [1, t]. As listed in Table 1, we suppose there are t well labeled ECG signals and different signals are collected from different individuals. Viewing the t source ECG sets as t experts, we denote the t experts as Sn1 1 ×r , ..., Snt t ×r . ni for i ∈ [1, t] is the number of cardiac cycles in the i-th expert and r is the number of features for each cardiac cycles. Without loss of generality, we suppose the values in {S i }ti=1 have been scaled to the range of nonnegative. The label matrices for the t i experts are Yn11 ×c , ..., Yntt ×c in which Yj,m is the probability that the j-th cardiac cycle in the i-th expert ECG set belongs to the m-th class for i ∈ [1, t], j ∈ [1, ni ] and m ∈ [1, c]. Without loss of generality, we use the first class m = 1 to denote the class of normal cardiac cycles and use m ∈ [2, c] to denote c − 1 types of arrhythmias. For simplicity here, in the paper we only consider two classes classification task, in which we aim to determine if each cardiac cycle is normal or abnormal. We denote the ECG signals of a mobile device user as TnT ×r , and the unknown label matrix for the target as YnTT ×c . The problem of arrhythmia detection can be expressed as:given a target ECG signal TnT ×r , t source ECG sets {Sni i ×r }ti=1 and the corresponding label sets {Ynii ×c }ti=1 , learn a mapping F : T → Y T . 2.2 Expert Extraction Generally, there are two drawbacks of directly utilizing the t source ECG sets for the learning of the target ECG set. On the one hand, each source ECG set usually contains over a thousand of cardiac cycles. When t is large, the source ECG sets take huge amount of space, and cause high energy usage on mobile devices. On the other hand, due to the fact that ECG signals are collected using different assessing equipments and under different conditions, there are noises in the ECG sets, which reduces the discriminating ability of ACM-BCB 2013 the source ECG sets. To solve the above mentioned problems, in this section, we propose a novel expert extraction method to map each source ECG set to two vectors: a task-independent expert characteristic vector and a task-related expert preference vector. The former (characteristics) is the consensus shared by the cardiac cycles from the same expert (source ECG set), and captures the type of cardiac cycles that the expert is professional with. The latter measures how the expert decide if a cardiac cycle is normal or abnormal (arrhythmias). The extracted two vectors thus can capture both the characteristics and the discriminating ability of each expert while significantly reducing space usage on mobile devices. Without loss of generality, we assume that each cardiac cycle can be expressed as: i i Sj,: = X1×f Bf>×r + Λij,: + S . (1) In Eq. 1, S is the noise term, X i is the task-independent characteristics of the i-th expert, f is the number of features in the characteristics, B is the mapping matrix which maps from the original feature space of S i to the characteristic feature space of X i , and Λij,: is the unique property of the j-th cardiac cycle in the i-th expert, which determines the label of the cardiac cycle. For each person, normal cardiac cycles are usually much more than arrhythmias. This can be observed on patients with severe heart diseases. In mathematics, this fact means the feature vectors S i of cardiac cycles locate around a mean X i B > and have low probability to be highly deviated with large Λij,: . Such knowledge can be included into the model with a prior distribution on Λi . Specifically, we place the row-wise independent spherical Gaussian prior distribution on Λi as: ! ! r i> X T r(Λi CΛ−1 ) kΛi:,k k2 i Λ p(Λi ) ∝ exp − = exp − . 2CΛi 2 k=1 k (2) In Eq. 2, CΛi is a diagonal matrix with diagonal entries CΛi = diag{CΛi , ..., CΛir }. 1 The conditional distributions over the existing expert ECG i t sets {S }i=1 are: p(S i |X i , B) = ni r Y Y i N (Sjk |(X i B > )jk , σS2 ), j=1 k=1 in which σS2 is introduced by the noise term S and the unique property of each cardiac cycle Λij,k . Similarly, for the label matrix Y i , we have: i i Yj,: = Λij,: Wr×c + Y , in which Y is the noise term, and W i is the proposed taskrelated preference vector of the i-th expert. W i measures i how the i-th expert determines whether a cardiac cycle Sj,: is normal or abnormal (arrhythmias). The conditional distributions over the observed label matrices {Y i }ti=1 are: p(Y i |Λi , W i ) = ni c Y Y i N (Yjm |(Λi W i )jm , σY2 ), j=1 m=1 in which σY2 is introduced by the noise Y . 268 By Bayesian inference, to obtain the optimal values for {X i }ti=1 , B and {W i }ti=1 , we need to maximize the log posterior distribution as: For each X i , we have: > > ∂JX = −2I i S i B + 2I i I i X i B > B ∂X i Jp = max log p(X, B, W |S, Y ) > = max log X,B,W t Y X,B,W t Y > > i where Pc×f = W i B. Therefore, the updating rule for {X i }ti=1 is: h > > > X i ← X i ◦ (I i S i B + γI i H i P ) p(X i , B, W i |S i , Y i ) i=1 ∝ max log > + γ(−2I i H i P + 2I i I i X i P > P ), X,B,W p(S i |X i , B)p(Y i |Λi , W i ). i=1 i > > (I i I i X i B > B + γI i I i X i P > P ) . (5) By estimating {Λi }ti=1 as Λi = S i − IX i B > , we further formulate the objective as: (3) fix {X i }ti=1 and {W i }ti=1 , the objective J is formu" ! lated into: ni t r i i > 2 X (Sjk − Xj,: Bk,: ) 1 XX 2 t h i X Jp = max − log(2πσS ) + > X,B,W 2 j=1 σS2 JB = min kS i − I i X i B > k2F + γkH i − I i X i B > W i k2F . i=1 k=1 B≥0 !# i=1 ni c i i i > i 2 (Y − (S − X B )W ) 1XX :,m jm j,: j,: 2 − log(2πσY ) + . For B, we have: 2 j=1 m=1 σY2 t h X > > > ∂JB 2 2 = 2BQi Qi − 2S i Qi In the above objective Jp , σS and σY are parameters, ∂B i=1 therefore, the objective is equivalent to: i > > > +2γ(W i W i BQi Qi − W i H i Qi ) , t h X i > J = min kSni i ×r − Ini i ×1 X1×f Br×f k2F > > X,B,W in which Qif ×ni = X i I i . (3) i=1 i Therefore, the updating rule for B is: i > +γk(Sni i ×r − Ini i ×1 X1×f Br×f )W i − Y i k2F , " t X i> i> > 2 B ← B◦ S Q + γW i H i Qi σS in which γ = σ2 is a nonnegative parameter. i=1 Y (6) # This equivalent objective well fits our initial intuitions. t X i > i i> i i> i i> In J , kSni i ×r − Ini i ×1 X1×f Br×f k2F guarantees the extracted BQ Q + γW W BQ Q . task-independent expert characteristics X i to well represent i=1 each cardiac cycle in the i-th source ECG set. For the other We summarize the solution in Alg. 1. i > part, k(Sni i ×r −Ini i ×1 X1×f Br×f )W i −Y i k2F ensures that the i extracted task-related expert preference W can correctly Algorithm 1 Expert Extraction for Mobile Device based classify cardiac cycles in the i-th source ECG set. The paArrhythmia Detection rameter γ then works as the trade-off coefficient which conRequire: t source ECG sets {S i }ti=1 , the corresponding latrols the importance of the above two parts. bel matrices {Y i }ti=1 , number of features f for expert To optimize the objective, we add nonnegative constraints i t i t characteristics and nonnegative trade-off coefficient γ to {X }i=1 , {W }i=1 and B as in [12] and use the iterative Ensure: Expert characteristics {X i }ti=1 , expert preferences multiplicative update method [12]. By this technique, we i t {W }i=1 and mapping matrix B iteratively update one aimed matrix while fixing all the other 1: Randomly initialize {X i }ti=1 , {W i }ti=1 and B with the matrices. In detail, we non-negative constraints (1) fix B and {X i }ti=1 , the objective J is formulated into: 2: repeat t X 3: Update {W i }ti=1 has: i JW = min kΛi W i − Y i k2F . > > W i ← W i ◦ (Λi Y ) (Λi Λi W i ) W ≥0 i=1 4: i For each W , we have: Therefore, the updating rule for {W i }ti=1 is: h > i > W i ← W i ◦ (Λi Y ) (Λi Λi W i ) (4) (2) fix B and {W i }ti=1 , the objective J is formulated into: X≥0 t X > kS i − I i X i B > k2F + γkH i − I i X i B > W i k2F , i=1 > > i in which Hc×n = W i Si − Y i i ACM-BCB 2013 > > > X i ← X i ◦ (I i S i B + γI i H i P ) > > ∂JW = −2Λi Y + 2Λi Λi W i ∂W i JX = min Update {X i }ti=1has: > for i ∈ [1, t]. i > > (I i I i X i B > B + γI i I i X i P > P ) . 5: Update B as: h Pt i > i > > B←B◦ Q + γW i H i Qi i=1 S i P > > > ti=1 BQi Qi + γW i W i BQi Qi . 6: until convergence There are two types of stopping conditions that can be applied to the Alg. 1. First, setting the largest iteration number of the iterative updating algorithms. Second, setting an upper-bound of the changes of a specified matrix. In the 269 implementation, we use the second method, and at the end P P (s) (s−1) of s-th iteration we check if rj=1 fm=1 kBj,m − Bj,m k ≤ 0.0001. If yes, the algorithm has reached convergence; if no, jump to the next iteration. Another issue that needs to be taken care of is undersampling the labeled source ECG sets to balance the number of cardiac cycles in each class. The reason for this operation is that in the objective J , data with imbalanced class distributions cause W i put more weights to instances in the majority class (normal cardiac cycles in the paper). In the implementation, we randomly under-sample the normal cardiac cycles to be the same amount of the number of arrhythmias in each expert ECG set. By the expert extraction method, we are able to map each labeled expert ECG set to two 1×f vectors while keeping the discriminating ability of the expert. In the real world, since there is nearly infinite number of possible labeled expert ECG sets, storing all possible processed expert vectors on mobile devices will have very low space and time efficiencies. To solve this problem, we propose a straightforward strategy to efficiently select the processed experts and to control the expert size as user-specified. The principle of the expert selecting strategy is to maximize the coverage of the set of experts while minimizing duplicates. We define the similarity of the i-th and the j-th experts as: S(i, j) = Sv (X i , X j )Sv (W i , W j ), where Sv (U, V ) is an off the shelf similarity measure on two vectors U and V . If two experts have high similarity, then they are highly duplicated. A threshold pruning strategy can be used to select the experts: with a pre-defined threshold of the expert similarity, iteratively check if two arbitrary experts have similarity higher than the threshold. If yes, randomly remove one. Keep the iterations until no more to be removed. 2.3 Ensemble of Experts Given t processed and selected experts, the most naive way is to select the best expert for each mobile device user. However, there are several significant drawbacks of this naive strategy. First, choosing the best expert for a specific mobile device user would either involve complex distribution matching of the user’s ECG and the experts’ signals, or require labeled cardiac cycles of the user to supervise the choice of the expert. Therefore, the naive strategy is quite inefficient and quite expensive (requiring labeled cardiac cycles). Second, compared to the number of mobile device users, the number of experts is much smaller. We may not be able to find an expert that performs well enough for each mobile device user. To avoid the above drawbacks, in the section, we propose an ensemble of experts model which integrates the selected experts for MAD. Specifically, the integration of opinions from the selected t experts is achieved through: p(Y T |T ) = t X We model the conditional probability p(S i |T ) as: p(S i |T ) = α + βD(X i , X T ), in which X T is the mobile user characteristic vector, and we discuss the details of estimating X T in the next section. α and β are two coefficients that map the vector similarity of X i and X T to the probability p(S i |T ). D is a distance metric measuring the closeness of two characteristic vectors X i and X T . For simplicity, in the implementation, we use L2 norm as the distance metric: v u f uX D(X i , X T ) = t (Xki − XkT )2 . k=1 The intuition behind the model of p(S i |T ) is that if an expert has closer characteristics with the target T , then it is more trustworthy in learning the target T . In our setting, D is a distance matrix, and the closer X T and X i are, the smaller D(X T , X i ) is. To fit the intuition, we have the constraint for β as β ≤ 0. Manually setting the values of α and β is subjective and may not well fit the real situations. Therefore, we propose a cross-validation based learning method to automatically set the values of α and β. When α and β are optimized, for each of the t labeled expert ECG sets, the label matrix {Y i }ti=1 should be close to the label matrix estimated by all the other experts. By this intuition, we obtain the objective for optimizing α and β as: Jαβ = min α,β p(Y T |T, S i )p(S i |T ). (7) In Eq. 7, p(Y T |T ) is the conditional probability of the target label matrix Y T given the target ECG set T . p(Y T |T, S i ) is the probability that the i-th expert S i classifying on the ni h t X X i Yj,: − (α + β · Di,: )Rij i2 , i=1 j=1 where Dil is the distance measured by D(X i , X l ) for l ∈ [1, t] ij and l 6= i. The auxiliary matrix Rt−1×c is defined as: ij i Rl,: = Sj,: − X lB> · W l. Then we have: ni t i X X h ij ij > ∂Jαβ i > = I 2R R (α · I + β · Di )> − 2Rij Yj,: , ∂α i=1 j=1 t ni i X X i h ij ij > ∂Jαβ i > = D 2R R (α · I + β · Di )> − 2Rij Yj,: . ∂β i=1 j=1 Setting ∂Jαβ ∂α = 0 and ∂Jαβ ∂β = 0, we then obtain: > ij i > ij ij > βDi i=1 j=1 I R Yj,: − R R α= , Pt Pni ij ij > I > i=1 j=1 IR R P t P ni > i i > Rij Yj,: − Rij Rij αI > i=1 j=1 D β= . Pt Pni i ij ij > D i > i=1 j=1 D R R Pt i=1 ACM-BCB 2013 target set T obtains label matrix Y T . p(S i |T ) is the probability that the i-th expert fits the classification task on the target set T . Among the three conditional probabilities, p(Y T |T, S i ) can be easily obtained as: i > p(Y T |T, S i ) = T − InTT ×1 X1×f Br×f Wr×c . Pni 270 Mobile Device based Implementation Space Usage over Various Numbers of Experts 7 10 With extracted expert characteristics {X i }ti=1 , expert preferences {W i }ti=1 , mapping matrix B, and the ensemble of experts model, we are able to classify ECG signals of mobile device users. In this section, we explicitly discuss the details of the mobile device based implementation. To measure the closeness from the target ECG signal to each existing expert, we extract the characteristics of the target by the objective: 6 10 Log Space Usage 2.4 The Mixture of Experts Model Multiple Sources Learning Baselines Oversampling Baselines Undersampling Baselines 5 10 4 10 JT = min kT̂n×r − X T ≥0 T > In×1 X1×f Br×f k2F . (8) 3 10 In the objective JT , T̂n×r is a part of the target ECG set and n ≤ nT . In mobile device based implementation, the objective stands for learning the user characteristics using the data collected in a certain period of time. The classification on the user’s signal starts after the extracting period. To solve the objective, an iterative multiplication approach is derived. Since ∂JT = −2I > T B + 2I > IXB > B, ∂X T we can infer: h i X T ← X T ◦ (I > T B) (I > IXB > B) . Then the cardiac cycles in the target ECG signal can be classified according to Eq. 7. We summarize the algorithm for MAD in Alg. 2. Algorithm 2 Mobile Device based Arrhythmia Detection Require: t expert characteristics {X i }ti=1 , expert preferences {W i }ti=1 , the mapping matrix Br×f , ensemble model coefficients α and β, and the target ECG set T Ensure: Y T the label matrix which includes the probabilities that whether each cardiac cycles in T is arrhythmia 1: Randomly initialize X T with the non-negative constraints 2: repeat 3: Update X T as: X T ← X T ◦ (I > T B) (I > IXB > B) 4: until convergence 5: Calculate the distance between target and expert characteristics: DT (i) = D(X i , X T ) for i ∈ [1, t] 6: Classify cardiac in T through: P cycles T Yj,: = ti=1 Tj,: − X i B > W i α + βDiT 2.5 Space Usage and Time Efficiency In this section, we theoretically analyze the space usage and the time efficiency of the proposed ensemble algorithm. Space Usage In this method, mobile devices need to store following ini formation: t extracted expert characteristics {X1×f }ti=1 ; t i extracted expert preferences {Wr×c }ti=1 ; 1 mapping matrix Br×f ; and two parameters of the ensemble model α and β. T Moreover, to avoid repeats on the calculations of Dt−1×1 and {X i B > }ti=1 , we also prefer to keep them in the mobile devices. Therefore, the overall space for the proposed ACM-BCB 2013 10 20 30 40 50 60 70 80 90 100 Number of Experts Figure 2: Theoretical Comparisons on Space Usages. ensemble algorithm is: SP ACEM = O (t · f + t · r · c + r · f + t + t · r + 1) ≤ O t · r · c + r2 . In the SP ACEM , c is the fixed number of classes and c ≥ 2. r is the number of features in each cardiac cycles, which is usually determined by the sampling frequency of the used ECG sensors. f is the input number of features and 0 ≤ f ≤ r. Thus the overall space is linear w.r.t. the number of experts. In comparison, general multiple source learning baselines for arrhythmia detection usually require mobile devices storing the t original ECG sets {Sni i ×r }ti=1 . For simplicity, suppose ni = n for i ∈ [1, t]. In these baselines, the minimum space usage is: SP ACEB = O (t · n · r) . Several existing methods utilize over-sampling and undersampling to balance the number of cardiac cycles in each class. Suppose for each expert ECG set, ρ · n of them are arrhythmias, in which ρ ∈ (0, 0.5). Then for over-sampling baselines, the space usages are at least: SP ACEO = O ((1 − ρ) · t · n · r) . And for under-sampling baselines, the space usages are at least: SP ACEU = O (ρ · t · n · r) . Under the settings that ρ = 0.1, r = 40, c = 10 and n = 1000, the detailed comparisons among the above strategies are demonstrated in Fig. 2. In comparison, the used space of the proposed ensemble model is consistently 50 to 500 times less than the other baselines. Along with the expert selection strategy proposed in Section 2.2, the space usage of the proposed model is low enough for mobile devices. 2.5.2 2.5.1 0 Time Efficiency The computation costs are as follows. The proposed ensemble model is divided into four parts. (1) Expert Extraction. As shown in Alg. 1, this part involves an iterative solution, in which we iteratively update i {W1×f }ti=1 , {X i }ti=1 and Br×f . Specifically, in each iteration, updating {W i }ti=1 takes O(t · c · n · r2 ); updating {X i }ti=1 takes O(t · n · r · f 2 ); and updating B takes O(t · c · n · r2 · f 2 ). 271 Suppose q is the total number of iterations in the optimizing process, the overall time complexity for the expert extraction is: T IM EEE = O(q · t · c · n · r2 · f 2 ). (2) Ensemble Model Learning. The aim of this part is to learn the parameters α and β for the ensemble model. ij In the process, calculating the auxiliary matrix Rl.: takes O(t2 · c · n · r · f ). Calculating α and β takes O(t3 · c · n). Thus the overall time complexity for the ensemble model learning is: 2 3 T IM EM E = O(t · c · n · r · f + t · c · n). (3) Target Characteristics Extraction. As shown in Alg. 2, this part focuses on extracting the characteristic vectors X T for the target ECG signals from mobile device users. The time complexity is: T IM ET E = O(q · n · r · f 2 ). (4) Real Time Classification. With the above learned model, the cardiac in the target ECG set canbe clasP cycles T sified as Yj,: = ti=1 Tj,: − X i B > W i α + βDiT . Since X i B > and D> have been calculated and stored in the last step. The classification of each cardiac cycle in the target ECG set can be achieved at the cost of: T IM ERT = O(t · c · r). Among the four steps, the experts extraction part and the ensemble model learning part (step (1) and (2)) don’t need input of any information from the target users, thus can be finished before the implementation on mobile devices. In the step (3), T IM ET E is linear w.r.t. the number of iterations in the optimizing process as well as the number of cardiac cycles for the target characteristics extraction. This step can be finished on mobile devices before the classifications of cardiac cycles of the target users. The final step, which requires real time processing, takes only O(t · c · r). The complexity is linear to the number of experts t, considering that r, the number of features in each cardiac cycles, and c, the number of classes, are taskdependent constants. We will experimentally evaluate the T IM ET E and the T IM ERT in Section 3. 3. EXPERIMENTS In this section, we experimentally evaluate and demonstrate the effectiveness and efficiency of the proposed ensemble model for MAD. In the experiments, we implement the proposed algorithm on the MIT-BIH Arrhythmia Dataset [16], and compare the method with seven state-of-the-art approaches in the area. The results show that the proposed framework could successfully integrate decisions from different experts and identify arrhythmias in ECG signals with significant improvements over the compared methods on most cases. Besides, the real time processing speed and low space usage further validate that the proposed algorithm is suitable for the mobile device based implementation. 3.1 Dataset Descriptions The MIT-BIH Arrhythmia Dataset [16] has been used by many existing researches [13, 9] for experiments on arrhythmia detection. To fairly compare the proposed algorithm, ACM-BCB 2013 we use the ECG signals of the same set of patients as in [9]. The patient numbers of the set are 100, 101, 103, 105, 109, 115, 121, 201, 202, 210, 215, 230 and 232. The selected ECG signals are preprocessed following a similar way to [14]. Each patient’s ECG signal covers from 1008 to 1415 cardiac cycles ({ni }ti=1 ), and each cardiac cycles includes r = 39 dimensional features. The percentage of arrhythmias (ρ) in each patient’s ECG set varies from 0.7% to 29.38% with only 9.09% on average. 3.2 Evaluation Metric The most common evaluation metric for classification results is overall accuracy, which captures the percentage of correctly classified instances over the whole set of instances. However, for arrhythmia detection, the accuracy metric does not work properly. For example, suppose all the instances in an extremely imbalanced dataset are classified to the majority class, the evaluated accuracy is very high while all classifications on the minority are wrong. The ECG datasets in experiments are very imbalanced, and in arrhythmia detection, we focus more on the arrhythmias/minority. Therefore, in the experiments, we have to use an evaluation metric which can overcome the class imbalance. Generally, for binary classifications, the classified results can be divided into four cases: TP (true positive), TN (true negative), FP (false positive) and FN (false negative), where TP means the instances classified to be positive and in ground truth they are positive, and similarly for TN, FP and FN, respectively. Without loss of generality, in the experiments, we assume the class of arrhythmias to be the positive class. The ROC curve, which illustrates the fraction of TP out of the positives vs. the fraction of FP out of the negatives. Thus ROC can be used to evaluate the performance of arrhythmia detection. In the experiments, we use area-under-the-curve (AUC) which captures the area under the ROC curve to numerically evaluate the performance of each investigated methods. AUC is in the range of [0, 1]. The higher AUC a method can achieve, the better its performance is. 3.3 Compared Methods We compare the proposed ensemble model with a set of 6 state-of-the-art approaches from the multiple sources transfer learning (MSTL) area, which are described as follows: The first two compared approaches are two recent MSTL methods CRC [15] and GCM [8]. CRC uses the assumption that all source experts are closely related to the target sets while GCM assumes that the majority of the source experts are similar to the target. Based on the different assumptions, both methods seek to maximize the consensus among experts on classifications of target instances. Two recently proposed MSTL methods MDA [4] and LWE [7] relax the above assumptions through weighing the importance of the source experts. Both of methods are developed based on the assumption that if the predictions of an expert are consistent among the instances which are close in the feature space, then the expert should obtain a high weight. The difference of the two methods is that MDA assigns a weight to each expert while LWE put a weight on each instance of each expert. Since the datasets we used in the experiments are highly imbalanced, two MSTL methods DAM [6] and SLW [9] for 272 Data Index 100 101 103 105 109 115 121 201 202 210 215 230 232 CRC 0.666 0.611 0.511 0.522 0.620 0.576 0.534 0.600 0.600 0.617 0.620 0.614 0.652 GCM 0.777 0.779 0.626 0.654 0.739 0.679 0.610 0.699 0.715 0.699 0.760 0.679 0.771 MDA 0.760 0.742 0.478 0.714 0.700 0.654 0.655 0.843 0.818 0.830 0.537 0.334 0.724 LWE 0.722 0.423 0.543 0.718 0.753 0.720 0.492 0.854 0.795 0.819 0.632 0.610 0.948 DAM 0.925 0.753 0.648 0.725 0.879 0.746 0.572 0.894 0.847 0.899 0.544 0.674 0.855 LP 0.959 0.802 0.683 0.617 0.837 0.503 0.526 0.892 0.675 0.828 0.869 0.824 0.954 SLW 0.975 0.820 0.920 0.731 0.964 0.713 0.710 0.945 0.881 0.947 0.919 0.859 0.978 EE 1.000 0.987 0.955 0.713 0.949 0.860 0.893 0.926 0.930 0.932 0.935 0.680 0.982 Table 2: Experiments on the MIT-BIH Arrhythmia Dataset. imbalanced dataset are also considered. DAM computes the weight of each source by computing the Maximal Mean Discrepancy [1] between source samples and target samples. SLW assigns weights to each expert based on the expert’s ability of predicting accurately on the local region of the target in the feature space. Moreover, to demonstrate the benefits of MSTL, we include the Label Propagation (LP) [20] as another compared method. The LP does not utilize any source expert, and it purely relies on propagating labels from small amount of labeled data in the target’s dataset. The results of these compared algorithms on the MIT-BIH Arrhythmia Dataset are collected from [9]. 3.4 Performance Study In the experiments, we iteratively fix one of the ECG set as the target while viewing other sets as the source experts. For all the cases, we set γ = 1000, and f = 16. In the implementation, we run all the experiments in MATLAB using a PC with a 2.30 GHz Intel Core i7 − 3610QM CPU and 8.00 GB RAM. We present the experimental results in Table 2. In the table, the results of the proposed method are listed under the EE (Ensemble of Experts), the name of each ECG set is listed under the ”Data Index”, and the best performance of the investigated approaches are marked in bold. In the comparisons between the proposed method with the LP approaches, the results of our method significantly outperform the LP in 12 out of the 13 ECG signal sets. Please notice that in experiments on LP, several target cardiac cycles are assumed to be labeled ahead of learning while in the proposed ensemble model, no labeled cardiac cycles in the targets are required for the learning. The superior performance of our method over the LP method verifies the benefits of transfer knowledge from publicly available and well labeled ECG sets to the signals of general mobile device users. The knowledge transferring process not only does not need labeled cardiac cycles of the targets, but also can perform significantly better than the supervised LP method, which proves that the proposed ensemble model is effective in solving the problem of lack of labeled data in the task of MAD. Among the 6 compared state-of-the-art approaches in MSTL, we notice that CRC generally performs badly on all the cases. In 11 out of the 13 cases, the results are worse than ACM-BCB 2013 those of the LP methods, which indicates that for arrhythmia detection, multiple sources transfer learning using CRC could not achieve better performance than using the target itself by label propagations. The reason under the bad performance of CRC is that the characteristics of each ECG set could be quite different from the others, thus the consensus of the diverse expert characteristics may not match the target well. GCM works better since it only assumes majority of the experts are reasonable. However, for the task of arrhythmia detection, majority experts may be far from the targets in the feature space. Thus the performance of GCM is still lower than that of LP in 9 out of the 13 cases. The two weighted MSTL methods MDA and LEW show very vulnerable performance over different cases. In several ECG sets such as 103 for MDA and 101 for LWE, the AUC scores are even lower than 0.5, which means performing worse than randomly guessing. Such vulnerability in performance is mainly caused by the severe data imbalance in ECG sets. Since the percentages and distributions of arrhythmias in different ECG sets are quite diverse, the weighing MSTL approaches MDA and LWE performs well when the weighing fit targets and perform badly otherwise. As for the two MSTL methods DAM and SLW for imbalanced data sets, the results are better than the results of LP in most cases. Similar to the proposed ensemble model, these two methods also assumes the importance of each expert is proportional to the distance between the expert and the target. The difference among the two methods and our method is that DAM uses Maximal Mean Discrepancy to estimate the distance, SLW uses percentage of shared regions in feature space to capture the distance, and the proposed ensemble model utilizes the extracted experts and target characteristics to estimate the distance. The superiority of these three methods over all the other compared approaches proves the robustness of the assumption. In the comparisons between the proposed method with the other 7 investigated approaches, our model performs significantly the best in 8 out of the 13 cases. In the other 5 cases, the results are also close to the best or highly ranked in the 8 investigated methods. Overall, the proposed ensemble model outperforms all the other investigated methods in average AUC scores. This fact proves that the proposed ensemble model designated for MAD can outperform the stateof-the-art multiple sources transfer learning and supervised classification techniques on performance. 273 −3 10 6 10 Log Rate of Change 2 10 Log Running Time 100 201 230 4 10 0 10 −2 10 −4 10 Testing Time for Each Cardiac Cycle Running Time over Iterations −4 1.5 100 201 230 −4 10 −6 200 400 600 800 1000 1200 1400 Number of Iterations 1600 1800 2000 Figure 3: Convergence Rate. 3.5 10 0 1.2 1.1 1 0.7 200 400 600 800 1000 1200 1400 Number of Iterations Running Time and Space Usage 1800 2000 0 200 400 600 800 1000 1200 1400 Cardiac Cycle Index Figure 5: Testing Time. Space Usage 5 10 4 Space Usage (KB) 10 Running Time As aforementioned in Section 2.5.2, in the mobile device based implementation, we mainly need to consider the time costs on two steps: target characteristics extraction and real time classifications. We thus focus on testing the related time costs T IM ET E and T IM ERT , respectively. The time cost for the target characteristics extractions can be observed in details in Fig. 3 and Fig. 4 where we show the plots for patients 100, 201 and 230 (due to space limit, we only show 3 cases). In the two graphs, Fig. 3 shows the changing values of Pr Pf (s) (s−1) j=1 m=1 kBj,m − Bj,m k over the s-th iteration while s varies from 1 to 2000. Obviously, from the plot, we can observe that the change rate of B is less than 10−4 before the 2000-th iteration which means in our settings, the target characteristics extraction process converges in less than 2000 iterations. Fig. 4 shows the time cost of each iteration in the process. For all the three patients, in the target characteristics extraction process, each iteration takes less than 10−3 second. Therefore, overall, the target characteristics extraction process takes less than 2 seconds. The other part of the mobile device based implementation of our method is the real time classifications of cardiac cycles of the users. To evaluate this time cost, we illustrate the classification time for each cardiac cycle of the patients 100, 201 and 230 in Fig. 5, respectively. Obviously, the time cost for classifying each cardiac cycle is less than 2 × 10−4 seconds which is 4000 times faster than the regular time cost of a cardiac cycle (0.8 second). According to the above results, even though mobile devices have lower processing speed than the equipment we used in the above simulations, the time costs of target characteristics extractions and cardiac cycle classifications are low enough for the mobile device based implementation. Moreover, as aforementioned in Section 2.5.2, the target characteristics extraction time T IM ET E is independent of the number of the experts while the classification time T IM ERT is linear to the number of the experts. Thus the above experiment results support that the proposed ensemble model can utilize over 40, 000 experts in the learning while keeping the real time processing speed. To sum up, the experiments in this section proves that the proposed ensemble model is highly efficient for MAD. ACM-BCB 2013 1600 Figure 4: Running Time. In this section, by numerically evaluating the running time and space usage of our model in the implementation, we seek to prove the superiority and efficiency of our model for MAD. 3.5.1 1.3 0.8 −5 −8 0 100 201 230 0.9 10 10 x 10 1.4 Running Time Convergence Rate Changes over Iterations Oversampling Naive Undersampling Mixture of Experts 3 10 2 10 1 10 0 10 100 201 230 Dataset Index Figure 6: Space Usage in Experiments. 3.5.2 Space Usage As discussed in Section 2.5.1, for the mobile device based t t implementation, we need to store X i i=1 , W i i=1 , B and several other parameters in the mobile devices. We experimentally evaluate the space usage and demonstrate the results in Fig. 6. Besides the space usage of the proposed method SP ACEM , the space costs of the three discussed baselines strategies SP ACEB for naive multiple source transfer learning, SP ACEO for oversampling and SP ACEU for undersampling MSTL have also been experimentally evaluated, respectively. Obviously, the proposed ensemble model could save around 100 times space than all the baselines, and uses only around 10 KB for learning from 12 source experts. When scaling the space usage to 40, 000 experts, the space usage of our method is only around 40 MB, which is still quite low for the mobile device based implementation. In the contrast, for the most space-saving baseline over-sampling, the space usage will be over 1 GB which is very large for mobile device based monitoring programs, and tends to waste much energy in running. Overall, the experiments in this section prove that the proposed ensemble model is space efficient enough for MAD. 4. RELATED WORK In the areas of machine learning and data mining, arrhythmia detection can be categorized a specific problem in anomaly detection [19, 18] which focuses on mining the irregular and rare instances in data sets. For the problem, one general idea is to over-sampling the minority [5] or undersampling [11] the majority. As discussed in Section 2.2, in the proposed method, we utilize the same idea to randomly 274 under-sampling the majority to balancing the trained expert preferences {W i }ti=1 . Another general idea [3] in solving the data imbalance is to weigh the importance of each instance in a sense minority instances obtain higher weights while majority instances obtain lower weights. The drawback of these approaches is that they require sufficient labeled instances to guide the sampling and reweighing. There are also several methods that do not require any supervised information. One class support vector machine (OCSVM) [17] assumes majority instances are densely located in the reproducing kernel Hilbert space and tries to draw a round boundary to distinguish majority and minority. [10] utilizes another assumption that minority are either not in any cluster or far from their cluster centers. On the one hand, although such unsupervised methods do not require labeled instances, they always require the training data to include sufficient minority to better learn the distributions of minority and majority. For MAD, such conditions can hardly be satisfied due to the fact that there may be no arrhythmia during the training data collecting period, no matter how long the period is. On the other hand, such methods usually need long learning time which is inefficient for the mobile device based implementation. 5. CONCLUSIONS In this paper, we proposed an ensemble model for MAD. To solve the problem of lack of labeled data for each mobile device user, we proposed to integrate information from existing publicly available and well labeled ECG signals as supervised experts to guide the learning of cardiac cycles in the ECG signals of mobile device users. To reduce the space usage as well as running time, each expert ECG set is mapped into a characteristic vector and a preference vector before the implementation on mobile devices. Moreover, to fully utilize multiple extracted experts, we proposed to simultaneously consider the opinions of each expert following the principle that an expert that is closer to the target has better predicting ability. Experiments on a real world data set validated that the proposed method performs better than 7 state-of-the-art methods while keeps low space usage and real time processing speed for MAD. 6. ACKNOWLEDGMENTS The materials published in this paper are partially supported by the National Science Foundation under Grants No. 1218393, No. 1016929, and No. 0101244. 7. REFERENCES [1] Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, pages e49–e57, 2006. [2] M. T. Bahadori, Y. Liu, and D. Zhang. Learning with minimum supervision: A general framework for transductive transfer learning. Proc. of ICDM’11, pages 61–70, 2011. [3] X. Chang, Q. Zheng, and P. Lin. Cost-sensitive supported vector learning to rank. Learning, pages 305–314, 2009. [4] R. Chattopadhyay, J. Ye, S. Panchanathan, W. Fan, and I. Davidson. Multi-source domain adaptation and its application to early detection of fatigue. Procc of KDD’11, pages 717–725, 2011. ACM-BCB 2013 [5] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321–357, 2002. [6] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via auxiliary classifiers. Proc. of ICML’09, pages 1–8, 2009. [7] J. Gao, W. Fan, J. Jiang, and J. Han. Knowledge transfer via multiple model local structure mapping. Proc. of KDD’08, pages 283–291, 2008. [8] J. Gao, F. Liang, W. Fan, Y. Sun, and J. Han. Graph-based consensus maximizaion among multiple supervised and unsupervised models. Proc. of NIPS’09, 2009. [9] L. Ge, J. Gao, H. Ngo, K. Li, and A. Zhang. On handling negative transfer and imbalanced distributions in multiple source transfer learning. Proc. of SDM’13, 2013. [10] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10):1641–1650, 2003. [11] S. B. Kotsiantis and P. E. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics Computing and Teleinformatics, 1(1):46–55, 2003. [12] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, pages 788–91, 1999. [13] K. Li, N. Du, and A. Zhang. Detecting ecg abnormalities via transductive transfer learning. In Proc. of ACM BCB’12, pages 210–217, 2012. [14] P. Li, K. Chan, S. Fu, and S. Krishnan. An abnormal ecg beat detection approach for long-term monitoring of heart patients based on hybrid kernel machine ensemble. Biomedical Engineering, pages 346–355, 2005. [15] P. Luo, F. Zhuang, H. Xiong, Y. Xiong, and Q. He. Transfer learning from multiple source domains via consensus regularization. Proc. of CIKM’08, page 103, 2008. [16] G. B. Moody and R. G. Mark. The impact of the mit-bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, pages 45–50, 2001. [17] B. Schölkopf, J. C. Platt, J. S. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. [18] C. G. M. Snoek, M. Worring, J. C. Van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. Proc. of MULTIMEDIA’06, pages 421–430, 2006. [19] L. I. Xin-fu, Y. U. Yan, and Y. I. N. Peng. A New Method of Text Categorization on Imbalanced Datasets, pages 10–13. 2008. [20] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch. Learning with local and global consistency. Proc. of NIPS’03, page 595602, 2003. 275