Detecting ECG Abnormalities via Transductive Transfer Learning Kang Li Department of Computer Science and Engineering SUNY at Buffalo Buffalo, 14260, U.S.A. kli22@buffalo.edu Nan Du Department of Computer Science and Engineering SUNY at Buffalo Buffalo, 14260, U.S.A. nandu@buffalo.edu Aidong Zhang Department of Computer Science and Engineering SUNY at Buffalo Buffalo, 14260, U.S.A. azhang@buffalo.edu ABSTRACT Detecting Electrocardiogram (ECG) abnormalities is the process of identifying irregular cardiac activities which may lead to severe heart damage or even sudden death. Due to the rapid development of cyber-physic systems and health informatics, embedding the function of ECG abnormality detection to various devices for real time monitoring has attracted more and more interest in the past few years. The existing machine learning and pattern recognition techniques developed for this purpose usually require sufficient labeled training data for each user. However, obtaining such supervised information is difficult, which makes the proposed ECG monitoring function unrealistic. To tackle the problem, we take advantage of existing well labeled ECG signals and propose a transductive transfer learning framework for the detection of abnormalities in ECG. In our model, unsupervised signals from target users are classified with knowledge transferred from the supervised source signals. In the experimental evaluation, we implemented our method on the MIT-BIH Arrhythmias Dataset and compared it with both anomaly detection and transductive learning baseline approaches. Extensive experiments show that our proposed algorithm remarkably outperforms all the compared methods, proving the effectiveness of it in detecting ECG abnormalities. Categories and Subject Descriptors I.5.3 [Pattern Recognition]: Clustering—algorithms; J.3 [Computer Applications]: Life and Medical Science— health General Terms Algorithms, Performance Keywords ECG, anomaly detection, transductive learning Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ACM-BCB’12 October 7-10, 2012, Orlando, FL, USA Copyright 2012 ACM 978-1-4503-1670-5/12/10 ...$15.00. ACM-BCB 2012 Figure 1: A Cyber-Physical System. 1. INTRODUCTION ECG is an important and commonly used interpretation of cardiac electrical activities over time and can be used to assess different heart related biological processes and diseases including cardiac arrhythmias [21], mental stress [24], morbid obesity [5], etc. The collected ECG signals can be naturally divided into sequenced instances, each of which is an observed cardiac cycle (heart beat). In ECG signal monitoring, existing studies usually classify these instances into two classes: normal and abnormal, which represent normal state of the heart and abnormal biological events that may lead to heart damage or even sudden death, respectively. An automatic ECG abnormality detection approach is thus very meaningful towards promptly responses of heart attacks and a better understanding of the underlying biological mechanisms. Due to the rapid development of health informatics and cyber-physical systems, embedding the function of ECG abnormality detection into wearable devices for health monitoring and timely alarm has attracted more and more interests during the past few years. Such a typical system usually contains electro-nodes attached to human bodies, carry-on devices such as smart phones for signal collection, and a remote database for data management and analysis, as presented in Figure 1. One of the major factors limiting the popular usage of these cyber-physical systems is that the existing machine learning and pattern recognition techniques developed for ECG abnormality detection usually require sufficient labeled training data for each user to make the model user-specified and accurate. However, for ECG signals, such supervised information is very difficult to obtain without professional knowledge and a lot of effort. Moreover, since abnormal instances are usually much less than normal ones, supervised methods require more data in 210 these cases to capture enough variability on data distributions, which makes the labels expensive to obtain. To solve the problem, we take advantage of existing well labeled ECG signals and use them as sources into a novel transductive transfer learning framework for ECG abnormality detection. We propose to unsupervisedly learn ECG signals from target patients with knowledge transferred from these supervised sources. In detail, knowledge transferring is achieved by the Kernel Mean Matching (KMM) [8] method, in which source instances are weighted to match the distribution of target instances in a reproducing kernel Hilbert space (RKHS). A novel weighted transductive one class support vector machine (WTOSVM) as well as a labeling process are then proposed to discriminatively classify instances in the target. The rest of this paper is organized as follows: we will briefly discuss the related work of this topic in Section 2; after giving out the notation used in this paper, we present the theoretical work of our method in Section 3; the experiments and corresponding analysis are shown in Section 4; and finally we come to the conclusions and perspectives in Section 5. 2. RELATED WORK As a classical problem in machine learning and pattern recognition, ECG abnormality detection can be categorized into the areas of anomaly detection and imbalanced learning, due to the imbalance between numbers of instances of normal and abnormal classes in the ECG data. Similar datasets exist in a lot of research and applications, such as text mining [28], video indexing [27] and network security [7]. In such datasets, one or more classes have significantly less instances than the others, which conflicts with the assumption made by regular machine learning methods that each class should be equal or nearly equal in size. Failing to cover the distributions of imbalanced datasets by traditional methods invokes general interest of developing new algorithms for them. The developed methods are further applied to many relevant areas such as noise filtering [1], intrusion detection [9] and innovation detection [19]. To solve the problem, several solutions have been proposed at both data and algorithmic levels. The former one includes many different re-sampling methods favoring rebalancing class distributions, such as random under-sampling of majority instances [15] and synthetic minority oversampling technique (SMOTE) [4]. The algorithmic approaches usually give higher weights to minority instances in costsensitive and discriminative methods, aiming at rebalancing the biased decision boundaries [6, 3]. These popular approaches always require supervised information to guide the rebalancing process, which is expensive to obtain, as explained in the Introduction. Several methods have been proposed for unsupervised learning on imbalanced datasets. One class support vector machine (OCSVM) [26] adapts the support vector machine (SVM) algorithm to bound data into a circle while minimizing the diameter of it. Kernel nearest neighbors (KNN) [23] assumes that majority instances are dense and minorities are sparse, then classify them according to their kernel distances. [10] makes another assumption that each minority instance should either not belong to any cluster or be far from its cluster center, and presents a clustering based method for detecting local outliers. Other unsuper- ACM-BCB 2012 vised methods could be generally divided into these three categories. The main challenge in unsupervised learning on imbalanced ECG datasets is how to uncover the real distribution of each class without any labeled information. Inspired by the success of transfer learning on many areas, we propose a transductive transfer learning algorithm for ECG abnormality detection. Instead of assuming the distributions of normal and abnormal instances, we classify each target with assistance of a well labeled source, which fits our common experience in real life, such as diagnosing a disease on a target patient with cases of similar patients having the same disease. The formulated problem in this paper is related to transfer learning, which is the process of transferring information from a source domain to help learning in a target domain. It is commonly used when sources and targets are drawn from the same feature space. Specifically, transductive transfer learning is the process of knowledge transferring when only labeled sources and unlabeled targets are available. It has been successfully implemented in various areas [22, 2]. The major difficulty in knowledge transferring is that source and target data are not likely to be drawn from the same distribution, thus directly estimating target data through source will cause covariance shift and is not applicable in most cases. To address the problem, we follow the framework proposed in [8] which matches the kernel means of the source and the target in an RKHS for knowledge transferring. The main contributions of this paper are: • We propose a novel anomaly detection algorithm to solve the problem of detecting irregular events in the collected ECG signals. • We investigate the problem from the transductive transfer learning point of view. Since our model is general, it can be further expanded to transductive learning on imbalanced datasets in other areas. • In the experimental study of our method on the MITBIH Arrhythmias Dataset [18], our model outperforms all the other baseline methods for ECG abnormality detection and achieves 25% to 60% performance improvement than the baselines, which demonstrates the effectiveness and superority of our proposed algorithm. 3. METHOD Notice that the preprocessing method of dividing ECG signals into instances (heart beats) is not the focus of this paper. Instead, we aim at unsupervised recognition of the preprocessed instances. Before diving into the details of our framework, let’s first define the notation used in the following sections. Suppose for a target user/patient, we obtain an ECG signal including t instances T = {xi | i ∈ [1, t]} to be classified into normal and abnormal classes. A selected source labeled signal S = {xˆi | i ∈ [1, s]} will be used to help the learning process of T . For simplicity here, we assume that xi and xˆi are drawn from the same feature space and the label vector for the source is Ŷ = {yˆi |yˆi ∈ {−1, +1}, i ∈ [1, s]}, where −1 represents majority (normal) instances and +1 represents minority (abnormal) instances, respectively. The aimed problem in this paper can be expressed as: given a target imbalanced dataset T , a source dataset S and its labels Ŷ , learn a mapping f = T → Y , in which Y = {yi |yi ∈ {−1, 1}, i ∈ [1, t]}. 211 to observations that are under-represented in Pˆr and lower weights to over-represented instances. The estimation of β(x, y) thus does not require any supervised information in both the source and the target. According to Equation 2, knowledge transferring from a source to a target is accomplished by estimating the target data distribution P r(x) by β(x)Pˆr(x). The idea of KMM in learning the β vector is to minimize the discrepancy between the centers of instances in the target and the weighted instances in the source in an RKHS. Moreover, a constraint | 1s si=1 βi − 1| ≤ is enforced to ensure that β(x)Pˆr(x) is close to a probability distribution. The objective function for KMM is then expressed as: 140 Dataset Dataset Dataset Dataset 120 100 1: 1: 2: 2: Normal Abnormal Normal Abnormal Dimension #2 80 60 40 20 0 −20 −40 0 50 100 150 200 Dimension #1 250 300 350 Figure 2: Distributions of two imbalanced datasets. 3.1 β = arg min Ex∼P r [ϕ] − Ex∼Pˆr [βϕ]2 Knowledge Transferring = arg min Our aim is to classify instances in an unsupervised target with assistance of labeled instances in a source. However, directly classifying data in the target by using models trained on the source is not applicable, because the source and the target are not independently and identically drawn (iid ) from the same distribution. As shown in Figure 2, in a two dimensional space, two imbalanced datasets 1 and 2 are presented in different shapes, and abnormal instances of them are marked in red. If we directly classify data in dataset 2 by classifiers learned on dataset 1, most normal instances will be labeled as abnormal due to the different distributions of these two datasets. To enable knowledge transferring between datasets with different distributions, we use the KMM method [8]. Suppose we obtain the target samples Z = {(x1 , y1 ), . . . , (xt , yt )} ⊆ X × Y from a Borel probability distribution P r(x, y), and the source samples Ẑ = {(xˆ1 , yˆ1 ), . . . , (xˆs , yˆs )} ⊆ X̂ × Ŷ from another such distribution Pˆr(x, y). A learning method on sources generally minimizes the expected risk: θ = arg min R[Pˆr, θ, l(x, y, θ)] = arg min E(x,y)∼Pˆr [l(x, y, θ)], (1) where l(x, y, θ) is a loss function that depends on a parameter vector θ, such as negative log-likelihood − log P r(y|x, θ) and misclassification loss; E is the expectation; and R is the over all loss that we wish to minimize. Since P r(x, y) and Pˆr(x, y) are different, what we would really like to minimize is R[P r, θ, l]. An observation from the field of importance sampling [11] is that θ = arg min R[P r, θ, l(x, y, θ)] = arg min E(x, y)∼P r [l(x, y, θ)] = arg min E(x, y)∼Pˆr [ P r(x, y) l(x, y, θ)] Pˆr(x, y) (2) = arg min R[Pˆr, θ, β(x, y)l(x, y, θ)], P r(x, y) . Pˆr(x, y) Through the function, we can compute the risk with respect to P r using Pˆr. But the key problem is that the coefficient vector β(x, y) is usually unknown, and we need to estimate it from the data. In our context, since P r and Pˆr differ only in P r(x) and Pˆr(x), we P r(x) have β(x, y) = P ˆr(x) . β then can be viewed as a weighting factor for the source instances and it gives higher weights where β(x, y) = ACM-BCB 2012 s t 1 1 βi ϕ(xˆi ) − ϕ(xi ) 2 s i=1 t i=1 = arg min (3) 1 T 2 β Kβ − K̂ T β + const, s2 st where ϕ is a mapping function, which maps data of the source t and the target into an RKHS; Kij = k(xˆi , xˆj ); K̂i = 1 ˆi , xj ); and const represents an unknown constant j=1 k(x t which could be ignored in the optimization process. The function k could be an arbitrary kernel function specified according to the aimed task. In our experiments, we set k to be the commonly used Gaussian Kernel, which is k(xi , xj ) = 3.2 (xi − xj )T (xi − xj ) 1 exp[− ]. 2πσ 2 2σ 2 (4) Weighted Transductive One Class SVM OCSVM [26], which bounds data in a circle while minimizing the diameter of the boundary, has been successful used in many imbalanced datasets. However, it is not applicable to directly use it in our transductive transfer learning framework. The major difficulty is that each instance is weighted by β for knowledge transferring. To solve this, we propose a variant objective function called Weighted Transductive OCSVM (WTOSVM) which can be expressed as: min R2 + 2 s 1 ξi , sγ i=1 (5) 2 s.t. βi ϕi − O ≤ R + ξi , ξi ≥ 0, where R is the diameter of the decision boundary; O is the center of the bounding circle; ξi is the slack variable, which measures the degree of misclassification of data xˆi ; and γ is a measurement for the trade-off between a small diameter of the boundary and a small error penalty. Using multipliers αi , ζi ≥ 0, we introduce the Lagrangian as: L(R, γ, β, ξ, α, ζ) = R2 + s i=1 s 1 ξi + sγ i=1 αi [ βi ϕi − O 2 −R2 − ξi ] − s (6) ζi ξi . i=1 Set the derivatives with respect to the primal variables R, 212 ξ and O to be zero, yielding: dL = 2R(1 − dR s ⇒ αi ) = 0 i=1 s αi = 1, i=1 dL 1 = − α i − ζi = 0 dξi sγ s dL αi (βi ϕi − O) = 0 = −2 dO i=1 ⇒ 0 ≤ αi ≤ ⇒O= s 1 , sγ α i β i ϕi . i=1 (7) Substituting Equation 7 into Equation 5, and using kernel function method: K(x, y) = (ϕ(x) · ϕ(y)), (8) we then obtain the dual problem as: min s 2 αi βi K(xˆi , xˆi ) − s s i=1 αi αj βi βj K(xˆi , xˆj ) i=1 j=1 s 1 αi = 1, 0 ≤ γ ≤ 1. , s. t. 0 ≤ αi ≤ sγ i=1 (9) The optimal value of αi and the corresponding center of the decision boundary are determined by solving the quadratic programming (QP) problem in (9). There are many off-shelf QP solvers which are convenient for such tasks. In our case, we take advantage of the Optimization Toolbox in MATLAB to solve the objective function of WTOSVM. The distances between weighted instances in the source and the center of the round boundary are: (10) DS(xˆi ) = βi ϕ(xˆi ) − O2 . 3.3 Labeling Process After transferring knowledge from a source to a target via KMM method, the weighted supervised instances in the source can then participate in the process of learning on the unsupervised target. The crucial problems are how to use the supervised information provided by the weighted source instances and how to discriminate abnormal instances in the target. Through the WTOSVM, weighted instances in the source are re-distributed into a circle and a round decision boundary is learned. However, labeling instances in the target using the trained decision boundary of the source is impractical for mainly two reasons. First, distances calculated through parameters trained for the source will be comparatively larger for data in the target because of the discrepancy existing between the distributions of P r and β Pˆr, even if they are perfectly matched in the kernel means. Second, labeling data solely by WTOSVM actually adopts an assumption in OCSVM that minorities are sparse and majorities are dense, which may work well if the assumption is met and cause high bias otherwise. Notice that in WTOSVM, distance from each weighted instance to the center is calculated, and the instances with larger distances are supposed to be abnormal with higher probability and vice versa. Since there is supervised information of the source, we can estimate the probability distribution of normal instances over different distances as: s · yˆi ) · sgn(DS(xˆi ) − d)] i=1 [sgn(−1 s , (11) pn (d) = ˆi ) i=1 sgn(−1 · y ACM-BCB 2012 where pn (d) measures the probability that an instance is normal given that its distance to the center is d; sgn(x) = 1 if x > 0 and sgn(x) = 0 otherwise. In the same way, we estimate the probability distribution of abnormal instances over different distances as: s yˆi ) · sgn(d − DS(xˆi ))] i=1 [sgn( s , (12) pa (d) = ˆi ) i=1 sgn(y where pa (d) measures the probability that an instance is abnormal given that its distance to the center is d. Using OCSVM, we can re-distribute instances in the target in the same way as the source trained by WTOSVM, and the distances between instances in the target dataset and the center in the round boundary are: (13) DT (xi ) = ϕi − OT 2 , in which OT is the center of target instances determined by OCSVM. Our idea of discriminating them is using the distance to calculate the probabilities for normal and abnormal states, according to the statistically estimated probability distributions pn and pa in the source. The decision making process is: −1 if pn (DT (xi )) > pa (DT (xi )), yi = (14) +1 otherwise. Our algorithm of transductive transfer learning on imbalanced datasets is then summarized in Algorithm 1. Notice that this framework is for transductive transfer learning using a single source. Algorithm 1: ECG Abnormality Detection via Transductive Transfer Learning Input: A supervised source S and an unsupervised target T , a subset Nt ∈ T , kernel covariance σ and trade-off parameter γ. Output: The label vector Y for instances in T 1. use subset Nt to weight instances in S through KMM and obtain weighting vector β. 2. implement WTOSVM on βS to calculate DS, pn and pa . 3. use OCSVM on T to obtain DT . 4. for i = 1 to t do: 5. if pn (DT (xi )) > pa (DT (xi )): 6. yi = −1; 7. else: 8. yi = +1; 9. end if 10. end for. 11. return: Y = {yi }ti=1 . 3.4 Source Selection Given the above framework for transductive learning on a target ECG signal with a single source, a noticeable unsolved problem is how to take advantage of different sources for each target. Considering each source as a decision maker, then for data in the target, ideally we could combine opinions from all the sources and make a better decision. However, it is not the case for our problem because no supervised information is provided for the target. Existing unsupervised multiple sources ensemble approaches can be divided into two categories: unweighted and weighted combination. The former equals the weights of opinions from different decision makers, ignoring the fact that they should have different 213 suitability and reliability for the same target, causing a lot of bias and driving down the overall performance; the latter always assumes that opinions from majority sources should be close to each other, which does not hold in our cases due to the difficulty in handling data imbalance. These ensemble methods will then introduce much randomness into the decision making process. To avoid this and to refrain distraction of this paper, we will focus on selecting the most reasonable source for each target, and we present a straight-forward approach for it based on cross-validation. In detail, suppose we are going to choose one optimal source from n well labeled candidate datasets S1 , S2 , S3 , ..., Sn , we iteratively set each of them to be the source and other candidates to be the targets. The source which has the highest accumulated performance is set to be the optimal source for an aimed target. The underlying assumption of this strategy is that a source consistently performs well on other targets will be a reasonable source for the aimed target. In the experiment, we will experimentally verify this assumption and show that the simple source selection strategy can obtain results well fitting the highest performance of each target and outperforms comparative methods in most cases. Moreover, this source selection framework can be executed beforehand and easily implemented in a parallel way, which highly decreases the computing time. 4. EXPERIMENTS In this section, we demonstrate the effectiveness of the transductive learning based ECG abnormality detection framework. The algorithm is evaluated on the MIT-BIH Arrhythmias Dataset [18], and compared to a set of baseline approaches. Results show that the proposed framework could successfully transfer knowledge from sources to targets to better identify irregular events in ECG signals with significant improvements. Beside these, the optimal source selection method is also experimentally tested. The results fit the highest performance of each target well and outperform all those of the comparative methods, which verify the practicality and superiority of our method. Finally, experiments on parameter sensitivity are performed to prove the robustness of our model against various settings. 4.1 Dataset Description and Preprocessing The MIT-BIH Arrhythmias Dataset [18] contains 48 half hours excerpts of ECG signals, obtained from 47 subjects. Each heart beat is labeled as normal or abnormal by clinical experts. From it, we randomly picked 10 records, including the following files: 100, 101, 103, 105, 109, 115, 121, 210, 215 and 232. The preprocessing and feature extraction of each signal are implemented in a similar way to [17]. The percentage of abnormal instances in each preprocessed dataset varies from 0.7% to 29.38% with only 9.09% on average, which indicates these datasets are highly imbalanced. 4.2 Evaluation Metric The overall accuracy on a test dataset is commonly used in evaluating the performance of a classifier. However, for imbalanced datasets, especially in severely imbalanced cases, overall accuracies could be extremely high even if misclassifying all the minorities to the class of majority, thus alternative measurement is employed here. In the binary classification, we refer minority to be Positive and majority to be Negative. After obtaining the numbers of four values: TP ACM-BCB 2012 (True Positive), TN (True Negative), FP (False Positive) and FN (False Negative), G-Mean [16] is calculated as: TN TP • . (15) G-M ean = TP + FN TN + FP The G-Mean is a combinatory measurement on accuracies of the two classes. The range of it is from 0 to 1, and the higher it is the better the performance is in evaluation. The reason why we use this measurement is that it is extremely sensitive towards classification errors on minority instances in severely imbalanced dataset. In some ECG signals, there are only less than 10 abnormal heart beats and several thousand normal ones, which means a model misclassifying only 1% of the data may have a G-Mean as low as 0. For this evaluation metric, perfect recognition on most instances does not guarantee a high outcome. 4.3 Baseline Approaches We compare the transductive transfer learning based ECG abnormality detection framework with a set of baseline approaches from both anomaly detection and transductive learning areas, including: OCSVM [26]: Our model is built upon OCSVM, we hereby set it as a comparative approach. Local Distance Based Outlier Factor (LDOF) [29]: LDOF is a local distance based and unsupervised anomaly detection method. It defines a distance factor measuring how much an object deviates from its neighborhood. Objects with higher deviation degree measured by the factor are supposed to be more abnormal than others. In the implementation, the major difference of this approach to others is that it requires an estimation of total number of abnormal instances (ECG heart beats). We set the number to be 9.09% of each target dataset, according to the actual average percentage of minority instances in the datasets. Clustering Based Outlier Detection (CBOD) [30]: It is a clustering based unsupervised anomaly detection method. It assumes that abnormal instances are either in small clusters or far from the centers in large clusters. Partitioning Around Medoids (PAM) [14] and a defined distance based factor are used in clustering and detecting outliers in large clusters, respectively. Transductive Support Vector Machine (TSVM) [13]: TSVM is one of the most popular knowledge transferring method in existing research. It builds the discriminative structure based on the margin of separating hyperplanes on both the source and the target data. In experiment, we implemented it through the SVM light [12]. Modified Bisecting DBSCAN (BRSD-DB) [25]: It is another transductive learning approach. It estimates the data structure of both source and target through clustering, and aims to re-balance the distribution of the source through uniformly drawing samples from each clusters. The re-balanced source data could then be viewed as supervised training dataset for the target. In experiment, we use SVM for the final learning and classification goal. 4.4 Performance Study The results are summarized in Table 1, where we denote the name of each dataset by Data Index and mark the highest G-Means over different methods in bold font. For each target, when using transductive learning approaches including TSVM, BRSD-DB and WTOSVM, other datasets are 214 Data Index 100 101 103 105 109 115 121 210 215 232 OCSVM 0.8713 0.8609 0.8230 0.3343 0.9431 0.7191 0.6734 0.6293 0.6097 0.4665 LDOF 0.5190 0.8662 0.6753 0.4124 0.5331 0.7219 0.6255 0.4631 0.3267 0.2921 CBOD 0.9798 0.8770 0.7701 0.4722 0.9323 0.5248 0.6428 0.6857 0.5989 0.4259 TSVM 0.1715 0.5222 0.3927 0.6079 0.7068 0.4578 0.2671 0.4717 0.8638 0.9676 BRSD-DB 0.5941 0.6735 0.3534 0.5639 0.7269 0.5928 0.2646 0.6867 0.4541 0.9595 WTOSVM 0.9876 0.9444 0.8910 0.7291 0.9056 0.7590 0.7820 0.9063 0.9082 0.9754 Table 1: Compare our method to baselines on the MIT-BIH Arrhythmias Dataset. ACM-BCB 2012 1 Using Best Sources 0.95 Using Optimal Sources 0.9 G−Mean iteratively set to be the source and the highest G-Mean of these results is presented in the table. The parameter settings will be explained in the section Parameter Sensitivity. In the implementation, we ran our algorithm in MATLAB using a PC with 3.17 GHz Intel Core 2 Duo processor and 4GB RAM. The average time of classifying an instance by our algorithm is 0.0127 second, which only takes 1.59% of the duration of a usual heart beat according to [20]. Since our framework is based on OCSVM, comparison with it can best verify the effectiveness of our model. It can be clearly seen that for 9 out of the 10 tasks, the proposed ECG abnormality detection framework can significantly outperform the original OCSVM method and achieve improvement by 26.79%. The only exception is for Data 109, in which our result is lower for 0.04. These facts indicate that in most cases the proposed framework could transfer knowledge from sources into targets successfully, and these transferred knowledge does help in the classification of data in the targets. The other two anomaly detection methods also show much lower performance than ours in most cases. Each of them can get relatively high G-Means in several cases, and CBOD can even achieve slightly higher than our method for data 109. However, these two methods show obviously unstable performance over different datasets. Such instability also exists in TSVM and BRSD-DB. There are two explanations for it. One is due to the sensitivity of the evaluation metric G-Mean, which stresses the misclassification on minority instances and exaggerates overall performance variations. Another reason is that all the comparative methods are built based on their proposed assumptions. LDOF assumes minorities are in ’sparse area’; CBOD makes assumption that abnormal instances are either in small clusters or far from cluster centers; TSVM assumes there are shared hyperplanes for source and target; and BRSD-DB assumes clustering could mine the shared data distributions of sources and targets. Different from them, our method only assumes that for a given target there are sources that have close distributions with it, thus can produce more stable performance with enough candidate sources. According to this assumption, the relatively lower performance of our method on the dataset 109 can be explained as there is no investigated source having sufficiently close data distribution to it. In spite of this, the result of our method shows significantly higher stability than those of the comparative methods, demonstrating the robustness of the proposed algorithm. On average, our method improves the performance of the listed five baseline methods for 26.79%, 0.85 0.8 0.75 0.7 0.65 100 101 103 105 109 115 Data Index 121 210 215 232 Figure 3: Cross-Validation Based Source Selection. 61.68%, 27.18%, 61.86% and 49.72%, respectively. 4.5 Source Selection Study The experimental results of the cross-validation based source selection method are shown in Figure 3, in which we denote the selected sources as Optimal Sources and sources with best performance as Best Sources. In the results, Optimal Sources for 100, 103, 121 and 215 are also the Best Sources for them, thus produce the highest possible G-Means in our framework. In all the cases, the deviations of the G-Means using Best and Optimal Sources are less than 0.03, and the root-mean-square deviation (RMSD) is only 0.0146, proving that the Optimal Sources well fit the Best Sources. When compared with the listed baseline approaches, our method using the Optimal Sources could still outperform them in 8 out of the 10 ECG signals. For 109, the G-Means is lowered to 0.8829 and lower than OCSVM and CBOD; and for 232, the performance is reduced to 0.9531 thus slightly lower than TSVM and BRSD-DB. However, notice that the presented G-Means of TSVM and BRSD-DB are for the best sources of the targets. On average, our method using the selected optimal sources improves the performance of the baseline methods for 25.44%, 59.95%, 25.82%, 60.13% and 48.11%, respectively. 4.6 Parameter Sensitivity There are three parameters in the proposed algorithm, including the subset N t of a target data which is involved in the knowledge transferring process, kernel covariance σ and trade-off parameter γ used in the objective. Traditional way of setting parameters through experimenting on training data of each target is not applicable due to the fact 215 1 0.9 0.945 Source is Dataset 100 and Target is Dataset 101 0.8 0.7 Source is Dataset 100 and Target is Dataset 101 0.6 G−Mean G−Mean 0.94 0.935 0.5 0.4 0.93 0.3 0.2 0.925 0.1 0.92 100 200 300 400 500 600 700 Size of the Subset Nt 800 900 1000 0 1100 0 0.1 0.2 0.3 0.4 0.5 0.6 Trade−off Parameter γ 0.7 0.8 0.9 1 Figure 4: Parameter Sensitivity of N t. Figure 6: Parameter Sensitivity of γ. that no supervised information exists in target ECG signals during the learning process. So here, we show some experimental results on parameter sensitivity and give out some basic principles in setting parameters. Without losing generality, we choose data 100 as the source and data 101 as the target for the following experiments. from 0.7 to 0.9. We could hereby consider 0.3 to 0.7 as a reasonable range for γ and set it to be 0.5 in experiments in Table 1. For baseline approaches, we set parameters in the similar way to the three above experiments. We could also notice that in spite of the changes caused by parameter variations, our proposed WTOSVM significantly outperforms the best baseline method (LDOF when target is 101) in most of the cases. This proves the robustness of our method against various parameter settings. 1 0.98 Source is Dataset 100 and Target is Dataset 101 0.96 0.94 5. G−Mean 0.92 0.9 0.88 0.86 0.84 0.82 0.8 40 60 80 100 Kernel Parameter σ 120 140 Figure 5: Parameter Sensitivity of σ. The results on the sensitivity of N t is shown in Figure 4. We fixed σ = 100 and γ = 0.5, and then varied the size of N t from 100 to 1011, where 1011 is the size of the target data 101. In each setting, if the size is K, then first K instances in the data are set as the subset N t. The performance is relatively stable, with varance = 1.39×10−4 . This suggests that the subset does not have significant impact on the performance in these settings. However, to avoid possible randomness caused by insufficient learning instances and overfitting caused in the opposite way when using other sources and targets, we set N t to be the top 500 instances of each target for the experiments shown in Table 1. To test the sensitivity of σ, we fixed N t to be the top 100 instances and γ = 0.5, and then varied σ from 40 to 140 as shown in Figure 5. It is clearly evident that when σ rises from 40 to 80, the G-Mean first reaches the lowest point when σ = 50 then gradually improves. After the point of 80, the performance stays almost stable until γ ≥ 120. Therefore a reasonable setting for σ should be chosen from [80, 120]. For Table 1, we set it to be 100. In the similar way, we tested the sensitivity of parameter γ and show it in Figure 6. When varying the value of it from 0.1 to 0.9, the performance of our framework first increases to a turning point at γ = 0.3 and then keeps almost the same to γ = 0.7. A hardly observable decrease happens at the end ACM-BCB 2012 CONCLUSIONS AND PERSPECTIVES Unsupervised ECG abnormality detection is an important problem in health informatics and cyber-physical systems that has not been well investigated. In this work, we take advantage of well labeled ECG signals, and propose a transductive transfer learning framework to utilize these labeled source data into the classification of target unsupervised ECG signals. The experimental results on the MIT-BIH Arrhythmias Dataset show that the proposed method could outperform both anomaly detection and transfer learning baseline models in most of the cases and improve significantly over each of them for 25% to 60% on G-Mean. A cross-validation based source selection framework is also presented. It could determine the optimal source for unsupervised learning on each target, and the related experimental results could well fit the highest G-Mean for each target and significantly outperform baselines in most cases, demonstrating the practicality and superiority of our framework. Moreover, we have also presented the sensitivity for each parameter used in the model, and given out principles for their settings. In the future, we plan to investigate ECG abnormality detection based on transductive transfer learning using multiple sources. 6. REFERENCES [1] B. W. Andrews, T.-M. Yi, and P. A. Iglesias. Optimal noise filtering in the chemotactic response of escherichia coli. PLoS Computational Biology, 2(11):12, 2006. [2] M. T. Bahadori, Y. Liu, and D. Zhang. Learning with Minimum Supervision: A General Framework for Transductive Transfer Learning. IEEE, 2011. [3] X. Chang, Q. Zheng, and P. Lin. Cost-sensitive supported vector learning to rank. Learning, pages 305–314, 2009. 216 [4] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16(1):321–357, 2002. [5] J. Domienik-Karlowicz, B. Lichodziejewska, W. Lisik, M. Ciurzynski, P. Bienias, A. Chmura, and P. Pruszczyk. Electrocardiographic criteria of left ventricular hypertrophy in patients with morbid obesity. EARSeL eProceedings, 10(3):1–8, 2011. [6] P. Domingos. MetaCost: A General Method for Making Classifiers Cost-Sensitive, volume 55, pages 155–164. ACM, 1999. [7] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, and S. Stolfo. A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, volume 6, page 20. Kluwer, 2002. [8] A. Gretton, A. J. Smola, J. Huang, M. Schmittfull, K. M. Borgwardt, and B. Sch?lkopf. Covariate Shift by Kernel Mean Matching, page 131ı̈£¡C160. MIT Press, 2009. [9] M. Guimaraes and M. Murray. Overview of intrusion detection and intrusion prevention. Proceedings of the 5th annual conference on Information security curriculum development InfoSecCD 08, page 44, 2008. [10] Z. He, X. Xu, and S. Deng. Discovering cluster-based local outliers. Pattern Recognition Letters, 24(9-10):1641–1650, 2003. [11] J. Huang, A. Smola, A. Gretton, K. Borgwardt, and B. SchÃűlkopf. Correcting sample selection bias by unlabeled data. Distribution, 19:601–608, 2006. [12] T. Joachims. Making large-scale support vector machine learning practical, pages 169–184. MIT Press, Cambridge, MA, 1999. [13] T. Joachims. Transductive inference for text classification using support vector machines. Most, pages 200–209, 1999. [14] L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis, volume 39. John Wiley Sons, 1990. [15] S. B. Kotsiantis and P. E. Pintelas. Mixture of expert agents for handling imbalanced data sets. Annals of Mathematics Computing and Teleinformatics, 1(1):46–55, 2003. [16] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: one-sided selection. Training, pages 179–186, 1997. [17] P. Li, K. Chan, S. Fu, and S. Krishnan. An abnormal ecg beat detection approach for long-term monitoring of heart patients based on hybrid kernel machine ensemble. Biomedical Engineering, 3541/2005:346–355, 2005. [18] G. B. Moody and R. G. Mark. The impact of the mit-bih arrhythmia database. IEEE Engineering in Medicine and Biology Magazine, 20(3):45–50, 2001. [19] M. Nakatsuji, Y. Miyoshi, and Y. Otsuka. Innovation detection based on user-interest ontology of blog community. Innovation, 4273(1):9–11, 2006. [20] P. Palatini. Resting heart rate. Heart, 33:622–625, 1999. [21] D. Patra, M. K. Das, and S. Pradhan. Integration of fcm , pca and neural networks for classification of ecg ACM-BCB 2012 arrhythmias. International Journal, (February), 2010. [22] B. Quanz and J. Huan. Large margin transductive transfer learning. Proceeding of the 18th ACM conference on Information and knowledge management CIKM 09, page 1327, 2009. [23] S. Ramaswamy, R. Rastogi, and K. Shim. Efficient algorithms for mining outliers from large data sets. ACM SIGMOD Record, 29(2):427–438, 2000. [24] G. Ranganathan and K. College. Ecg signal processing using dyadic wavelet for mental stress assessment. Engineering, 2010. [25] J. Ren, X. Shi, W. Fan, and P. S. Yu. Type-independent correction of sample selection bias via structural discovery and re-balancing, page 565âĂŞ576. Citeseer, 2008. [26] B. SchÃűlkopf, J. C. Platt, J. S. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 13(7):1443–1471, 2001. [27] C. G. M. Snoek, M. Worring, J. C. Van Gemert, J.-M. Geusebroek, and A. W. M. Smeulders. The challenge problem for automated detection of 101 semantic concepts in multimedia. Proceedings of the 14th annual ACM international conference on Multimedia MULTIMEDIA 06, pages 421–430, 2006. [28] L. I. Xin-fu, Y. U. Yan, and Y. I. N. Peng. A New Method of Text Categorization on Imbalanced Datasets, pages 10–13. 2008. [29] K. Zhang, M. Hutter, and H. Jin. A new local distance-based outlier detection approach for scattered real-world data. Advances in Knowledge Discovery and Data Mining, 5476:813–822, 2009. [30] B. A. Zoubi. An effective clustering-based approach for outlier detection. European Journal of Scientific Research, 28(2):310–316, 2009. 217