See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/339095180 Developing a Predictive Model of Stroke using Support Vector Machine Conference Paper · October 2019 DOI: 10.1109/TSSA48701.2019.8985498 CITATIONS READS 0 45 2 authors, including: Alexander A Hernandez Technological Institute of the Philippines 83 PUBLICATIONS 104 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Classification of Nile Tilapua using Convolutional Neural Network View project Coffee Eco-Market: Coffee Business Ecosystem with Online Bidding View project All content following this page was uploaded by Alexander A Hernandez on 12 February 2020. The user has requested enhancement of the downloaded file. Developing a Predictive Model of Stroke using Support Vector Machine Jovel T. Rosado Alexander A. Hernandez 7HFKQRORJLFDO,QVWLWXWHRIWKH3KLOLSSLQHV Manila, Philippines jovelrosado08@gmail.com 7HFKQRORJLFDO,QVWLWXWHRIWKH3KLOLSSLQHV Manila, Philippines alexander.hernandez@tip.edu.ph Abstract— Health is a fundamental human right of all the Filipinos in the Philippines, as stated by the Philippine Constitution of 1987. Based on the data published by the World Health Organization in 2018, there are 41 million deaths occurred because of stroke and its complications. Thus, given the parameters for the variables of risk factors of stroke, a predictive model is developed for the occurrence of stroke based on the medical records of the patient. To ensure quality data, the medical data of the patients underwent data preprocessing, principal component analysis is used for dimension reduction. The model is evaluated using accuracy, precision, recall, F1 score, and area under curve. The study used datasets of 1500 patients from Cavite, Philippines. This study used 60 percent for training the model, and 30 percent is used for testing the model and 10 percent for validating the model. The SVM model achieved an accuracy of 99% for training the data, 98.89% for testing, and 97.33% for validation. The results of the model show the potential use of the predictive model for stroke, thus, remains relevant for researchers and practitioners in the medical and health sciences field. and proposes a new predictive method using Principal Component Analysis and a supervised machine learning algorithm. For dimensionality reduction and dealing with the multi-collinearity problem in the experimental data, PCA is used [8]. Support Vector Machine (SVM) is a technique suitable for disease prediction task, [9]. Thus, SVM is chosen to predict stroke. SVM based-approach for various kernel functions produced accurate results, and it showed the predictive power of SVM within a small set of input parameters [10]. The paper intends to develop a predictive model using the medical records of the patients and undergo dimension reduction through Principal Component Analysis by reducing the range of continuous data into a range of values or categories and processed using Support Vector Machine The model is evaluated using accuracy, precision, recall, F1 score and area under curve. Keywords—support vector machine, principal component analysis, stroke prediction, Philippines II. RELATED WORKS I. INTRODUCTION A. Overview of Stroke Stroke is a prevalent disease that for many years, can influence the patient and his/her family. It is one of the world’s major causes of adult disability. Developing countries face this kind of non-communicable disease [11]. For this reason, knowing what stroke is, is an essential first step. A stroke is a “brain attack.” It can occur anytime and can affect anyone. It happens when blood flows to a cut area of the brain. Brain cells die when this occurs due to the absence of oxygen. Memory and muscle control are some of the capabilities regulated by the brain region that will be lost when brain cells die. The common signs of stroke are weakness or numbness of the face, arm, and leg of one side of the body. Speech difficulty happened and has trouble seeing in one or both eyes. A patient can also experience sudden severe dizziness and loss of balance and has a severe headache. Moreover, lastly, increasing drowsiness with possible loss of consciousness and confusion. [12]. Stroke is the top life-threatening disease in the world. It is the leading cause of cognitive disorder around the world. [1]. To decrease the problem of stroke in the population, it is needed first to identify the modifiable risk factors and to demonstrate the effectiveness of risk reduction efforts [2]. Accordingly, preventing stroke in the field s of neurology, cardiology, vascular medicine, and geriatrics medicine remains as one of the essential targets [3]. In 2016, there were an estimated 41 million deaths because of non-communicable diseases. The significant part of the percentage was because of cardiovascular disease accounting to 17.9 million of deaths equivalent to 44% of all non-communicable diseases deaths [4]. On the other hand, based on the Philippines Statistics Authority (PSA), stroke was the top leading cause of death with 74,134 or 12.7 percent of the total in the Philippines [5]. However, the growing number of stroke incidents can be addressed through innovation and technology. The use of machine learning in knowledge discovery for disease prediction has been one of the interesting and relevant topics addressed by researchers [6]. Accordingly, because of the importance of disease prediction to the people, several studies have been conducted on modeling procedures for prediction [7]. This study incorporates machine learning ,((( B. Support Vector Machine Support Vector Machine, based on statistical learning theory, ensures a machine learning method. In the training information descriptor space, a separate hyperplane is developed, and variables are categorized based on the side where the hyperplane is situated [13]. It is possible to use 35 SVM to discover complicated patterns. Similarity (or kernels) is selected to transforms the information and to select information points or vectors to help it [14]. Karaman, and Turtay [21], SVM, and ANN anticipated the stroke based on chosen early diagnostic predictors for clinical decision support system. Moreover, in terms of classification, prediction, and regression analysis, SVM is one of the supervised learning methods used [15]. Moreover, Xiang [22] applied and compared different categories of machine learning model that have good interpretability, including generalized linear models, to build the prediction for stroke and thromboembolism. This study used integrated machine approaches, including data curation, feature engineering, and supervised learning to build the thromboembolism prediction model. The study showed that the approach could achieve significantly better prediction performance. Negative Hyperplane III. MATERIALS AND METHODS This study applies the general framework on knowledge discovery in databases, presented in Figure 2. Positive Hyperplane Figure 1. Maximum Margin separating Hyperplane Figure 1 shows the margin of classes and the hyperplane used to classify data of two classes. Support vectors used to have the maximum margins from each class of data [16]. The solid line is the maximum margin separating the hyperplane. The point with the smallest margins are exactly the one closest to the decision boundary parallel to the decision boundary. Thus, only these three points will be non-zero at the optimal solution to our optimization problem. These three points are known as support vectors [17]. Figure 2. Knowledge Discovery in Databases C. Principal Component Analysis (PCA) PCA is a significant method from the domain of lots of variables that are often used for data dimensionality reduction. It is also a popular way to extract significant features from the training data used to learn a model of machine learning [18]. PCA will be used in this study using the data sets of the patients for the prediction of stroke. A. Datasets The data used by this study came from the medical records of the patients. A hospital in Cavite, Philippines initially owns these datasets. In this study, there are a total of 1,500 patients for the past year to the present. The medical records of the hospital contained different variables, such as shown in the table below. In a general structure, PCA works as a linear transformation method that converts the first data variables into a feature space that has the same dimensions as the unprocessed data. There is no correlation between the transformed variables in the feature space, and these are called principal component. The transformation aims to create the most of the variance in the feature space among the projected variables and thus enables the participation of each principal component to be evaluated. The technique is that the primary data can be selected and the remaining discarded [19]. TABLE 1. PATIENTS’ MEDICAL DATA Attribute Age Sex Chief Complaint Diabetes Hypertension Smoker Alcoholic and Beverage Drinker Blood Pressure Pulse Rate Weight D. Stroke Prediction Model With regards to the prediction of stroke, this study will use a machine learning method, SVM for predicting stroke possibility base on the medical records of the patient. In a study conducted by Bentley et al., [20] SVM performed higher accuracy than radiological methods. On the other side, according to the research undertaken by Colak, Description Patient’s age Gender of the patient Patient’s major health complain If patient has diabetes If patient has hypertension If patient is a smoker If patient is alcoholic and beverage drinker Blood pressure of the patient Pulse rate of the patient Weight of the patient The data sets consist of 33 attributes (patient’s name, age, sex, civil status, birthday, nationality, occupation, father’s name, mother’s name, chief complaint, history of present illness, past medical history, diabetes, hypertension, cancer, pulmonary tuberculosis, others, smoker, alcoholic and beverage drinker, food and drug allergy, general 36 appearance, blood pressure, respiratory rate, temperature, weight, sheent (skin, head, eyes, ears, nose & throat), chest and lungs, CVS, abdomen, genitalia, extremities, CNS, diagnosis) cleaned and underwent dimension reduction to extract the essential features used to train the support vector machine. The data was narrowed down into 11 attributes that served as the attributes for the stroke prediction model. The remaining 11 attributes were the data that caused stroke and annotated by the physicians. Based on the medical records of the patient, if he/she had all positive responses of the attributes used, then he/she had the probability of having a stroke. 60% of the total data (900) was used for training the model, and the 30% (450) was used as testing data set and 10% (150) is used for validation. F(x)=WI + bias (1) Where W=weight factor, I=input vector and bias. The hyperplane which divides is defined by f(x)=0. Therefore, first class that falls above the hyperplane has f(x)>0 and another class below the plane is f(x)<0 [24]. D. Evaluation The performance of the model is evaluated using accuracy. It is defined in terms of correctly classified instances divided by the total number of instances present in the dataset as used in other study [25]. (2) TP + TN Accuracy = TP + FP + TN + FN B. Data Preprocessing Since the hospital did not have the electronic copy of the medical records of the patients, individual records were encoded in Microsoft Excel. After encoding all the information of 1500 patients with 33 attributes, the data were cleaned by deleting all the redundant information and the unnecessary details and became 11 attributes for the parameters of stroke. Where TP-True Positive, FP-False Positive, TN-True Negative, FN-False Negative TP Rate: It is fraction of data that are positive were predicted positive. The true-positive rate is also called sensitivity [25]. The raw data is contains binary, nominal, and numeric type. For different data types, this study designed sets of cleansing rules to ensure complete and accurate data are available. The cleansing rules were used to standardize the format, correct the input errors, or discard the values that cannot be recognized. = TPR TP (3) TP + FN Precision is defined as the degree to which the repeated measurements under unchanged conditions show the same results [25]. After imputed the missing values that cannot be connected to other features. The features with too many missing entries are discarded because the distributions are difficult to estimate, which may lead to inaccurate results. Xiang [22] suggest that if a binary feature has more than 80% missing instances or a numeric/multi-value nominal feature has more than 60% missing entries, then this feature will be removed from the data sets. Thus, the other 22 variables were dropped since they were not necessary stroke parameters, and some had missing values. Precision = TP (4) TP + FP Recall is the ratio of correctly predicted positive observations to the total predicted positive observations [26]. TP Recall (5) = TP + FN F-measure is the combination of both precision and recall. It is used to estimate the query classification performance [25]. C. Model Building Further, preprocessing activity was performed to remove outliers in the data set. PCA is used for feature selection as it is a standard method of extracting the essential features from the training data. Many feature selections can relate to distinct aspects of data analysis for better data visualization and comprehension, computational time decrease, analytical length, and predictive accuracy [23]. F-Measure = 2*Recall * Precision Recall + Precision This study used the SVM algorithm for model building. It utilizes both linear and nonlinear kernel functions. It classifies the data by finding the hyperplane, the point that separates the data points of the first class from that of the second class. If a large margin is found, then the model would be better [24]. The SVM uses the linear classifier of the following form, 37 (6) IV. EXPERIMENTAL RESULTS in terms of accuracy and other relevant performance measures. Prevention is better than cure. Early signs of potentinal stroke is essential since it is a life-threatening. It could improve patient’s life expectancy and health condition. A supervised algorithm known as SVM was used to develop the model of stroke prediction. TABLE 3. Model Testing Result Model Testing (450) Precision Recall F1 Score AUC 98.89% 75% 81.82% 78.41% 99.8% (.998) Table 3 on the other hand, presents the different parameters for evaluating the model using the testing data, which consists 450 medical records. It is found that the accuracy of the SVM model in the testing data is 98.89%. Precision is 75%, and recall, which showed correctly identified the fraction of actual positive stroke cases for SVM model is 81.82%. F1 score of SVM model is found 78.41, and AUC is 99.8 % (.998). Based on the results, the the classifier is able to predict correctly based on patterns used in the training activity. Thus, the model is accurate to use for predicting potential stroke. Figure 3. Data plot of patients using SVM Figure 3 shows the plotting of patients data. The blue dots indicate those patients who are negative of stroke and the brown dots show those patients who have the possibility of stroke. The spread of Radial Basis Function (RBF) kernel shows that the gamma value is very high that the decision boundary is starting to cover the spread of data better, transforming the data into a higher dimensional feature space. RBF is a popular kernel (way of computing the dot products of two vectors) method used in the SVM model. It is a function whose value depends on the distance from the origin. TABLE 4. Model Validation Result Without Stroke With Stroke Precision Recall F1 Score AUC 99% 80% 76.19% 78.10% 99.8% (.998) Predicted Correctly Not Predicted Correctly Accuracy 150 143 3 97.33% 3 1 TABLE 5. Model Testing Confusion Matrix TABLE 2. Model Training Result Accuracy Number of Data Table 4 shows the validation result, which was 10% of the total data. For validation, 150 records of data is used. It generated a result of 143, which was correctly predicted by the model without stroke and 3 was not correctly predicted. Moreover, it predicted 3 instances with stroke and 1, which was not correctly predicted. Based on the generated results of validation, the model is 97.33% accurate. In this study, the parameters accuracy, precision, recall, F1 score, and AUC are computed to evaluate the performance of the SVM classifier. The 1500 datasets were divided into 60 % training, 30% testing, and 10% validation. The data underwent cross-validation to evaluate and compare the results by dividing the data into two segments: one used to learn or train a model, and the other used to validate the model. Model Training Data (900) Accuracy pred without stroke pred with stroke True without Stroke 436 3 True with Stroke 2 9 Table 5 shows the confusion matrix of the data used in testing the model. The rows in the confusion matrix correspond to what the model predicted, and the columns correspond to the known truth. There are 436 patients without stroke that were correctly identified by the model. There are 9 patients with stroke, which were correctly identified by the model. On the other hand, there were 3 patients without stroke, but the algorithm identified these with stroke. Lastly, 2 patients had a stroke, but the algorithm recognized it without stroke. Table 2 presents the different parameters for evaluating the model using the training data, which consists of 900 medical records. The results show that the accuracy of the SVM model in using the training data is 99.00%. Precision is 80%, and recall, which shows correctly identified the fraction of actual positive stroke cases for SVM model 76.19%. F1 score of SVM model is found 78.10, and AUC is 99.8 % (.998) which means that it is an ideal classifier. The results show that the classifier could still be improved Hence, from the above study, it can be seen that using the training data, the model obtained an accuracy of 99% and 98.89 % for testing. To better ensure the accuracy and 38 efficiency of the algorithm used, the model underwent validation and generated a result of 97.33%. In providing a better understanding of the classifier performance, F1 score matters as it provides a balance between recall and precision [28]. Journal of Soft Computing and Decision Support Systems, 5, 24-30. [9] Hazi Mohammad Azamathulla, A. H. (2017). Application of Data Mining Methods in Diabetes Prediction. 2017 2nd International Conference on Image, Vision and Computing (IEEE), 106-110. [10] Jeena RS, D. S. (2016). Stroke Prediction Using SVM. International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT) (IEEE). V. CONCLUSION [11] Subha PP, P. G. (2015). ,Pattern and risk factors of stroke in the young among stroke parients admitted in medical college hospital. Thiruvananthapuram.,Ann indian Acad Neurol, 18:20-3 . The objective of this study is to develop a predictive model using SVM to predict the possibility of stroke of the patients in Cavite, Philippines. Predictions from SVM kernel resulted in high-performance classifier for RBF as 1.0. This can assist doctors to plan for better stroke detection medication soon. This study proves the predictive capability of SVM with 1, 500 patients, and 10 attributes. The results for evaluation resulted in accuracy of 99% using the training data and 98.89% using the testing data with a validation result of 97.33%. [12] National Stroke Association. (2019). (American Heart Association Inc.) Retrieved May 28, 2019, from https://www.stroke.org/understand-stroke/what-is-stroke/ [13] Dr. S. Vijayarani, M. S. (2015). Data Mining Classification Algorithms for Kidney Disease Prediction. International Journal on Cybernatics and Informatics, 4(4), 13-25. [14] Jean-Emmanuel Bibault, P. G. (2016). Big Data and machine learning in radiation oncology: State of the art and future prospect. Elsevier, 110-117. This study is not free from limitations. Thus, this recommends some future activities. The study could be used in the future for stroke prevention since it could detect the early occurrence of stroke among the patients of Cavite, Philippines. The results could also help in developing a control plan for those patients since stroke cannot be detected beforehand. This study could also be used for developing another model for further comparison of the different machine learning algorithms. [15] Cemil Colak, E. K. (2015). Application of knowledge discovery process on the prediction of stroke. Elsevier, 181-185. [16] Raoof Gholami, N. F. (2017). Support Vector Machine: Principles, Parameters, And Applications. Elsevier, 515-533. [17] Ng, A. (n.d.). Standford Edu. Retrieved May 30, 2019, from cs229.stanford.edu/notes/cs229-notes3.pdf [18] Smita Jhajharia, H. K. (2016). A Neural Network Based Breast Cancer Prognosis Model with PCA Processed Feature. Intl. Conference on Advances in Computing, Communications and Informatics (ICACCI). Jaipur, India. REFERENCES [1] V Mozaffarian, D. B. (2015). Heart disease and stroke statistics 2015 update: a report from the American Heart Association. American Heart Association, Circulation 131, e29–322. [19] O. Inan, M. S. (n.d.). “A new hybrid feature selection method based on association rules and pca for detection of breast cancer. International Journal of Innovative Computing and Information and Control, 09(02), 727-739. [2] Amelia K. Boehme, C. E. (2017). Stroke Risk Factors, Genetics, and Prevention. Circulation Research Journal of the American Heart Association. [20] P. Bentley, J. G. (n.d.). Prediction of stroke thrombolysis outcome using CT brain machine learning. Nueroimage, 4, 635-640. [3] M. Edip Gurol, J. S. (2018). Adbances in Stroke Prevention in 2018. Journal of Stroke, 143-144. [21] Cemil Colak, E. K. (2015). Application of knowledge discovery process on the prediction of stroke. Elsevier, 181-185. [4] WHO. (2018). World Health Statistics 2018: Monitoring Health for SDGs, sustainable dvelopment goals. Geneva World Health Organization. [22] Xiang Li, P. H. (2017). Integrated Machine Learning Approaches for Predicting Ischemic Stroke and Thromboembolism in Atrial Fibrillation. AMIA Annual Proceedings Archive, 799-807. [5] PSA. (2018, February 12). Deaths in the Philippines 2016. Retrieved from Philippine Statistics Authority: https://psa.gov.ph/content/deathsphilippines-2016 [23] Ionnis Kavakiotis, O. T. (2017). Machine Learning and Data Mining Methids in Diabetes Research. Elsevier Computational and Structural Biotechnology Journal(15), 104-116. [6] Mehrbakhsh Nilashi, H. A. (September 2017). Knowledge Discovery and Diseases Prediction: A Comparative Study of Machine Learning Techniques. Journal of Soft Computing and Decision Support Systems, 4(No,5), 8-16. [24] Radhimeennakshi, S. (2016). Classification and prediction of Heart Disease Risk Using Data Mining Techniques of Support Vector IEEE Machine and Artificial Neural Network. InternationalConference on Computing for Sustainable Global Development (INDIACom) , 3107-3111. [7] Nilashi, M. b. (2017). An Analytical Method for Diseases Prediction Using Machine Learning Techniques. Computers & Chemical Engineering. 106, 212-223. [25] O. Dr. S. Vijayarani, M. S. (2015). Data Mining Classification Algorithms for Kidney Disease Prediction. International Journal on Cybernatics and Informatics, 4(4), 13-25. [8] Nilashi, M. E. (2016). A multi-criteria collaborative filtering recommender system using clustering and regression techniques. 39 [26] Joshi, R. (2016, September 9). Exsilio Solutions. Retrieved June 4, 2019, from https://blog.exsilio.com/all/accuracy-precision-recall-f1score-interpretation-of-performance-measures/ [27] Harleen Kaur, V. K. (2018). Predicitve modeliing and analytics for diabetes using a machine learning approach. Applied Computing and Informatics. [28] J. Li, O. A. (2017). Glycaemic index precision: a pilot study of data linkage challenges and the application of machine learning. IEEE EMBS Int. Conf. on Biomed. & Health Informat (BHI), 357-360. 40 View publication stats