Uploaded by Mallika Ravi

Stroke Prediction using SVM

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/339095180
Developing a Predictive Model of Stroke using Support Vector Machine
Conference Paper · October 2019
DOI: 10.1109/TSSA48701.2019.8985498
CITATIONS
READS
0
45
2 authors, including:
Alexander A Hernandez
Technological Institute of the Philippines
83 PUBLICATIONS 104 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Classification of Nile Tilapua using Convolutional Neural Network View project
Coffee Eco-Market: Coffee Business Ecosystem with Online Bidding View project
All content following this page was uploaded by Alexander A Hernandez on 12 February 2020.
The user has requested enhancement of the downloaded file.
Developing a Predictive Model of Stroke using
Support Vector Machine
Jovel T. Rosado
Alexander A. Hernandez
7HFKQRORJLFDO,QVWLWXWHRIWKH3KLOLSSLQHV
Manila, Philippines
jovelrosado08@gmail.com
7HFKQRORJLFDO,QVWLWXWHRIWKH3KLOLSSLQHV
Manila, Philippines
alexander.hernandez@tip.edu.ph
Abstract— Health is a fundamental human right of all the
Filipinos in the Philippines, as stated by the Philippine
Constitution of 1987. Based on the data published by the
World Health Organization in 2018, there are 41 million
deaths occurred because of stroke and its complications. Thus,
given the parameters for the variables of risk factors of stroke,
a predictive model is developed for the occurrence of stroke
based on the medical records of the patient. To ensure quality
data, the medical data of the patients underwent data preprocessing, principal component analysis is used for dimension
reduction. The model is evaluated using accuracy, precision,
recall, F1 score, and area under curve. The study used datasets
of 1500 patients from Cavite, Philippines. This study used 60
percent for training the model, and 30 percent is used for
testing the model and 10 percent for validating the model. The
SVM model achieved an accuracy of 99% for training the data,
98.89% for testing, and 97.33% for validation. The results of
the model show the potential use of the predictive model for
stroke, thus, remains relevant for researchers and practitioners
in the medical and health sciences field.
and proposes a new predictive method using Principal
Component Analysis and a supervised machine learning
algorithm. For dimensionality reduction and dealing with
the multi-collinearity problem in the experimental data,
PCA is used [8].
Support Vector Machine (SVM) is a technique suitable
for disease prediction task, [9]. Thus, SVM is chosen to
predict stroke. SVM based-approach for various kernel
functions produced accurate results, and it showed the
predictive power of SVM within a small set of input
parameters [10].
The paper intends to develop a predictive model using
the medical records of the patients and undergo dimension
reduction through Principal Component Analysis by
reducing the range of continuous data into a range of values
or categories and processed using Support Vector Machine
The model is evaluated using accuracy, precision, recall, F1
score and area under curve.
Keywords—support vector machine, principal component
analysis, stroke prediction, Philippines
II. RELATED WORKS
I. INTRODUCTION
A. Overview of Stroke
Stroke is a prevalent disease that for many years, can
influence the patient and his/her family. It is one of the
world’s major causes of adult disability. Developing
countries face this kind of non-communicable disease [11].
For this reason, knowing what stroke is, is an essential first
step. A stroke is a “brain attack.” It can occur anytime and
can affect anyone. It happens when blood flows to a cut area
of the brain. Brain cells die when this occurs due to the
absence of oxygen. Memory and muscle control are some of
the capabilities regulated by the brain region that will be lost
when brain cells die. The common signs of stroke are
weakness or numbness of the face, arm, and leg of one side
of the body. Speech difficulty happened and has trouble
seeing in one or both eyes. A patient can also experience
sudden severe dizziness and loss of balance and has a severe
headache. Moreover, lastly, increasing drowsiness with
possible loss of consciousness and confusion. [12].
Stroke is the top life-threatening disease in the world. It
is the leading cause of cognitive disorder around the world.
[1]. To decrease the problem of stroke in the population, it is
needed first to identify the modifiable risk factors and to
demonstrate the effectiveness of risk reduction efforts [2].
Accordingly, preventing stroke in the field s of neurology,
cardiology, vascular medicine, and geriatrics medicine
remains as one of the essential targets [3].
In 2016, there were an estimated 41 million deaths
because of non-communicable diseases. The significant part
of the percentage was because of cardiovascular disease
accounting to 17.9 million of deaths equivalent to 44% of all
non-communicable diseases deaths [4]. On the other hand,
based on the Philippines Statistics Authority (PSA), stroke
was the top leading cause of death with 74,134 or 12.7
percent of the total in the Philippines [5].
However, the growing number of stroke incidents can be
addressed through innovation and technology. The use of
machine learning in knowledge discovery for disease
prediction has been one of the interesting and relevant topics
addressed by researchers [6]. Accordingly, because of the
importance of disease prediction to the people, several
studies have been conducted on modeling procedures for
prediction [7]. This study incorporates machine learning
‹,(((
B. Support Vector Machine
Support Vector Machine, based on statistical learning
theory, ensures a machine learning method. In the training
information descriptor space, a separate hyperplane is
developed, and variables are categorized based on the side
where the hyperplane is situated [13]. It is possible to use
35
SVM to discover complicated patterns. Similarity (or
kernels) is selected to transforms the information and to
select information points or vectors to help it [14].
Karaman, and Turtay [21], SVM, and ANN anticipated the
stroke based on chosen early diagnostic predictors for
clinical decision support system.
Moreover, in terms of classification, prediction, and
regression analysis, SVM is one of the supervised learning
methods used [15].
Moreover, Xiang [22] applied and compared different
categories of machine learning model that have good
interpretability, including generalized linear models, to build
the prediction for stroke and thromboembolism. This
study used integrated machine approaches, including data
curation, feature engineering, and supervised learning to
build the thromboembolism prediction model. The study
showed that the approach could achieve significantly better
prediction performance.
Negative Hyperplane
III. MATERIALS AND METHODS
This study applies the general framework on knowledge
discovery in databases, presented in Figure 2.
Positive Hyperplane
Figure 1. Maximum Margin separating Hyperplane
Figure 1 shows the margin of classes and the hyperplane
used to classify data of two classes. Support vectors used to
have the maximum margins from each class of data [16].
The solid line is the maximum margin separating the
hyperplane. The point with the smallest margins are exactly
the one closest to the decision boundary parallel to the
decision boundary. Thus, only these three points will be
non-zero at the optimal solution to our optimization
problem. These three points are known as support vectors
[17].
Figure 2. Knowledge Discovery in Databases
C. Principal Component Analysis (PCA)
PCA is a significant method from the domain of lots of
variables that are often used for data dimensionality
reduction. It is also a popular way to extract significant
features from the training data used to learn a model of
machine learning [18]. PCA will be used in this study using
the data sets of the patients for the prediction of stroke.
A. Datasets
The data used by this study came from the medical
records of the patients. A hospital in Cavite, Philippines
initially owns these datasets. In this study, there are a total
of 1,500 patients for the past year to the present. The
medical records of the hospital contained different variables,
such as shown in the table below.
In a general structure, PCA works as a linear
transformation method that converts the first data variables
into a feature space that has the same dimensions as the
unprocessed data. There is no correlation between the
transformed variables in the feature space, and these are
called principal component. The transformation aims to
create the most of the variance in the feature space among
the projected variables and thus enables the participation of
each principal component to be evaluated. The technique is
that the primary data can be selected and the remaining
discarded [19].
TABLE 1. PATIENTS’ MEDICAL DATA
Attribute
Age
Sex
Chief Complaint
Diabetes
Hypertension
Smoker
Alcoholic and
Beverage Drinker
Blood Pressure
Pulse Rate
Weight
D. Stroke Prediction Model
With regards to the prediction of stroke, this study will
use a machine learning method, SVM for predicting stroke
possibility base on the medical records of the patient. In a
study conducted by Bentley et al., [20] SVM performed
higher accuracy than radiological methods. On the other
side, according to the research undertaken by Colak,
Description
Patient’s age
Gender of the patient
Patient’s major health complain
If patient has diabetes
If patient has hypertension
If patient is a smoker
If patient is alcoholic and beverage drinker
Blood pressure of the patient
Pulse rate of the patient
Weight of the patient
The data sets consist of 33 attributes (patient’s name,
age, sex, civil status, birthday, nationality, occupation,
father’s name, mother’s name, chief complaint, history of
present illness, past medical history, diabetes, hypertension,
cancer, pulmonary tuberculosis, others, smoker, alcoholic
and beverage drinker, food and drug allergy, general
36
appearance, blood pressure, respiratory rate, temperature,
weight, sheent (skin, head, eyes, ears, nose & throat), chest
and lungs, CVS, abdomen, genitalia, extremities, CNS,
diagnosis) cleaned and underwent dimension reduction to
extract the essential features used to train the support vector
machine. The data was narrowed down into 11 attributes
that served as the attributes for the stroke prediction model.
The remaining 11 attributes were the data that caused stroke
and annotated by the physicians. Based on the medical
records of the patient, if he/she had all positive responses of
the attributes used, then he/she had the probability of having
a stroke. 60% of the total data (900) was used for training
the model, and the 30% (450) was used as testing data set
and 10% (150) is used for validation.
F(x)=WI + bias
(1)
Where W=weight factor, I=input vector and bias. The
hyperplane which divides is defined by f(x)=0. Therefore,
first class that falls above the hyperplane has f(x)>0 and
another class below the plane is f(x)<0 [24].
D. Evaluation
The performance of the model is evaluated using
accuracy. It is defined in terms of correctly classified
instances divided by the total number of instances present in
the dataset as used in other study [25].
(2)
TP + TN
Accuracy
=
TP + FP + TN + FN
B. Data Preprocessing
Since the hospital did not have the electronic copy of the
medical records of the patients, individual records were
encoded in Microsoft Excel. After encoding all the
information of 1500 patients with 33 attributes, the data
were cleaned by deleting all the redundant information and
the unnecessary details and became 11 attributes for the
parameters of stroke.
Where TP-True Positive, FP-False Positive, TN-True
Negative, FN-False Negative
TP Rate: It is fraction of data that are positive were
predicted positive. The true-positive rate is also called
sensitivity [25].
The raw data is contains binary, nominal, and numeric
type. For different data types, this study designed sets of
cleansing rules to ensure complete and accurate data are
available. The cleansing rules were used to standardize the
format, correct the input errors, or discard the values that
cannot be recognized.
=
TPR
TP
(3)
TP + FN
Precision is defined as the degree to which the repeated
measurements under unchanged conditions show the same
results [25].
After imputed the missing values that cannot be
connected to other features. The features with too many
missing entries are discarded because the distributions are
difficult to estimate, which may lead to inaccurate results.
Xiang [22] suggest that if a binary feature has more than
80% missing instances or a numeric/multi-value nominal
feature has more than 60% missing entries, then this feature
will be removed from the data sets. Thus, the other 22
variables were dropped since they were not necessary stroke
parameters, and some had missing values.
Precision
=
TP
(4)
TP + FP
Recall is the ratio of correctly predicted positive
observations to the total predicted positive observations
[26].
TP
Recall
(5)
=
TP + FN
F-measure is the combination of both precision and
recall. It is used to estimate the query classification
performance [25].
C. Model Building
Further, preprocessing activity was performed to remove
outliers in the data set. PCA is used for feature selection as
it is a standard method of extracting the essential features
from the training data. Many feature selections can relate to
distinct aspects of data analysis for better data visualization
and comprehension, computational time decrease, analytical
length, and predictive accuracy [23].
F-Measure =
2*Recall * Precision
Recall + Precision
This study used the SVM algorithm for model building.
It utilizes both linear and nonlinear kernel functions. It
classifies the data by finding the hyperplane, the point that
separates the data points of the first class from that of the
second class. If a large margin is found, then the model
would be better [24].
The SVM uses the linear classifier of the following
form,
37
(6)
IV. EXPERIMENTAL RESULTS
in terms of accuracy and other relevant performance
measures.
Prevention is better than cure. Early signs of potentinal
stroke is essential since it is a life-threatening. It could
improve patient’s life expectancy and health condition. A
supervised algorithm known as SVM was used to develop
the model of stroke prediction.
TABLE 3. Model Testing Result
Model
Testing
(450)
Precision
Recall
F1 Score
AUC
98.89%
75%
81.82%
78.41%
99.8%
(.998)
Table 3 on the other hand, presents the different
parameters for evaluating the model using the testing data,
which consists 450 medical records. It is found that the
accuracy of the SVM model in the testing data is 98.89%.
Precision is 75%, and recall, which showed correctly
identified the fraction of actual positive stroke cases for
SVM model is 81.82%. F1 score of SVM model is found
78.41, and AUC is 99.8 % (.998). Based on the results, the
the classifier is able to predict correctly based on patterns
used in the training activity. Thus, the model is accurate to
use for predicting potential stroke.
Figure 3. Data plot of patients using SVM
Figure 3 shows the plotting of patients data. The blue
dots indicate those patients who are negative of stroke and
the brown dots show those patients who have the possibility
of stroke. The spread of Radial Basis Function (RBF) kernel
shows that the gamma value is very high that the decision
boundary is starting to cover the spread of data better,
transforming the data into a higher dimensional feature
space. RBF is a popular kernel (way of computing the dot
products of two vectors) method used in the SVM model. It
is a function whose value depends on the distance from the
origin.
TABLE 4. Model Validation Result
Without
Stroke
With
Stroke
Precision
Recall
F1 Score
AUC
99%
80%
76.19%
78.10%
99.8%
(.998)
Predicted
Correctly
Not
Predicted
Correctly
Accuracy
150
143
3
97.33%
3
1
TABLE 5. Model Testing Confusion Matrix
TABLE 2. Model Training Result
Accuracy
Number
of Data
Table 4 shows the validation result, which was 10% of
the total data. For validation, 150 records of data is used. It
generated a result of 143, which was correctly predicted by
the model without stroke and 3 was not correctly predicted.
Moreover, it predicted 3 instances with stroke and 1, which
was not correctly predicted. Based on the generated results
of validation, the model is 97.33% accurate.
In this study, the parameters accuracy, precision, recall,
F1 score, and AUC are computed to evaluate the
performance of the SVM classifier. The 1500 datasets were
divided into 60 % training, 30% testing, and 10% validation.
The data underwent cross-validation to evaluate and
compare the results by dividing the data into two segments:
one used to learn or train a model, and the other used to
validate the model.
Model
Training
Data
(900)
Accuracy
pred without stroke
pred with stroke
True without Stroke
436
3
True with Stroke
2
9
Table 5 shows the confusion matrix of the data used in
testing the model. The rows in the confusion matrix
correspond to what the model predicted, and the columns
correspond to the known truth. There are 436 patients
without stroke that were correctly identified by the model.
There are 9 patients with stroke, which were correctly
identified by the model. On the other hand, there were 3
patients without stroke, but the algorithm identified these
with stroke. Lastly, 2 patients had a stroke, but the algorithm
recognized it without stroke.
Table 2 presents the different parameters for evaluating
the model using the training data, which consists of 900
medical records. The results show that the accuracy of the
SVM model in using the training data is 99.00%. Precision
is 80%, and recall, which shows correctly identified the
fraction of actual positive stroke cases for SVM model
76.19%. F1 score of SVM model is found 78.10, and AUC
is 99.8 % (.998) which means that it is an ideal classifier.
The results show that the classifier could still be improved
Hence, from the above study, it can be seen that using
the training data, the model obtained an accuracy of 99%
and 98.89 % for testing. To better ensure the accuracy and
38
efficiency of the algorithm used, the model underwent
validation and generated a result of 97.33%. In providing a
better understanding of the classifier performance, F1 score
matters as it provides a balance between recall and precision
[28].
Journal of Soft Computing and Decision Support Systems, 5, 24-30.
[9] Hazi Mohammad Azamathulla, A. H. (2017). Application of Data
Mining Methods in Diabetes Prediction. 2017 2nd International
Conference on Image, Vision and Computing (IEEE), 106-110.
[10] Jeena RS, D. S. (2016). Stroke Prediction Using SVM. International
Conference on Control, Instrumentation, Communication and
Computational Technologies (ICCICCT) (IEEE).
V. CONCLUSION
[11] Subha PP, P. G. (2015). ,Pattern and risk factors of stroke in the young
among stroke parients admitted in medical college hospital.
Thiruvananthapuram.,Ann indian Acad Neurol, 18:20-3 .
The objective of this study is to develop a predictive
model using SVM to predict the possibility of stroke of the
patients in Cavite, Philippines. Predictions from SVM
kernel resulted in high-performance classifier for RBF as
1.0. This can assist doctors to plan for better stroke detection
medication soon. This study proves the predictive capability
of SVM with 1, 500 patients, and 10 attributes. The results
for evaluation resulted in accuracy of 99% using the training
data and 98.89% using the testing data with a validation
result of 97.33%.
[12] National Stroke Association. (2019). (American Heart Association
Inc.)
Retrieved
May
28,
2019,
from
https://www.stroke.org/understand-stroke/what-is-stroke/
[13] Dr. S. Vijayarani, M. S. (2015). Data Mining Classification
Algorithms for Kidney Disease Prediction. International Journal on
Cybernatics and Informatics, 4(4), 13-25.
[14] Jean-Emmanuel Bibault, P. G. (2016). Big Data and machine learning
in radiation oncology: State of the art and future prospect. Elsevier,
110-117.
This study is not free from limitations. Thus, this
recommends some future activities. The study could be used
in the future for stroke prevention since it could detect the
early occurrence of stroke among the patients of Cavite,
Philippines. The results could also help in developing a
control plan for those patients since stroke cannot be
detected beforehand. This study could also be used for
developing another model for further comparison of the
different machine learning algorithms.
[15] Cemil Colak, E. K. (2015). Application of knowledge discovery
process on the prediction of stroke. Elsevier, 181-185.
[16] Raoof Gholami, N. F. (2017). Support Vector Machine: Principles,
Parameters, And Applications. Elsevier, 515-533.
[17] Ng, A. (n.d.). Standford Edu. Retrieved May 30, 2019, from
cs229.stanford.edu/notes/cs229-notes3.pdf
[18] Smita Jhajharia, H. K. (2016). A Neural Network Based Breast Cancer
Prognosis Model with PCA Processed Feature. Intl. Conference on
Advances in Computing, Communications and Informatics (ICACCI).
Jaipur, India.
REFERENCES
[1] V Mozaffarian, D. B. (2015). Heart disease and stroke statistics 2015
update: a report from the American Heart Association. American
Heart Association, Circulation 131, e29–322.
[19] O. Inan, M. S. (n.d.). “A new hybrid feature selection method based
on association rules and pca for detection of breast cancer.
International Journal of Innovative Computing and Information and
Control, 09(02), 727-739.
[2] Amelia K. Boehme, C. E. (2017). Stroke Risk Factors, Genetics, and
Prevention. Circulation Research Journal of the American Heart
Association.
[20] P. Bentley, J. G. (n.d.). Prediction of stroke thrombolysis outcome
using CT brain machine learning. Nueroimage, 4, 635-640.
[3] M. Edip Gurol, J. S. (2018). Adbances in Stroke Prevention in 2018.
Journal of Stroke, 143-144.
[21] Cemil Colak, E. K. (2015). Application of knowledge discovery
process on the prediction of stroke. Elsevier, 181-185.
[4] WHO. (2018). World Health Statistics 2018: Monitoring Health for
SDGs, sustainable dvelopment goals. Geneva World Health
Organization.
[22] Xiang Li, P. H. (2017). Integrated Machine Learning Approaches for
Predicting Ischemic Stroke and Thromboembolism in Atrial
Fibrillation. AMIA Annual Proceedings Archive, 799-807.
[5] PSA. (2018, February 12). Deaths in the Philippines 2016. Retrieved
from Philippine Statistics Authority: https://psa.gov.ph/content/deathsphilippines-2016
[23] Ionnis Kavakiotis, O. T. (2017). Machine Learning and Data Mining
Methids in Diabetes Research. Elsevier Computational and Structural
Biotechnology Journal(15), 104-116.
[6] Mehrbakhsh Nilashi, H. A. (September 2017). Knowledge Discovery
and Diseases Prediction: A Comparative Study of Machine Learning
Techniques. Journal of Soft Computing and Decision Support Systems,
4(No,5), 8-16.
[24] Radhimeennakshi, S. (2016). Classification and prediction of Heart
Disease Risk Using Data Mining Techniques of Support Vector
IEEE
Machine
and
Artificial
Neural
Network.
InternationalConference on Computing for Sustainable Global
Development (INDIACom) , 3107-3111.
[7] Nilashi, M. b. (2017). An Analytical Method for Diseases Prediction
Using Machine Learning Techniques. Computers & Chemical
Engineering. 106, 212-223.
[25] O. Dr. S. Vijayarani, M. S. (2015). Data Mining Classification
Algorithms for Kidney Disease Prediction. International Journal on
Cybernatics and Informatics, 4(4), 13-25.
[8] Nilashi, M. E. (2016). A multi-criteria collaborative filtering
recommender system using clustering and regression techniques.
39
[26] Joshi, R. (2016, September 9). Exsilio Solutions. Retrieved June 4,
2019, from https://blog.exsilio.com/all/accuracy-precision-recall-f1score-interpretation-of-performance-measures/
[27] Harleen Kaur, V. K. (2018). Predicitve modeliing and analytics for
diabetes using a machine learning approach. Applied Computing and
Informatics.
[28] J. Li, O. A. (2017). Glycaemic index precision: a pilot study of data
linkage challenges and the application of machine learning. IEEE
EMBS Int. Conf. on Biomed. & Health Informat (BHI), 357-360.
40
View publication stats
Download