Uploaded by pathikgandhi2209

Sleep Apnea Summary

advertisement
Introduction
Sleep apnea is an illness that impairs a person’s breathing during sleep. It causes difficulties in
breathing while sleeping. Some other risk factors are hypertension, diabetes or other
cardiovascular disorders.
Dataset Description
The data consists of 35 records of patients. ECG voltage value is taken in every 10 milliseconds.
Every 1-minute of ECG signals were classified as apnea or not by the health experts. The
classification does not include stages of apnea. In total, there were 12 features – 6 time domain
features and 6 frequency domain features. (refer to report for details).
Standardization and Scaling
All distance-based algorithms such as KNN and SVM are affected by the scale of the variables.
This is because the performance of all distance-based models is determined by Euclidean
distance which is greatly dependent on the magnitudes of the variables. In order to eliminate the
impact of the magnitude of these variables, all the variables must be brought to the same scale.
The technique of Standardization is used for this purpose:
Z= (X-u)/(sigma)
https://www.analyticsvidhya.com/blog/2020/04/feature-scaling-machine-learning-normalizationstandardization/
Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. A
decision tree is only splitting a node based on a single feature. The decision tree splits a node on
a feature that increases the homogeneity of the node. This split on a feature is not influenced by
other features.
[Ignore hyperparameter tuning part if not confident]
Feature Selection
Irrelevant (completely or partially) features can impact performance of a model negatively and
can reduce the model accuracy. From the set of all the features, applying feature selection
methods, the best subset of features are selected and Machine learning algorithms are applied on
these obtained features to find the performance in terms of accuracy, recall rate and precision.
Advantages of Feature Selection:
● Overfitting reduced: As less redundant data leads to less noise based decisions
● Accuracy improved: As data with lesser noise gives chance for more accuracy.
● Training Time reduced: Fewer data points reduce algorithm complexity and algorithms
train faster.
Chi Square Test
Chi-square test is done for selection of features in ECG dataset. Chi-square is calculated for each
feature along with the target and the desired features are selected with the best Chi-square values.
Association between the two categorical variables (by converting features into integers) is
calculated through this in the dataset. The method used the chi-square statistical test for nonnegative features to select 10 best features of the ECG extracted dataset. This is done using
SelectKBest library.
From the obtained 10 best features, the same Machine learning algorithms were applied which
were applied before feature selection namely KNN, SVM, Random Forest, Naïve
Bayes using the sklearn library. On application of algorithms, the accuracy, Recall rate and
precision of the algorithms were calculated.
Video of Krish Naik
https://www.youtube.com/watch?v=EqLBAmtKMnQ
Models
KNN
Adv: Simple and Intuitive, immediately adapts to new data.
Disadv: High computational complexity as training data increases, affected by outliers and
imbalanced data.
SVM
Adv: It is more efficient in high dimensional spaces, relatively memory efficient, works well
when we have no idea on the data.
Disadv: Requires higher training time for large datasets, Difficult to choose the right kernel
function
Naïve Bayes
Adv: Simple to implement, Very fast (since conditional probabilities can be directly computed),
works well with large datasets
Disadv: Assumes independence of features
Logistic Regression
Adv: Logistic regression is easier to implement, interpret, and very efficient to train.
t not only provides a measure of how appropriate a predictor(coefficient size)is, but
also its direction of association (positive or negative). It is very fast at classifying
unknown records.
Disadv: The major limitation of Logistic Regression is the assumption of linearity
between the dependent variable and the independent variables.
Random Forest
Adv: Doesn’t overfit (Low bias and low variance), can handle both continuous and categorical
variables, no feature scaling required, robust to outliers.
Disadv: Requires a lot of time for training as it combines a lot of decision trees to determine the
class, it also fails to determine the significance of each variable
Download