Uploaded by renu_m210514ee

Breast Cancer PRA ASSIGNMENT

advertisement
EC6431D PATTERN RECOGNITION AND ANALYSIS
PROJECT -1
BREAST CANCER PREDICTION USING BAYESIAN
DECISION THEORY
DATE:
28/02/2022
Team Members : GANGA MANJIMA S _ M210143EE _IPA
NISHA SAHAI_M210518EE _IPA
RENU KASHYAP_ M210514EE_ IPA
AIM:
To design and develop a simple classifier using the concept of Bayesian decision theory.
NAIVE BAYES CLASSIFIER:
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It
is not a single algorithm but a family of algorithms where all of them share a common principle,
i.e. every pair of features being classified is independent of each other.
Bayes theorem provides a way that we can calculate the probability of a piece of data belonging
to a given class, given our prior knowledge. Bayes’ Theorem is stated as
P(A|B) = Posterior probability,
P(B|A) = likelihood probability,
P(A)= Prior probability,
P(B)= marginal probability.
Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification
problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for
each class are simplified to make their calculations tractable. Rather than attempting to calculate
the probabilities of each attribute value, they are assumed to be conditionally independent given
the class value.
This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not
interact. Nevertheless, the approach performs surprisingly well on data where this assumption
does not hold.
Naive bayes implementation on Breast cancer dataset:
Worldwide, breast cancer is the most common type of cancer in women and the second highest in
terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found
(from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious
lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so,
whether it has spread to other parts of the body.
This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from
Dr. William H. Wolberg.
Features of this dataset are:
1.
2.
3.
4.
5.
mean_radius
mean_perimeter
mean_texture
mean _smoothness
mean_diagnosis
classes: 1- for testing positive
0-for testing negative
CODE EXPLANATION:
STEP 1
Importing required libraries and dataset csv file
On a first step, we are importing required libraries and use a csv file named as breast cancer
prediction dataset created for “AI FOR SOCIAL GOOD: women coders bootcamp” created for
from “https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset”.
STEP 2
DISTRIBUTION OF CLASS 0 AND CLASS 1 IN DATAFILE
From the dataset , we can see our targeted variable is diagnosis here .so to know how target
variable (diagnosis) is distributed , we have plot histogram .from the plot it is clear that positive
case are more than negative case.
Here we can observe that dataset is of imbalance type.
STEP 3
TO CHECK THE FEATURES ARE INDEPENDENT OR NOT
To apply naive bayes classifier features must be independent of each other because in naive
bayes it is assumed that features are independent of each other that is why it is called naive bayes
we are plotting a heatmap ,it is a matrix and in its each cell there will be a correlation score here
we have used pearson correlation score, it is plotted by using sns library.
Here we can see that correlation score of mean_radius, mean_area and mean_perimeter is
positively correlated so we can take any one feature here we have taken mean_radius .
Now our data only contains mean_texture,mean_radius,mean_smoothness and target
variable.after extracting feature which is dependent here , now our features are independent from
each other and we can see 10 rows of our datafile after extracting dependent features.
Then we need to plot histograms of our features as we need to fit and approximate the
distribution so to fit a known distribution to our dataset we have to check whether our
distribution mimics the known distribution or not
Here we can see distribution of our features “ mean_radius, mean_smoothness, and
mean_texture”. From the plot it is very clear that we can use normal distribution for calculating
the posterior probability . but here we have continuous data, as from naves bayes distribution not
give good results for continuous data , so for tackle the continuous data here we used two
approaches.
STEP 4
CALCULATING PROBABILITIES P(Y=y) FOR ALL POSSIBLE y :
Gaussian Probability Density Function:
Calculating the probability or likelihood of observing a given real-value like X1 is difficult.
One way we can do this is to assume that X1 values are drawn from a distribution, such as a
bell curve or Gaussian distribution. A gaussian distribution can be summarized using only two
numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate
the probability of a given value. This piece of math is called a Gaussian probability distribution
function (or Gaussian PDF) and can be calculated as:
Class Probabilities
Now it is time to use the statistics calculated from our training data to calculate probabilities for
new data. Probabilities are calculated separately for each class. This means that we first calculate
the probability that a new piece of data belongs to the first class, then calculate probabilities that
it belongs to the second class, and so on for all the classes. The probability that a piece of data
belongs to a class is calculated as follows:
P(class|data) = P(X|class) * P(class)
We may note that this is different from the Bayes Theorem described above. The division has
been removed to simplify the calculation. This means that the result is no longer strictly a
probability of the data belonging to a class. The value is still maximized, meaning that the
calculation for the class that results in the largest value is taken as the prediction. This is a
common implementation simplification as we are often more interested in the class prediction
rather than the probability. The input variables are treated separately, giving the technique it’s
name “naive “. For the above example where we have 2 input variables, the calculation of the
probability that a row belongs to the first class 0 can be calculated as:
P (class=0|X1, X2) = P(X1|class=0) * P(X2|class=0) * P(class=0)
Now we can see why we need to separate the data by class value. The Gaussian Probability
Density function in the previous step is how we calculate the probability of a real value like X1
and the statistics we prepared are used in this calculation. Below is a function named
calculate_class_probabilities() that ties all of this together. It takes a set of prepared summaries
and a new row as input arguments. First the total number of training records is calculated from
the counts stored in the summary statistics. This is used in the calculation of the probability of a
given class or P(class) as the ratio of rows with a given class of all rows in the training data.
Next, probabilities are calculated for each input value in the row using the Gaussian probability
density function and the statistics for that column and of that class. Probabilities are multiplied
together as they accumulate. This process is repeated for each class in the dataset. Finally, a
dictionary of probabilities is returned with one entry for each class.
STEP 5:
Approach 1: calculate P(X=x/Y=y) using gaussian distribution
Here first we are calculating prior probabilities and likelihood , our feature are independent to
each other so calculating likelihood is required product of individual probabilities.
#Function to calculate prior probabilities and likelihood
#function for naive bayes gaussian distribution:
Here we are calculating posterior probability for all the classes , here we will write that particular
class which have the maximum probability.
Testing of model
For testing the model, we have to split the dataset into train data and test data.
RESULT FOR THE APPROACH 1 :
Here we got an F1 SCORE of 97.37% and false negative is = 0, so our model is giving very
good result .
APPROACH 2: CALCULATE P(X=x/Y=y) CATEGORICALLY
#function for converting continuous features to categorical features
Here we are continuous functions to categorical features.we are giving bins =3 means dividing
the parameter into 3 labels here [0, 1, 2].
# function for calculating the likelihood categorically
Here no need for extracting the features again, here calculating likelihood of each category
separately.
# function for naive bayes categorical distribution:
Here we are calculating P(X=x1/Y=y)*P(X=x2/Y=y).....P(X=xn/Y=y)*P(Y=y) for all y and
finding the maxima
To compute posterior probability in case of categorical naive base , this is different from the
previous one only in the fact that while calculating the likelihood we are multiplying conditional
probabilities that are return to calculate likelihood categorical
TEST FOR CATEGORICAL MODEL:
Here the F1 score of this model is 95% ,so score is decrease from the previous model but the
number of FALSE POSITIVE DECREASE but simultaneously FALSE NEGATIVE also
increase , this is due to that the categories are equally spaced so this is not the optimal way, so
previous model is the best if we consider equal categories.
RESULT:
F1_score= 97%( for naive bayes gaussian)
F1_score = 95%(for naive bayes categorical)
CONCLUSION:
In the code, the true and actual outcome is known,i.e whether the female is having breast cancer
or not, which matches the prediction by our naive bayes gaussian distribution model as well as
naive bayes categorical model too.
Thus , we have successfully classified our data set into classes using Bayes decision theory.
Download