EC6431D PATTERN RECOGNITION AND ANALYSIS PROJECT -1 BREAST CANCER PREDICTION USING BAYESIAN DECISION THEORY DATE: 28/02/2022 Team Members : GANGA MANJIMA S _ M210143EE _IPA NISHA SAHAI_M210518EE _IPA RENU KASHYAP_ M210514EE_ IPA AIM: To design and develop a simple classifier using the concept of Bayesian decision theory. NAIVE BAYES CLASSIFIER: Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Bayes theorem provides a way that we can calculate the probability of a piece of data belonging to a given class, given our prior knowledge. Bayes’ Theorem is stated as P(A|B) = Posterior probability, P(B|A) = likelihood probability, P(A)= Prior probability, P(B)= marginal probability. Naive Bayes is a classification algorithm for binary (two-class) and multiclass classification problems. It is called Naive Bayes or idiot Bayes because the calculations of the probabilities for each class are simplified to make their calculations tractable. Rather than attempting to calculate the probabilities of each attribute value, they are assumed to be conditionally independent given the class value. This is a very strong assumption that is most unlikely in real data, i.e. that the attributes do not interact. Nevertheless, the approach performs surprisingly well on data where this assumption does not hold. Naive bayes implementation on Breast cancer dataset: Worldwide, breast cancer is the most common type of cancer in women and the second highest in terms of mortality rates.Diagnosis of breast cancer is performed when an abnormal lump is found (from self-examination or x-ray) or a tiny speck of calcium is seen (on an x-ray). After a suspicious lump is found, the doctor will conduct a diagnosis to determine whether it is cancerous and, if so, whether it has spread to other parts of the body. This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. Features of this dataset are: 1. 2. 3. 4. 5. mean_radius mean_perimeter mean_texture mean _smoothness mean_diagnosis classes: 1- for testing positive 0-for testing negative CODE EXPLANATION: STEP 1 Importing required libraries and dataset csv file On a first step, we are importing required libraries and use a csv file named as breast cancer prediction dataset created for “AI FOR SOCIAL GOOD: women coders bootcamp” created for from “https://www.kaggle.com/merishnasuwal/breast-cancer-prediction-dataset”. STEP 2 DISTRIBUTION OF CLASS 0 AND CLASS 1 IN DATAFILE From the dataset , we can see our targeted variable is diagnosis here .so to know how target variable (diagnosis) is distributed , we have plot histogram .from the plot it is clear that positive case are more than negative case. Here we can observe that dataset is of imbalance type. STEP 3 TO CHECK THE FEATURES ARE INDEPENDENT OR NOT To apply naive bayes classifier features must be independent of each other because in naive bayes it is assumed that features are independent of each other that is why it is called naive bayes we are plotting a heatmap ,it is a matrix and in its each cell there will be a correlation score here we have used pearson correlation score, it is plotted by using sns library. Here we can see that correlation score of mean_radius, mean_area and mean_perimeter is positively correlated so we can take any one feature here we have taken mean_radius . Now our data only contains mean_texture,mean_radius,mean_smoothness and target variable.after extracting feature which is dependent here , now our features are independent from each other and we can see 10 rows of our datafile after extracting dependent features. Then we need to plot histograms of our features as we need to fit and approximate the distribution so to fit a known distribution to our dataset we have to check whether our distribution mimics the known distribution or not Here we can see distribution of our features “ mean_radius, mean_smoothness, and mean_texture”. From the plot it is very clear that we can use normal distribution for calculating the posterior probability . but here we have continuous data, as from naves bayes distribution not give good results for continuous data , so for tackle the continuous data here we used two approaches. STEP 4 CALCULATING PROBABILITIES P(Y=y) FOR ALL POSSIBLE y : Gaussian Probability Density Function: Calculating the probability or likelihood of observing a given real-value like X1 is difficult. One way we can do this is to assume that X1 values are drawn from a distribution, such as a bell curve or Gaussian distribution. A gaussian distribution can be summarized using only two numbers: the mean and the standard deviation. Therefore, with a little math, we can estimate the probability of a given value. This piece of math is called a Gaussian probability distribution function (or Gaussian PDF) and can be calculated as: Class Probabilities Now it is time to use the statistics calculated from our training data to calculate probabilities for new data. Probabilities are calculated separately for each class. This means that we first calculate the probability that a new piece of data belongs to the first class, then calculate probabilities that it belongs to the second class, and so on for all the classes. The probability that a piece of data belongs to a class is calculated as follows: P(class|data) = P(X|class) * P(class) We may note that this is different from the Bayes Theorem described above. The division has been removed to simplify the calculation. This means that the result is no longer strictly a probability of the data belonging to a class. The value is still maximized, meaning that the calculation for the class that results in the largest value is taken as the prediction. This is a common implementation simplification as we are often more interested in the class prediction rather than the probability. The input variables are treated separately, giving the technique it’s name “naive “. For the above example where we have 2 input variables, the calculation of the probability that a row belongs to the first class 0 can be calculated as: P (class=0|X1, X2) = P(X1|class=0) * P(X2|class=0) * P(class=0) Now we can see why we need to separate the data by class value. The Gaussian Probability Density function in the previous step is how we calculate the probability of a real value like X1 and the statistics we prepared are used in this calculation. Below is a function named calculate_class_probabilities() that ties all of this together. It takes a set of prepared summaries and a new row as input arguments. First the total number of training records is calculated from the counts stored in the summary statistics. This is used in the calculation of the probability of a given class or P(class) as the ratio of rows with a given class of all rows in the training data. Next, probabilities are calculated for each input value in the row using the Gaussian probability density function and the statistics for that column and of that class. Probabilities are multiplied together as they accumulate. This process is repeated for each class in the dataset. Finally, a dictionary of probabilities is returned with one entry for each class. STEP 5: Approach 1: calculate P(X=x/Y=y) using gaussian distribution Here first we are calculating prior probabilities and likelihood , our feature are independent to each other so calculating likelihood is required product of individual probabilities. #Function to calculate prior probabilities and likelihood #function for naive bayes gaussian distribution: Here we are calculating posterior probability for all the classes , here we will write that particular class which have the maximum probability. Testing of model For testing the model, we have to split the dataset into train data and test data. RESULT FOR THE APPROACH 1 : Here we got an F1 SCORE of 97.37% and false negative is = 0, so our model is giving very good result . APPROACH 2: CALCULATE P(X=x/Y=y) CATEGORICALLY #function for converting continuous features to categorical features Here we are continuous functions to categorical features.we are giving bins =3 means dividing the parameter into 3 labels here [0, 1, 2]. # function for calculating the likelihood categorically Here no need for extracting the features again, here calculating likelihood of each category separately. # function for naive bayes categorical distribution: Here we are calculating P(X=x1/Y=y)*P(X=x2/Y=y).....P(X=xn/Y=y)*P(Y=y) for all y and finding the maxima To compute posterior probability in case of categorical naive base , this is different from the previous one only in the fact that while calculating the likelihood we are multiplying conditional probabilities that are return to calculate likelihood categorical TEST FOR CATEGORICAL MODEL: Here the F1 score of this model is 95% ,so score is decrease from the previous model but the number of FALSE POSITIVE DECREASE but simultaneously FALSE NEGATIVE also increase , this is due to that the categories are equally spaced so this is not the optimal way, so previous model is the best if we consider equal categories. RESULT: F1_score= 97%( for naive bayes gaussian) F1_score = 95%(for naive bayes categorical) CONCLUSION: In the code, the true and actual outcome is known,i.e whether the female is having breast cancer or not, which matches the prediction by our naive bayes gaussian distribution model as well as naive bayes categorical model too. Thus , we have successfully classified our data set into classes using Bayes decision theory.