BREAST CANCER PREDICTION USING LOGISTIC REGRESSION ALGORITHM A Project report submitted in partial fulfilment of the requirements for the award of the degree of BACHELOR OF TECHNOLOGY IN ELECTRONICS AND COMMUNICATION ENGINEERING Submitted by A SAI KRISHNA - 320106512001 ADADALA GURU DATTA - 320106512002 MADDINENI YAKSHITH - 320106512030 Under the guidance of PROF. G. SASIBHUSHANA RAO PROFESSOR DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, ANDHRA UNIVERSITY COLLEGE OF ENGINEERING, VISAKHAPATNAM-530003. 2023-2024 i DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING, ANDHRA UNIVERSITY COLLEGE OF ENGINEERING, VISAKHAPATNAM – 530003. 2023 – 2024 BONAFIDE CERTIFICATE This is to certify that the project work entitled “BREAST CANCER PREDICTION USING LOGISTIC REGRESSION ALGORITHM” is a Bonafide work done by A SAI KRISHNA (Regd.No.:320106512001),ADADALA GURU DATTA( Regd.No.:320106512002), MADDINENI YAKSHITH (Regd.No.:320106512030) under the esteemed guidance of PROF. G. SASIBHUSHANA RAO submitted in the partial fulfilment of the requirement for the award of the degree of BACHELOR OF TECHNOLOGY in Electronics and Communication Engineering, Andhra University College of Engineering, Andhra University, Visakhapatnam during the year 2023-2024. PROF. G. SASIBHUSHANA RAO PROF. P.V. SRIDEVI PROJECT GUIDE HEAD OF DEPARTMENT Department of Electronics and Department of Electronics and Communication Engineering Communication Engineering Andhra University College of Andhra University College of Engineering, Engineering, Visakhapatnam. Visakhapatnam. ii ACKNOWLEDGEMENT We would like to express our deep gratitude to our project guide PROF. G. SASIBHUSHANA RAO Professor, Department of Electronics And Communication Engineering, AUCE, for his guidance with unsurpassed knowledge and immense encouragement. We are grateful to PROF. P.V. SRIDEVI, Head of the Department, Electronics and Communication Engineering, for providing us with the required facilities for the completion of the project work. We are thankful to Prof. P. Rajesh Kumar, Prof. M.S. Anuradha, Dr. V. Malleswara Rao, Dr. S. Aruna, Dr. G. Rajeswara Rao, Dr. K. Chiranjeevi, Dr. Praveen Babu Choppala of Department of Electronics and Communication Engineering, Andhra University College of Engineering, Andhra University, Visakhapatnam for their kind encouragement. We thank all the scholars, technical staff and non-teaching staff of the Department of Electronics and Communication Engineering, Andhra University College of Engineering, Andhra University, Visakhapatnam for their constant support. Regards, A SAI KRISHNA (Regd.No.:320106512001), ADADALA GURU DATTA (Regd.No.:320106512002), MADDINENI YAKSHITH (Regd.No.:320106512030). iii DECLARATION We hereby declare that the project entitled “BREAST CANCER PREDICTION USING LOGISTIC REGRESSION ALGORITHM” submitted in the partial fulfilment for the award of the degree Bachelor of Technology in Electronics and Communication Engineering, for the academic year 2023-2024, is the record of the Bonafide work carried out by A SAI KRISHNA (Regd.No.:320106512001), ADADALA GURU DATTA (Regd.No.:320106512002), MADDINENI YAKSHITH (Regd.No.:320106512030) under the guidance of PROF.G. SASIBHUSHANA RAO, Department of Electronics and Communication Engineering, Andhra University College of Engineering, Andhra University, Visakhapatnam. Place: Visakhapatnam Date: Regards, A SAI KRISHNA (Regd.No.:320106512001), ADADALA GURU DATTA (Regd.No.:320106512002), MADDINENI YAKSHITH (Regd.No.:320106512030). iv ABSTRACT Breast cancer is a significant health concern affecting millions of women worldwide. Breast cancer is a type of cancer that develops in the cells of the breast, most commonly in the ducts or lobules. It is the most common cancer among women worldwide and can also affect men, although it's rare. Early detection plays a pivotal role in improving survival rates and treatment outcomes. In recent years, machine learning techniques have emerged as powerful tools for breast cancer prediction, aiding in the identification of high-risk individuals and informing personalized treatment strategies. Risk factors for breast cancer include age, family history of breast cancer, genetic mutations (such as BRCA1 and BRCA2), hormonal factors (early menstruation, late menopause, hormone replacement therapy), lifestyle factors (diet, physical activity, alcohol consumption), and certain medical conditions (such as dense breast tissue and previous radiation therapy). Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Machine Learning is used across many ranges around the world. The healthcare industry is no exception. Machine Learning can play an essential role in predicting the presence or absence of cancerous cells like Malignant. Such information, if predicted well in advance, can provide important intuitions to doctors who can then adapt their diagnosis and dealing per patient basis. We work on predicting breast cancer in people using machine-learning algorithms. In this project, the analysis of classifiers like K-Nearest Neighbours is performed, and an ensemble classifier (Logistic Regression) that performs hybrid classification by taking strong and weak classifiers since it can have multiple samples for training and validating the data so we perform the analysis of existing classifiers and proposed classifier which can give the better accuracy and predictive analysis. v CONTENTS PAGE No. CHAPTER-1: INTRODUCTION 1.1 Introduction to Breast Cancer 1 1.2 Motivation of the Work 12 1.3 Problem Statement 12 1.4 Organisation of the work 13 CHAPTER-2: LITERATURE SURVEY 14 CHAPTER-3: EXISTING METHODOLOGY 3.1 K- Nearest Neighbours 20 3.2 Flow Chart 21 3.3 Types of KNN 24 3.4 Advantages of KNN 24 3.5 Disadvantages of KNN 25 CHAPTER-4: WORKING OF SYSTEM 4.1 System Architecture 26 4.2 Machine Learning 28 4.3 Machine Learning Methods 29 4.4 Logistic Regression 32 vi 4.5 Work Flow 33 4.6 Advantages of Logistic Regression over KNN 38 CHAPTER-5: EXPERIMENTAL ANALYSIS 5.1 System Configuration 40 5.1.1 Hardware Requirements 40 5.1.2 Software Requirements 40 5.2 Features Extracted 40 5.3 Performance Analysis 42 5.4 Result 43 CHAPTER-6: CONCLUSION AND FUTURE WORK 6.1 Conclusion 45 6.2 Future Scope 46 APPENDIX 47 REFERENCES 49 vii LIST OF FIGURES Fig .No. Figure Description Page No. 1.1 Breast Tumour 5 1.2 Progression of Breast Cancer 6 1.3 Benign and Malignant Tumour 7 3.1 K-Nearest Neighbour 20 4.1 System Working 25 4.2 Supervised Learning 29 4.3 Unsupervised Learning 30 4.4 Semi-Supervised Learning 31 4.5 Reinforcement Learning 31 4.6 Sigmoid Function 32 4.7 Biopsy Image 34 4.8 Data Pre-Processing 37 5.1 Biopsy Image 41 5.2 Breast Cancer Output 42 5.4 Confusion Matrix 43 5.4 Performance Metrics 44 viii LIST OF EQUATIONS EQUATION N0. PAGE NO. Equation 3.1 22 Equation 3.2 23 Equation 3.3 23 Equation 3.4 23 Equation 4.1 32 Equation 4.2 35 Equation 4.3 35 Equation 4.4 35 Equation 4.5 35 Equation 4.6 35 Equation 4.7 36 Equation 4.8 36 Equation 4.9 36 Equation 4.10 37 Equation 4.11 37 ix LIST OF TABLES TABLE N0. PAGE NO. Table 5.1 42 Table 5.2 43 Table 5.3 44 x CHAPTER 1 INTRODUCTION 1.1 INTRODUCTION TO BREAST CANCER Breast cancer has become the most recurrent type of health issue among women especially for women in middle age. Early detection of breast cancer can help women cure this disease and the death rate can be reduced. In the present-day scenario, to observe breast cancer mammograms are used and they are known to be the most effective scanning technique. In this, the detection of cancer cells is done by machine learning techniques. The exact cause of breast cancer is not fully understood, but it is believed to result from a combination of genetic, hormonal, environmental, and lifestyle factors. Mutations in specific genes, such as BRCA1 and BRCA2, are associated with an increased risk of developing breast cancer. Hormonal factors, such as Estrogen and progesterone, also play a role, as evidenced by the higher incidence of breast cancer in women with early onset of menstruation, late menopause, or hormone replacement therapy. Additionally, lifestyle factors such as alcohol consumption, obesity, and lack of physical activity have been linked to an increased risk of breast cancer. Breast cancer typically begins in the cells lining the milk ducts (ductal carcinoma) or the lobules (lobular carcinoma) of the breast. Over time, the abnormal cells may invade surrounding breast tissue and metastasize to other body parts, such as the lymph nodes, bones, liver, or lungs. The exact mechanisms underlying cancer progression vary depending on the subtype of breast cancer, which includes hormone receptor-positive, HER2-positive, and triplenegative breast cancer. Studies say that over 1,70,000 new breast cancer cases are likely to develop in India by 2020. According to research, 1 in every 28 women is likely to get affected by the disease. While breast cancer occurs almost entirely in women, around 1-2% of men are likely to get affected, too. 1 1.1.1 OVERVIEW Breast cancer is a disease in which abnormal breast cells grow out of control and form tumours. If left unchecked, the tumours can spread throughout the body and become fatal. Breast cancer cells begin inside the milk ducts and the milk-producing lobules of the breast. The earliest form (in situ) is not life-threatening and can be detected in early stages. Cancer cells can spread into nearby breast tissue (invasion). This creates tumours that cause lumps or thickening. Invasive cancers can spread to nearby lymph nodes or other organs (metastasize). Metastasis can be lifethreatening and fatal. Treatment is based on the person, the type of cancer, and its spread. The treatment combines surgery, radiation therapy, and medications. 1.1.2 SCOPE OF THE WORK Female gender is the strongest breast cancer risk factor. Approximately 99% of breast cancers occur in women and 0.5–1% of breast cancers occur in men. The treatment of breast cancer in men follows the same principles of management as for women. Certain factors increase the risk of breast cancer including increasing age, obesity, harmful use of alcohol, family history of breast cancer, history of radiation exposure, reproductive history (such as age that menstrual periods began and age at first pregnancy), tobacco use, and postmenopausal hormone therapy. Approximately half of breast cancers develop in women who have no identifiable breast cancer risk factor other than gender (female) and age (over 40 years). A family history of breast cancer increases the risk of breast cancer, but most women diagnosed with breast cancer do not have a known family history of the disease. Lack of a known family history does not necessarily mean that a woman is at reduced risk. Certain inherited high penetrance gene mutations greatly increase breast cancer risk, the most dominant being mutations in the genes BRCA1, BRCA2, and PALB-2. Women found to have mutations in these major genes may consider risk reduction strategies such as surgical removal of both breasts or chemoprevention strategies. In 2022, there were 2.3 million women diagnosed with breast cancer and 670,000 deaths globally. Breast cancer occurs in every country of the world in women at any age after puberty but with increasing rates in later life. Global estimates reveal striking inequities in the breast cancer burden according to human 2 development. For instance, in countries with a very high Human Development Index (HDI), 1 in 12 women will be diagnosed with breast cancer in their lifetime and 1 in 71 women die of it. In contrast, in countries with a low HDI; while only 1 in 27 women is diagnosed with breast cancer in their lifetime, 1 in 48 women will die from it. 1.1.3 TYPES OF BREAST CANCER Breast cancer can be invasive or non-invasive. Invasive breast cancer is cancer that spreads into surrounding tissues and/or distant organs. Non-invasive breast cancer does not go beyond the milk ducts or lobules in the breast. About 80% of breast cancer is invasive cancer, and about 20% is non-invasive cancer. There are multiple types of breast cancers, which are classified based on how they look under a microscope A. Classification based on microscopic view: • Ductal carcinoma in situ (DCIS). This is a non-invasive cancer (stage 0) that is located only in the duct and has not spread outside the duct. • Invasive or infiltrating ductal carcinoma. This is cancer that has spread outside of the ducts. It is the most common type of invasive breast cancer. • Invasive lobular carcinoma. This is a type of breast cancer that has spread outside of the lobules. B. Classification based on tumour characteristics: There are 3 main types of breast cancer that are determined by doing specific tests on a sample of the tumour to determine its characteristics. • Hormone receptor positive: Breast cancers expressing estrogen receptors (ER) and/or progesterone receptors (PR) are called “hormone receptor-positive.” These receptors are proteins found in cells. Tumours that have estrogen receptors are called “ER positive.” Tumours that have progesterone receptors are called “PR positive.” Only 1 of these receptors needs to be positive for cancer to be called hormone receptorpositive. This type of cancer may depend on the hormones estrogen and/or progesterone to grow. Hormone receptor-positive cancers can occur at any age, but they are more common after menopause. About two-thirds of breast cancers have estrogen and/or progesterone receptors. Cancers without these receptors are called “hormone receptor- 3 negative.” Hormone receptor-positive breast cancers are commonly treated using hormone therapy. • HER2 positive: About 15% to 20% of breast cancers depend on the gene called “human epidermal growth factor receptor 2” (HER2) to grow. These cancers are called “HER2 positive” and have many copies of the HER2 gene or high levels of the HER2 protein. These proteins are also called “receptors.” The HER2 gene makes the HER2 protein, which is found in cancer cells and is important for tumour cell growth. HER2-positive breast cancers grow more quickly. They can also be either hormone receptor-positive or hormone receptor-negative. HER2-positive early-stage breast cancers are commonly treated using HER2-targeted therapies. Cancers that have no HER2 protein are called “HER2 negative." Cancers that have low levels of the HER2 protein and/or few copies of the HER2 gene are sometimes now called “HER2 low." • Triple negative. If a tumour does not express ER, PR, and HER2, the tumour is called “triple negative.” Triple-negative breast cancer makes up about 10% to 20% of invasive breast cancers. Triple-negative breast cancer seems to be more common among younger women, particularly younger Black women and Hispanic women. Triple-negative breast cancer is also more common in women with a mutation in the BRCA1 gene. Experts often recommend that people with triple-negative breast cancer be tested for BRCA gene mutations Most people will not experience any symptoms when the cancer is still early hence the importance of early detection. Breast cancer can have combinations of symptoms, especially when it is more advanced. Symptoms of breast cancer can include: • a breast lump or thickening, often without pain. • change in size, shape, or appearance of the breast dimpling, redness, pitting, or other changes in the skin. • change in nipple appearance or the skin surrounding the nipple (areola). • abnormal or bloody fluid from the nipple. 1.1.5 TYPES OF TUMOURS A tumour is a pathologic disturbance of cell growth, characterized by excessive and abnormal proliferation of cells. Tumours are abnormal masses of tissue that may be solid or fluid-filled. When the growth of tumour cells is confined to the site of origin and are of normal physicality 4 they are concluded as benign tumours. When the cells are abnormal and can grow uncontrollably, they are concluded as cancerous cells i.e., malignant tumours. Tumours are also called ‘NEOPLASM’. To determine whether a tumour is benign or cancerous, a doctor can take a sample of the cells with a biopsy procedure. Then the biopsy is analysed under a microscope by a pathologist (a doctor specializing in laboratory science). Fig 1.1 Breast Tumour 1. Benign Tumours: Noncancerous If the cells are non-cancerous, the tumour is concluded as benign. It won’t invade nearby tissues or spread to other areas of the body (metastasize). A benign tumour is less harmful unless it is present near any important organs, tissues, nerves, or blood vessels and causes damage. Fibroids in the uterus and breast, polyps of the colon, and moles are some examples of benign tumours. Benign tumours can be removed by surgery. They can grow very large, sometimes weighing pounds. They can be dangerous, such as when they occur in the brain and crowd the normal structures in the enclosed space of the skull. They can press on vital organs or block channels. Also, some types of benign tumours such as intestinal polyps are considered precancerous and are removed immediately to prevent them from becoming malignant. Benign tumours usually don’t reoccur once removed, but if they do it is usually in the same place. 5 2. Malignant Tumours: Cancerous Malignant means that the tumour is made of cancer cells and it can invade nearby tissues. Some cancer cells can move into the bloodstream or lymph nodes, where they can spread to other tissues within the body—this is called metastasis. Cancer can occur anywhere in the body including the breast, lungs, intestines, reproductive organs, blood, or skin. For example, breast cancer begins in the breast tissue and may spread to lymph nodes in the armpit if it’s not caught early enough and treated. Once breast cancer has spread to the lymph nodes, the cancer cells can travel to other areas of the body, like the bones or liver. The breast cancer cells can then form tumours in those locations referred to as secondary tumours. A biopsy of these tumours might show characteristics of the original breast cancer tumour. Fig 1.2 Progression of Breast Cancer Differences Between Benign and Malignant Tumours There are many important differences between benign and malignant tumours which are as follows: i. On the Basis of Growth rate: Generally malignant tumours grow more rapidly than benign tumours, although there are slow-growing and fast-growing tumours in either category. ii. On the basis of Ability to invade locally: Malignant tumours use to invade the tissues around them. One of the most prominent hallmarks of cancer is penetration of the basal membrane that surrounds normal tissues. 6 iii. On the Basis of Ability to spread at Distance: Malignant tumours may spread to other parts of the body using the bloodstream or the lymphatic system. Malignant tumours may also invade nearby tissues and send out fingers into them, while benign tumours don’t. Benign tumours only grow in size at the place of their origin. Fig 1.3 Benign and Malignant Tumour iv. On the Basis of Recurrence: Benign tumours can be removed completely by surgery as they have clearer boundaries, and as a result, they are less likely to reoccur. If they do reoccur, it is only at the original site. Malignant tumours may spread to other parts of the body. They are more likely to reoccur such as breast cancer recurring in the lungs or bones. v. On the Basis of Cellular Appearance: When a pathologist looks at tumour cells under a microscope, it is very easy to determine whether they are normal, benign cells, or cancerous cells as Cancer cells often have abnormal chromosomes and DNA, making their nuclei larger and darker. They also often have different shapes than normal cells. However, sometimes the difference is subtle. vi. On the Basis of Systemic effects: Some benign tumours secrete hormones, such as benign pheochromocytomas, and malignant tumours are more likely to do so. Malignant tumours can secrete substances that cause effects throughout the body, such as fatigue and weight loss. This is known as paraneoplastic syndrome. vii. On the Basis of Treatments: A benign tumour can usually be completely treated with surgery, although some may be treated with radiation therapy or chemotherapy. Some benign tumours are not treated as they do not pose any health risk. Malignant tumours may require 7 chemotherapy, radiation therapy, or immunotherapy medications to eliminate a tumour cell that remains after treatment or to treat secondary tumours present in other parts of the body. 1.1.6 STAGING OF BREAST CANCER Staging is a way of describing how extensive the breast cancer is, including the size of the tumour, whether it has spread to lymph nodes, whether it has spread to distant parts of the body, and what its biomarkers are. Staging can be done either before or after a patient undergoes surgery. Staging done before surgery is called the clinical stage, and staging done after surgery is called the pathologic stage. TNM Staging System: The most common tool that doctors use to describe the stage is the TNM system. • Tumour (T): The primary tumour size in the breast and its biomarkers. • Node (N): The spread of the tumour to the lymph nodes, the number of nodes it has touched, and the position of the nodes. • Metastasis (M): The spread of cancer to other parts of the body. There are 5 major stages of breast cancer: stage 0 (zero), which is non-invasive ductal carcinoma in situ (DCIS), and stages I through IV (1 through 4), which are used for invasive breast cancer. Here are more details on each part of the TNM system for breast cancer: 1. Tumour(T): Using the TNM system, the “T” plus a letter or number (0 to 4) is used to describe the size and location of the tumour. Tumour size is measured in centimetres (cm). • TX: The primary tumour cannot be evaluated. • T0 (T zero): There is no evidence of cancer in the breast. • Tis: Refers to carcinoma in situ. The cancer is confined within the ducts of the breast tissue and has not spread into the surrounding tissue of the breast. There are 2 types of breast carcinoma in situ: • Tis (DCIS): DCIS is a non-invasive cancer, but if not removed, it may develop into an invasive breast cancer later. DCIS means that cancer cells have been found in breast ducts and have not spread past the layer of tissue where they began. 8 • Tis (Paget’s disease): Paget's disease of the nipple is a rare form of early, non-invasive cancer that is only in the skin cells of the nipple. Sometimes Paget's disease is associated with invasive breast cancer. If there is an invasive breast cancer, it is classified according to the stage of the invasive tumour. T1: The tumour in the breast is 20 millimetres (mm) or smaller in size at its widest area. This is a little less than an inch. This stage is then broken into 4 substages depending on the size of the tumour: • T1mi is a tumour that is 1 mm or smaller. • T1a is a tumour that is larger than 1 mm but 5 mm or smaller. • T1b is a tumour that is larger than 5 mm but 10 mm or smaller. • T1c is a tumour that is larger than 10 mm but 20 mm or smaller T2: The tumour is larger than 20 mm but not larger than 50 mm. T3: The tumour is larger than 50 mm. T4: The tumour falls into 1 of the following groups: • T4a means the tumour has grown into the chest wall. • T4b is when the tumour has grown into the skin. • T4c is cancer that has grown into the chest wall and the skin. 2. Node (N): The “N” in the TNM staging system stands for lymph nodes. These small, bean-shaped organs help fight infection. Lymph nodes near where the cancer started are called regional lymph nodes. Regional lymph nodes include: • Lymph nodes located under the arm, called the axillary lymph nodes • Lymph nodes located above and below the collarbone • Lymph nodes located under the breastbone, called the internal mammary lymph nodes. Lymph nodes in other parts of the body are called distant lymph nodes. The information below describes the staging. 1. NX: The lymph nodes were not evaluated. 2. N0: Either of the following: • No cancer was found in the lymph nodes. • Only areas of cancer smaller than 0.2 mm are in the lymph nodes. 9 3. N1: The cancer has spread to 1 to 3 axillary lymph nodes and/or the internal mammary lymph nodes. If the cancer in the lymph node is larger than 0.2 mm but 2 mm or smaller, it is called "micro metastatic" (N1mi). 4. N2: The cancer has spread to 4 to 9 axillary lymph nodes. Or, it has spread to the internal mammary lymph nodes, but not the axillary lymph nodes. 5. N3: The cancer has spread to 10 or more axillary lymph nodes, or it has spread to the lymph nodes located under the clavicle, or collarbone. It may have also spread to the internal mammary lymph nodes. Cancer that has spread to the lymph nodes above the clavicle, called the supraclavicular lymph nodes, is also described as N3. If there is cancer in the lymph nodes, knowing how many lymph nodes are involved and where they are helps doctors plan treatment. The pathologist can find out the number of axillary lymph nodes that contain cancer after they are removed during surgery. It is not common to remove the supraclavicular or internal mammary lymph nodes during surgery. If there is cancer in these lymph nodes, treatment other than surgery, such as radiation therapy, chemotherapy, and hormonal therapy, is generally used. 3. Metastasis: The “M” in the TNM system describes whether the cancer has spread to other parts of the body, called metastasis. This is no longer considered early-stage or locally advanced cancer. 1. MX: Distant spread cannot be evaluated. 2. M0: There is no evidence of distant metastases. 3. M0 (i+): There is no clinical or radiographic evidence of distant metastases. However, there is microscopic evidence of tumour cells in the blood, bone marrow, or other lymph nodes that are no larger than 0.2 mm. 4. M1: There is evidence of metastasis to another part of the body, meaning breast cancer cells are growing in other organs. 10 Stage Groups of Breast Cancer Doctors assign the stage of the cancer by combining the T, N, and M classifications 1. Stage 0: Stage zero (0) describes disease that is only in the ducts of the breast tissue and has not spread to the surrounding tissue. It is also called non-invasive or in situ cancer (Tis, N0, M0). 2. Stage IA: The tumour is small, invasive, and has not spread to the lymph nodes (T1, N0, M0). 3. Stage IB: Cancer has spread to the lymph nodes and the cancer in the lymph node is larger than 0.2 mm but less than 2 mm in size. There is either no evidence of a tumour in the breast or the tumour in the breast is 20 mm or smaller (T0 or T1, N1mi, M0). 4. Stage IIA: Any 1 of these conditions: a. There is no evidence of a tumour in the breast, but the cancer has spread to 1 to 3 axillary lymph nodes. It has not spread to distant parts of the body (T0, N1, M0). b. The tumour is 20 mm or smaller and has spread to 1 to 3 axillary lymph nodes (T1, N1, M0). c. The tumour is larger than 20 mm but not larger than 50 mm and has not spread to the axillary lymph nodes (T2, N0, M0). 5. Stage IIB: Either of these conditions: a. The tumour is larger than 20 mm but not larger than 50 mm and has spread to 1 to 3 axillary lymph nodes (T2, N1, M0). b. The tumour is larger than 50 mm but has not spread to the axillary lymph nodes (T3, N0, M0). 6. Stage IIIA: The tumour of any size has spread to 4 to 9 axillary lymph nodes or to internal mammary lymph nodes. It has not spread to other parts of the body (T0, T1, T2, or T3; N2; M0). Stage IIIA may also be a tumour larger than 50 mm that has spread to 1 to 3 axillary lymph nodes (T3, N1, M0). 7. Stage IIIB: The tumour has spread to the chest wall or caused swelling or ulceration of the breast. It may or may not have spread to up to 9 axillary or internal mammary lymph nodes. It has not spread to other parts of the body (T4; N0, N1, or N2; M0). 8. Stage IIIC: A tumour of any size that has spread to 10 or more axillary lymph nodes, the internal mammary lymph nodes, and/or the lymph nodes under the collarbone. It has not spread to other parts of the body (any T, N3, M0). 11 9. Stage IV (metastatic): The tumour can be any size and has spread to other organs, such as the bones, lungs, brain, liver, distant lymph nodes, or chest wall (any T, any N, M1). Metastatic breast cancer that is found when the cancer is first diagnosed occurs about 6% of the time. This may be called de novo metastatic breast cancer. Most commonly, metastatic breast cancer is found after a previous diagnosis and treatment of early-stage breast cancer. 1.2 MOTIVATION OF THE WORK The primary objective of this research is to develop a robust breast cancer prediction model to forecast the likelihood of breast cancer occurrence. Additionally, the study aims to identify the most effective classification algorithm for detecting the presence of breast cancer in patients. This endeavour is justified through a comprehensive comparative analysis utilizing various classification algorithms, with logistic regression as the primary focus. While logistic regression is a widely used machine learning technique, accurately predicting breast cancer is a critical endeavour demanding the highest level of precision. Therefore, the algorithms undergo thorough evaluation using diverse assessment methodologies and criteria. By conducting this research, both researchers and medical professionals will gain valuable insights to enhance breast cancer detection and diagnosis, ultimately leading to improved patient outcomes. 1.3 PROBLEM STATEMENT The primary challenge in breast cancer lies in its early detection. Although there are existing diagnostic tools for breast cancer prediction, they often come with significant drawbacks, such as high cost or inefficiency in accurately assessing the risk of breast cancer in individuals. Early detection of breast cancer plays a crucial role in reducing mortality rates and minimizing overall complications. However, continuous monitoring of patients is not always feasible, and the availability of round-the-clock medical consultation is limited due to the requirement for expertise and time. With the abundance of data available in today's world, leveraging various machine learning algorithms offers a promising avenue for analysing data to uncover hidden patterns. These hidden patterns can serve as valuable insights for health diagnosis in medical datasets, empowering healthcare professionals to enhance breast cancer detection and provide timely interventions, ultimately improving patient outcomes. 12 1.4 ORGANISATION OF THE WORK This project is structured into six chapters, each dedicated to various aspects of utilizing machine learning for breast cancer prediction. It commences with an introduction to breast cancer and its stages in the first chapter. Following this, the second chapter reviews previous research and the application of machine learning in breast cancer prediction. In the third chapter, the K-nearest neighbours method, a prevalent approach in this field, is explained. Subsequently, the fourth chapter proposes employing logistic regression alongside other machine learning techniques to enhance prediction accuracy. The fifth chapter undertakes a comparative analysis of the effectiveness of existing and proposed methods. Finally, the project concludes by summarizing findings and suggesting future research directions in the sixth chapter. The overarching objective is to advance breast cancer prediction using machine learning techniques and contribute to ongoing efforts in cancer research. 13 CHAPTER 2 LITERATURE SURVEY To develop a comprehensive understanding of image feature extraction techniques and machine learning algorithms, it's crucial to first familiarize ourselves with the existing research and advancements in these domains. Therefore, in this section, we provide an overview of various papers and books that we have explored to gather relevant insights for our project. By examining these diverse sources, we aim to lay a solid foundation of knowledge necessary for effectively utilizing these techniques in our work. [1] Tarini Sinha. Tumours: Benign and Malignant. Canc Therapy & Oncol Int J. 2018; 10(3): 555790. DOI:10.19080/CTOIJ.2018.10.555790 A tumour is a pathologic disturbance of cell growth, characterized by excessive and abnormal proliferation of cells. Tumours are abnormal mass of tissue which may be solid or fluid filled. When the growth of tumour cells confined to the site of origin and are of normal physicality they are concluded as benign tumours. When the cells are abnormal and can grow uncontrollably, they are concluded as cancerous cells i.e. malignant tumour. Tumours are also called as ‘NEOPLASM’. To determine whether a tumour is benign or cancerous, a doctor can take a sample of the cells with a biopsy procedure. Then the biopsy is analysed under a microscope by a pathologist (a doctor specializing in laboratory science). [2] Comparison of machine learning models for breast cancer diagnosis IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 12, No. 1, March 2023, pp. 415~421 ISSN: 2252-8938 Breast cancer is the most common cause of death among women worldwide. Breast cancer can be detected early, and the death rate can be reduced. Machine learning (ML) techniques are a hot topic for study and have proved influential in cancer prediction and early diagnosis. This study's objective is to predict and diagnose breast cancer using ML models and evaluate the most effective based on six criteria: specificity, sensitivity, precision, accuracy, F1-score and receiver operating characteristic curve. All work is done in the anaconda environment, which uses Python's NumPy and SciPy numerical and scientific libraries, and pandas and matplotlib. This study used the Wisconsin diagnostic breast cancer dataset to test ten ML algorithms: decision tree, linear discriminant analysis, forests of randomized trees, gradient boosting, passive aggressive, logistic regression, naïve Bayes, nearest centroid, support vector machine, 14 and perceptron. After collecting the findings, we performed a performance evaluation and compared these various classification techniques. Gradient boosting model outperformed all other algorithms, scoring 96.77% on the F1-score. [3] Comparing Logistic Regression to the K-nearest Neighbours (KNN) technique, A Novel Pattern Discovery Based Human Activity Recognition Research Scholar, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, Tamil Nādu, India, Pincode:602105.10(1S) 1625-1624 The main objective of this research study is to improve accuracy for Human Activity Recognition using Data Analysis Techniques. Materials and Methods: A Logistic Regression with sample size 10 and K-nearest Neighbours (KNN) Algorithm with sample size of 10. It was iterated at different times predicting the accuracy percentage of human Activity Recognition. Results: Human Activity Recognition utilizing Novel Logistic Regression 95.52% accuracy compared with K-nearest Neighbours (KNN) Algorithm 90.52% accuracy. Logistic Regression seems to perform essentially better compared to K-nearest Neighbours (KNN) Algorithm (p=0.42) (p=0.5) The Logistic Regression algorithm in computer vision appears to perform significantly better than the K-nearest Neighbours (KNN) Algorithm. [4] Extract the Similar Images Using the Grey Level Cooccurrence Matrix and the Hu Invariants Moments Beshaier A. Abdulla a*, Yossra H. Ali b, Nuha J. Ibrahim c a Department of Computers Science, University of Technology, Iraq. In the last years, many types of research have introduced different methods and techniques for a correct and reliable image retrieval system. The goal of this paper is a comparison study between two different methods which are the Grey level co-occurrence matrix and the Hu invariants moments, and this study is done by building up an image retrieval system employing each method separately and comparing between the results. The Euclidian distance measure is used to compute the similarity between the query image and database images. Both systems are evaluated according to the measures that are used in detection, description, and matching fields which are precision, recall, and accuracy, and addition to that mean square error (MSE) and structural similarity index (SSIM) is used. And as it shows from the results the Grey level co-occurrence matrix (GLCM) had outstanding and better results from the Hu invariants moment method. 15 [5] Detection and Classification of Cancer from Microscopic Biopsy Images Using Clinically Significant and Biologically Interpretable Features Rajesh Kumar, Rajeev Srivastava, and Subodh Srivastava Department of Computer Science and Engineering, Indian Institute of Technology (Banaras Hindu University), Varanasi 221005, India This study proposes an automated framework for cancer detection from microscopic biopsy images. It includes image enhancement, background cell segmentation, feature extraction, and classification stages. Selected methods, like Contrast Limited Adaptive Histogram Equalization for image enhancement and the k-means segmentation algorithm for background cell segmentation, were chosen based on comparative analysis. Various clinically significant features are extracted, including Gray level texture, colour-based, and wavelet features. Classification into normal and cancerous categories is done using the K-nearest Neighbour method. The framework's performance is evaluated on 1000 biopsy images representing four tissue types. [6] VolcAshDB: a Volcanic Ash Database of classified particle images and features Damià Benet1,2,3 · Fidel Costa1,2,3 · Christina Widiwijayanti1,2 · John Pallister4 · Gabriela Pedreros5 · Patrick Allard3 · Hanik Humaida6 · Yosuke Aoki7 · Fukashi Maeno7 Volcanic ash is a valuable source of information for understanding volcanic activity, but classifying ash particles is challenging due to varying observations and lack of standardized methodologies. To address this, we developed Volcanic Ash Database (VolcAshDB), containing over 6,300 high-resolution images of ash particles and quantitative features. Each particle is classified into one of four main categories: free crystal, altered material, lithic, and juvenile. VolcAshDB is publicly available and facilitates comparative studies and machine learning model training to automate particle classification and reduce observer biases. [7] Gray level co-occurrence matrix (GLCM) texture-based crop classification using low altitude remote sensing platforms Naveed Iqbal, # Rafia Mumtaz,# Uferah Shafi, and Syed Mohammad Hassan Zaidi For many computer vision and machine learning problems, large training sets are key for good performance. However, the most computationally expensive part of many computers vision and machine learning algorithms consists of finding nearest Neighbour matches to high dimensional vectors that represent the training data. We propose new algorithms for 16 approximate nearest Neighbour matching and evaluate and compare them with previous algorithms. For matching high dimensional features, we find two algorithms to be the most efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority search k-means tree. We also propose a new algorithm for matching binary features by searching multiple hierarchical clustering trees and show it outperforms methods typically used in the literature. We show that the optimal nearest Neighbour algorithm and its parameters depend on the data set characteristics and describe an automated configuration procedure for finding the best algorithm to search a particular data set. In order to scale to very large data sets that would otherwise not fit in the memory of a single machine, we propose a distributed nearest Neighbour matching framework that can be used with any of the algorithms described in the paper. All this research has been released as an open-source library called Fast Library for approximate nearest Neighbours (FLANN), which has been incorporated into OpenCV and is now one of the most popular libraries for nearest-Neighbour matching. [8] Logistic regression in data analysis: An overview, July 2011 International Journal of Data Analysis Techniques and Strategies 3(3):281-299 DOI:10.1504/IJDATS.2011.041335 Logistic regression (LR) continues to be one of the most widely used methods in data mining in general and binary data classification in particular. This paper is focused on providing an overview of the most important aspects of LR when used in data analysis, specifically from an algorithmic and machine learning perspective and how LR can be applied to imbalanced and rare events data. [9] Machine Learning: Algorithms, Real-World Applications and Research Directions Iqbal H. Sarker1,2 Received: 27 January 2021 / Accepted: 12 March 2021 / Published online: 22 March 2021 In the current age of the Fourth Industrial Revolution (4IR or Industry 4.0), the digital world has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data, business data, social media data, health data, etc. To intelligently analyse these data and develop the corresponding smart and automated applications, the knowledge of artifcial intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement learning exist in the area. Besides, the deep learning, which is part of a broader family of machine learning methods, can intelligently analyse the data on a large scale. In this paper, we 17 present a comprehensive view on these machine learning algorithms that can be applied to enhance the intelligence and the capabilities of an application. Thus, this study’s key contribution is explaining the principles of different machine learning techniques and their applicability in various real-world application domains, such as cybersecurity systems, smart cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges and potential research directions based on our study. Overall, this paper aims to serve as a reference point for both academia and industry professionals as well as for decision-makers in various real-world situations and application areas, particularly from the technical point of view. [10] High-Level K-Nearest Neighbours (HLKNN): A Supervised Machine Learning Model for Classification Analysis Elife Ozturk Kiyak 1, Bita Ghasemkhani 2 and Derya Birant 3, Independent Researcher, Izmir 35390, Turkey; elife.ozturk@cs.deu.edu.tr 2 Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey. The k-nearest Neighbours (KNN) algorithm has been widely used for classification analysis in machine learning. However, it suffers from noise samples that reduce its classification ability and therefore prediction accuracy. This article introduces the high-level k-nearest Neighbours (HLKNN) method, a new technique for enhancing the k-nearest Neighbours algorithm, which can effectively address the noise problem and contribute to improving the classification performance of KNN. Instead of only considering k Neighbours of a given query instance, it also takes into account the Neighbours of these Neighbours. Experiments were conducted on 32 well-known popular datasets. The results showed that the proposed HLKNN method outperformed the standard KNN method with average accuracy values of 81.01% and 79.76%, respectively. In addition, the experiments demonstrated the superiority of HLKNN over previous KNN variants in terms of the accuracy metric in various datasets. outperformed standard KNN, achieving an average accuracy of 81.01% compared to 79.76% for KNN. Moreover, HLKNN demonstrated superiority over previous KNN variants across various datasets, showcasing its effectiveness in improving classification accuracy. 18 Summary: The presented research papers offer a comprehensive exploration of machine learning and computer vision applications across diverse domains. Central to these studies is the pivotal role of logistic regression and the Grey Level Co-occurrence Matrix (GLCM) in advancing classification and image analysis tasks. Beginning with breast cancer prediction and diagnosis, the research underscores the significance of leveraging machine learning techniques, with logistic regression emerging as a potent tool for accurate classification. Subsequent investigations delve into human activity recognition, where logistic regression showcases its superiority over K-nearest neighbours (KNN), highlighting its efficacy in computer vision tasks. Additionally, GLCM emerges as a key player in image retrieval methods. Notably, in the context of automated cancer detection from microscopic biopsy images, GLCM's utilization alongside various techniques such as image enhancement, segmentation, and classification underscore its importance in achieving accurate diagnoses. These findings underscore the pivotal roles of logistic regression and GLCM in driving advancements across machine learning and computer vision domains, ultimately contributing to the resolution of complex challenges in getting the image features. 19 CHAPTER 3 EXISTING METHODOLOGY 3.1 K-NEAREST NEIGHBOURS: Using the K-Nearest Neighbours (KNN) algorithm for breast cancer prediction involves collecting a dataset comprising relevant features such as tumour characteristics and patient data, with corresponding labels indicating benign or malignant status. After preprocessing the data, including feature selection and normalization, the KNN model is trained by storing the entire dataset. During prediction, the algorithm calculates the distance between a new, unlabelled data point and all other points in the training set, identifying the k nearest neighbours based on distance metrics like Euclidean distance The k-nearest neighbours (KNN) algorithm is a non-parametric, supervised learning classifier, which uses proximity to make classifications or predictions about the grouping of an individual data point. It is one of the popular and simplest classification and regression classifiers used in machine learning today. While the KNN algorithm can be used for either regression or classification problems, it is typically used as a classification algorithm. Here's a basic overview of how KNN works: 1. Training Phase: In the training phase, KNN simply memorizes the features and labels of the training dataset. No explicit model is built. 2. Prediction Phase: When a new data point is provided, KNN calculates the distance between that point and every other point in the training dataset. The most common distance metric used is Euclidean distance, but other metrics like Manhattan distance or cosine similarity can also be used. 3. Choosing K: K represents the number of nearest neighbours to consider. A small value of K can make the model sensitive to noise, while a large value of K can make the model overly generalized. Choosing the right value of K is crucial and often involves experimentation and cross-validation. 4. Classification: For classification tasks, KNN takes a majority vote among the K nearest neighbours and assigns the class label that is most common among them to the new data point. 20 3.2 FLOW CHART Data Testing data Training data Choosing K KNN algorithm KNN Model Classification 1. Data: To train a K-Nearest Neighbours (KNN) model, you'll need a dataset consisting of features and their corresponding labels. Here's a breakdown of the data you'll need: Features: These are the characteristics or attributes used to predict the label. Each feature should be represented as a numerical value or a value that can be converted into a numerical representation. For example, if you're predicting breast cancer, features might include tumour size, age of the patient, number of positive lymph nodes, etc. 21 Labels: These are the outcomes or classes that you want to predict based on the features. Labels should be categorical values, such as "benign" or "malignant" for breast cancer prediction. 2. Splitting of data: Splitting the data into training and testing sets is a fundamental step in machine learning model development. A. Training Set: The training set is used to train the machine learning model. It consists of a portion of the data (typically the majority) and includes both the features (input variables) and their corresponding labels (output variables). During training, the model learns the patterns and relationships between the features and labels in the training data. B. Testing Set: The testing set is used to evaluate the performance of the trained model. It consists of a separate portion of the data that was not used during training. The testing set allows us to assess how well the trained model generalizes to new, unseen data. By evaluating the model on data it hasn't seen before, we can estimate its performance in real-world scenarios Purpose of Splitting: Splitting the data into training and testing sets helps prevent overfitting, which occurs when a model learns to memorize the training data instead of learning general patterns. By evaluating the model on separate testing data, we can assess its ability to generalize beyond the training examples. Additionally, splitting the data allows us to measure the model's performance objectively and identify potential issues such as underfitting or overfitting. 3. Distance Metrics of KNN: To determine which data points are closest to a given query point, the distance between the query point and the other data points will need to be calculated. These distance metrics help to form decision boundaries, which partition query points into different regions. 1.Euclidean distance (p=2): This is the most commonly used distance measure, and it is limited to real-valued vectors. Using the below formula, it measures a straight line between the query point and the other point being measured. n Distance = √∑(xi − yi )2 i=0 22 (31) 2.Manhattan distance (p=1): This is also another popular distance metric, which measures the absolute value between two points. It is also referred to as taxicab distance or city block distance as it is commonly visualized with a grid, illustrating how one might navigate from one address to another via city streets. m Manhattan Distance = d(x, y) = (∑|xi − yi |) (3.2) i=1 3.Minkowski distance: This distance measure is the generalized form of Euclidean and Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of other distance metrics. Euclidean distance is represented by this formula when p is equal to two, and Manhattan distance is denoted with p equal to one. 1/𝑝 𝑛 Minkowski distance = (∑ |𝑥𝑖 − 𝑦𝑖 ) (3.3) 𝑖=1 4.Hamming distance: This technique is used typically used with Boolean or string vectors, identifying the points where the vectors do not match. As a result, it has also been referred to as the overlap metric. This can be represented with the following formula: k Hamming Distance = DH = (∑|xi − yi |) i=1 𝑥=y 𝐷=0 𝑥≠y 𝐷≠0 23 (3.4) Defining K: In K-Nearest Neighbours (KNN), "K" represents the number of nearest neighbouurs used to predict a new data point. It's a hyperparameter that needs to be specified before training the model. When a new data point is provided, the algorithm calculates the distances between that point and all the points in the training dataset. It then selects the K closest points (nearest neighbours) based on these distances. Fig. 3.1 K-Nearest Neighbour 3.3 TYPES OF KNN: 1. Basic KNN: This is the standard version of the algorithm, where the prediction for a new data point is made based on the majority class (for classification) or the average (or weighted average) of the target values (for regression) of its nearest neighbours. 2. Weighted KNN: In weighted KNN, instead of giving equal importance to all nearest neighbours, the algorithm assigns weights to them based on their distance from the new data point. Typically, closer neighbours have a higher weight, while farther neighbours have a lower weight. This helps to give more influence to neighbours that are more similar to the new data point. 3.4 ADVANTAGES OF KNN: 1. Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of the first classifiers that a new data scientist will learn. 24 2. Adapts easily: As new training samples are added, the algorithm adjusts to account for any new data since all training data is stored in memory. 3. Few hyperparameters: KNN only requires a k value and a distance metric, which is low when compared to other machine learning algorithms. 3.5 DISADVANTAGES OF KNN: 1. Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of dimensionality, which means that it doesn’t perform well with high-dimensional data inputs. This is sometimes also referred to as the peaking phenomenon, where after the algorithm attains the optimal number of features, additional features increase the number of classification errors, especially when the sample size is smaller. 2. Computational Complexity: One of the major drawbacks of KNN is its computational complexity, especially during the prediction phase. As the algorithm needs to calculate the distances between the new data point and all points in the training set, it can become computationally expensive, particularly with large datasets or high-dimensional data. This makes KNN inefficient for real-time applications or datasets with a large number of features. 3. Memory Usage: Since KNN is an instance-based algorithm, it needs to store the entire training dataset in memory. This can be memory-intensive, especially for large datasets, and may not be feasible for memory-constrained environments. 4. Sensitive to Outliers and Noise: KNN is sensitive to outliers and noisy data points, as they can significantly affect the calculation of distances and the prediction outcome. Outliers or irrelevant features can distort the decision boundaries and lead to poor performance. 5. Optimal Value of K: Choosing the right value for K is crucial for the performance of the KNN algorithm. Selecting an inappropriate value of K can result in underfitting or overfitting the data. Determining the optimal value often requires experimentation and validation techniques. . 25 CHAPTER 4 PROPOSED METHODOLOGY 4.1 SYSTEM ARCHITECTURE The system architecture gives an overview of the workings of the system. The working of this system is described as shown in Figure 4.1. Fig. 4.1 System Working Designing a system architecture for breast cancer prediction typically involves several components, including data collection, preprocessing, feature extraction, model training, and deployment. Here's a high-level overview of a system architecture for breast cancer prediction: 26 1. Data Collection: Obtain datasets containing relevant medical information such as patient demographics, medical history, imaging data (like mammograms), and biopsy results. Datasets may come from hospitals, research institutions, or public repositories like the SEER database. 2. Data Pre-processing: • Clean the data to handle missing values, outliers, and inconsistencies. • Normalize or standardize numerical features. • Encode categorical variables. • Split the data into training, validation, and testing sets. 3.Feature Extraction: • Extract relevant features from the data that can help in predicting breast cancer. • Features may include demographic information (age, ethnicity), clinical data (family history, previous biopsies), and imaging features (from mammograms or MRI scans). 4. Model Selection and Training: • Choose appropriate machine learning or deep learning models for breast cancer prediction. • Train the selected models using the pre-processed data 5. Evaluation: • Evaluate the trained models using metrics such as accuracy, precision, recall, and F1-score. • Perform cross-validation to assess model generalization. • Compare the performance of different models to select the best one. 6. Deployment: • Once a satisfactory model is obtained, deploy it in a production environment. • Develop an interface (web-based, mobile app, or API) for users to interact with the model. • Ensure security and privacy measures are in place, especially when dealing with sensitive medical data. • Monitor the deployed model for performance and retrain periodically with new data if necessary. 27 7. Continuous Improvement: • Regularly update the model with new data to improve prediction accuracy. • Incorporate feedback from clinicians and end-users to refine the system and address any usability issues. 4.2 MACHINE LEARNING (ML) Machine learning (ML) is a branch of artificial intelligence and computer science that focuses on using data and algorithms to enable AI to imitate the way that humans learn, gradually improving its accuracy. The basic concept of machine learning in data science involves using statistical learning and optimization methods that let computers analyse datasets and identify patterns. How does Machine Learning work: The learning system of a machine learning algorithm mainly consists of three main parts. 1. Decision Process 2. An Error Function 3. Model Optimization Process Decision Process: In general, machine learning algorithms are used to make a prediction or classification. Based on some input data, which can be labelled or unlabelled, your algorithm will produce an estimate about a pattern in the data. An Error Function: An error function evaluates the prediction of the model. If there are known examples, an error function can make a comparison to assess the accuracy of the model. Model Optimization Process: If the model can fit better to the data points in the training set, then weights are adjusted to reduce the discrepancy between the known example and the model estimate. The algorithm will repeat this iterative “evaluate and optimize” process, updating weights autonomously until a threshold of accuracy has been met. 28 4.3 MACHINE LEARNING METHODS Machine learning (ML) has four basic types of learning methods 1. Supervised Machine Learning 2. Unsupervised Machine Learning 3. Semi-Supervised Machine Learning 4. Reinforcement Machine Learning 4.3.1 Supervised Machine Learning: Supervised learning is the type of machine learning in which machines are trained using well "labelled" training data, and based on that data, machines predict the output. The labelled data means some input data is already tagged with the correct output. Supervised learning is a process of providing input data as well as correct output data to the machine learning model. A supervised learning algorithm aims to find a mapping function to map the input variable(x) with the output variable(y). During training, the model adjusts its parameters based on the provided input-output pairs to minimize the error between its predictions and the true labels. Once trained, the model can make predictions on new, unseen data. The term "supervised" refers to the presence of labeled data that guides the learning process by providing correct answers for the training examples. Fig. 4.2 Supervised Learning 29 4.3.2 Unsupervised Machine Learning: Unsupervised learning cannot be directly applied to a regression or classification problem because, unlike supervised learning, we have the input data but no corresponding output data. The goal of unsupervised learning is to find the underlying structure of the dataset, group that data according to similarities, and represent that dataset in a compressed format. The advantages of Unsupervised learning are: ● Unsupervised learning helps find useful insights from the data. ● Unsupervised learning is very similar to how a human learns to think by their own experiences, which makes it closer to real AI. ● Unsupervised learning works on unlabelled and uncategorized data which makes unsupervised learning more important. ● In the real world, we do not always have input data with the corresponding output so to solve such cases, we need unsupervised learning. Fig. 4.3 Unsupervised Learning 4.3.3 Semi-Supervised Learning: Semi-supervised learning offers a happy medium between supervised and unsupervised learning. During training, it uses a smaller labelled data set to guide classification and feature extraction from a larger, unlabelled data set. Semi-supervised learning can solve the problem of not having enough labelled data for a supervised learning algorithm. It also helps if it’s too costly to label enough data. 30 Fig. 4.4 Semi-Supervised Learning 4.3.4 Reinforcement Learning: Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. It is employed by various software and machines to find the best possible behaviour or path it should take in a specific situation. Reinforcement learning differs from supervised learning in a way that in supervised learning the training data has the answer key with it so the model is trained with the correct answer itself whereas in reinforcement learning, there is no answer but the reinforcement agent decides what to do to perform the given task. In the absence of a training dataset, it is bound to learn from its experience. Fig. 4.5 Reinforcement Learning 31 PROPOSED ALGORITHM: LOGISTICE REGRESSION Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Logistic regression predicts the output of a categorical dependent variable. Therefore, the outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. Logistic Regression is much similar to the Linear Regression except that how they are used. Linear Regression is used for solving Regression problems, whereas logistic regression is used for solving the classification problems. In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1). 𝐟(𝐱) = 𝟏 𝟏 + 𝐞−(𝐱) (4.1) The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, whether a mouse is obese or not based on its weight, etc. Fig. 4.6 Sigmoid Function Logistic Regression is a significant machine learning algorithm because it can provide probabilities and classify new data using continuous and discrete datasets. 32 4.4 WORK FLOW: Collection of microscopic biopsy images Feature Extraction Data processing Train and Evaluation split Logistic Regression Prediction Classification Breast Cancer Negative Breast Cancer Positive 33 4.4.1 MICROSCOPIC BIOPSY IMAGES: Microscopic biopsy images of breast cancer are crucial for the diagnosis and treatment of the disease. These images are obtained through procedures such as fine needle aspiration (FNA) biopsy or core needle biopsy, where a small sample of tissue is collected from the breast for examination under a microscope. In these images, pathologists analyse the tissue samples to identify characteristic features of breast cancer, such as abnormal cell growth patterns, presence of cancerous cells, and tissue architecture. Key features that pathologists look for include the size, shape, and arrangement of cells, presence of mitotic figures (indicating cell division), and features like nuclear atypia and tumour-infiltrating lymphocytes. Fig. 4.7 Biopsy Image We have collected breast cancer images from Kaggle is a commendable step toward building a robust dataset for breast cancer detection and classification tasks. The availability of diverse image datasets plays a crucial role in training, feature extracting, and evaluating machine learning models for medical image analysis. Kaggle, being a popular platform for data science competitions and repositories, offers a wide range of datasets, including those related to breast cancer. 4.4.2 FEATURE EXTRACTION: GLCM(Gray Level Co-occurrence Matrix) is a common method to describe the texture of images by studying their spatial correlation characteristics. In 1973, Haralick et al. proposed using GLCM to describe texture features. The excellent ability of GLCM in breast cancer histopathological image recognition, especially for the three-channel features of the images has 34 been discovered. In this paper, three-channel features are considered. We calculate GLCM at 0, π4, π2, and 3π4 in four directions with a Gray level of 256. Then, according to the GLCM, 22 features were calculated, including autocorrelation, contrast, correlation in two forms, cluster prominence, cluster shade, dissimilarity, energy, entropy, homogeneity in two forms, maximum probability, the sum of squares, sum average, sum variance, sum entropy, difference variance, difference entropy, normalized inverse difference, normalized inverse difference moment and information measures of correlation in two forms. Given the GLCM of an image, p (I, j) is the (I,j)th entry in a normalized GLCM. px(i) is the ith entry in the marginal-probability matrix obtained by summing the rows of p(I, j). Ng is the number of distinct grey levels in the quantized image. μ is the mean value of the normalized GLCM. The mean value and standard deviation for the rows and columns of the matrix are μx = ∑ i ∑ j. i ⋅ p(I, j) (4.2) μy = ∑ i ∑ j. j ⋅ p(I, j) (4.3) 𝜎𝑥 = ∑ j(i − μx )2 ⋅ p(I, j) (4.4) σy = ∑ i ∑ j(j − μy)2 ⋅ p(I, j) (4.5) 1. Contrast: Measures the local variations in the image. High contrast values indicate large differences between neighbouring pixel intensities. Dissimilarity: Measures the average difference in intensity between neighbouring pixels. High dissimilarity values indicate greater heterogeneity in texture. Contrast, denoted as C, is calculated as the weighted sum of squared differences between the intensity values of pixel pairs in the GLCM levels−1 Contrast = ∑ I,j=0 PI ,j (i − j)2 (4.6) 2. Correlation: Correlation is a textural feature in a Gray-level co-occurrence matrix (GLCM) that measures the linear dependencies between the grey tones of an image. Correlation is 1 or -1 for perfectly 35 positive correlated image or negatively correlated. Correlation is Nan for a constant image. Returns the sum of squared elements in the GLCM. Energy is 1 for a constant image. Correlation, denoted as Corr. Corr is calculated as the weighted covariance of pixel pairs in the GLCM, normalized by the standard deviations of the intensity values levels−1 Correlation = ∑ PI ,j [ I,j=0 (i − μi )(j − μj ) √(σ2i )(σ2j ) ] (4.7) 3. Dissimilarity: Measures the average difference in intensity between Neighbouring pixels. High dissimilarity values indicate greater heterogeneity in texture. Homogeneity: Reflects the closeness of the distribution of elements in the GLCM to the GLCM diagonal. Dissimilarity, denoted as D, is calculated as the weighted sum of absolute differences between the intensity values of pixel pairs in the GLCM levels−1 PI ,j |i − j| Dissimilarity = ∑ I,j=0 (4.8) 4. Homogeneity: The homogeneity feature in the Gray-Level Co-occurrence Matrix (GLCM) refers to a measure of the similarity or uniformity of grayscale values in an image. GLCM is a statistical method used to quantify texture patterns within an image by calculating the frequency of pixel pairs with specific intensity values occurring at given spatial relationships. levels−1 Homogenity = ∑ [ I,j=0 PI ,j ] 1 + ( i − j) 2 (4.9) 5. Angular Second Movement (ASM): The Angular Second Moment (ASM), also known as energy, is a statistical measure derived from the Gray-Level Co-occurrence Matrix (GLCM) in texture analysis. ASM measures the uniformity or homogeneity of grayscale values in an image. It indicates the orderliness or 36 predictability of the texture pattern, where a higher ASM value reflects a more uniform texture with less variation in pixel intensity values. ASM is calculated as the sum of squared elements in the GLCM levels−1 Angular Second Movement = ∑ I,j=0 Pi2 ,j (4.10) 6. Energy: Energy, like ASM, is a statistical measure derived from the GLCM and reflects the uniformity or homogeneity of grayscale values in an image texture. Energy, denoted as E is calculated as the sum of squared elements in the GLCM 2 Energy = √ASM (4.11) 4.4.3 DATA PREPROCESSING Data pre-processing is an important step for the creation of a machine learning model shown below. Initially, data may not be clean or in the required format for the model which can cause misleading outcomes. In pre-processing of data, we transform data into our required format. It is used to deal with noises, duplicates, and missing values of the dataset. Fig 4.8 Data Pre-processing Data pre-processing has activities like importing datasets, splitting datasets, attribute scaling, etc. Pre-processing of data is required to improve the accuracy of the model. 37 4.4.4 TRANING A MODEL During the training phase, the model learns to recognize patterns or relationships in the input data and output labels provided in the dataset. This process involves adjusting the model’s internal parameters to minimize a predefined measure of error or loss between the predicted outputs and the actual labels in the training data. The model iteratively updates its parameters using an optimization algorithm, such as gradient descent, to find the optimal values that minimize the loss function. 4.4.5 PREDICTION OF DISEASE Overall, model training for breast cancer prediction involves careful consideration of data preprocessing, model selection, hyperparameter tuning, and evaluation to build an effective and reliable predictive model. By optimizing these aspects of model training, healthcare practitioners can develop accurate and robust predictive models for early detection and diagnosis of breast cancer, ultimately improving patient outcomes. 4.5 ADVANTAGES OF LOGISTIC REGRESSION OVER KNN: Even though K-Nearest Neighbour algorithm is much popular in breast cancer prediction, logistic regression gives more accurate results while dealing with large datasets. Here are the advantages of Logistic regression algorithm over KNN: Interpretability: Logistic regression provides clear and interpretable results by estimating the probability of each class based on the input features. This allows clinicians and researchers to understand the factors contributing to the prediction, such as the importance of specific features like tumour size or patient age. Efficiency with Large Datasets: Logistic regression can be more computationally efficient, especially with large datasets, compared to KNN. KNN requires storing the entire training dataset and computing distances to all data points for each prediction, which can become computationally expensive as the dataset size increases. Logistic regression, on the other hand, involves learning model parameters from the data and making predictions based on a learned function, which can be faster for large datasets. 38 Handles Irrelevant Features: Logistic regression can handle irrelevant features or noise in the data more effectively than KNN. Logistic regression estimates the coefficients for each feature, and features with low coefficients are considered less important for prediction. In contrast, KNN treats all features equally and can be sensitive to irrelevant or noisy features, potentially leading to suboptimal predictions. Better Performance with Linearly Separable Data: Logistic regression performs well when the decision boundary between classes is approximately linear. In cases where the relationship between features and class labels is roughly linear or can be approximated by a linear function, logistic regression may outperform KNN. Regularization: Logistic regression can easily incorporate regularization techniques such as L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting and improve generalization performance. Regularization helps control model complexity and can improve predictive performance, especially when dealing with high-dimensional data or correlated features. 39 CHAPTER 5 EXPERIMENTAL ANALYSIS 5.1 SYSTEM CONFIGURATION Python is an interpreted, high-level, general purpose programming language created by Guido Van Rossum and first released in 1991, Python's design philosophy emphasizes code Readability with its notable use of significant White space. Its language constructs and objectoriented approach aim to help programmers write clear, logical code for small and large-scale projects. Python is dynamically typed and garbage collected. The following are the required system configurations. 5.1.1 HARDWARE REQUIREMENTS Processer : Any Update Processer Ram : Min 4GB Hard Disk : Min 100GB 5.1.2 SOFTWARE REQUIREMENTS Operating System : Windows family Technology : Python3.7 IDE : Jupiter notebook 5.1.3 FEATURES EXTRACTED FROM SAMPLE IMAGE: In our image analysis process, we employed the Gray Level Co-occurrence Matrix (GLCM) technique to extract six key texture features from biopsy images. These features, namely Contrast, Correlation, Angular Second Moment (ASM), Energy, Homogeneity, and Dissimilarity, offer valuable insights into the textural characteristics present within the images. 40 Contrast provides an indication of local intensity variations, aiding in the identification of regions with pronounced changes in grayscale values. Correlation measures the linear relationship between Gray levels of neighbouring pixels, offering insights into the spatial coherence of the image. ASM, also referred to as uniformity or homogeneity, quantifies the orderliness of intensity distribution, capturing how uniform or varied the texture appears. Energy, derived from the sum of squared elements in the GLCM, complements ASM by representing the overall intensity distribution’s smoothness or roughness. Homogeneity gauges the closeness of element distribution in the GLCM to its diagonal, offering a measure of the image’s uniformity or heterogeneity. Dissimilarity, on the other hand, quantifies the average absolute difference between pairs of pixels, providing further granularity on the variations in intensity across the image. By leveraging these features extracted through GLCM analysis, we enhance our understanding of the biopsy images’ texture characteristics, facilitating tasks such as image classification, segmentation, and the detection of anomalies with greater precision and accuracy. Fig.5.1 Biopsy Image 41 Table 5.1 Features of a sample image Features Extracted Values Contrast 1326.0000 Correlation 0.24160 Dissimilarity 0.01860 Homogeneity 0.73390 Angular Second Moment 0.08690 Energy 0.05667 Predicted output: The logistic regression model then generates predicted outputs, which are binary values of 1 or 0. Here, a prediction of 1 implies the presence of cancer in the biopsy image, while a prediction of 0 signifies the absence of cancer, or non-cancerous tissue. This binary classification framework allows us to effectively distinguish between cancerous and non-cancerous regions within the biopsy images based on the texture features extracted through GLCM analysis. Fig.5.2 Breast Cancer Output 5.3 PERFORMANCE ANALYSIS In this project, various machine learning algorithms like KNN and logistic Regression is used for the prediction of Breast cancer. Here we have extracted the features from the microscopic biopsy images and with the help of grey level co-matrix and we have extracted the features like correlation, contrast, homogeneity, ASM, Energy, dissimilarity for the predictive analysis. For evaluating the experiment, various evaluation metrics like accuracy, confusion matrix, precision, recall, and f1-score are considered. Accuracy: Accuracy is the ratio of the number of correct predictions to the total number of inputs in the dataset. It is expressed as: 42 ● Accuracy = (TP + TN) /(TP+FP+FN+TN) where TP: True positive FN: False Negative FP: False Positive TN: True Negative. Correlation Matrix: The correlation matrix in machine learning is used for feature selection. It represents dependency between various attributes. Precision: It is the ratio of correct positive results to the total number of positive results predicted by the system. Recall: It is the ratio of correct positive results to the total number of positive results predicted by the system. F1 Score: It is the harmonic mean of Precision and Recall. It measures the test accuracy. The range of this metric is 0 to 1. Confusion Matrix: It gives us a matrix as output and gives the total performance of the system Fig 5.3 confusion matrix Table 5.2 TP, TN, FP, FN TP TN FP FN KNN 36 71 7 0 Logistic regression 40 70 1 3 43 5.4 RESULT Following the application of the Logistic Regression algorithm for training and testing in our machine learning workflow, we meticulously evaluated its performance using a comprehensive set of metrics: Accuracy, F1 Score, Precision, and Recall. Leveraging the insights provided by the confusion matrix—comprising counts of True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN)—we employed the respective equations for each metric to derive their values. Comparison of Performance Metrics: In our analysis, we're comparing the performance metrics of K-Nearest Neighbours (KNN) and Logistic Regression algorithms. This comparative analysis enables us to identify the most effective algorithm for our specific task, potentially informing clinical decision-making and enhancing diagnostic accuracy in medical settings. Fig 5.4 Performance Metrics Table 5.3 Performance Metrics Accuracy Precision Recall F1 score KNN 93.8% 0.92 0.94 0.93 Logistic Regression 96.5% 0.97 0.96 0.96 44 CHAPTER 6 CONCLUSION AND FUTURE WORK 6.1 CONCLUSION Breast cancer is a pressing global health issue, and its early detection is paramount for effective treatment and better patient outcomes. With the advent of advanced technologies like machine learning, there is significant potential to transform healthcare practices, particularly in the realm of breast cancer prediction. Early prognosis facilitated by machine learning models allows for timely interventions and informed treatment decisions, ultimately leading to reduced complications and mortality rates among patients. The increasing prevalence of breast cancer underscores the critical need for improved diagnostic methods and treatment approaches. Machine learning algorithms, such as Logistic Regression, offer a powerful tool for breast cancer prediction by leveraging patient characteristics and medical history to model the probability of cancer occurrence. This enables healthcare practitioners to enhance the accuracy and efficiency of diagnosis, thereby facilitating timely interventions and personalized treatment plans tailored to individual patient needs. By integrating machine learning into breast cancer prediction, healthcare professionals can empower patients with valuable insights and personalized care strategies. This proactive approach not only facilitates early detection but also enables the implementation of preventive measures and lifestyle modifications to mitigate risk factors associated with breast cancer development. Patients can benefit from timely information and support, leading to improved health outcomes and quality of life. Furthermore, the integration of advanced technological support in healthcare systems streamlines data analysis and decision-making processes, optimizing patient care delivery. Machine learning algorithms like Logistic Regression enable healthcare providers to extract valuable insights from large and complex datasets, facilitating more accurate and tailored treatment strategies. This not only improves patient care but also drives advancements in medical research and clinical practice. 45 In conclusion, the adoption of machine learning algorithms, particularly Logistic Regression, for breast cancer prediction represents a significant advancement in healthcare. This innovative approach has the potential to revolutionize breast cancer diagnosis and treatment, ultimately leading to improved patient outcomes and advancements in the field of medicine as a whole. Through collaborative efforts between healthcare professionals, researchers, and technology experts, we can harness the power of machine learning to address the challenges posed by breast cancer and improve the lives of patients worldwide. 6.2 FUTURE SCOPE For breast cancer prediction using logistic regression, there are numerous avenues for future research that could advance the field. Here are several potential directions: Feature Selection: Investigate which features or variables are most informative for predicting breast cancer and refine the logistic regression model accordingly. This could involve exploring a wide range of clinical, demographic, and genetic factors to identify the most relevant predictors. Comparison with Other Algorithms: Logistic regression is just one of many machine learning algorithms that can be applied to breast cancer prediction. Researchers could compare the performance of logistic regression with other algorithms such as random forests, support vector machines, or neural networks. This comparative analysis could help identify the most effective modelling approaches for breast cancer prediction. Model Interpretation: Explore methods for interpreting logistic regression models to gain insights into the underlying factors associated with breast cancer risk. Researchers could analyse the estimated coefficients of the predictor variables to understand their impact on the likelihood of breast cancer occurrence. This could provide valuable insights into the biological, environmental, and lifestyle factors contributing to breast cancer development. Validation: Validate logistic regression models using independent datasets to assess their generalizability and robustness. This validation process would involve testing the models on data from different populations or healthcare settings to ensure their reliability and applicability in diverse clinical contexts. 46 APPENDIX Python: Python is an interpreted, high-level, general purpose programming language created by Guido Van Rossum and first released in 1991, Python's design philosophy emphasizes code Readability with its notable use of significant White space. Its language constructs and objectoriented approach aim to help programmers write clear, logical code for small and large-scale projects. Python is dynamically typed and garbage collected. It supports multiple programming paradigms, including procedural, object-oriented, and functional programming. Sklearn: Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistent interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib. NumPy: NumPy is a library for the python programming language, adding support for large, multidimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created by Jim with contributions from several other developers. In 2005, Travis created NumPy by incorporating features of the competing Numarray into Numeric, with extensive modifications. NumPy is open-source software and has many contributors. CV2: The OpenCV-Python library is a collection of Python bindings for dealing with computer vision problems. The method cv2.imread() opens a file and loads an image. If the image cannot be read, this method returns an empty matrix (due to insufficient permissions, a missing file, or an unsupported or invalid format). 47 Matplotlib: Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. It provides an object-oriented API for embedding plots into applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely resemble that of MATLAB, though its use is discouraged. Seaborn: Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. Seaborn is a library in Python predominantly used for making statistical graphics. Seaborn is a data visualization library built on top of matplotlib and closely integrated with pandas’ data structures in Python. Visualization is the central part of Seaborn which helps in exploration and understanding of data. 48 REFERENCES [1] Tarini Sinha. Tumours: Benign and Malignant. Canc Therapy & Oncol Int J. 2018; 10(3): 555790. DOI:10.19080/CTOIJ.2018.10.555790 [2] Comparison of machine learning models for breast cancer diagnosis IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 12, No. 1, March 2023, pp. 415~421 ISSN: 2252-8938 [3] Comparing Logistic Regression to the K-nearest Neighbours (KNN) technique, A Novel Pattern Discovery Based Human Activity Recognition Research Scholar, Saveetha School of Engineering, Saveetha Institute of Medical and Technical Sciences, Saveetha University, Chennai, Tamil Nādu, India, Pincode:602105.10(1S) 1625-1624 [4] Extract the Similar Images Using the Grey Level Cooccurrence Matrix and the Hu Invariants Moments Beshaier A. Abdulla a*, Yossra H. Ali b, Nuha J. Ibrahim c a Department of Computers Science, University of Technology, Iraq. [5] Detection and Classification of Cancer from Microscopic Biopsy Images Using Clinically Significant and Biologically Interpretable Features Rajesh Kumar, Rajeev Srivastava, and Subodh Srivastava Department of Computer Science and Engineering, Indian Institute of Technology (Banaras Hindu University), Varanasi 221005, India [6] VolcAshDB: a Volcanic Ash Database of classified particle images and features Damià Benet1,2,3· Fidel Costa1,2,3 · Christina Widiwijayanti1,2 · John Pallister4 · Gabriela Pedreros5 · Patrick Allard3 · Hanik Humaida6 · Yosuke Aoki7 · Fukashi Maeno7 [7] Gray level co-occurrence matrix (GLCM) texture-based crop classification using low altitude remote sensing platforms Naveed Iqbal, # Rafia Mumtaz, # Uferah Shafi, and Syed Mohammad Hassan Zaidi [8] Logistic regression in data analysis: An overview,July 2011 International Journal of Data Analysis Techniques and Strategies 3(3):281-299 DOI:10.1504/IJDATS.2011.041335. [9] Machine Learning: Algorithms, Real-World Applications and Research Directions Iqbal H. Sarker1,2 Received: 27 January 2021 / Accepted: 12 March 2021 / Published online: 22 March 2021 [10] High-Level K-Nearest Neighbours (HLKNN): A Supervised Machine Learning Model for Classification Analysis Elife Ozturk Kiyak 1 , Bita Ghasemkhani 2 and Derya Birant 3, 49 Independent Researcher, Izmir 35390, Turkey; elife.ozturk@cs.deu.edu.tr 2 Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey. [11] logistic regression analysis for studying the impact of home quarantine on psychological health during COVID-19 in Saudi Arabia 022 Oct; 61(10): 7995–8005. Taghreed M.Jawa. [12] Application of logistic regression models to assess household financial decisions regarding debt Agnieszka Strzelecka a *, Agnieszka Kurdyś-Kujawskaa, Danuta Zawadzkaa a Koszalin University of Technology, Faculty of Economic Science, Department of Finance, Kwiatkowskiego 6e, 75-343 Koszalin, Poland [13] A logistic regression investigation of the relationship between the Learning Assistant model and failure rates in introductory STEM courses Jessica L. Alzen, Laurie S. Langdon & Valerie K. Otero, International Journal of STEM Education volume 5, Article number: 56 (2018). [14] Application of Logistic Regression in the Study of Students’ Performance Level (Case Study of Vlora University) September 2015, Journal of Educational and Social Research 5(3), DOI:10.5901/jesr.2015.v5n3p239, License-CC BY-NC 4.0, Authors:Miftar Ramosaco [15] KNN Model-Based Approach in ClassificationGongde Guo, Hui Wang, David Bell, Yaxin Bi & Kieran Greer , Conference paper,5713 Accesses,571 Citations,8 Altmetric,Part of the book series: Lecture Notes in Computer Science ((LNCS, volume 2888)) [16] Introduction to machine learning: k-nearest neighbors,Zhongheng Zhang, Ann Transl Med. 2016 Jun; 4(11): 218.,doi: 10.21037/atm.2016.03.37. [17] Stem Cell, Nanotechnology, System Biology, and Bioinformatic Approaches in Therapy of Phospholipases-Induced Diseases Karuppiah Prakash Shyam, ... Daniel A. Gideon, in Phospholipases in Physiology and Pathology, 2023 50