Uploaded by guru610datta

Report final

advertisement
BREAST CANCER PREDICTION USING LOGISTIC
REGRESSION ALGORITHM
A Project report submitted in partial fulfilment of the requirements for
the award of the degree of
BACHELOR OF TECHNOLOGY
IN
ELECTRONICS AND COMMUNICATION ENGINEERING
Submitted by
A SAI KRISHNA
-
320106512001
ADADALA GURU DATTA
-
320106512002
MADDINENI YAKSHITH
-
320106512030
Under the guidance of
PROF. G. SASIBHUSHANA RAO
PROFESSOR
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING,
ANDHRA UNIVERSITY COLLEGE OF ENGINEERING,
VISAKHAPATNAM-530003.
2023-2024
i
DEPARTMENT OF ELECTRONICS AND COMMUNICATION
ENGINEERING,
ANDHRA UNIVERSITY COLLEGE OF ENGINEERING,
VISAKHAPATNAM – 530003.
2023 – 2024
BONAFIDE CERTIFICATE
This is to certify that the project work entitled “BREAST CANCER PREDICTION USING
LOGISTIC REGRESSION ALGORITHM” is a Bonafide work done by A SAI KRISHNA
(Regd.No.:320106512001),ADADALA GURU DATTA( Regd.No.:320106512002),
MADDINENI YAKSHITH (Regd.No.:320106512030) under the esteemed guidance of
PROF. G. SASIBHUSHANA RAO submitted in the partial fulfilment of the requirement for
the award of the degree of BACHELOR OF TECHNOLOGY in Electronics and
Communication Engineering, Andhra University College of Engineering, Andhra University,
Visakhapatnam during the year 2023-2024.
PROF. G. SASIBHUSHANA RAO
PROF. P.V. SRIDEVI
PROJECT GUIDE
HEAD OF DEPARTMENT
Department of Electronics and
Department of Electronics and
Communication Engineering
Communication Engineering
Andhra University College of
Andhra University College of
Engineering,
Engineering,
Visakhapatnam.
Visakhapatnam.
ii
ACKNOWLEDGEMENT
We would like to express our deep gratitude to our project guide
PROF. G.
SASIBHUSHANA RAO Professor, Department of Electronics And Communication
Engineering, AUCE, for his guidance with unsurpassed knowledge and immense
encouragement. We are grateful to PROF. P.V. SRIDEVI, Head of the Department,
Electronics and Communication Engineering, for providing us with the required
facilities for the completion of the project work.
We are thankful to Prof. P. Rajesh Kumar, Prof. M.S. Anuradha, Dr. V.
Malleswara Rao, Dr. S. Aruna, Dr. G. Rajeswara Rao, Dr. K. Chiranjeevi,
Dr. Praveen Babu Choppala of Department of Electronics and Communication
Engineering, Andhra University College of Engineering, Andhra University,
Visakhapatnam for their kind encouragement.
We thank all the scholars, technical staff and non-teaching staff of the Department
of Electronics and Communication Engineering, Andhra University College of
Engineering, Andhra University, Visakhapatnam for their constant support.
Regards,
A SAI KRISHNA (Regd.No.:320106512001),
ADADALA GURU DATTA (Regd.No.:320106512002),
MADDINENI YAKSHITH (Regd.No.:320106512030).
iii
DECLARATION
We hereby declare that the project entitled “BREAST CANCER PREDICTION
USING LOGISTIC REGRESSION ALGORITHM” submitted in the partial
fulfilment for the award of the degree Bachelor of Technology in Electronics and
Communication Engineering, for the academic year 2023-2024, is the record of
the Bonafide work carried out by A SAI KRISHNA (Regd.No.:320106512001),
ADADALA
GURU
DATTA
(Regd.No.:320106512002),
MADDINENI
YAKSHITH (Regd.No.:320106512030) under the guidance of PROF.G.
SASIBHUSHANA RAO,
Department of Electronics and Communication
Engineering, Andhra University College of Engineering, Andhra University,
Visakhapatnam.
Place: Visakhapatnam
Date:
Regards,
A SAI KRISHNA (Regd.No.:320106512001),
ADADALA GURU DATTA (Regd.No.:320106512002),
MADDINENI YAKSHITH (Regd.No.:320106512030).
iv
ABSTRACT
Breast cancer is a significant health concern affecting millions of women worldwide. Breast
cancer is a type of cancer that develops in the cells of the breast, most commonly in the ducts
or lobules. It is the most common cancer among women worldwide and can also affect men,
although it's rare. Early detection plays a pivotal role in improving survival rates and treatment
outcomes. In recent years, machine learning techniques have emerged as powerful tools for
breast cancer prediction, aiding in the identification of high-risk individuals and informing
personalized treatment strategies.
Risk factors for breast cancer include age, family history of breast cancer, genetic mutations
(such as BRCA1 and BRCA2), hormonal factors (early menstruation, late menopause,
hormone replacement therapy), lifestyle factors (diet, physical activity, alcohol consumption),
and certain medical conditions (such as dense breast tissue and previous radiation therapy).
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or False,
etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie
between 0 and 1.
Machine Learning is used across many ranges around the world. The healthcare industry is no
exception. Machine Learning can play an essential role in predicting the presence or absence
of cancerous cells like Malignant. Such information, if predicted well in advance, can provide
important intuitions to doctors who can then adapt their diagnosis and dealing per patient basis.
We work on predicting breast cancer in people using machine-learning algorithms. In this
project, the analysis of classifiers like K-Nearest Neighbours is performed, and an ensemble
classifier (Logistic Regression) that performs hybrid classification by taking strong and weak
classifiers since it can have multiple samples for training and validating the data so we perform
the analysis of existing classifiers and proposed classifier which can give the better accuracy
and predictive analysis.
v
CONTENTS
PAGE No.
CHAPTER-1: INTRODUCTION
1.1
Introduction to Breast Cancer
1
1.2
Motivation of the Work
12
1.3
Problem Statement
12
1.4
Organisation of the work
13
CHAPTER-2: LITERATURE SURVEY
14
CHAPTER-3: EXISTING METHODOLOGY
3.1
K- Nearest Neighbours
20
3.2
Flow Chart
21
3.3
Types of KNN
24
3.4
Advantages of KNN
24
3.5
Disadvantages of KNN
25
CHAPTER-4: WORKING OF SYSTEM
4.1
System Architecture
26
4.2
Machine Learning
28
4.3
Machine Learning Methods
29
4.4 Logistic Regression
32
vi
4.5 Work Flow
33
4.6 Advantages of Logistic Regression over KNN
38
CHAPTER-5: EXPERIMENTAL ANALYSIS
5.1
System Configuration
40
5.1.1 Hardware Requirements
40
5.1.2 Software Requirements
40
5.2
Features Extracted
40
5.3
Performance Analysis
42
5.4
Result
43
CHAPTER-6: CONCLUSION AND FUTURE WORK
6.1 Conclusion
45
6.2 Future Scope
46
APPENDIX
47
REFERENCES
49
vii
LIST OF FIGURES
Fig .No.
Figure Description
Page No.
1.1
Breast Tumour
5
1.2
Progression of Breast Cancer
6
1.3
Benign and Malignant Tumour
7
3.1
K-Nearest Neighbour
20
4.1
System Working
25
4.2
Supervised Learning
29
4.3
Unsupervised Learning
30
4.4
Semi-Supervised Learning
31
4.5
Reinforcement Learning
31
4.6
Sigmoid Function
32
4.7
Biopsy Image
34
4.8
Data Pre-Processing
37
5.1
Biopsy Image
41
5.2
Breast Cancer Output
42
5.4
Confusion Matrix
43
5.4
Performance Metrics
44
viii
LIST OF EQUATIONS
EQUATION N0.
PAGE NO.
Equation 3.1
22
Equation 3.2
23
Equation 3.3
23
Equation 3.4
23
Equation 4.1
32
Equation 4.2
35
Equation 4.3
35
Equation 4.4
35
Equation 4.5
35
Equation 4.6
35
Equation 4.7
36
Equation 4.8
36
Equation 4.9
36
Equation 4.10
37
Equation 4.11
37
ix
LIST OF TABLES
TABLE N0.
PAGE NO.
Table 5.1
42
Table 5.2
43
Table 5.3
44
x
CHAPTER 1
INTRODUCTION
1.1 INTRODUCTION TO BREAST CANCER
Breast cancer has become the most recurrent type of health issue among women especially for
women in middle age. Early detection of breast cancer can help women cure this disease and
the death rate can be reduced. In the present-day scenario, to observe breast cancer
mammograms are used and they are known to be the most effective scanning technique. In this,
the detection of cancer cells is done by machine learning techniques.
The exact cause of breast cancer is not fully understood, but it is believed to result from a
combination of genetic, hormonal, environmental, and lifestyle factors. Mutations in specific
genes, such as BRCA1 and BRCA2, are associated with an increased risk of developing breast
cancer. Hormonal factors, such as Estrogen and progesterone, also play a role, as evidenced by
the higher incidence of breast cancer in women with early onset of menstruation, late
menopause, or hormone replacement therapy. Additionally, lifestyle factors such as alcohol
consumption, obesity, and lack of physical activity have been linked to an increased risk of
breast cancer.
Breast cancer typically begins in the cells lining the milk ducts (ductal carcinoma) or the
lobules (lobular carcinoma) of the breast. Over time, the abnormal cells may invade
surrounding breast tissue and metastasize to other body parts, such as the lymph nodes, bones,
liver, or lungs. The exact mechanisms underlying cancer progression vary depending on the
subtype of breast cancer, which includes hormone receptor-positive, HER2-positive, and triplenegative breast cancer.
Studies say that over 1,70,000 new breast cancer cases are likely to develop in India by 2020.
According to research, 1 in every 28 women is likely to get affected by the disease. While
breast cancer occurs almost entirely in women, around 1-2% of men are likely to get affected,
too.
1
1.1.1 OVERVIEW
Breast cancer is a disease in which abnormal breast cells grow out of control and form tumours.
If left unchecked, the tumours can spread throughout the body and become fatal. Breast cancer
cells begin inside the milk ducts and the milk-producing lobules of the breast. The earliest form
(in situ) is not life-threatening and can be detected in early stages. Cancer cells can spread into
nearby breast tissue (invasion). This creates tumours that cause lumps or thickening. Invasive
cancers can spread to nearby lymph nodes or other organs (metastasize). Metastasis can be lifethreatening and fatal. Treatment is based on the person, the type of cancer, and its spread. The
treatment combines surgery, radiation therapy, and medications.
1.1.2 SCOPE OF THE WORK
Female gender is the strongest breast cancer risk factor. Approximately 99% of breast cancers
occur in women and 0.5–1% of breast cancers occur in men. The treatment of breast cancer in
men follows the same principles of management as for women.
Certain factors increase the risk of breast cancer including increasing age, obesity, harmful use
of alcohol, family history of breast cancer, history of radiation exposure, reproductive history
(such as age that menstrual periods began and age at first pregnancy), tobacco use, and
postmenopausal hormone therapy. Approximately half of breast cancers develop in women
who have no identifiable breast cancer risk factor other than gender (female) and age (over 40
years).
A family history of breast cancer increases the risk of breast cancer, but most women diagnosed
with breast cancer do not have a known family history of the disease. Lack of a known family
history does not necessarily mean that a woman is at reduced risk.
Certain inherited high penetrance gene mutations greatly increase breast cancer risk, the most
dominant being mutations in the genes BRCA1, BRCA2, and PALB-2. Women found to have
mutations in these major genes may consider risk reduction strategies such as surgical removal
of both breasts or chemoprevention strategies. In 2022, there were 2.3 million women
diagnosed with breast cancer and 670,000 deaths globally. Breast cancer occurs in every
country of the world in women at any age after puberty but with increasing rates in later life.
Global estimates reveal striking inequities in the breast cancer burden according to human
2
development. For instance, in countries with a very high Human Development Index (HDI), 1
in 12 women will be diagnosed with breast cancer in their lifetime and 1 in 71 women die of
it. In contrast, in countries with a low HDI; while only 1 in 27 women is diagnosed with
breast cancer in their lifetime, 1 in 48 women will die from it.
1.1.3 TYPES OF BREAST CANCER
Breast cancer can be invasive or non-invasive. Invasive breast cancer is cancer that spreads
into surrounding tissues and/or distant organs. Non-invasive breast cancer does not go beyond
the milk ducts or lobules in the breast. About 80% of breast cancer is invasive cancer, and about
20% is non-invasive cancer. There are multiple types of breast cancers, which are classified
based on how they look under a microscope
A. Classification based on microscopic view:
•
Ductal carcinoma in situ (DCIS). This is a non-invasive cancer (stage 0) that is located
only in the duct and has not spread outside the duct.
•
Invasive or infiltrating ductal carcinoma. This is cancer that has spread outside of the
ducts. It is the most common type of invasive breast cancer.
• Invasive lobular carcinoma. This is a type of breast cancer that has spread outside
of the lobules.
B. Classification based on tumour characteristics:
There are 3 main types of breast cancer that are determined by doing specific tests on a
sample of the tumour to determine its characteristics.
•
Hormone receptor positive: Breast cancers expressing estrogen receptors (ER) and/or
progesterone receptors (PR) are called “hormone receptor-positive.” These receptors
are proteins found in cells. Tumours that have estrogen receptors are called “ER
positive.” Tumours that have progesterone receptors are called “PR positive.” Only 1
of these receptors needs to be positive for cancer to be called hormone receptorpositive. This type of cancer may depend on the hormones estrogen and/or progesterone
to grow. Hormone receptor-positive cancers can occur at any age, but they are more
common after menopause. About two-thirds of breast cancers have estrogen and/or
progesterone receptors. Cancers without these receptors are called “hormone receptor-
3
negative.” Hormone receptor-positive breast cancers are commonly treated using
hormone therapy.
•
HER2 positive: About 15% to 20% of breast cancers depend on the gene
called “human epidermal growth factor receptor 2” (HER2) to grow. These cancers are
called “HER2 positive” and have many copies of the HER2 gene or high levels of the
HER2 protein. These proteins are also called “receptors.” The HER2 gene makes the
HER2 protein, which is found in cancer cells and is important for tumour cell growth.
HER2-positive breast cancers grow more quickly. They can also be either hormone
receptor-positive or hormone receptor-negative. HER2-positive early-stage breast
cancers are commonly treated using HER2-targeted therapies. Cancers that have no
HER2 protein are called “HER2 negative." Cancers that have low levels of the HER2
protein and/or few copies of the HER2 gene are sometimes now called “HER2 low."
•
Triple negative. If a tumour does not express ER, PR, and HER2, the tumour is called
“triple negative.” Triple-negative breast cancer makes up about 10% to 20% of invasive
breast cancers. Triple-negative breast cancer seems to be more common among younger
women, particularly younger Black women and Hispanic women. Triple-negative
breast
cancer
is
also
more
common
in
women
with
a
mutation
in
the BRCA1 gene. Experts often recommend that people with triple-negative breast
cancer be tested for BRCA gene mutations
Most people will not experience any symptoms when the cancer is still early hence the
importance of early detection. Breast cancer can have combinations of symptoms, especially
when it is more advanced. Symptoms of breast cancer can include:
•
a breast lump or thickening, often without pain.
•
change in size, shape, or appearance of the breast dimpling, redness, pitting, or other
changes in the skin.
•
change in nipple appearance or the skin surrounding the nipple (areola).
•
abnormal or bloody fluid from the nipple.
1.1.5 TYPES OF TUMOURS
A tumour is a pathologic disturbance of cell growth, characterized by excessive and abnormal
proliferation of cells. Tumours are abnormal masses of tissue that may be solid or fluid-filled.
When the growth of tumour cells is confined to the site of origin and are of normal physicality
4
they are concluded as benign tumours. When the cells are abnormal and can grow
uncontrollably, they are concluded as cancerous cells i.e., malignant tumours. Tumours are also
called ‘NEOPLASM’. To determine whether a tumour is benign or cancerous, a doctor can
take a sample of the cells with a biopsy procedure. Then the biopsy is analysed under a
microscope by a pathologist (a doctor specializing in laboratory science).
Fig 1.1 Breast Tumour
1. Benign Tumours: Noncancerous
If the cells are non-cancerous, the tumour is concluded as benign. It won’t invade nearby tissues
or spread to other areas of the body (metastasize). A benign tumour is less harmful unless it is
present near any important organs, tissues, nerves, or blood vessels and causes damage.
Fibroids in the uterus and breast, polyps of the colon, and moles are some examples of benign
tumours. Benign tumours can be removed by surgery. They can grow very large, sometimes
weighing pounds. They can be dangerous, such as when they occur in the brain and crowd the
normal structures in the enclosed space of the skull. They can press on vital organs or block
channels. Also, some types of benign tumours such as intestinal polyps are considered
precancerous and are removed immediately to prevent them from becoming malignant. Benign
tumours usually don’t reoccur once removed, but if they do it is usually in the same place.
5
2. Malignant Tumours: Cancerous
Malignant means that the tumour is made of cancer cells and it can invade nearby tissues. Some
cancer cells can move into the bloodstream or lymph nodes, where they can spread to other
tissues within the body—this is called metastasis. Cancer can occur anywhere in the body
including the breast, lungs, intestines, reproductive organs, blood, or skin. For example, breast
cancer begins in the breast tissue and may spread to lymph nodes in the armpit if it’s not caught
early enough and treated. Once breast cancer has spread to the lymph nodes, the cancer cells
can travel to other areas of the body, like the bones or liver. The breast cancer cells can then
form tumours in those locations referred to as secondary tumours. A biopsy of these tumours
might show characteristics of the original breast cancer tumour.
Fig 1.2 Progression of Breast Cancer
Differences Between Benign and Malignant Tumours
There are many important differences between benign and malignant tumours which are as
follows:
i. On the Basis of Growth rate: Generally malignant tumours grow more rapidly than benign
tumours, although there are slow-growing and fast-growing tumours in either category.
ii. On the basis of Ability to invade locally: Malignant tumours use to invade the tissues
around them. One of the most prominent hallmarks of cancer is penetration of the basal
membrane that surrounds normal tissues.
6
iii. On the Basis of Ability to spread at Distance: Malignant tumours may spread to other
parts of the body using the bloodstream or the lymphatic system. Malignant tumours may also
invade nearby tissues and send out fingers into them, while benign tumours don’t. Benign
tumours only grow in size at the place of their origin.
Fig 1.3 Benign and Malignant Tumour
iv. On the Basis of Recurrence: Benign tumours can be removed completely by surgery as
they have clearer boundaries, and as a result, they are less likely to reoccur. If they do reoccur,
it is only at the original site. Malignant tumours may spread to other parts of the body. They
are more likely to reoccur such as breast cancer recurring in the lungs or bones.
v. On the Basis of Cellular Appearance: When a pathologist looks at tumour cells under a
microscope, it is very easy to determine whether they are normal, benign cells, or cancerous
cells as Cancer cells often have abnormal chromosomes and DNA, making their nuclei larger
and darker. They also often have different shapes than normal cells. However, sometimes the
difference is subtle.
vi. On the Basis of Systemic effects: Some benign tumours secrete hormones, such as benign
pheochromocytomas, and malignant tumours are more likely to do so. Malignant tumours can
secrete substances that cause effects throughout the body, such as fatigue and weight loss. This
is known as paraneoplastic syndrome.
vii. On the Basis of Treatments: A benign tumour can usually be completely treated with
surgery, although some may be treated with radiation therapy or chemotherapy. Some benign
tumours are not treated as they do not pose any health risk. Malignant tumours may require
7
chemotherapy, radiation therapy, or immunotherapy medications to eliminate a tumour cell
that remains after treatment or to treat secondary tumours present in other parts of the body.
1.1.6 STAGING OF BREAST CANCER
Staging is a way of describing how extensive the breast cancer is, including the size of the
tumour, whether it has spread to lymph nodes, whether it has spread to distant parts of the body,
and what its biomarkers are. Staging can be done either before or after a patient undergoes
surgery. Staging done before surgery is called the clinical stage, and staging done after surgery
is called the pathologic stage.
TNM Staging System:
The most common tool that doctors use to describe the stage is the TNM system.
•
Tumour (T): The primary tumour size in the breast and its biomarkers.
•
Node (N): The spread of the tumour to the lymph nodes, the number of nodes it has
touched, and the position of the nodes.
•
Metastasis (M): The spread of cancer to other parts of the body.
There are 5 major stages of breast cancer: stage 0 (zero), which is non-invasive ductal
carcinoma in situ (DCIS), and stages I through IV (1 through 4), which are used for invasive
breast cancer. Here are more details on each part of the TNM system for breast cancer:
1. Tumour(T):
Using the TNM system, the “T” plus a letter or number (0 to 4) is used to describe the size
and location of the tumour. Tumour size is measured in centimetres (cm).
•
TX: The primary tumour cannot be evaluated.
•
T0 (T zero): There is no evidence of cancer in the breast.
•
Tis: Refers to carcinoma in situ. The cancer is confined within the ducts of the breast
tissue and has not spread into the surrounding tissue of the breast. There are 2 types of
breast carcinoma in situ:
•
Tis (DCIS): DCIS is a non-invasive cancer, but if not removed, it may develop into an
invasive breast cancer later. DCIS means that cancer cells have been found in breast ducts
and have not spread past the layer of tissue where they began.
8
•
Tis (Paget’s disease): Paget's disease of the nipple is a rare form of early, non-invasive
cancer that is only in the skin cells of the nipple. Sometimes Paget's disease is associated
with invasive breast cancer. If there is an invasive breast cancer, it is classified according
to the stage of the invasive tumour.
T1: The tumour in the breast is 20 millimetres (mm) or smaller in size at its widest area.
This is a little less than an inch. This stage is then broken into 4 substages depending on
the size of the tumour:
•
T1mi is a tumour that is 1 mm or smaller.
•
T1a is a tumour that is larger than 1 mm but 5 mm or smaller.
•
T1b is a tumour that is larger than 5 mm but 10 mm or smaller.
•
T1c is a tumour that is larger than 10 mm but 20 mm or smaller
T2: The tumour is larger than 20 mm but not larger than 50 mm.
T3: The tumour is larger than 50 mm.
T4: The tumour falls into 1 of the following groups:
•
T4a means the tumour has grown into the chest wall.
•
T4b is when the tumour has grown into the skin.
•
T4c is cancer that has grown into the chest wall and the skin.
2. Node (N):
The “N” in the TNM staging system stands for lymph nodes. These small, bean-shaped organs
help fight infection. Lymph nodes near where the cancer started are called regional lymph
nodes. Regional lymph nodes include:
•
Lymph nodes located under the arm, called the axillary lymph nodes
•
Lymph nodes located above and below the collarbone
•
Lymph nodes located under the breastbone, called the internal mammary lymph nodes.
Lymph nodes in other parts of the body are called distant lymph nodes. The information below
describes the staging.
1. NX: The lymph nodes were not evaluated.
2. N0: Either of the following:
•
No cancer was found in the lymph nodes.
•
Only areas of cancer smaller than 0.2 mm are in the lymph nodes.
9
3. N1: The cancer has spread to 1 to 3 axillary lymph nodes and/or the internal mammary
lymph nodes. If the cancer in the lymph node is larger than 0.2 mm but 2 mm or smaller, it
is called "micro metastatic" (N1mi).
4. N2: The cancer has spread to 4 to 9 axillary lymph nodes. Or, it has spread to the internal
mammary lymph nodes, but not the axillary lymph nodes.
5. N3: The cancer has spread to 10 or more axillary lymph nodes, or it has spread to the lymph
nodes located under the clavicle, or collarbone. It may have also spread to the internal
mammary lymph nodes. Cancer that has spread to the lymph nodes above the clavicle,
called the supraclavicular lymph nodes, is also described as N3.
If there is cancer in the lymph nodes, knowing how many lymph nodes are involved and where
they are helps doctors plan treatment. The pathologist can find out the number of axillary lymph
nodes that contain cancer after they are removed during surgery. It is not common to remove
the supraclavicular or internal mammary lymph nodes during surgery. If there is cancer in these
lymph nodes, treatment other than surgery, such as radiation therapy, chemotherapy, and
hormonal therapy, is generally used.
3. Metastasis:
The “M” in the TNM system describes whether the cancer has spread to other parts of the
body, called metastasis. This is no longer considered early-stage or locally advanced cancer.
1. MX: Distant spread cannot be evaluated.
2. M0: There is no evidence of distant metastases.
3. M0 (i+): There is no clinical or radiographic evidence of distant metastases. However,
there is microscopic evidence of tumour cells in the blood, bone marrow, or other lymph
nodes that are no larger than 0.2 mm.
4. M1: There is evidence of metastasis to another part of the body, meaning breast cancer
cells are growing in other organs.
10
Stage Groups of Breast Cancer
Doctors assign the stage of the cancer by combining the T, N, and M classifications
1. Stage 0: Stage zero (0) describes disease that is only in the ducts of the breast tissue and
has not spread to the surrounding tissue. It is also called non-invasive or in situ cancer (Tis,
N0, M0).
2. Stage IA: The tumour is small, invasive, and has not spread to the lymph nodes (T1, N0,
M0).
3. Stage IB: Cancer has spread to the lymph nodes and the cancer in the lymph node is larger
than 0.2 mm but less than 2 mm in size. There is either no evidence of a tumour in the
breast or the tumour in the breast is 20 mm or smaller (T0 or T1, N1mi, M0).
4. Stage IIA: Any 1 of these conditions:
a. There is no evidence of a tumour in the breast, but the cancer has spread to 1 to 3 axillary
lymph nodes. It has not spread to distant parts of the body (T0, N1, M0).
b. The tumour is 20 mm or smaller and has spread to 1 to 3 axillary lymph nodes (T1, N1,
M0).
c. The tumour is larger than 20 mm but not larger than 50 mm and has not spread to the
axillary lymph nodes (T2, N0, M0).
5. Stage IIB: Either of these conditions:
a. The tumour is larger than 20 mm but not larger than 50 mm and has spread to 1 to 3 axillary
lymph nodes (T2, N1, M0).
b. The tumour is larger than 50 mm but has not spread to the axillary lymph nodes (T3, N0,
M0).
6. Stage IIIA: The tumour of any size has spread to 4 to 9 axillary lymph nodes or to internal
mammary lymph nodes. It has not spread to other parts of the body (T0, T1, T2, or T3; N2;
M0). Stage IIIA may also be a tumour larger than 50 mm that has spread to 1 to 3 axillary
lymph nodes (T3, N1, M0).
7. Stage IIIB: The tumour has spread to the chest wall or caused swelling or ulceration of the
breast. It may or may not have spread to up to 9 axillary or internal mammary lymph nodes.
It has not spread to other parts of the body (T4; N0, N1, or N2; M0).
8. Stage IIIC: A tumour of any size that has spread to 10 or more axillary lymph nodes, the
internal mammary lymph nodes, and/or the lymph nodes under the collarbone. It has not
spread to other parts of the body (any T, N3, M0).
11
9. Stage IV (metastatic): The tumour can be any size and has spread to other organs, such as
the bones, lungs, brain, liver, distant lymph nodes, or chest wall (any T, any N, M1).
Metastatic breast cancer that is found when the cancer is first diagnosed occurs about 6%
of the time. This may be called de novo metastatic breast cancer. Most commonly,
metastatic breast cancer is found after a previous diagnosis and treatment of early-stage
breast cancer.
1.2 MOTIVATION OF THE WORK
The primary objective of this research is to develop a robust breast cancer prediction model to
forecast the likelihood of breast cancer occurrence. Additionally, the study aims to identify the
most effective classification algorithm for detecting the presence of breast cancer in patients.
This endeavour is justified through a comprehensive comparative analysis utilizing various
classification algorithms, with logistic regression as the primary focus. While logistic
regression is a widely used machine learning technique, accurately predicting breast cancer is
a critical endeavour demanding the highest level of precision. Therefore, the algorithms
undergo thorough evaluation using diverse assessment methodologies and criteria. By
conducting this research, both researchers and medical professionals will gain valuable insights
to enhance breast cancer detection and diagnosis, ultimately leading to improved patient
outcomes.
1.3 PROBLEM STATEMENT
The primary challenge in breast cancer lies in its early detection. Although there are existing
diagnostic tools for breast cancer prediction, they often come with significant drawbacks, such
as high cost or inefficiency in accurately assessing the risk of breast cancer in individuals. Early
detection of breast cancer plays a crucial role in reducing mortality rates and minimizing
overall complications. However, continuous monitoring of patients is not always feasible, and
the availability of round-the-clock medical consultation is limited due to the requirement for
expertise and time. With the abundance of data available in today's world, leveraging various
machine learning algorithms offers a promising avenue for analysing data to uncover hidden
patterns. These hidden patterns can serve as valuable insights for health diagnosis in medical
datasets, empowering healthcare professionals to enhance breast cancer detection and provide
timely interventions, ultimately improving patient outcomes.
12
1.4 ORGANISATION OF THE WORK
This project is structured into six chapters, each dedicated to various aspects of utilizing
machine learning for breast cancer prediction. It commences with an introduction to breast
cancer and its stages in the first chapter. Following this, the second chapter reviews previous
research and the application of machine learning in breast cancer prediction. In the third
chapter, the K-nearest neighbours method, a prevalent approach in this field, is explained.
Subsequently, the fourth chapter proposes employing logistic regression alongside other
machine learning techniques to enhance prediction accuracy. The fifth chapter undertakes a
comparative analysis of the effectiveness of existing and proposed methods. Finally, the project
concludes by summarizing findings and suggesting future research directions in the sixth
chapter. The overarching objective is to advance breast cancer prediction using machine
learning techniques and contribute to ongoing efforts in cancer research.
13
CHAPTER 2
LITERATURE SURVEY
To develop a comprehensive understanding of image feature extraction techniques and
machine learning algorithms, it's crucial to first familiarize ourselves with the existing research
and advancements in these domains. Therefore, in this section, we provide an overview of
various papers and books that we have explored to gather relevant insights for our project. By
examining these diverse sources, we aim to lay a solid foundation of knowledge necessary for
effectively utilizing these techniques in our work.
[1] Tarini Sinha. Tumours: Benign and Malignant. Canc Therapy & Oncol Int J. 2018;
10(3): 555790. DOI:10.19080/CTOIJ.2018.10.555790
A tumour is a pathologic disturbance of cell growth, characterized by excessive and abnormal
proliferation of cells. Tumours are abnormal mass of tissue which may be solid or fluid filled.
When the growth of tumour cells confined to the site of origin and are of normal physicality
they are concluded as benign tumours. When the cells are abnormal and can grow
uncontrollably, they are concluded as cancerous cells i.e. malignant tumour. Tumours are also
called as ‘NEOPLASM’. To determine whether a tumour is benign or cancerous, a doctor can
take a sample of the cells with a biopsy procedure. Then the biopsy is analysed under a
microscope by a pathologist (a doctor specializing in laboratory science).
[2] Comparison of machine learning models for breast cancer diagnosis
IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 12, No. 1, March 2023,
pp. 415~421 ISSN: 2252-8938
Breast cancer is the most common cause of death among women worldwide. Breast cancer can
be detected early, and the death rate can be reduced. Machine learning (ML) techniques are a
hot topic for study and have proved influential in cancer prediction and early diagnosis. This
study's objective is to predict and diagnose breast cancer using ML models and evaluate the
most effective based on six criteria: specificity, sensitivity, precision, accuracy, F1-score and
receiver operating characteristic curve. All work is done in the anaconda environment, which
uses Python's NumPy and SciPy numerical and scientific libraries, and pandas and matplotlib.
This study used the Wisconsin diagnostic breast cancer dataset to test ten ML algorithms:
decision tree, linear discriminant analysis, forests of randomized trees, gradient boosting,
passive aggressive, logistic regression, naïve Bayes, nearest centroid, support vector machine,
14
and perceptron. After collecting the findings, we performed a performance evaluation and
compared these various classification techniques. Gradient boosting model outperformed all
other algorithms, scoring 96.77% on the F1-score.
[3] Comparing Logistic Regression to the K-nearest Neighbours (KNN) technique, A
Novel Pattern Discovery Based Human Activity Recognition Research Scholar, Saveetha
School of Engineering, Saveetha Institute of Medical and Technical Sciences, Saveetha
University, Chennai, Tamil Nādu, India, Pincode:602105.10(1S) 1625-1624
The main objective of this research study is to improve accuracy for Human Activity
Recognition using Data Analysis Techniques. Materials and Methods: A Logistic Regression
with sample size 10 and K-nearest Neighbours (KNN) Algorithm with sample size of 10. It
was iterated at different times predicting the accuracy percentage of human Activity
Recognition. Results: Human Activity Recognition utilizing Novel Logistic Regression
95.52% accuracy compared with K-nearest Neighbours (KNN) Algorithm 90.52% accuracy.
Logistic Regression seems to perform essentially better compared to K-nearest Neighbours
(KNN) Algorithm (p=0.42) (p=0.5) The Logistic Regression algorithm in computer vision
appears to perform significantly better than the K-nearest Neighbours (KNN) Algorithm.
[4] Extract the Similar Images Using the Grey Level Cooccurrence Matrix and the Hu
Invariants Moments Beshaier A. Abdulla a*, Yossra H. Ali b, Nuha J. Ibrahim c a
Department of Computers Science, University of Technology, Iraq.
In the last years, many types of research have introduced different methods and techniques for
a correct and reliable image retrieval system. The goal of this paper is a comparison study
between two different methods which are the Grey level co-occurrence matrix and the Hu
invariants moments, and this study is done by building up an image retrieval system employing
each method separately and comparing between the results. The Euclidian distance measure is
used to compute the similarity between the query image and database images. Both systems
are evaluated according to the measures that are used in detection, description, and matching
fields which are precision, recall, and accuracy, and addition to that mean square error (MSE)
and structural similarity index (SSIM) is used. And as it shows from the results the Grey level
co-occurrence matrix (GLCM) had outstanding and better results from the Hu invariants
moment method.
15
[5] Detection and Classification of Cancer from Microscopic Biopsy Images Using
Clinically Significant and Biologically Interpretable Features Rajesh Kumar, Rajeev
Srivastava, and Subodh Srivastava Department of Computer Science and Engineering,
Indian Institute of Technology (Banaras Hindu University), Varanasi 221005, India
This study proposes an automated framework for cancer detection from microscopic biopsy
images. It includes image enhancement, background cell segmentation, feature extraction, and
classification stages. Selected methods, like Contrast Limited Adaptive Histogram
Equalization for image enhancement and the k-means segmentation algorithm for background
cell segmentation, were chosen based on comparative analysis. Various clinically significant
features are extracted, including Gray level texture, colour-based, and wavelet features.
Classification into normal and cancerous categories is done using the K-nearest Neighbour
method. The framework's performance is evaluated on 1000 biopsy images representing four
tissue types.
[6] VolcAshDB: a Volcanic Ash Database of classified particle images and features
Damià Benet1,2,3 · Fidel Costa1,2,3 · Christina Widiwijayanti1,2 · John Pallister4 ·
Gabriela Pedreros5
· Patrick Allard3
· Hanik Humaida6
· Yosuke Aoki7
·
Fukashi Maeno7
Volcanic ash is a valuable source of information for understanding volcanic activity, but
classifying ash particles is challenging due to varying observations and lack of standardized
methodologies. To address this, we developed Volcanic Ash Database (VolcAshDB),
containing over 6,300 high-resolution images of ash particles and quantitative features. Each
particle is classified into one of four main categories: free crystal, altered material, lithic, and
juvenile. VolcAshDB is publicly available and facilitates comparative studies and machine
learning model training to automate particle classification and reduce observer biases.
[7] Gray level co-occurrence matrix (GLCM) texture-based crop classification using low
altitude remote sensing platforms Naveed Iqbal, # Rafia Mumtaz,# Uferah Shafi,
and Syed Mohammad Hassan Zaidi
For many computer vision and machine learning problems, large training sets are key for good
performance. However, the most computationally expensive part of many computers vision
and machine learning algorithms consists of finding nearest Neighbour matches to high
dimensional vectors that represent the training data. We propose new algorithms for
16
approximate nearest Neighbour matching and evaluate and compare them with previous
algorithms. For matching high dimensional features, we find two algorithms to be the most
efficient: the randomized k-d forest and a new algorithm proposed in this paper, the priority
search k-means tree. We also propose a new algorithm for matching binary features by
searching multiple hierarchical clustering trees and show it outperforms methods typically used
in the literature. We show that the optimal nearest Neighbour algorithm and its parameters
depend on the data set characteristics and describe an automated configuration procedure for
finding the best algorithm to search a particular data set. In order to scale to very large data sets
that would otherwise not fit in the memory of a single machine, we propose a distributed nearest
Neighbour matching framework that can be used with any of the algorithms described in the
paper. All this research has been released as an open-source library called Fast Library for
approximate nearest Neighbours (FLANN), which has been incorporated into OpenCV and is
now one of the most popular libraries for nearest-Neighbour matching.
[8] Logistic regression in data analysis: An overview, July 2011 International Journal of
Data Analysis Techniques and Strategies 3(3):281-299
DOI:10.1504/IJDATS.2011.041335
Logistic regression (LR) continues to be one of the most widely used methods in data mining
in general and binary data classification in particular. This paper is focused on providing an
overview of the most important aspects of LR when used in data analysis, specifically from an
algorithmic and machine learning perspective and how LR can be applied to imbalanced and
rare events data.
[9] Machine Learning: Algorithms, Real-World Applications and Research Directions
Iqbal H. Sarker1,2 Received: 27 January 2021 / Accepted: 12 March 2021 / Published
online: 22 March 2021
In the current age of the Fourth Industrial Revolution (4IR or Industry 4.0), the digital world
has a wealth of data, such as Internet of Things (IoT) data, cybersecurity data, mobile data,
business data, social media data, health data, etc. To intelligently analyse these data and
develop the corresponding smart and automated applications, the knowledge of artifcial
intelligence (AI), particularly, machine learning (ML) is the key. Various types of machine
learning algorithms such as supervised, unsupervised, semi-supervised, and reinforcement
learning exist in the area. Besides, the deep learning, which is part of a broader family of
machine learning methods, can intelligently analyse the data on a large scale. In this paper, we
17
present a comprehensive view on these machine learning algorithms that can be applied to
enhance the intelligence and the capabilities of an application. Thus, this study’s key
contribution is explaining the principles of different machine learning techniques and their
applicability in various real-world application domains, such as cybersecurity systems, smart
cities, healthcare, e-commerce, agriculture, and many more. We also highlight the challenges
and potential research directions based on our study. Overall, this paper aims to serve as a
reference point for both academia and industry professionals as well as for decision-makers in
various real-world situations and application areas, particularly from the technical point of
view.
[10] High-Level K-Nearest Neighbours (HLKNN): A Supervised Machine Learning
Model for Classification Analysis Elife Ozturk Kiyak 1, Bita Ghasemkhani 2 and Derya
Birant 3, Independent Researcher, Izmir 35390, Turkey; elife.ozturk@cs.deu.edu.tr 2
Graduate School of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390,
Turkey.
The k-nearest Neighbours (KNN) algorithm has been widely used for classification analysis in
machine learning. However, it suffers from noise samples that reduce its classification ability
and therefore prediction accuracy. This article introduces the high-level k-nearest Neighbours
(HLKNN) method, a new technique for enhancing the k-nearest Neighbours algorithm, which
can effectively address the noise problem and contribute to improving the classification
performance of KNN. Instead of only considering k Neighbours of a given query instance, it
also takes into account the Neighbours of these Neighbours. Experiments were conducted on
32 well-known popular datasets. The results showed that the proposed HLKNN method
outperformed the standard KNN method with average accuracy values of 81.01% and 79.76%,
respectively. In addition, the experiments demonstrated the superiority of HLKNN over
previous KNN variants in terms of the accuracy metric in various datasets.
outperformed standard KNN, achieving an average accuracy of 81.01% compared to 79.76%
for KNN. Moreover, HLKNN demonstrated superiority over previous KNN variants across
various datasets, showcasing its effectiveness in improving classification accuracy.
18
Summary:
The presented research papers offer a comprehensive exploration of machine learning and
computer vision applications across diverse domains. Central to these studies is the pivotal role
of logistic regression and the Grey Level Co-occurrence Matrix (GLCM) in advancing
classification and image analysis tasks. Beginning with breast cancer prediction and diagnosis,
the research underscores the significance of leveraging machine learning techniques, with
logistic regression emerging as a potent tool for accurate classification. Subsequent
investigations delve into human activity recognition, where logistic regression showcases its
superiority over K-nearest neighbours (KNN), highlighting its efficacy in computer vision
tasks. Additionally, GLCM emerges as a key player in image retrieval methods. Notably, in
the context of automated cancer detection from microscopic biopsy images, GLCM's utilization
alongside various techniques such as image enhancement, segmentation, and classification
underscore its importance in achieving accurate diagnoses. These findings underscore the
pivotal roles of logistic regression and GLCM in driving advancements across machine
learning and computer vision domains, ultimately contributing to the resolution of complex
challenges in getting the image features.
19
CHAPTER 3
EXISTING METHODOLOGY
3.1 K-NEAREST NEIGHBOURS:
Using the K-Nearest Neighbours (KNN) algorithm for breast cancer prediction involves
collecting a dataset comprising relevant features such as tumour characteristics and patient
data, with corresponding labels indicating benign or malignant status. After preprocessing the
data, including feature selection and normalization, the KNN model is trained by storing the
entire dataset. During prediction, the algorithm calculates the distance between a new,
unlabelled data point and all other points in the training set, identifying the k nearest neighbours
based on distance metrics like Euclidean distance
The k-nearest neighbours (KNN) algorithm is a non-parametric, supervised learning classifier,
which uses proximity to make classifications or predictions about the grouping of an individual
data point. It is one of the popular and simplest classification and regression classifiers used in
machine learning today.
While the KNN algorithm can be used for either regression or classification problems, it is
typically used as a classification algorithm.
Here's a basic overview of how KNN works:
1. Training Phase: In the training phase, KNN simply memorizes the features and labels of
the training dataset. No explicit model is built.
2. Prediction Phase: When a new data point is provided, KNN calculates the distance
between that point and every other point in the training dataset. The most common distance
metric used is Euclidean distance, but other metrics like Manhattan distance or cosine
similarity can also be used.
3. Choosing K: K represents the number of nearest neighbours to consider. A small value of
K can make the model sensitive to noise, while a large value of K can make the model
overly generalized. Choosing the right value of K is crucial and often involves
experimentation and cross-validation.
4. Classification: For classification tasks, KNN takes a majority vote among the K nearest
neighbours and assigns the class label that is most common among them to the new data
point.
20
3.2 FLOW CHART
Data
Testing data
Training data
Choosing K
KNN algorithm
KNN Model
Classification
1. Data:
To train a K-Nearest Neighbours (KNN) model, you'll need a dataset consisting of features
and their corresponding labels. Here's a breakdown of the data you'll need:
Features: These are the characteristics or attributes used to predict the label. Each feature
should be represented as a numerical value or a value that can be converted into a numerical
representation. For example, if you're predicting breast cancer, features might include tumour
size, age of the patient, number of positive lymph nodes, etc.
21
Labels: These are the outcomes or classes that you want to predict based on the features.
Labels should be categorical values, such as "benign" or "malignant" for breast cancer
prediction.
2. Splitting of data:
Splitting the data into training and testing sets is a fundamental step in machine learning
model development.
A. Training Set: The training set is used to train the machine learning model. It consists
of a portion of the data (typically the majority) and includes both the features (input
variables) and their corresponding labels (output variables). During training, the model
learns the patterns and relationships between the features and labels in the training data.
B. Testing Set: The testing set is used to evaluate the performance of the trained model. It
consists of a separate portion of the data that was not used during training. The testing set
allows us to assess how well the trained model generalizes to new, unseen data. By
evaluating the model on data it hasn't seen before, we can estimate its performance in
real-world scenarios
Purpose of Splitting:
Splitting the data into training and testing sets helps prevent overfitting, which occurs when a
model learns to memorize the training data instead of learning general patterns. By evaluating
the model on separate testing data, we can assess its ability to generalize beyond the training
examples. Additionally, splitting the data allows us to measure the model's performance
objectively and identify potential issues such as underfitting or overfitting.
3. Distance Metrics of KNN:
To determine which data points are closest to a given query point, the distance between the
query point and the other data points will need to be calculated. These distance metrics help
to form decision boundaries, which partition query points into different regions.
1.Euclidean distance (p=2): This is the most commonly used distance measure, and it is
limited to real-valued vectors. Using the below formula, it measures a straight line between
the query point and the other point being measured.
n
Distance = √∑(xi − yi )2
i=0
22
(31)
2.Manhattan distance (p=1): This is also another popular distance metric, which measures
the absolute value between two points. It is also referred to as taxicab distance or city block
distance as it is commonly visualized with a grid, illustrating how one might navigate from one
address to another via city streets.
m
Manhattan Distance = d(x, y) = (∑|xi − yi |)
(3.2)
i=1
3.Minkowski distance: This distance measure is the generalized form of Euclidean and
Manhattan distance metrics. The parameter, p, in the formula below, allows for the creation of
other distance metrics. Euclidean distance is represented by this formula when p is equal to
two, and Manhattan distance is denoted with p equal to one.
1/𝑝
𝑛
Minkowski distance = (∑
|𝑥𝑖 − 𝑦𝑖 )
(3.3)
𝑖=1
4.Hamming distance: This technique is used typically used with Boolean or string vectors,
identifying the points where the vectors do not match. As a result, it has also been referred to
as the overlap metric. This can be represented with the following formula:
k
Hamming Distance = DH = (∑|xi − yi |)
i=1
𝑥=y
𝐷=0
𝑥≠y
𝐷≠0
23
(3.4)
Defining K:
In K-Nearest Neighbours (KNN), "K" represents the number of nearest neighbouurs used to
predict a new data point. It's a hyperparameter that needs to be specified before training the
model. When a new data point is provided, the algorithm calculates the distances between that
point and all the points in the training dataset. It then selects the K closest points (nearest
neighbours) based on these distances.
Fig. 3.1 K-Nearest Neighbour
3.3 TYPES OF KNN:
1. Basic KNN: This is the standard version of the algorithm, where the prediction for a new
data point is made based on the majority class (for classification) or the average (or weighted
average) of the target values (for regression) of its nearest neighbours.
2. Weighted KNN: In weighted KNN, instead of giving equal importance to all nearest
neighbours, the algorithm assigns weights to them based on their distance from the new data
point. Typically, closer neighbours have a higher weight, while farther neighbours have a
lower weight. This helps to give more influence to neighbours that are more similar to the
new data point.
3.4 ADVANTAGES OF KNN:
1. Easy to implement: Given the algorithm’s simplicity and accuracy, it is one of the first
classifiers that a new data scientist will learn.
24
2. Adapts easily: As new training samples are added, the algorithm adjusts to account for
any new data since all training data is stored in memory.
3. Few hyperparameters: KNN only requires a k value and a distance metric, which is low
when compared to other machine learning algorithms.
3.5 DISADVANTAGES OF KNN:
1. Curse of dimensionality: The KNN algorithm tends to fall victim to the curse of
dimensionality, which means that it doesn’t perform well with high-dimensional data
inputs. This is sometimes also referred to as the peaking phenomenon, where after the
algorithm attains the optimal number of features, additional features increase the number
of classification errors, especially when the sample size is smaller.
2. Computational Complexity: One of the major drawbacks of KNN is its computational
complexity, especially during the prediction phase. As the algorithm needs to calculate
the distances between the new data point and all points in the training set, it can become
computationally expensive, particularly with large datasets or high-dimensional data. This
makes KNN inefficient for real-time applications or datasets with a large number of
features.
3. Memory Usage: Since KNN is an instance-based algorithm, it needs to store the entire
training dataset in memory. This can be memory-intensive, especially for large datasets,
and may not be feasible for memory-constrained environments.
4. Sensitive to Outliers and Noise: KNN is sensitive to outliers and noisy data points, as they
can significantly affect the calculation of distances and the prediction outcome. Outliers or
irrelevant features can distort the decision boundaries and lead to poor performance.
5. Optimal Value of K: Choosing the right value for K is crucial for the performance of the
KNN algorithm. Selecting an inappropriate value of K can result in underfitting or
overfitting the data. Determining the optimal value often requires experimentation and
validation techniques.
.
25
CHAPTER 4
PROPOSED METHODOLOGY
4.1 SYSTEM ARCHITECTURE
The system architecture gives an overview of the workings of the system. The working of this
system is described as shown in Figure 4.1.
Fig. 4.1 System Working
Designing a system architecture for breast cancer prediction typically involves several
components, including data collection, preprocessing, feature extraction, model training, and
deployment. Here's a high-level overview of a system architecture for breast cancer prediction:
26
1. Data Collection:
Obtain datasets containing relevant medical information such as patient demographics, medical
history, imaging data (like mammograms), and biopsy results. Datasets may come from
hospitals, research institutions, or public repositories like the SEER database.
2. Data Pre-processing:
•
Clean the data to handle missing values, outliers, and inconsistencies.
•
Normalize or standardize numerical features.
•
Encode categorical variables.
•
Split the data into training, validation, and testing sets.
3.Feature Extraction:
•
Extract relevant features from the data that can help in predicting breast cancer.
•
Features may include demographic information (age, ethnicity), clinical data (family
history, previous biopsies), and imaging features (from mammograms or MRI scans).
4. Model Selection and Training:
•
Choose appropriate machine learning or deep learning models for breast cancer
prediction.
•
Train the selected models using the pre-processed data
5. Evaluation:
•
Evaluate the trained models using metrics such as accuracy, precision, recall, and F1-score.
•
Perform cross-validation to assess model generalization.
•
Compare the performance of different models to select the best one.
6. Deployment:
•
Once a satisfactory model is obtained, deploy it in a production environment.
•
Develop an interface (web-based, mobile app, or API) for users to interact with the
model.
•
Ensure security and privacy measures are in place, especially when dealing with sensitive
medical data.
•
Monitor the deployed model for performance and retrain periodically with new data if
necessary.
27
7. Continuous Improvement:
•
Regularly update the model with new data to improve prediction accuracy.
•
Incorporate feedback from clinicians and end-users to refine the system and address any
usability issues.
4.2 MACHINE LEARNING (ML)
Machine learning (ML) is a branch of artificial intelligence and computer science that focuses
on using data and algorithms to enable AI to imitate the way that humans learn, gradually
improving its accuracy.
The basic concept of machine learning in data science involves using statistical learning and
optimization methods that let computers analyse datasets and identify patterns.
How does Machine Learning work:
The learning system of a machine learning algorithm mainly consists of three main parts.
1. Decision Process
2. An Error Function
3. Model Optimization Process
Decision Process:
In general, machine learning algorithms are used to make a prediction or classification. Based
on some input data, which can be labelled or unlabelled, your algorithm will produce an
estimate about a pattern in the data.
An Error Function:
An error function evaluates the prediction of the model. If there are known examples, an error
function can make a comparison to assess the accuracy of the model.
Model Optimization Process:
If the model can fit better to the data points in the training set, then weights are adjusted to
reduce the discrepancy between the known example and the model estimate. The algorithm
will repeat this iterative “evaluate and optimize” process, updating weights autonomously until
a threshold of accuracy has been met.
28
4.3 MACHINE LEARNING METHODS
Machine learning (ML) has four basic types of learning methods
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Semi-Supervised Machine Learning
4. Reinforcement Machine Learning
4.3.1 Supervised Machine Learning:
Supervised learning is the type of machine learning in which machines are trained using well
"labelled" training data, and based on that data, machines predict the output. The labelled data
means some input data is already tagged with the correct output.
Supervised learning is a process of providing input data as well as correct output data to the
machine learning model. A supervised learning algorithm aims to find a mapping function to
map the input variable(x) with the output variable(y).
During training, the model adjusts its parameters based on the provided input-output pairs to
minimize the error between its predictions and the true labels. Once trained, the model can
make predictions on new, unseen data. The term "supervised" refers to the presence of labeled
data that guides the learning process by providing correct answers for the training examples.
Fig. 4.2 Supervised Learning
29
4.3.2 Unsupervised Machine Learning:
Unsupervised learning cannot be directly applied to a regression or classification problem
because, unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of the dataset, group that
data according to similarities, and represent that dataset in a compressed format.
The advantages of Unsupervised learning are:
● Unsupervised learning helps find useful insights from the data.
● Unsupervised learning is very similar to how a human learns to think by their own
experiences, which makes it closer to real AI.
● Unsupervised learning works on unlabelled and uncategorized data which makes
unsupervised learning more important.
● In the real world, we do not always have input data with the corresponding output so to solve
such cases, we need unsupervised learning.
Fig. 4.3 Unsupervised Learning
4.3.3 Semi-Supervised Learning:
Semi-supervised learning offers a happy medium between supervised and unsupervised
learning. During training, it uses a smaller labelled data set to guide classification and feature
extraction from a larger, unlabelled data set. Semi-supervised learning can solve the problem
of not having enough labelled data for a supervised learning algorithm. It also helps if it’s too
costly to label enough data.
30
Fig. 4.4 Semi-Supervised Learning
4.3.4 Reinforcement Learning:
Reinforcement learning is an area of Machine Learning. It is about taking suitable action to
maximize reward in a particular situation. It is employed by various software and machines to
find the best possible behaviour or path it should take in a specific situation. Reinforcement
learning differs from supervised learning in a way that in supervised learning the training data
has the answer key with it so the model is trained with the correct answer itself whereas in
reinforcement learning, there is no answer but the reinforcement agent decides what to do to
perform the given task. In the absence of a training dataset, it is bound to learn from its
experience.
Fig. 4.5 Reinforcement Learning
31
PROPOSED ALGORITHM: LOGISTICE REGRESSION
Logistic regression is one of the most popular Machine Learning algorithms, which comes
under the Supervised Learning technique. It is used for predicting the categorical dependent
variable using a given set of independent variables.
Logistic regression predicts the output of a categorical dependent variable. Therefore, the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1.
Logistic Regression is much similar to the Linear Regression except that how they are used.
Linear Regression is used for solving Regression problems, whereas logistic regression is used
for solving the classification problems.
In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function,
which predicts two maximum values (0 or 1).
𝐟(𝐱) =
𝟏
𝟏 + 𝐞−(𝐱)
(4.1)
The curve from the logistic function indicates the likelihood of something such as whether the
cells are cancerous or not, whether a mouse is obese or not based on its weight, etc.
Fig. 4.6 Sigmoid Function
Logistic Regression is a significant machine learning algorithm because it can provide
probabilities and classify new data using continuous and discrete datasets.
32
4.4 WORK FLOW:
Collection of microscopic
biopsy images
Feature Extraction
Data processing
Train and Evaluation split
Logistic Regression
Prediction
Classification
Breast Cancer Negative
Breast Cancer Positive
33
4.4.1 MICROSCOPIC BIOPSY IMAGES:
Microscopic biopsy images of breast cancer are crucial for the diagnosis and treatment of the
disease. These images are obtained through procedures such as fine needle aspiration (FNA)
biopsy or core needle biopsy, where a small sample of tissue is collected from the breast for
examination under a microscope.
In these images, pathologists analyse the tissue samples to identify characteristic features of
breast cancer, such as abnormal cell growth patterns, presence of cancerous cells, and tissue
architecture. Key features that pathologists look for include the size, shape, and arrangement
of cells, presence of mitotic figures (indicating cell division), and features like nuclear atypia
and tumour-infiltrating lymphocytes.
Fig. 4.7 Biopsy Image
We have collected breast cancer images from Kaggle is a commendable step toward building
a robust dataset for breast cancer detection and classification tasks. The availability of diverse
image datasets plays a crucial role in training, feature extracting, and evaluating machine
learning models for medical image analysis. Kaggle, being a popular platform for data science
competitions and repositories, offers a wide range of datasets, including those related to breast
cancer.
4.4.2 FEATURE EXTRACTION:
GLCM(Gray Level Co-occurrence Matrix) is a common method to describe the texture of
images by studying their spatial correlation characteristics. In 1973, Haralick et al. proposed
using GLCM to describe texture features. The excellent ability of GLCM in breast cancer
histopathological image recognition, especially for the three-channel features of the images has
34
been discovered. In this paper, three-channel features are considered. We calculate GLCM at
0, π4, π2, and 3π4 in four directions with a Gray level of 256. Then, according to the GLCM,
22 features were calculated, including autocorrelation, contrast, correlation in two forms,
cluster prominence, cluster shade, dissimilarity, energy, entropy, homogeneity in two forms,
maximum probability, the sum of squares, sum average, sum variance, sum entropy, difference
variance, difference entropy, normalized inverse difference, normalized inverse difference
moment and information measures of correlation in two forms.
Given the GLCM of an image, p (I, j) is the (I,j)th entry in a normalized GLCM. px(i) is the ith
entry in the marginal-probability matrix obtained by summing the rows of p(I, j). Ng is the
number of distinct grey levels in the quantized image. μ is the mean value of the normalized
GLCM. The mean value and standard deviation for the rows and columns of the matrix are
μx = ∑ i ∑ j. i ⋅ p(I, j)
(4.2)
μy = ∑ i ∑ j. j ⋅ p(I, j)
(4.3)
𝜎𝑥 = ∑ j(i − μx )2 ⋅ p(I, j)
(4.4)
σy = ∑ i ∑ j(j − μy)2 ⋅ p(I, j)
(4.5)
1. Contrast:
Measures the local variations in the image. High contrast values indicate large differences
between neighbouring pixel intensities. Dissimilarity: Measures the average difference in
intensity between neighbouring pixels. High dissimilarity values indicate greater heterogeneity
in texture. Contrast, denoted as C, is calculated as the weighted sum of squared differences
between the intensity values of pixel pairs in the GLCM
levels−1
Contrast = ∑
I,j=0
PI ,j (i − j)2
(4.6)
2. Correlation:
Correlation is a textural feature in a Gray-level co-occurrence matrix (GLCM) that measures
the linear dependencies between the grey tones of an image. Correlation is 1 or -1 for perfectly
35
positive correlated image or negatively correlated. Correlation is Nan for a constant image.
Returns the sum of squared elements in the GLCM. Energy is 1 for a constant image.
Correlation, denoted as Corr. Corr is calculated as the weighted covariance of pixel pairs in the
GLCM, normalized by the standard deviations of the intensity values
levels−1
Correlation = ∑
PI ,j [
I,j=0
(i − μi )(j − μj )
√(σ2i )(σ2j )
]
(4.7)
3. Dissimilarity:
Measures the average difference in intensity between Neighbouring pixels. High dissimilarity
values indicate greater heterogeneity in texture. Homogeneity: Reflects the closeness of the
distribution of elements in the GLCM to the GLCM diagonal.
Dissimilarity, denoted as D, is calculated as the weighted sum of absolute differences between
the intensity values of pixel pairs in the GLCM
levels−1
PI ,j |i − j|
Dissimilarity = ∑
I,j=0
(4.8)
4. Homogeneity:
The homogeneity feature in the Gray-Level Co-occurrence Matrix (GLCM) refers to a measure
of the similarity or uniformity of grayscale values in an image. GLCM is a statistical method
used to quantify texture patterns within an image by calculating the frequency of pixel pairs
with specific intensity values occurring at given spatial relationships.
levels−1
Homogenity = ∑
[
I,j=0
PI ,j
]
1 + ( i − j) 2
(4.9)
5. Angular Second Movement (ASM):
The Angular Second Moment (ASM), also known as energy, is a statistical measure derived
from the Gray-Level Co-occurrence Matrix (GLCM) in texture analysis. ASM measures the
uniformity or homogeneity of grayscale values in an image. It indicates the orderliness or
36
predictability of the texture pattern, where a higher ASM value reflects a more uniform texture
with less variation in pixel intensity values.
ASM is calculated as the sum of squared elements in the GLCM
levels−1
Angular Second Movement = ∑
I,j=0
Pi2 ,j
(4.10)
6. Energy:
Energy, like ASM, is a statistical measure derived from the GLCM and reflects the uniformity
or homogeneity of grayscale values in an image texture.
Energy, denoted as E is calculated as the sum of squared elements in the GLCM
2
Energy = √ASM
(4.11)
4.4.3 DATA PREPROCESSING
Data pre-processing is an important step for the creation of a machine learning model shown
below. Initially, data may not be clean or in the required format for the model which can cause
misleading outcomes. In pre-processing of data, we transform data into our required format. It
is used to deal with noises, duplicates, and missing values of the dataset.
Fig 4.8 Data Pre-processing
Data pre-processing has activities like importing datasets, splitting datasets, attribute scaling,
etc. Pre-processing of data is required to improve the accuracy of the model.
37
4.4.4 TRANING A MODEL
During the training phase, the model learns to recognize patterns or relationships in the input
data and output labels provided in the dataset. This process involves adjusting the model’s
internal parameters to minimize a predefined measure of error or loss between the predicted
outputs and the actual labels in the training data. The model iteratively updates its parameters
using an optimization algorithm, such as gradient descent, to find the optimal values that
minimize the loss function.
4.4.5 PREDICTION OF DISEASE
Overall, model training for breast cancer prediction involves careful consideration of data
preprocessing, model selection, hyperparameter tuning, and evaluation to build an effective
and reliable predictive model. By optimizing these aspects of model training, healthcare
practitioners can develop accurate and robust predictive models for early detection and
diagnosis of breast cancer, ultimately improving patient outcomes.
4.5 ADVANTAGES OF LOGISTIC REGRESSION OVER KNN:
Even though K-Nearest Neighbour algorithm is much popular in breast cancer prediction,
logistic regression gives more accurate results while dealing with large datasets. Here are the
advantages of Logistic regression algorithm over KNN:
Interpretability: Logistic regression provides clear and interpretable results by estimating the
probability of each class based on the input features. This allows clinicians and researchers to
understand the factors contributing to the prediction, such as the importance of specific features
like tumour size or patient age.
Efficiency with Large Datasets: Logistic regression can be more computationally efficient,
especially with large datasets, compared to KNN. KNN requires storing the entire training
dataset and computing distances to all data points for each prediction, which can become
computationally expensive as the dataset size increases. Logistic regression, on the other hand,
involves learning model parameters from the data and making predictions based on a learned
function, which can be faster for large datasets.
38
Handles Irrelevant Features: Logistic regression can handle irrelevant features or noise in
the data more effectively than KNN. Logistic regression estimates the coefficients for each
feature, and features with low coefficients are considered less important for prediction. In
contrast, KNN treats all features equally and can be sensitive to irrelevant or noisy features,
potentially leading to suboptimal predictions.
Better Performance with Linearly Separable Data: Logistic regression performs well when
the decision boundary between classes is approximately linear. In cases where the relationship
between features and class labels is roughly linear or can be approximated by a linear function,
logistic regression may outperform KNN.
Regularization: Logistic regression can easily incorporate regularization techniques such as
L1 (Lasso) and L2 (Ridge) regularization to prevent overfitting and improve generalization
performance. Regularization helps control model complexity and can improve predictive
performance, especially when dealing with high-dimensional data or correlated features.
39
CHAPTER 5
EXPERIMENTAL ANALYSIS
5.1 SYSTEM CONFIGURATION
Python is an interpreted, high-level, general purpose programming language created by Guido
Van Rossum and first released in 1991, Python's design philosophy emphasizes code
Readability with its notable use of significant White space. Its language constructs and objectoriented approach aim to help programmers write clear, logical code for small and large-scale
projects. Python is dynamically typed and garbage collected. The following are the required
system configurations.
5.1.1 HARDWARE REQUIREMENTS
Processer
: Any Update Processer
Ram
: Min 4GB
Hard Disk
: Min 100GB
5.1.2 SOFTWARE REQUIREMENTS
Operating System
: Windows family
Technology
: Python3.7
IDE
: Jupiter notebook
5.1.3 FEATURES EXTRACTED FROM SAMPLE IMAGE:
In our image analysis process, we employed the Gray Level Co-occurrence Matrix (GLCM)
technique to extract six key texture features from biopsy images. These features, namely
Contrast, Correlation, Angular Second Moment (ASM), Energy, Homogeneity, and
Dissimilarity, offer valuable insights into the textural characteristics present within the images.
40
Contrast provides an indication of local intensity variations, aiding in the identification of
regions with pronounced changes in grayscale values. Correlation measures the linear
relationship between Gray levels of neighbouring pixels, offering insights into the spatial
coherence of the image. ASM, also referred to as uniformity or homogeneity, quantifies the
orderliness of intensity distribution, capturing how uniform or varied the texture appears.
Energy, derived from the sum of squared elements in the GLCM, complements ASM by
representing the overall intensity distribution’s smoothness or roughness.
Homogeneity gauges the closeness of element distribution in the GLCM to its diagonal,
offering a measure of the image’s uniformity or heterogeneity. Dissimilarity, on the other hand,
quantifies the average absolute difference between pairs of pixels, providing further granularity
on the variations in intensity across the image. By leveraging these features extracted through
GLCM analysis, we enhance our understanding of the biopsy images’ texture characteristics,
facilitating tasks such as image classification, segmentation, and the detection of anomalies
with greater precision and accuracy.
Fig.5.1 Biopsy Image
41
Table 5.1 Features of a sample image
Features Extracted
Values
Contrast
1326.0000
Correlation
0.24160
Dissimilarity
0.01860
Homogeneity
0.73390
Angular Second Moment
0.08690
Energy
0.05667
Predicted output:
The logistic regression model then generates predicted outputs, which are binary values of 1 or
0. Here, a prediction of 1 implies the presence of cancer in the biopsy image, while a prediction
of 0 signifies the absence of cancer, or non-cancerous tissue. This binary classification
framework allows us to effectively distinguish between cancerous and non-cancerous regions
within the biopsy images based on the texture features extracted through GLCM analysis.
Fig.5.2 Breast Cancer Output
5.3 PERFORMANCE ANALYSIS
In this project, various machine learning algorithms like KNN and logistic Regression is used
for the prediction of Breast cancer. Here we have extracted the features from the microscopic
biopsy images and with the help of grey level co-matrix and we have extracted the features like
correlation, contrast, homogeneity, ASM, Energy, dissimilarity for the predictive analysis. For
evaluating the experiment, various evaluation metrics like accuracy, confusion matrix,
precision, recall, and f1-score are considered.
Accuracy: Accuracy is the ratio of the number of correct predictions to the total number of
inputs in the dataset. It is expressed as:
42
● Accuracy = (TP + TN) /(TP+FP+FN+TN)
where
TP: True positive
FN: False Negative
FP: False Positive
TN: True Negative.
Correlation Matrix: The correlation matrix in machine learning is used for feature selection.
It represents dependency between various attributes.
Precision: It is the ratio of correct positive results to the total number of positive results
predicted by the system.
Recall: It is the ratio of correct positive results to the total number of positive results predicted
by the system.
F1 Score: It is the harmonic mean of Precision and Recall. It measures the test accuracy. The
range of this metric is 0 to 1.
Confusion Matrix: It gives us a matrix as output and gives the total performance of the system
Fig 5.3 confusion matrix
Table 5.2 TP, TN, FP, FN
TP
TN
FP
FN
KNN
36
71
7
0
Logistic regression
40
70
1
3
43
5.4 RESULT
Following the application of the Logistic Regression algorithm for training and testing in our
machine learning workflow, we meticulously evaluated its performance using a comprehensive
set of metrics: Accuracy, F1 Score, Precision, and Recall. Leveraging the insights provided by
the confusion matrix—comprising counts of True Positives (TP), True Negatives (TN), False
Positives (FP), and False Negatives (FN)—we employed the respective equations for each
metric to derive their values.
Comparison of Performance Metrics:
In our analysis, we're comparing the performance metrics of K-Nearest Neighbours (KNN) and
Logistic Regression algorithms. This comparative analysis enables us to identify the most
effective algorithm for our specific task, potentially informing clinical decision-making and
enhancing diagnostic accuracy in medical settings.
Fig 5.4 Performance Metrics
Table 5.3 Performance Metrics
Accuracy
Precision
Recall
F1 score
KNN
93.8%
0.92
0.94
0.93
Logistic Regression
96.5%
0.97
0.96
0.96
44
CHAPTER 6
CONCLUSION AND FUTURE WORK
6.1 CONCLUSION
Breast cancer is a pressing global health issue, and its early detection is paramount for effective
treatment and better patient outcomes. With the advent of advanced technologies like machine
learning, there is significant potential to transform healthcare practices, particularly in the
realm of breast cancer prediction. Early prognosis facilitated by machine learning models
allows for timely interventions and informed treatment decisions, ultimately leading to reduced
complications and mortality rates among patients.
The increasing prevalence of breast cancer underscores the critical need for improved
diagnostic methods and treatment approaches. Machine learning algorithms, such as Logistic
Regression, offer a powerful tool for breast cancer prediction by leveraging patient
characteristics and medical history to model the probability of cancer occurrence. This enables
healthcare practitioners to enhance the accuracy and efficiency of diagnosis, thereby
facilitating timely interventions and personalized treatment plans tailored to individual patient
needs.
By integrating machine learning into breast cancer prediction, healthcare professionals can
empower patients with valuable insights and personalized care strategies. This proactive
approach not only facilitates early detection but also enables the implementation of preventive
measures and lifestyle modifications to mitigate risk factors associated with breast cancer
development. Patients can benefit from timely information and support, leading to improved
health outcomes and quality of life.
Furthermore, the integration of advanced technological support in healthcare systems
streamlines data analysis and decision-making processes, optimizing patient care delivery.
Machine learning algorithms like Logistic Regression enable healthcare providers to extract
valuable insights from large and complex datasets, facilitating more accurate and tailored
treatment strategies. This not only improves patient care but also drives advancements in
medical research and clinical practice.
45
In conclusion, the adoption of machine learning algorithms, particularly Logistic Regression,
for breast cancer prediction represents a significant advancement in healthcare. This innovative
approach has the potential to revolutionize breast cancer diagnosis and treatment, ultimately
leading to improved patient outcomes and advancements in the field of medicine as a whole.
Through collaborative efforts between healthcare professionals, researchers, and technology
experts, we can harness the power of machine learning to address the challenges posed by
breast cancer and improve the lives of patients worldwide.
6.2 FUTURE SCOPE
For breast cancer prediction using logistic regression, there are numerous avenues for future
research that could advance the field. Here are several potential directions:
Feature Selection: Investigate which features or variables are most informative for predicting
breast cancer and refine the logistic regression model accordingly. This could involve exploring
a wide range of clinical, demographic, and genetic factors to identify the most relevant
predictors.
Comparison with Other Algorithms: Logistic regression is just one of many machine
learning algorithms that can be applied to breast cancer prediction. Researchers could compare
the performance of logistic regression with other algorithms such as random forests, support
vector machines, or neural networks. This comparative analysis could help identify the most
effective modelling approaches for breast cancer prediction.
Model Interpretation: Explore methods for interpreting logistic regression models to gain
insights into the underlying factors associated with breast cancer risk. Researchers could
analyse the estimated coefficients of the predictor variables to understand their impact on the
likelihood of breast cancer occurrence. This could provide valuable insights into the biological,
environmental, and lifestyle factors contributing to breast cancer development.
Validation: Validate logistic regression models using independent datasets to assess their
generalizability and robustness. This validation process would involve testing the models on
data from different populations or healthcare settings to ensure their reliability and applicability
in diverse clinical contexts.
46
APPENDIX
Python:
Python is an interpreted, high-level, general purpose programming language created by Guido
Van Rossum and first released in 1991, Python's design philosophy emphasizes code
Readability with its notable use of significant White space. Its language constructs and objectoriented approach aim to help programmers write clear, logical code for small and large-scale
projects. Python is dynamically typed and garbage collected. It supports multiple programming
paradigms, including procedural, object-oriented, and functional programming.
Sklearn:
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It
provides a selection of efficient tools for machine learning and statistical modeling including
classification, regression, clustering and dimensionality reduction via a consistent interface in
Python. This library, which is largely written in Python, is built upon NumPy, SciPy and
Matplotlib.
NumPy:
NumPy is a library for the python programming language, adding support for large, multidimensional arrays and matrices, along with a large collection of high-level mathematical
functions to operate on these arrays. The ancestor of NumPy, Numeric, was originally created
by Jim with contributions from several other developers. In 2005, Travis created NumPy by
incorporating features of the competing Numarray into Numeric, with extensive modifications.
NumPy is open-source software and has many contributors.
CV2:
The OpenCV-Python library is a collection of Python bindings for dealing with computer
vision problems. The method cv2.imread() opens a file and loads an image. If the image cannot
be read, this method returns an empty matrix (due to insufficient permissions, a missing file,
or an unsupported or invalid format).
47
Matplotlib:
Matplotlib is a plotting library for the Python programming language and its numerical
mathematics extension NumPy. It provides an object-oriented API for embedding plots into
applications using general-purpose GUI toolkits like Tkinter, wxPython, Qt, or GTK. There is
also a procedural "pylab" interface based on a state machine (like OpenGL), designed to closely
resemble that of MATLAB, though its use is discouraged.
Seaborn:
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level
interface for drawing attractive and informative statistical graphics. Seaborn is a library in
Python predominantly used for making statistical graphics. Seaborn is a data visualization
library built on top of matplotlib and closely integrated with pandas’ data structures in Python.
Visualization is the central part of Seaborn which helps in exploration and understanding of
data.
48
REFERENCES
[1] Tarini Sinha. Tumours: Benign and Malignant. Canc Therapy & Oncol Int J. 2018; 10(3):
555790. DOI:10.19080/CTOIJ.2018.10.555790
[2] Comparison of machine learning models for breast cancer diagnosis
IAES International Journal of Artificial Intelligence (IJ-AI) Vol. 12, No. 1, March 2023, pp.
415~421 ISSN: 2252-8938
[3] Comparing Logistic Regression to the K-nearest Neighbours (KNN) technique, A Novel
Pattern Discovery Based Human Activity Recognition Research Scholar, Saveetha School of
Engineering, Saveetha Institute of Medical and Technical Sciences, Saveetha University,
Chennai, Tamil Nādu, India, Pincode:602105.10(1S) 1625-1624
[4] Extract the Similar Images Using the Grey Level Cooccurrence Matrix and the Hu
Invariants Moments Beshaier A. Abdulla a*, Yossra H. Ali b, Nuha J. Ibrahim c a Department
of Computers Science, University of Technology, Iraq.
[5] Detection and Classification of Cancer from Microscopic Biopsy Images Using Clinically
Significant and Biologically Interpretable Features Rajesh Kumar, Rajeev Srivastava, and
Subodh Srivastava Department of Computer Science and Engineering, Indian Institute of
Technology (Banaras Hindu University), Varanasi 221005, India
[6] VolcAshDB: a Volcanic Ash Database of classified particle images and features
Damià Benet1,2,3·
Fidel Costa1,2,3 ·
Christina Widiwijayanti1,2 ·
John Pallister4
·
Gabriela Pedreros5 · Patrick Allard3 · Hanik Humaida6 · Yosuke Aoki7 · Fukashi Maeno7
[7] Gray level co-occurrence matrix (GLCM) texture-based crop classification using low altitude
remote sensing platforms Naveed Iqbal, # Rafia Mumtaz, # Uferah Shafi, and Syed Mohammad
Hassan Zaidi
[8] Logistic regression in data analysis: An overview,July 2011 International Journal of Data
Analysis Techniques and Strategies 3(3):281-299
DOI:10.1504/IJDATS.2011.041335.
[9] Machine Learning: Algorithms, Real-World Applications and Research Directions
Iqbal H. Sarker1,2 Received: 27 January 2021 / Accepted: 12 March 2021 / Published online:
22 March 2021
[10] High-Level K-Nearest Neighbours (HLKNN): A Supervised Machine Learning Model for
Classification Analysis Elife Ozturk Kiyak 1 , Bita Ghasemkhani 2 and Derya Birant 3,
49
Independent Researcher, Izmir 35390, Turkey; elife.ozturk@cs.deu.edu.tr 2 Graduate School
of Natural and Applied Sciences, Dokuz Eylul University, Izmir 35390, Turkey.
[11] logistic regression analysis for studying the impact of home quarantine on psychological
health during COVID-19 in Saudi Arabia 022 Oct; 61(10): 7995–8005. Taghreed M.Jawa.
[12] Application of logistic regression models to assess household financial decisions
regarding debt Agnieszka Strzelecka a *, Agnieszka Kurdyś-Kujawskaa, Danuta Zawadzkaa a
Koszalin University of Technology, Faculty of Economic Science, Department of Finance,
Kwiatkowskiego 6e, 75-343 Koszalin, Poland
[13] A logistic regression investigation of the relationship between the Learning Assistant
model and failure rates in introductory STEM courses Jessica L. Alzen, Laurie S.
Langdon & Valerie K. Otero, International Journal of STEM Education volume 5,
Article number: 56 (2018).
[14] Application of Logistic Regression in the Study of Students’ Performance Level (Case
Study of Vlora University) September 2015, Journal of Educational and Social Research 5(3),
DOI:10.5901/jesr.2015.v5n3p239, License-CC BY-NC 4.0, Authors:Miftar Ramosaco
[15] KNN Model-Based Approach in ClassificationGongde Guo, Hui Wang, David Bell, Yaxin
Bi & Kieran Greer , Conference paper,5713 Accesses,571 Citations,8 Altmetric,Part of the
book series: Lecture Notes in Computer Science ((LNCS, volume 2888))
[16] Introduction to machine learning: k-nearest neighbors,Zhongheng Zhang, Ann Transl
Med. 2016 Jun; 4(11): 218.,doi: 10.21037/atm.2016.03.37.
[17] Stem Cell, Nanotechnology, System Biology, and Bioinformatic Approaches in Therapy
of Phospholipases-Induced Diseases
Karuppiah Prakash Shyam, ... Daniel A. Gideon, in Phospholipases in Physiology and
Pathology, 2023
50
Download