87654zz A Comparative Study of Chronic Kidney Disease Prediction Using Machine Learning Techniques Rehnuma Ferdous #ID: 19103008 A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of Computer Science and Engineering (BCSE) Department of Computer Science and Engineering College of Engineering and Technology IUBAT – International University of Business Agriculture and Technology Summer2022 A Comparative Study of Chronic Kidney Disease Prediction Using Machine Learning Techniques Rehnuma Ferdous #ID: 19103008 A Thesis in the Partial Fulfillment of the Requirements for the Award of Bachelor of Computer Science and Engineering (BCSE) The thesis has been examined and approved, _____________________________ Prof. Dr. Utpal Kanti Das Professor and Chairman _____________________________ Dr.Hasibur Rashid Chayon Associate Professor and Coordinator _____________________________ Arifa Tur Rahman Assistant Professor and Supervisor Department of Computer Science and Engineering College of Engineering and Technology IUBAT – International University of Business Agriculture and Technology Summer2022 Letter of Transmittal 14 August 2022 The Chair Thesis Defense Committee Department of Computer Science and Engineering IUBAT–International University of Business Agriculture and Technology 4 Embankment Drive Road, Sector 10, Uttara Model Town Dhaka 1230, Bangladesh Subject: Letter of Transmittal. Dear Sir, We are pleased to present you our thesis report titled ‘A Comparative Study of Chronic Kidney Disease Prediction Using Machine Learning Technique’ as required by IUBAT for the partial fulfillment of the requirements for the award of Bachelor of Computer Science and Engineering. It was indeed a great opportunity for us to work on this project to actualize our theoretical knowledge into practice. Finally, we would like to express our gratitude to you for giving us this opportunity to pursue our studies in your renowned university. Yours sincerely, _____________ Rehnuma Ferdous 19103008 Student’s Declaration We hereby declare that this practicum report titled ‘A Comparative Study of Chronic Kidney Disease Prediction Using Machine Learning Technique’ is our original work. It has never been presented previously or concurrently for any other purpose, reward or degree at IUBAT or any other institutions either by us or by any other student. We also declare that there is no plagiarism or data falsification and materials used in this report and various sources have been duly cited. _____________ Rehnuma Ferdous 19103008 iv Supervisor’s Certification This is to certify that the thesis report on “A Comparative Study of Chronic Kidney Disease Prediction Using Machine Learning Technique” has been carried out by Rehnuma Ferdous (bearing ID# 19103008), student of Department of Computer Science and Engineering of IUBAT — International University of Business Agriculture and Technology, As a partial fulfillment of the requirement of the degree in Bachelor of Computer Science and Engineering. The report has been prepared under my guidance and is a record of work carried out successfully. Now they are permitted to submit the report. I wish them success in the future endeavors _______________________________ Arifa Tur Rahman Supervisor and Assistant Professor Department of Computer Science and Engineering IUBAT–International University of Business Agriculture and Technology v Abstract Chronic Kidney Disease (CKD), also regarded as persistent renal disease, has emerge as a serious public health issue with a consistent expand in incidence. A character can only continue to exist for about a week except their kidneys.18 days is a lengthy time, for this reason a kidney transplant is in excessive demand. Dialysis is some other option. It is imperative to have advantageous approaches for early detection’s diagnosis and prognosis Machine mastering techniques are useful in a variety of situations’ prognosis This paper suggests a approach for predicting CKD.a repute primarily based on clinical information that accommodates data prepossessing, way for handling lacking values the use of collaborative filtering and decision of characteristics Out of the 9 computer mastering techniques available, The greater tree classifier and random forest classifier are taken into account. Are proved to produce the best accuracy and the least bias the characteristics. The study additionally takes into account the sensible problems of records gathering and emphasizes the want of applying domain know-how when using computer learning to predict CKD status. Index Terms - Machine learning, classification algorithms,chronic kidney illness, chronic renal disease vi Acknowledgments We take this opportunity to express our sincere gratitude to our research supervisor Arifa Tur Rahman, Department of CSE, and IUBAT- International University of Business Agriculture and Technology. Our project manager has a strong background in "Machine Learning" and a genuine interest in it. This project was made possible by his never-ending patience, academic leadership, strong motivation, persistent encouragement, constant and energetic supervision, constructive criticism, invaluable counsel, reading numerous subpar versions and fixing them at all stages.. We would like to express our heartiest gratitude to the Almighty Allah and also to Prof. Dr. Utpal Kanti Das, Chairman of Department of Computer Science and Engineering, and Prof. Dr. Hasibur Rashid Chayon, Co-Ordinator of Department of Computer Science & Engineering and other faculty members & the staffs of the Department of CSE of IUBAT- International University of Business Agriculture and Technology to finish our research. We'd like to extend our gratitude to all of our classmates at the International University of Business, Agriculture, and Technology (IUBAT), who participated in this discussion while also attending class. Finally, we must respectfully appreciate our parents' unwavering assistance and endurance. vii Table of Contents Letter of Transmittal ....................................................................................................... iii Student’s Declaration ...................................................................................................... iv Supervisor’s Certification .................................................................................................v Abstract ............................................................................................................................. vi Acknowledgments ........................................................................................................... vii List of Figures .....................................................................................................................x List of Tables .................................................................................................................... xi Chapter 1. Introduction ....................................................................................................1 Chapter 2. Literature Review ...........................................................................................4 2.1 Chronic Kidney Disease (. (F. E. Murtagh,et al.)..............................................4 2.2 Five stages of CKD (C. A. Johnson et al.) .........................................................4 2.3 Little's MCAR method (S. Nair et al.) ...............................................................5 2.4 Characteristics to predict CKD (P. Yildirim et al.) ...........................................5 2.5 WEKA data mining tool (D. Dua et al.) ...........................................................7 Chapter 3. Research Methodology .................................................................................10 3.1 Data Preprocessing: Missing Value Handling…………..................................10 3.2 Data Preprocessing: Feature Selection.............................................................14 3.3 Model Training.................................................................................................20 3.4 Model Evaluation and Selection……………………………………………...22 Chapter 4. Result and Discussion ...................................................................................23 viii 4.1 Algorithm feature significance standard deviation...................................24 Chapter 5. Conclusion .....................................................................................................26 References ........................................................................................................................27 ix List of Figures Figure 3.1 Proposed workflow................................................................................................ 10 Figure 3.2 Heat map of attributes' relationships with the class variable…...............................15 Figure 3.3 Albumin over specific gravity distribution…. ...................................................... 16 Figure 3.4 The ratio of serum creatinine to hemoglobin……. .............................................. 18 Figure 3.5 Diabetes mellitus prevalence compared to high blood pressure. ......................... 19 Figure 3.6 The distribution of appetite. ................................................................................. 20 Figure 3.7Feature importance of each trained model............................................................. 22 x List of Tables Table 3.1 TESTS FOR MEASURING MULTIPLE ATTRIBUTES AND MISSING VALUE PERCENTAGE…………………………………………………………………................... 11 Table 3.2LITTLE'S MCAR TEST RESULT…………………...…………………………...13 Table 3.3PERCENTAGE CHANGE OF STATISTICS OF ATTRIBUTES AFTER FILLING MISSING VALUES……………………………..…………………………………………. 17 Table 3.4 ACCURACIES OF EACH ALGORITHM….…………………….…………….. 21 Table 3.5 PRECISION, RECALL AND F1-SCORE OF EACH ALGORITHM…...….….. 21 Table 3.6 FEATURE IMPORTANCE OF EACH ALGORITHM…….……….………….. 23 Table 3.7 ALGORITHM IMPORTANCE STANDARD DEVIATION OF FEATURE ………………………………………………………………………….…………………… 24 xi Chapter 1. Introduction Theikidneysiareitwoibeanshapediorgansithatiareieachiroughlyitheisizeiofiaifist.iTheyi areiputionieachisideiofitheispine,ioneiimmediatelyiunderitheiribicage.iTheikidneysifilteri120 itoi150iquartsiofibloodieachidayitoigeneratei1itoi2iquartsiofiurine.iTheiprimaryifunctioniofit heikidneysiisitoieliminateiexcessifluidiandiwasteifromitheibodyithroughiurine.iAiseriesiofiin crediblyicomplexiexcretioniandireabsorptioniprocessesicombineitoiformiurine.iTheibodyinee dsithisisystemitoimaintainiaiconstantichemicaliequilibrium.iTheikidneysiareiinichargeioficon trollingitheibody'sisalt,ipotassium,iandiacidilevelsiasiwelliasicreatingihormonesithatiaffectih owivariousiorgansifunction.iAniexampleiofiaihormoneigeneratedibyitheikidneysiisioneithat,i amongiotherithings,istimulatesitheicreationiofiredibloodicells,imanagesibloodipressure,iandic ontrolsicalciumimetabolism. Chronicikidneyidiseasei(CKD)iisicurrentlyiregardediasitheibiggestihazarditoisociety's ihealth.iWithitheihelpiofilaboratoryitests,iitiisipossibleitoidiagnoseichronicikidneyidisease,ia ndithereiareitreatmentsiavailableitoistopitheidiseaseifromiprogressing,islowiitidown,ilessenit heichallengesiofiailoweriGFRianditheiriskioficardiovascularidisease,iandiimproveisurvivalia ndiqualityiofilife.iLackiofiwatericonsumption,ismoking,iaipooridiet,iinsufficientisleep,iandiai numberiofiotherifactorsicanicauseiCKD.iArounditheiworld,i753imillionipeopleiwereiafflicte dibyithisiillnessiini2016,iincludingi417imillionifemalesiandi336imillionimales.iMostioften,it heiconditioniisidiscoverediiniitsilatteristages,iwhichicanicauseirenalifailure.i(“KidneyiDiseas e:iTheiBasics,”iAug.i2014) Aimajoriproblemiinitheiworld,ichronicikidneyidiseasei(CKD)iisicharacterizedibyiaigr adualideclineiofikidneyifunctionioveritime.i14%iofipeopleionitheiplanetihaveiCKD.iEvenith oughithisinumberimayionlyirepresenti10%iofithoseiwhoineeditreatmentitoisurvive,ioveritwoi millionipeopleiworldwideidependionidialysisioriaikidneyitransplantitoistayialive.iMoreipeopl eidieifromichronicirenaliillnessithanifromibreastioriprostateicancer. Theiglomerularifiltrationiratei(eGFR),iwhichiisientirelyibasedionicreatinineilevels,igender,ira ce,iandiage,ideterminesitheistagesiofiCKDiinigeneral.iThereiareifiveistagesiinitheikidneyifea ture.iTheifunctioniisimarginallyiimpairediinistagei2iandinormaliinistagei1,ihoweveriinitheiva stimajorityioficases,itheifunctioniisiatistagei3.i(F.iE.iMurtagh,vol.i40,ino.i3,ipp.i342– 352,i2010) The following signs and symptoms could also manifest in the patient with untreated CKD: Anemia, fatigue, poor nutrition, and nerve damage are all indications of high blood pressure. Reduced immune response because, at more advanced stages, harmful concentrations of fluids, electrolytes, and wastes can accumulate in your blood and body. Because of this, it is crucial to identify CKD as soon as possible. However, this can be challenging because the disease's signs and symptoms don't appear suddenly and don't always indicate CKD. Due to the fact that some people have no symptoms at all, desktop learning can be effective in determining whether or not a patient has CKD. This is achieved by machine learning by training a predictive model with historical CKD patient data. The most reliable test for determining your kidney condition and stage of chronic kidney disease is the glomerular filtration rate (GFR). It may be calculated using your blood creatinine, age, race, gender, and other other factors. The greater the chance of identifying a disease and halting or preventing it, the earlier it is detected. (D. Dua, 2017) It is possible to forecast excellent CKD repute and CKD stages using machine learning. Machine learning is a significant synthetic intelligence challenge when it comes to applying classification and regression algorithms to infer future outcomes from past data. Computing device mastering strategies for CKD prediction have been studied on the foundation of various 2 data sets. The UCI repository dataset is listed among them as a benchmark dataset. The benchmark dataset is considered in this analysis, as it is in the majority of similar ones. Forithatireason,ithisiresearchidiscussesitheiproblemsiwithihandlingimissingivaluesiw hileilookingiatiCKDidata,ioffersiaifreshimethodiforihandlingimissingivalues,iandicontrastsin ovelisolutionsiusingitheiUCIidataset.iWhileimakingiaipredictionibasedioniscientificistatistics irelateditoiCKD,ithisiworkiemphasizesitheineediofistatisticalianalysisiasiwelliasigeographica liawarenessiofitheifactors.iTheipresentimethodiofianalysisireliesioniaiurineianalysisiandithei determinationiofiserumicreatinineiconcentrations.iAivarietyioficlinicaliinterventions,iincludi ngiscreeningiandiultrasonography,iareiemployeditoiachieveithisigoal.iEveryoneiisiscreened,ii ncludingithoseiwithihypertension,iaihistoryioficardiovascularidisease,iaimedicalihistory,iandi peopleiwithirelativesiwhoihaveiexperiencedirenaliillness.iTheibloodicreatinineileveliandithei urineialbumintocreatinineiratioi(ACR),iwhichiisideterminediduringiaifirstimorningiurineitest, iareiuseditoicomputeitheiestimatediGFR.iThisistudyifocusesionicomputerilearningitechnique silikeiACOiandiSVMitoiimproveipredictioniaccuracyibyireducingifeaturesiandichoosingithei righticharacteristics. Chapter 2. Literature Review A.iJ.iHussainiandihisicolleaguesiwereiableitoipredictiCKDiiniitsiearlyitiersiwithiania ccuracyiofi0.995iusingimultilayericomprehensioniandipreprocessingiofitheidataisetiwithineur alinetworksitoifilliinitheimissingiinformation.iOutliers are eliminated, statistical analysis is performed to identify the top seven qualities, and primary factor analysis is used to omit the traits with the strongest inter-correlation (PCA). The study under discussion's trained models' accuracy is significantly impacted by the missing fee filling approach. However, the accuracy of missing cost prediction was once little diminished since only 260 totally done records instances were combined with the Neural Network for 20 characteristics. Eliminating attributes with more than 20% of their values missing has had a significant impact on the accuracy of replacing such values. The selection of attributes for the education mannequin from each category has been facilitated by the classification of attributes using sources, such as blood tests or urine tests. (A. J. Aljaaf, 2018, pp. 1–9.) ForitheifiveistagesiofiCKD,iaimethodiforipredictingiaistageiwithitheibestiaccuracyifo riaistageiwithi0.997iandistandardiaccuracyiofi0.967iwasiproposed.iThisimethodialsoicomput esitheieGFRiusingitheiaforesaidirecordsisetiwithiextraigenderiandiracialivariables,iwhileieli minatingicasesiwithimissingivalues.i(C.iA.iJohnson,ivol.i70,ino.i5,ipp.i869–876,i2004) Dueitoitheimodel'sisubstantiallyiloweriprecision,iconstantsiareiemployeditoireplacei missingidata.iHowever,iouristudy'sirandomizationioficomponentsiwithiinsufficientistatistical ipoweriisipreferrediaccordingitoiLittle'siMCARitechniquei(seeitheiMethodologyisection).iFu rthermore,iwheniexaminingitheifeatures,itheisignificanceiofiserumicreatinineiisiskewed.iHo wever,iinitheiearlyitiersiofiCKD,iserumicreatinineicaniproceediatieverydayilevels,ianditheico mpleteiimportanceiofialliotheriparametersimayialsoinowinotibeiextraithanithatiofiserumicrea 4 tinine,imakingiserumicreatinineiunhelpfuliinidiseaseiprediction.iBecauseidomainirecordsiisin otiincluded,itheitrainedimodels'iaccuracyiinipredictinginewisituationsioutsideitheirecordsiseti isiquestioned. (S. Nair,vol. 37, no. 2, pp. 483–487, Feb. 2014) Ini2017,iaiteamiofiresearchersiusediaimulticlassiselectioniforestitoipredictiCKDiwithi 0.991iaccuracyiusingi14ifeatures.iAineuralinetworkiandiailogisticiregressionimodeliwereitrai ned,ianditheseimodelsiproducedinormaliaccuraciesiofi0.975iandi0.960,irespectively.iTheyith eniexcludeditheicasesiwithimissingivalues.iTheicorelationsibetweenitheichoseniqualitiesiran geifromi[0.2itoi0.8].iAccordingitoiscience,iCKDicanicauseihypertension,iandihypertensionic anicauseiCKD,iandispecificigravityihasiai0.73icorrelationitoitheiclass.iIt'sipossibleithatielimi natingitheseicharacteristicsiwillireduceiaccuracy.i(P.iYildirim,ivol.i02,ipp.i193–198,i2017) In 2015, Lambodar J. and Narendra Ku. K. scanned with eight computing device research models using the WEKA data mining tool. With ROCs of 1 and accuracies of 0.950, 0.9975, and 0.99, respectively, the Naive Bayes, Multi-layer Perception, and J48 algorithms exhibited the ideal receiver operating characteristic (ROC) and accuracy. In the study, the argument strength was determined using Kappa statistics, with the multilayer perceptron algorithm receiving the highest score of 0.9947 and the choice desk and J48 algorithms receiving the lowest score of 0.9786. (D. Dua, 2017.) In light of previous research based on the UCI CKD data set, it was found that many of the lower accuracy cases are due to inadequate handling of missing information and the attribute determination mechanism. Deep learning algorithms have become well-known methods for characterizing patients, simulating the progression of disease (Choi et al. 2016a; Ma et al. 2017; Choi et al. 2016c; Lipton et al. 2016), and creating artificial EHR data for search purposes. (Choi et al. 2017b). Predicting illness outcomes is the most popular software for modeling disease development. When learning disease trajectories from scratch, deep neural networks have relatively limited capability; therefore, it is occasionally necessary to include existing clinical information (Ma et al. 2018; Pham et al. 2017) or complement EHR data with the naturally hierarchical structure of scientific ontologies (Choi et al. 2017a). Modeling EHR data includes assignments for missing values. RNNs can take advantage of the long-term dependency in time collection to improve prediction performance, as demonstrated by (Che et al. 2018).. Deep learning models need a lot of data to provide high-quality results, which is typically more than most healthcare institutions can handle. Combining EHR data from many sources is a straightforward option, but records harmonization is a time-consuming procedure. For deep learning models in addition to site-specific data harmonization, (Rajkomar et al. 2018) currently proposed a representation of EHRs based on the Fast Healthcare Interoperability Resources (FHIR) structure. Facts augmentation approaches can help training when working with an unbalanced dataset and insufficient high-quality samples. In order to generate candidate excellent and negative samples for the identification of rare diseases, CONAN (Cui et al. 2020) incorporates generative adversarial networks (GANs). Pre-training and switch getting to know can also help to fix this issue (Bengio 2012; Dauphin et al. 2012). The pretrained hospital go-to representations were employed by G-BERT (Shang et al. 2019) for downstream prediction tasks. (Rios and Kavuluru 019) trained a CNN on a sizable global biomedical abstract library and used the knowledge acquired to forecast diagnosis codes for one medical facility. Algorithms for representation analysis in the healthcare sector usually borrow from herbal language processing (NLP). The commonly accepted idea is to observe Word2Vec 6 methods (Mikolov et al. 2013a) to analyze embeddings after discrete clinical ideas (such as clinical codes) are encoded to one-hot vectors (Bengio, Courville, and Vincent 2013). For instance, Med2Vec (Choi et al. 2016b) examined intra-visit medical code cooccurrences as well as inter-visit sequential information using skipgram (Mikolov et al. 2013b). The standard skip-gram model is entirely predicated on the idea that words might have unique functions in unusual places within a sentence. Due to the unordered nature of clinical codes, this presumption is invalid. The sequence in which these scientific ideas develop is often overlooked when we use NLP algorithms to represent clinical notions. Instead, we apply the algorithms to the function dimension instead of the temporal dimension. To examine a multilayer embedding of EHR data, MiME (Choi et al. 2018) took advantage of the natural shape of clinical codes, but this mannequin required the EHR statistics to include full structure information between diagnoses and treatments. GCT (Choi et al. 2020) has recently been suggested as a solution to this issue. GCT demonstrated that Transformer is a suitable model to explore such structure at some point during training after realizing the graphical structure of EHR information. GCT's initial usage of Transformer to encode clinic visits served as the inspiration for our work. Geietial.i(2019)iPredictediparkisonsiailmentiseverityitheiusageiofiDeepiNeuraliNetw orkiwithiUCI’siparkison’sitelemonitoringivoiceidatasetiofipatients.iTheistudiesicomprisediai biomedicalivoiceidimensioniofi42isufferersiwithiParkisonsiDiseasei(PD).iSeverityiprediction ionitheigroundworkiofitotaliUnifiediParkisonsiDiseaseiRatingiScalei(UPDRS)iaccuracyiscor eiofi94.4422%iandi62.7335%iforiinstructiandicheckidatasetirespectivelyiandiseverityiPDisev erityionitheifoundationioficompleteiUPDRSiaccuracyiratingiofi83.367%iandi81.6657%iforit rainiandicheckidatasetirespectively. AyoniandiIslami(2019)iProposediaimethodiforitheianalysisiofidiabetesiusingiDNNio niPIMiIndianiDiabetesi(PID)idatasetifromiUCIimachineilearningirepositoryiwithianiaccurac yiofi98.35%,iF1iScore:98%iandiMCC:97%iforifivefoldigoivalidation.iAdditionally,iaccuracyiofi97.11%,iSensitivityi:96.25%iandiSpecificity:98 .80%iboughtiforiten-foldigoivalidationiandiindicatedithatifivefoldipassivalidationishowedibetteriperformance. Shafietial.i(2020)iProposediaimachineistudyingiprimarilyibasedisolutionitoiavoidiclef tiinitheimother’siwombiwithiDeepiLeaningimethodiandiotheri4imethods,ioni1000ipregnanti womanisamplesifromithreeiexceptionalihospitalsiiniLahore,iPunjab.iTheiauthorsicarriediouti factsicleaning,iscalingiandicharacteristicidecisionitechniqueiandicompareditheiaccuracyiforia llitheialgorithmsiwithiRandomiForest(RF)ialgorithm:85.77%,iDecisioniTree(DT):88.14%,iK NearestiNeighbori(KNN):89.72%,iSupportivectoriMachine(SVM):90.69%iandiMultilayeripe rception(MLP)iwhichiisiaiDeepiNeuraliNetwork:92.6%.iandiindicatedithatiMLPiyieldiaihigh eriaccuracy. SharmaiandiParmari(2020)iProposediaimodeliforiheartidiseaseipredictioniwithiDNNi modelioniheartidiseaseiUCIidatasetiwithisixi(6)idistinctiveiclassifiersiKNN,iSVM,iNB,iRFia ndiDNNitheiusageiofitalosioptimization.iTheiriworkiindicatedianiaccuracyiforiKNN:90.16% ,iLogisticiRegression:82.5%,iSVM:81.97%,iNB:85.25%iandiDNNiwithiTalosioptimization:9 0.78%. AhmediandiAlsheblyi(2019)iappliedidifferentilaptopigainingiknowledgeiofialgorithm iwhichiareiartificialinuralinatworkiandilogisticiragressioni(LR)itoiaihassleiinitheidomainioni scientificidiagnosisiandianalyzeditheiriefficiencyiofitheipredictionioni153icaseiandieleveniatt ributeiofiCKDipatients,itheiobservediperformanceiofitheiANNsiclassifieriisihigherithaniLRi modeiwithitheiaccuracyiofi84.44%,isensitivityiofi84.21,ispecificityiofi84.61%iandiAreaiUnd 8 eritheiCurvei(AUC)iofi84.41%iandideterminedithatitheimostiessentialielementsithatihaveiaic leariimpactionicontinualikidneyiailmentipatientsiareicreatinineiandiurea. Chapter 3. Research Methodology Threeicrucialicomponentsimakeiupitheisuggestedimethodology:idataipreparation,imo delitraining,iandimodeliselection.(Fig. 1). Fig. 1. Proposediworkflow A. Dataipreprocessing: IncorrectiValueiHandlingiDataipreprocessingiinithisiworkiuseditoibeidoneiinitwoiparts.i Theifirstistepiwasitoifilterioutitheipropertiesiwhereimoreithani20%iofitheirecordsilackedival uesi(seeiTableiI).iAsiairesult,itheistudyidoesinotiincludeitheicollectioniofifeaturesi(rediblood icells,isodium,ipotassium,iwhiteibloodicellicount,iandipurpleibloodicellicount).iTheimissingi valuesiinitheipreviousidataiwereiaddressediinitheisecondistageiofirecordsipreparation. TABLE I TESTSiFORiMEASURINGiMULTIPLEiATTRIBUTESiANDiMISSINGiVALUEiP ERCENTAGE attributeiiiiii missingipercentageiiiiiiiiiiii classiiiiiiiiiiiii 0.01i% appetite 0.27% doctoriinspection pedaliedema 0.24i% doctoriinspection anemia 0.26i% Fbc hypertension 0.56i% doctoriinspection diabetesimellitus 0.54% Fbc coronaryiarteryidisease 0.52i% doctoriinspection pusicelliclumps 0.99i% ufr bacteria 0.99i% ufr age 2.35i% doctoriinspection bloodipressure 3.03i% doctoriinspection serumicreatinine 4.35i% serumicreatinin 11 testitoiobtain bloodiurea 4.76i% bloodiuren bloodiglucoaseirandom 11.10i% rbs albumin 11.60i% ufr spacificigravity 11.85i% ufr sugar 12.35i% Ufr hemoglobin 13.10i% fbc pusicell 16.35i% ufr packedicellivolume 18.60i% fbc sodium 22.85i% serumielectroidsi potassium 21.01i% serumielectroids whiteibloodicellicount 27.28% fbc redibloodicellicount 37.56i% fbc redibloodicells 38.50i% ufr TABLE II MCARiiLITTLE'SiTESTiRESULT name value Chi. square 3170.482 value 1 degreeiofifreedom 2171 P missingipatterns 108 Toiachieveirealisticiaccuracy,imissingivaluesimustibeitreatedibasedionitheiridistributions iinitheipreprocessingistage.iInithisistudy,itheiRandomiSamplingitestiwasionceirunitoiconfir mitheiunpredictabilityiofitheimissingivalues.iTheimechanismicausingitheirecordsitoibeimissi ngideterminesitheiworkableibiasiresultingifromimissingifacts.itheianalyticalitechniquesiusedi toifilliinitheigaps.i(J.iC.iJakobsen,vol.i17,ino.i1,ip.i162,i2017)iTheiuseiofitheiMCAR'sichisquareitestitoimultivariateiquantitativeidataiisiexamined.iItideterminesiwhetheriorinotithereii siaisignificantidifferenceiinitheiabilityiofivariousimissingvalueipatterns.iLittle'siMCARitestir esults,iwhichiareishowniiniTableiII,ileditoitheiconclusionithatitheimissingivaluesiwereiunqu estionablyirandomibecauseithei'p'icostiequalsizero.iInilightiofitheifactithatithereiareimoreipo sitiveiCKDicasesithaninegativeiCKDicases,isubstitutingimissingivaluesiwithiaiconstantimayi 13 alsoireduceiaccuracyiandibiasitheipredictionitechnique.iSomeirelatediworksicanibeiuseditoii dentifyithisiscenario.i(W.iGunarathne,2017,ipp.i291,296.),i(S.iVijayarani,iS.iDhayanandietia l.,ivol.i4,ino.i4,ipp.i13,25,i2015).iByithinkingiaboutitheitheidrawbacki(ofi(A.iJ.iAljaaf,iD.iA lJumeily,2018,ipp.i1,9.))iTheiKiNearestiNeighboriImputeritechniqueiwasiemployediinithisist udyitoifilliinitheimissingivalues,iasiindicatediinitheirelatediwork.iByisettingitheinumberiofie stimatorsi(anialgorithmihyperparameter)itoibeiequalitoitheinumberiofifulliinstances,iwhichir esultediinitheilowestirecommendiandilowestistandardideviationichange,itheimethodiwasiable itoimaintainitheidataset'sioriginalidistribution. B. Data preprocessing: FeatureiChoiceiTheiabsoluteivaluesiofitheiwarmnessimapiofitheicorrelationsioficharacter isticsitoitheitypeilabeli(Fig.i2)irevealithatitheibesticorrelationsiareibetweenihemoglobin,ispec ificigravity,ialbumin,ihypertension,iandidiabetesimellitusi(moreithani0.5). Theisecondaryiqualitiesiwithicorrelationsiofigreaterithani0.3iareithenipusicell,ibloodigluc oseirandom,ihunger,ibloodiurea,ipedaliedema,isugar,ianemia,iandiserumicreatinine. Fig.i2.iHeatimapiofiattributes'irelationshipsiwithitheiclassivariable. Albumin,ihemoglobin,ihypertension,idiabetesimellitus,ibloodiglucoseirandom,iandise rumicreatinineiwereitheniselectediasitheitopqualityisubsetioficharacteristicsitoipredictiCKDiafteritakingiintoiaccountitheidistributioniofia ttributesivaluesianditheiclinicaliperspectiveiofitheiattributesispecificigravity.iBelowiisiaidetai lediexplanationiofihowitheichoseniattributesiwereidetermined. 15 Eachiofitheinumbersiforispecificigravityiandialbuminiisijusti5iunitsi(Fig.i3).iPlottedi againstioneianother,itheirivaluesiformianiamazingiclusteriofiCKD-pooricases. Fig. 3. Albumin over specific gravity distribution Basedioniaicheckiforiproteiniinitheiurine,itheiamountiofialbuminiisiestimated.iExtrai proteiniinitheiurineiisiaisignithatitheikidney'sifilteringimechanismsihaveibeenidamagedibyiai disease,iaifever,ioristrenuousiactivity.iOveriaifewiweeks,inumerousievaluationsiareirequiredi toiconfirmitheicondition. Hemoglobin levels can often drop for three reasons: decreased synthesis of red blood cells, increased red blood cell oxidation, and blood loss. Erythropoietin (EPO), a hormone, is produced by healthy kidneys.(“Facts About Chronic Kidney Disease,” May 2020). TABLE III PERCENTAGEiCHANGEiOFiSTATISTICSiOFiATTRIBUTESiAFTERiFILLINGiMISSI NGiVALUES Hb Specific Albumin Hypertension DM Gravity Pus Blood Cell Glucose Appetite BU Pedal Sugar Anemia SC Edema Random Count 13.00 11.75 11.50 0.50 0.50 16.25 11.00 0.25 4.75 0.25 12.25 0.25 4.25 Mean 0.66 0.01 -1.39 -.034 -0.36 -2.22 -0.14 0.04 -0.22 -0.10 -4.53 -0.15 -0.43 Sid -6.32- -6.15 -5.95 -0.19 -0.19 -8.94 -5.86 -0.12 -2.43 -0.12 -6.30 -0.12 -2.16 Min 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 25 5.39 0.47 0.000 0.000 0.000 0.000 1.98 0.000 0.000 0.000 0.000 0.000 0.000 50 2.072 -0.12 100 0.000 0.000 0.000 3.87 0.000 4.55 0.000 0.000 0.000 7.15 75 -2.59 0.000 0.000 0.000 0.000 100 -2.46 0.000 -3.01 0.000 1000 0.000 0.85 max 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.04 0.000 0.000 0.000 0.000 Aihormoneiisiaisubstanceithatitheibodyiproducesiandireleasesiintoitheibloodstreamitoiassisti initriggeringiorimodifyingispecificibodilyifunctions.iPurpleibloodicellsiareiproducediinitheib oneimarrowiasiairesultiofiEPO,ianditheseicellsisubsequentlyitransportioxygenithroughoutith eibody.iWhenitheikidneysiareiillioriinjured,itheyiareiunableitoiproduceienoughiEPO.iAsiaire 17 sult,itheiboneimarrowiproducesifeweriredibloodicells,iwhichicausesianemia.iHowever,ibefor eianemiaidevelopsi(whichioccursiafterilessithani50%iofioneikidneyiisifunctioningiwell),ithei levelsiofihemoglobinivaryislightly. Furthermore,ihemoglobinivsiserumicreatinineiplotiisiadditionallyisuggestsiaiseparati oniofitheitwoiclasses:iterribleianditremendous (Fig. 4)( “Kidney Disease: The Basics,” Aug. 2014). iiiiiiFig.i4.iHemoglobinioveriserumicreatinineidistribution. Bloodicreatinine,igenerate,iandicreatinineiareiotherinamesiforiserumicreatinine.iThei breakdowniofiaisubstanceicalledicreatineiresultsiinitheiwasteiproducticreatinine,iwhichiisifor medibyiutilisingimuscleimass.iTheikidneysiareiresponsibleiforieliminatingicreatinineifromith eibody.iThisiexaminationicalculatesitheiblood'sicreatinineilevel.iTheicycleistepithaticreatesit heielectricityineededitoicontractimusclesiisicreatine.iTheibodyiproducesicreatineiandicreatini neiatiaihighlyiregularirate.iAihighproteinidiet,icongestiveiheartifailure,idiabeticiproblems,ian didehydrationicanialliincreaseitheilevelioficreatinineiinitheibloodiiniadditionitoikidneyiprobl emsithatiareialreadyipresent.iCreatinineitypicallyirangesibetweeni0.6iandi1.1img/dLiinigirlsi andi0.7iandi1.3img/dLiiniboys. Upitoitwothirdsiofiinstancesiofichronicikidneyidiseaseicanibeiattributeditoitwoikeyic auses:idiabetesiandihighibloodipressurei(seeiFig.i5).iNumerousibodyiorgans,iincludingitheik idneys,iheart,ibloodivessels,inerves,iandieyes,iareidamagedibyidiabetes.iWhenibloodipressur eiagainstibloodivesseliwallsibecomesitooigreat,ihighibloodipressure,iorihypertension,iresults. iExcessiveibloodipressureicanibeitheiprimaryicauseioficoronaryiheartiattacks,istrokes,iandiC KDiifiitiisiuncontrolledioripoorlyicontrolled.iHowever,iCKDicaniresultiinihighibloodipressu re. Fig. 5. Diabetes mellitus over hypertension distribution. Iniaddition,iaboveinotedifactors,itheiobtainabilityiandifeasibilityi(TableiI)iwasionceia lsoiconsideriiniattributesiselection. 19 Theidistributioniofiappetiteivaluesiagainstitheiclassificationishowsithatiitibiasesiinith eidirectioniofiexcellentiurgeiforifoodi(Fig.i6).iHowever,iCKDiareinowinotitheisolelyireasoni ofihavingiaipooriappetite,iwhichiwillilieitoitheipredictionsiwhenimakingiuseiofitheiskilledim annequinitoiainewiscenario. C. Model Training: Inithisistudy,itrainingiwasidoneiusingi9iclassificationimodels.iTheyiareidecisionitreeiclas sifier,irandomiwoodediareaiclassifier,icatiboost,igradientiboostingiclassifier,istochasticigradi entiboosting,iXGBiclassifier,imoreitimbericlassifier,iandiadaiboosticlassifier.iKNearestiNeighborsi(KNN)iregressioniisioneiofithem.iTheidatasetiwasisplitiintoitwoiparts:i70 %iofitheidataiwereiusediforitraining,iandi30%iwereiusediforitesting.iTheimodelsiwereifurthe riimprovediusingigridisearchiforitheitrainingidatasetiandihyperparameterituningiviaiaigenetic ialgorithm. Fig. 6. The distribution of appetite Fromitheinotedi9ialgorithms,i6ialgorithmsioutperformediinieducationiaccuracy,itestingia ccuracyiandiinicrossivalidationiaccuracy.iThoseiareitheidecisionitreeiclassifier,irandomifores ticlassifier,iXGBiclassifier,igreateritimbericlassifier,iadaiboosticlassifier. TheiimplementationsiandicontrastihadibeenicarriedioutitheiusageiofiPythoniScikit,iandiKerasiframeworks. TABLE IV EACHiALGORITHMiACCURACIESi Algorithm Trainingiaccuracy Testingiaccuracy DecisioniTreeiClassifier 99.92% 93.23% 100.0% 96.53% 100.0% 95.26% 100.0% 95.46% 100.0% 92.93% 78.68% 61.88% 100.0% 94.64% 100.0% 97.69% 100.0% 96.51% Random Forest Classifier XG Boost Extra Trees Classifier Ada Boost Classifier KNN Gradient Boosting Classifier Stochastic Gradient Boosting Cat Boost TABLE V PRECISION,iRECALLiANDiF1-SCOREiOFiEACHiALGORITHM Algorithm Precision Recall F1-Score Decision Tree Classifier 0.95 0.95 0.97 Random Forest Classifier 0.94 1.0 0.94 21 XG Boost 0.96 0.96 0.95 Extra Trees Classifier 0.94 1.0 0.92 Ada Boost Classifier 0.92 0.91 0.93 KNN 0.66 0.65 0.62 Gradient Boosting Classifier 0.98 1.0 0.94 Stochastic Gradient Boosting 0.97 1.0 0.96 Cat Boost 0.93 0.94 0.92 D. ModeliEvaluationiandiSelectioni Theimethodsiwithitheihighestiaccuracyiinialli3idataisetsiwereichosenibasedionitheiresult si(TableiIV,iV).iTheseiincludeitheiXGBiclassifier,iadditionalitreesiclassifier,idecisionitreeicl assifier,irandomiforesticlassifier,iandiadaiboosticlassifier. Fig. 7. Featureiimportanceiinieachitrainedimodel. Chapter 4. Result and Discussion Evenithoughitheicitedimodelsiprovidedi100%iaccuracy,iitiisistilliimportantitoirecognizeit heicharacteristicsithatihaveitheimostiinfluenceionieachimodelibeforeimakingiaidecisioni(Tab leiVI).iTheipopularideviationiofitheifeatureiimportanceiofieachialgorithmiwasicomputediafte rideterminingitheirelevanceiofiaifewichosenifacetsiforieachipredictionialgorithm.iThisicalcul ationidirectlyidemonstrateditheialgorithm'sipreferenceiforidistinguishingiattributesi(TableiVI ,iTableiVII,iFig.i7).iTheiextraibushesiclassifierihasitheileastibiasiinitheidirectioniofipointsiad jacentitoitheirandomiforesticlassifier,iaccordingitoitheieffectsi(TableiVI,iTableiVII,iandiFig.i 7).iTheileastiamountiofibiasiisipresentiinitheiselectionitreeiclassifier. TABLE VI EACH ALGORITHM IMPORTANCE FEATURE Attribute Decission Tree Classifier Rendom Forest Classifer XGB Classifer Extra Tree Classifier Ada Boost Classifer Hemoglobin 0.583 0.245 0.255 0.175 0.332 Specific Gravity 0.269 0.276 0.137 0.244 0.328 Serum Creatinine 0.032 0.165 0.503 0.056 0.004 Albumin 0.107 0.197 0.086 0.153 0.146 Hypertension 0.006 0.054 0.004 0.194 0.132 23 Diabetes Mellitus 0.007 0.028 0.007 0.133 0.084 Blood Glucuse Rendom 0.027 0.047 0.025 0.042 0.003 AlthoughitheidataidistributioniaccuratelyicoversitheifulliregioniiniCKD,icommonisy mptomsiincludingihunger,ianemia,iandipedalioedemaiareislantedimoreitowardiCKD.iAlthou ghiitiisisimpleitoimakeiaisuccessfulipredictioniusingithisidataiset,iiticanialsoiresultiinifalseip ositivesiwheniusediinitheitypicalicontext,iasiseeniiniTableiV'sirecallicolumn.iFurthermore,iit iwasidifficultitoiachieveiaiperfectiaccuracyiotherithanibyifillingiinitheimissingivaluesiusingi aicollaborativeiimputeriratherithaniaiconstantibecauseitheyiwereiunquestionablyioverlookedi atirandom.iGivenitheiscientificivalueiofitheicharacteristics,isomeiofithemihaveiaiweakericorr elationithaniothersidependingionitheipatient'sistageiofidevelopment. TABLE VII STANDARDiDEVIATIONiOFiFEATUREiOFiALGORITHMS Attribut e ExtraiTreeiCl assifier RendomiForesti Classifier AdeiBoostiCl assifier XGBiCle ssifier DecissioniTreei Clessifier StendardiDa viation 0.070247746 0.101241419 0.1362923604 0.1853412263 0.21194746 Theitrainingiprocedureihasiaisignificantiimpactionitheimodels'iaccuracy.iAfteritheim odelihasibeenitrained,iitiisievidentithatitreesiareimoreiaccurateithaniothericategorizationialgo rithms.iThisiconclusionicanibeidrawnifromitheidistributioniofitheifactisetibecause,iwithitheie xceptioniofiserumicreatinine,itheiclassificationiofitheichoseniattributesiisimoreiclearlyideline ated.iConsideringitheireasonsiforitheialternateiofitheinominalivaluesiofithem,iitihasimanyiva riousiprobabilitiesiasideifromiCKD.iFinally,iwhenipickingitheimethod,icertainiexpertimodels ihaveiaibiasicloseritoiparticulariqualitiesiasiindicatediiniTableiVI.iAsiairesult,iitiencouragesi decisionmakersitoiweighimoreifactorsithanijustioneiwhenimakingichoices,ianditheibetteritreeiclassifi erihasibeenichoseniasiairesult. 25 Chapter 5. Conclusion Nearlyi14%iofitheiworld'sipopulationiisiaffectedibyichronicikidneyidiseasei(CKD),iwhichic anibeifatal.iByibeingiableitoiforecastiitiwithiai100%iaverageidegreeiofiaccuracy,ihumansica nilearniaboutiitiearlyioniandireceiveitreatmentiwithitheileastiamountiofiexpenseiandidanger.i Theirangeiofifeaturesirequirediforitheipredictionialgorithmiisireducedibyipropericharacteristi ciengineering,ieffectivelyiloweringitheinumberioficlinicaliexamsithatimustibeitaken.iUsingi KiNearestiNeighborsimputeri(KNNimputer)itoifilliinimissingivaluesibasedionitheiridistributionianditheicollocatio niofiotheriattributesiratherithaniimmediatelyireplacingithemiwithiaiconstantiimprovesipredic tioniaccuracyicompareditoirelatediworkidoneiwithitheisameidataset.iFurthermore,itheilargeri bushesiclassifierianditheirandomiforesticlassifieriareisuperiorialgorithmsiforimakingipredicti onsiforiCKDibecauseitheyiconsistentlyiachievei100%iaccuracyiandiexhibitilittleibiasitoward idistinctiveifeaturesiinicomparisonitoiotherimodels.iIniorderitoianticipateiifiCKDifameiwilli beipositiveiorinegative,iainewiapproachithatiincludesiinformationipreprocessing,imanagingi missingivalues,iandielementiselectioniisiproposediinithisiwork.iThisiworkialsoiemphasizesih owiimportantiitiisitoiconsideridomainiinformationiwhenichoosingifunctionsitoianalyzeiscien tificirecordsirelateditoiCKD.iTherefore,iitiwillibeibeneficialitoilearnihowitoicopeiwithimissi ngivaluesiinidataisetsirelateditoiseveralidiseasesiinitheifutureiusingiaiKNNimputerbasedistrategy.iAdditionally,ibyiincorporatingiknowledgeiaboutigenetics,iwatericonsumption ipatterns,iandifoodikindsiintoitheiresearch,ideeperiinsightsioniCKDicanibeigained. 27 References NationaliKidneyiFoundation.i2022.iKidneyiDisease:iTheiBasics.i[online]iAvailableia t:i<https://www.kidney.org/news/newsroom/fsindex>i[Accessedi15iAugusti2022].i NationaliKidneyiFoundation.i2022.iKidneyiDisease.i[online]iAvailableiat:i<https://w ww.kidney.org/kidneydisease/global-facts-aboutkidneydisease/>i[Accessedi15iAugusti2022]. NationaliKidneyiFoundation.i2022.iEstimatediGlomerulariFiltrationiRatei(eGFR).i[ online]iAvailableiat:i<https://www.kidney.org/atoz/content/gfr>i[Accessedi15iAugusti2022]. i Murtagh,iF.,iAddingtonHall,iJ.,iEdmonds,iP.,iDonohoe,iP.,iCarey,iI.,iJenkins,iK.iandiHigginson,iI.,i2010.iSymptom siinitheiMonthiBeforeiDeathiforiStagei5iChroniciKidneyiDiseaseiPatientsiManagediWithout iDialysis.iJournaliofiPainiandiSymptomiManagement,i40(3),ipp.342-352.i Xiao,iJ.,iDing,iR.,iXu,iX.,iGuan,iH.,iFeng,iX.,iSun,iT.,iZhu,iS.iandiYe,iZ.,i2019.iCo mparisoniandidevelopmentiofimachineilearningitoolsiinitheipredictioniofichronicikidneyidis easeiprogression.iJournaliofiTranslationaliMedicine,i17(1).i Archive.ics.uci.edu.i2022.iUCIiMachineiLearningiRepository.i[online]iAvailableiat:i <https://archive.ics.uci.edu/ml/index.php>i[Accessedi15iAugusti2022].i Abdullah,iA.,iHafidz,iS.iandiKhairunizam,iW.,i2020.iPerformanceiComparisoniofiM achineiLearningiAlgorithmsiforiClassificationiofiChroniciKidneyiDiseasei(CKD).iJournaliof iPhysics:iConferenceiSeries,i1529(5),ip.052077.i InternationaliJournaliofiRecentiTechnologyiandiEngineering,i2019.iPredictiveiAnaly ticsiofiChroniciKidneyiDiseaseiusingiMachineiLearningiAlgorithm.i8(2),ipp.940-947. Aljaaf,iA.,iAlJumeily,iD.,iHaglan,iH.,iAlloghani,iM.,iBaker,iT.,iHussain,iA.iandiMu stafina,iJ.,i2018.iEarlyiPredictioniofiChroniciKidneyiDiseaseiUsingiMachineiLearningiSupp ortedibyiPredictiveiAnalytics.i2018iIEEEiCongressioniEvolutionaryiComputationi(CEC),.i Rady,iE.iandiAnwar,iA.,i2019.iPredictioniofikidneyidiseaseistagesiusingidataimining ialgorithms.iInformaticsiiniMedicineiUnlocked,i15,ip.100178.i Johnson,iC.iA.,iLevey,iA.iS.,iCoresh,iJ.,iLevin,iA.,iLau,iJ.,i&iEknoyan,iG.i(2004).iC linicalipracticeiguidelinesiforichronicikidneyidiseaseiiniadults:iPartiI.iDefinition,idiseaseista ges,ievaluation,itreatment,iandiriskifactors.iAmericanifamilyiphysician,i70(5),i869–876. iLi,iC.,i2013.iLittle'siTestiofiMissingiCompletelyiatiRandom.iTheiStataiJournal:iPro motingicommunicationsionistatisticsiandiStata,i13(4),ipp.795-809. Nair,iS.,iO’Brien,iS.,iHayden,iK.,iPandya,iB.,iLisboa,iP.,iHardy,iK.iandiWilding,iJ.,i 2014.iEffectiofiaiCookediMeatiMealioniSerumiCreatinineiandiEstimatediGlomerulariFiltrati oniRateiiniDiabetes-RelatediKidneyiDisease.iDiabetesiCare,i37(2),ipp.483-487.i Yildirim,iP.,i2017.iChroniciKidneyiDiseaseiPredictionioniImbalancediDataibyiMultil ayeriPerceptron:iChroniciKidneyiDiseaseiPrediction.i2017iIEEEi41stiAnnualiComputeriSoft wareiandiApplicationsiConferencei(COMPSAC),.i Jakobsen,iJ.,iGluud,iC.,iWetterslev,iJ.iandiWinkel,iP.,i2017.iWheniandihowishouldi multipleiimputationibeiusediforihandlingimissingidataiinirandomisediclinicalitrialsi– iaipracticaliguideiwithiflowcharts.iBMCiMedicaliResearchiMethodology,i17(1).i S,iV.iandiS,iD.,i2015.iDataiMiningiClassificationiAlgorithmsiforiKidneyiDiseaseiPr ediction.iInternationaliJournalioniCyberneticsi&amp;iInformatics,i4(4),ipp.13-25.i 29 NationaliKidneyiFoundation.i2022.iFactsiAboutiChroniciKidneyiDisease.i[online]iA vailableiat:i<https://www.kidney.org/atoz/content/about-chronic-kidneydisease>i[Accessedi15iAugusti2022].