PCOS Detect using Machine Learning Algorithms

Page 1

Key Words: Machine Learning, Polycystic Ovary Syndrome, Random Forest, Decision Tree, Support Vector Classifier, K Nearest Neighbours, Logistic Regression, K Nearest Neighbours.

The most common symptoms of this disorder may include missed periods, irregular periods, or very light periods, it affects in a way that ovaries become large or may contain many cysts, it can also cause excess body hair, including the chest, stomach, and hirsutism, can cause weight gain, especially around the abdomen, Acne or oily skin. The exact pathophysiology of PCOS is not yet known. This heterogenous disorder is characterized by the ovaries mainly. PCOS is a multifactorial and polygenic condition. Machine Learning is capable of "learning" features from very large amount through clinical practice to diagnose this disorder. This paper put forwards a solution to this problem which helps in early detection and prediction of PCOS treatment from an optimal and minimal set of parameters which have been statistically analyzed. The solution is built using machine learning algorithms such as Random Forest, Decision Tree, Support Vector Classifier, Logistic Regression, K Nearest neighbors, XGBRF, CatBoost Classifier

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1214 Kinjal Raut 1 , Chaitrali Katkar 2 , Prof. Dr. Mrs. Suhasini A. Itkar 3 1Final Year Computer Engineering Student, PES Modern College of Engineering, Pune 2Final Year Computer Engineering Student, PES Modern College of Engineering, Pune 3Professor, Dept. of Computer Engineering, PES Modern College of Engineering, Pune, India *** Abstract

1. INTRODUCTION

1.1 Literature Review ofencauseeasewithOver10millionyounggenerationhasbeenaffectedglobally1inevery4fouryoungwomenhavingPCOS.Thedis-ismorecommoninurbanpopulationthanruralbe-ofthelifestyle.TheincreaseinnumberofPCOSwom-isdirectlycorrelatedwiththesedentarylifestyleandlacknutritionalfood,lackofexercise,weightgainandobesity. Table 1: SummaryofLiteraturereview AUTHORS OBJECTIVES DESIGNRESEARCH RESULTS Palak et al. [2012] Amethodto ers.bolicandonPCOSautomatebasedclinicalmeta-mark- gression.LogisticBayesianbasedofClassificationfeaturesonandRe- Among the two comcywithclassifierBayesianmodeltheparedmodelsbestbuiltisaccura-93.93%. Purnama et Detection of Three classi- On C=40 PCOS Detect using Machine Learning Algorithms

Technology is changing every outlook of our lives making remarkable transformations in the healthcare industry, nowadays technology and humans are working hand in hand. For example, robots performing surgeries once seemeda fiction but nowtheyareperformingcritical and complexsurgeriesinhospitals. Machine learning is a subclass of artificial intelligence, it helpsthesystemlearn,identifypatternsofdatasets,make logicaldecisionsandperformingdigitalanalysisondigital information including words, numbers, images and clicks. MachineLearningapplicationsmainlyincludeimagerecognition,dataprediction,MedicalDiagnosis HealthCareand ClinicalCare,etc.InthisworldoftechnologymanyadvancementsaretakingplacefordetectionofPCOSandMachine Learningalgorithmsareoneofthem.

2.2%prevalencethatPCOSisoneofthemostwidelycommonendocrinedisordersaffects1in10womenofchildbearingage.TheexactofPCOSisnotknownbutvariablerangingfromto26%globally.Itwasfirstdetailedin1935byStein andLeventhalasasyndromemanifestedbyhirsutism,and obesityassociatedwithenlargedpolycysticovaries.Woman inreproductiveage15 40experiencehormonalimbalance, hencePCOScanhappenatanyageafterpuberty. Hormones needed are progesterone, luteinizing hormone (LH),estrogenandfolliclestimulatinghormone(FSH).The commonsymptomsofPCOSareirregularmenstrualcycle, 3.2.1.ThePCOS.portantinginappropriateTheretoomuchhair,acne,weightgain,darkeningofskin,skintags.isahighriskoffirsttrimestermiscarriage,inovariesgrowthoffolliclecanbepreventedbydetect-PCOSatanearlystage.HencedetectionofPCOSisim-atprimarystage.Thispaperfocusesonpredictionofmainworkincludes:Selectionofmostimportantattributesusingfeaturese-lectionmethodfromthedataset.Applying/Performingmachinelearningalgorithmsontheselectedfeatures.Comparingtheperformedalgorithmsinordertocheckaccuracy.

Polycystic ovary syndrome (PCOS), also known as polycystic ovarian syndrome, is hormonal endocrine disorder among women of reproductive age. Over five million women worldwide in their reproductive age are suffering from PCOS.

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1215 al.[2015] basedFollicles on USG images by using the binary follicle tionsegmenttractionfeatureimages,ex-andafication scenarios kernel.SVMeanKNNworkNeuraldesignedwereNet-LVQ,Euclid-distance,RBF SVM cy.78%achievedK=5cy82%achievkernelRBFedaccura-andonKNNaccuraDennyetal. [2019] To overcome the time and cost involved in variousclinical ning.ovarytestsandscan- RFKNN,rithmslearningusedwithtranPCOSfeaturessformedPCAmachinealgo-likeSVM,,etc. The best and modelaccurateforthe PCOS detecaccestRandomtioncameoutFor-with0.89 Subrato et [2020]al. Data pository.KaggledatasetPCOSdiagnosisdrivenofusingonre- usedClassifiersare as follows gradient boosting, tion.crossholdoutpliedmethodsRFLRregressiforest,randomlogisticon,andap-areandvalidaThebesttesting 90%.call91.01%,RFLRobtainedaccuracyisofre-value Ning Ning Xie et al. [2020] To modeldiagnosticandomarkersgeneidentifybi-build estRandomasrithmslearningtwocombiappliedtionaComputa-lmethodbyningmachinealgo-suchANN,andFor- Anovel diagnostic taset.RNA0.6488datasetmicroarray0.7273cywithdevelopedmodelaccura-ofAUC:inandinseqdaPriyanka et [2020]al sented.willsymptomsthewhichgramsandsymptophysicalwilltionClassifica-ofPCOSusemssono-inonlyphysicalbepre- Used different algorithmslikeK star, IB1 instance based, learning,weightedlocally Decision Table, M5 rules, ZeroR, andtoRandomForestRandomandTreeclassifyfindbest Among different algorithms performed K star formed.outpermodel.

TanwaniNamarat A model is built using the causes and symptoms of PCOS as inputsandthe output is predicted as presence or absence or PCOS. learningMachine susion.gisticNNusegorithmssificationpervisedclas-al-dareKandLo-Regres-

The best achievedmodel ac95.6%AUROC:98%Precision:Recall:0.010sec,F1time:Traicuracy,ning97.11,score:98%,and Inan etal.[2021] stances.PCOStributewhichfeaturesrelevantstatisticallytoticaConductingprobabilis-approachselectcon-toin- SMOTE, ENN and ANOVA Test, Chi Square Test were used to identify important features. Classifiers such as XG Boost, SVM, KNN, NB, MLP, RF, AdaB were used. Boost call.and0.96fiersotherperformedout-allclassi-withaccuracy0.98ReMethodology Developmentofmachinelearningmodeltotrainthedataset isanimportantstepforsuccessfulimplementation.Thedataset contains attributes such as I beta HCG (mIU/mL), II beta HCG(mIU/mL),AMH(ng/mL),Age(yrs),Weight(Kg),

Khan

2.

The best accurate model built is LogisticRegression 92%.accuracywith [2021]aMadhumithetal. detailsOvary are large range of tionsegmentufolltypefollicles,ofcysts,iclesize,singimageaBasedonpre used.siongisticKNNtionslogicalandprocessingmorpho-opera-SVM,andLo-Regres-were All three alachievedcurandwerehybridbinedweregorithmscom-andmodelmade0.98ac-acywere Pijushetal [2021] sibleearlydiseasetionandDetectionpreven-ofthisasaspos- Used SMOTE

tionearlytogetherandtorSupportsionForest,sion,gisticsuchalgorithmsandfiveotherasLo-Regres-RandomDeci-Treevec-machineKNNfordetec-ofPCOS.

G

Forthismachinelearningimplementation,theplatform usedisJupyterNotebook,languageused Python.

FurtherthestepstobeperformedinDataPreprocessing are:

iii.FeatureSelection

niqueLogisticLogisticentationtovidesSVCSupportTreegorithmtographicallyDecisionDecisiontobuildsalgorithmusedforbothClassificationandRegression.Itsmultipledecisiontreesandmergesthemtogethergetamoreaccurateandstableprediction.TreeTreeisoftypesupervisedlearningalgorithm,representedforgettingallpossiblesolutionsaproblembasedongivenconditions.ItusedCARTal-whichstandsforClassificationandRegressionalgorithm.VectorClassifier(SVC)istofitthedataprovidedreturningabestfitthatdi-orcategorizesthedata.Thedatapointsarecloserthehyperplaneandcauseschangeinpositionandori-ofthegivenhyperplane.RegressionRegressionisatypeofsupervisedLearningtech-usedforsolvingtheclassificationproblems.Itisa

ii.DataLabeling

3. SelectionofimplementationPlatform

WhenPreparationappropriatedataisidentified,thedatashouldbe shaped in order to train the model. The data obtained will be in csv format in python. Visualize the data and checkthecorrelationsbetweendifferentcharacteristics. In this step checking for missing values or incomplete records,aggregation,augmentation,normalization,labelling,structured,unstructuredandsemi structureddata theseactivitiesareperformed.Thisdatasetcontainsthe womenpatientsinwhichtheyaresufferingfromPCOS.

It is a method of identifying raw data i.e videos, images,textfiles,etcandaddinformativetagstoprovide contexttoincreasethesignificanceofmachinelearning model. The non numerical are transformed into numericalvalues.

Thisisthecrucialstepofcollectingdata.Howgoodthe model will perform, the accuracy that we will get dependsonthedataset.Wecancollectdatafromvarious platforms such as Kaggle, UCI Repository, BuzzFeed News, etc. We performed research on the dataset obtained on Kaggle named Polycystic ovary syndrome (PCOS)

Thefirstmostimportantstepinistodefinetheproblem by including the inputs provided in the model and the expectedoutputofthemodel.

Feature Selection is an important step in which most Wrapperwithirrelevantpossibleperformancmachinerelevantfeaturesareextractedfromthedatasetandthenlearningalgorithmsareappliedforthebettereofmodel.Ithasagoaltofindthebestfeaturesforbuildingthemodelignoringthedetails.FeatureselectioncanbeperformedcommontechniquesincludingFiltermethods,methodsandEmbeddedmethods.

4. Data

Itistheprocessofidentifyingtheincorrect,incomplete ormissingpartofthedataandthenmodifying,replacing ordeletingthem Inourpaperthedatasetwaschecked formissingvaluesfirstbyusingthePandasandNumpy2

i.DataCleaning

1. DefiningtheProblem

2. DataCollection

2.1 Modelling Whenthedata iscompletelycleanedandselected,itis readytobeprocessedbythealgorithms. Thealgorithms used to create the model are Random Forest, Decision Tree, Support Vector Classifier, Logistic Regression, K Nearestneighbors,XGBRF,CatBoostClassifier. RandomForest RandomForestisakindofsupervisedmachinelearning

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1216 Height (Cm), BMI, Blood Group, Pulse rate(bpm), RR etc.(breaths/min),Fastfood,RegExercise,BP_Systolic(mmHg), Fig -1: Blockdiagramofthesystem

CatBoost

boostinglibraryusedforregressionand

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1217 machinelearningalgorithmusedforpredictingthecategoricaldependentvariableusingagivensetofindependent variables and the cost function is limited between values0and1. KNearestNeighbor(KNN) KNearestNeighboralsoknownaslazylearneralgorithm isasupervisedLearningalgorithmsusedforbothclassificationandregression.Insteadofinstantlylearningthe dataset,itfirststoresthedatasetandthenatthetimeof classificationperformsactiononthegivendataset.

themodelusingvalidatingCrossCrossbinationstisegoricalusedinthisalgorithmistoperformconversionfromcat-valuesintonumbersusingdifferenttypesofsta-ticsoncombinationsofcategoricalfeaturesandcom-ofcategoricalandnumericalfeatures.ValidationValidationinmachinelearningisatechniqueforthemodelefficiencyinwhichmodelistrainedthesubsetofthedatasetandaftertrainingtheisevaluatedusingthecomplementarysubsetofdataset. Fig 2: CorrelationbetweenFeatures Results and Discussion Theexperimentationisperformedonthedatasetusing Classifier.Regression,modeltheisvariousmachinelearningalgorithms.Themainobjectivetofindmostsuitablealgorithmfortheclassificationofdatasetcreated.ThealgorithmsusedtoconstructtheareDecisionTree,SVC,RandomForest,LogisticKNearestNeighbor,XGBRFandCatBoost Table 1: AccuracyofDifferentClassifierModels Models Accuracy DecisionTree 82.79 SVC 69.05 RandomForest 89.42 LogisticRegression 83.32 KNearestNeighbors 74.34 XGBRF 85.89 CatBoostClassifier 92.64

XGBoostXGBRF with Random Forest (XGBRF) is an ensemble methodusedforclassificationofPCOS.XGBoostisagradientboostingalgorithmandRandomForestisanexampleofbaggingalgorithm.XGBRFisamodifiedversionof XGBoostclassifier.TheadvantageofXGBRFisitisused toovercometheproblemofover fitting. CategoricalClassifierBoosting CatBoost or is an open source classification.It workswithmultiplecategoriesofdata,includingaudio, textandimageincludinghistoricaldata.Thetechnique

International Research Journal of Engineering and Technology (IRJET) e ISSN: 2395 0056 Volume: 09 Issue: 01 | Jan 2022 www.irjet.net p ISSN: 2395 0072 © 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal | Page1218 Fig -3:AccuracyofdifferentClassifierModels Theaccuraciesobtainedby differentalgorithmsare: DecisionTree 82.79%,SVC 69.05%,RandomForest 89.42%, Logistic Regression 83.32%, K Nearest Neighbors 74.34%,XBRF 85.89%,CatBoostClassifier 92.64%. Therefore,fromtheaboveresults,conclusionis CatBoostClassifierhasoutperformedandobtainedhighestaccuracy. 3. CONCLUSIONS Inthispaper,MachineLearningmodelissuccessfullybuilt andtrainedforearlydetectionofPCOS.PCOSisoneofthe verycommonconditioninwomenassociatedwithpsychological,reproductiveandmetabolicfeatures. Someday to dayactivitiestodecreasetheeffectsofPCOSaremaintaina healthyweight,limitcarbohydrates,beactive,exercisedaily andeathealthyfood.Thesysteminthispaperhelpsinearly detectionof PCOS froman optimal and minimal setof parameterswhichhavebeenstatisticallyanalyzed.Amongthe anlearningthisscreeninginvariousalgorithmsusedCatBoostClassifierisfoundsuperiorperformance.Thismodelcanbeusedbydoctorsforearlyanddiagnosingpatientswhoarelikelytodevelopdisorder.Therefore,withtheuseofvariousmachinetechniqueswehavebuiltamodeltodetectPCOSatearlystage.

[4] SubratoBharati,PrajoyPodder,M.RubaiyatHossain Mondal,“DiagnosisofPolycysticOvarySyndromeUsingMachineLearningAlgorithms”.2020IEEERegion 10 Symposium (TENSYMP), 5 7 June 2020, Dhaka, Bangladesh.

REFERENCES

[8] J.Madhumitha,M.Kalaiyarasi,S.SakthiyaRam,“Au tomated Polycystic Ovarian Syndrome Identification withFollicleRecognition”,20213rdInternationalConferenceonSignalProcessingandCommunication

[9] PijushDutta,ShobhandebPaul,MadhurimaMajumder, “An Efficient SMOTE Based Machine Learning classificationforPrediction&DetectionofPCOS”,ResearchSquare,November8th,2021.

[7] Namrata Tanwani, “Detecting PCOS using Machine Learning”, IJMTES | International Journal of Modern Trends in Engineering and Science ISSN: 2348 3121, Volume:07Issue:012020.

[5] Ning NingXie,Fang FangWang,JueZhou,ChangLiu, FanQu,“EstablishmentandAnalysisofaCombined DiagnosticModelofPolycysticOvarySyndromewith Random Forest and Artificial Neural Network”, Hindawi BioMed Research International Volume 2020.

[6] PriyankaR.Lele,AnuradhaD.Thakare, “Comparative AnalysisofClassifiersforPolycysticOvarySyndromeDetectionusingVariousStatistical Measures”,International Journal of Engineering Research & Technology (IJERT) ISSN:2278 0181:Vol.9Issue03,March 2020.

[1] Palak Mehrotra, Jyotirmoy, Chatterjee, Chandan Chakraborty,“AutomatedScreeningofPolycystic Ovary SyndromeusingMachineLearning Techniques”,IEEE, 2012. [2] BedyPurnama,UntariNoviaWisesti,Adiwijaya,Fhira Nhita,AndiniGayatri,TitikMutiah,“AClassificationof PolycysticOvarySyndromeBasedonFollicleDetec tion ofUltrasoundImages,20153rdInternational ConferenceonInformationandCommunicationTech nology (ICoICT).

[10] Muhammad Sakib Khan Inan, Rubaiath E Ulfath, Fahim Irfan Alam, Fateha Khanam Bappee, Rizwan Hasan,“ImprovedSamplingandFeatureSelectionto SupportExtremeGradientBoostingforPCOSDiagnosis.

[3] AmsyDenny,AnitaRaj,AshiAshok,ManeeshRamC, RemyaGeorge,“i HOPE:DetectionAndPredictionSys temForPolycysticOvarySyndrome(PCOS)UsingMachineLearningTechniques”,2019IEEERegion10Conference(TENCON2019).

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.