
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
![]()

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
Mohamed Ameen S,
Lecturer, Department of Computer Science, Sri Venkateshwaraa Group of Institutions, Puducherry
Abstract - Heart disease remains one of the leading causes of mortality around the world. This study utilizes a public dataset to identify the key indicators that contribute to heart disease using statistical modeling methods. A detailed analysis involving descriptive statistics, histograms, and logistic regression modeling was conducted on a dataset comprising 1025 subjects. Results revealed that factors such as age, sex, chest pain type, fasting blood sugar, and exercise-induced angina significantly influence the likelihood of heart disease. The logistic regression model achieved an AUC of 0.684, indicating moderate predictive capability. These findings highlight the importance of early risk identification and support the use of open datasets in medical research
Introduction
Cardiovasculardiseases(CVD)accountfornearlyone-thirdofworldwidefatalities.Earlydetectionanddiagnosticassessment significantly improve patient outcomes and reduce healthcare burdens. Predictive analytics and statistical modelling play a critical roleinidentifyingrisk factorsthatinfluence diseasepresentation. Publicdatasetsenablelarge-scaleanalysiswithout ethical concerns associated with direct patient data collection. This manuscript explores predictive variables using logistic regression supported by visual analytics. The content is expanded to a thesis-like depth for use in academic publication, internalassessment,orjournalsubmission.
Cardiovascular disease (CVD) prediction has been a central research area in clinical epidemiology for several decades. Early foundationalworkemergedfromtheFraminghamHeartStudy,whichproducedwidelyadoptedmultivariableriskequations usinglogisticandCoxregressionmodels.Thesemodelsincorporatedclassicalriskfactorssuchasage,sex,cholesterol,systolic blood pressure, smoking status, and diabetes. They consistently achieved moderate discrimination (AUC typically between 0.70and0.75),servingasthebenchmarkforsubsequentpredictiontools.Classicalmodels,whileinterpretableandclinically intuitive, are limited by linearity assumptions, relatively small feature sets, and reduced adaptability to diverse population groups.
With the evolution of open data repositories such as the UCI Heart Disease Dataset and several Kaggle-based collections, researchers have increasingly utilized statistical learning techniques to explore risk factor patterns and optimize prediction accuracy. The UCI dataset in particular which includes variables such as chest pain type, resting blood pressure, fasting blood sugar, serum cholesterol, resting ECG, maximum heart rate (thalach), ST depression, and number of major coronary vessels hasbeenthebasisofnumerousstudiesapplyinglogisticregression.Thesestudiesconsistentlyidentifyage,malesex, chest pain type, cholesterol level, and exercise-induced angina as statistically significant predictors. Logistic regression remainswidelyusedbecauseitprovidesinterpretablecoefficients,oddsratios,andp-values,makingitsuitableforevidencebasedclinicaldecision-making
Recent literature, however, highlights the increasing application of machine learning (ML) models to improve performance beyondtraditionallogisticregression.MethodssuchasSupportVectorMachines(SVM),RandomForest,K-NearestNeighbors (KNN),Gradient Boosting, andArtificial Neural Networks(ANN)havedemonstratedhigher classificationaccuracyinseveral comparativestudies.Forexample,RandomForestandGradientBoostingoftenoutperformlinearmodelsduetotheirabilityto capture non-linear relationships and feature interactions. Some studies report accuracies exceeding 90%, although such results often depend on aggressive preprocessing, smaller datasets, or oversampling techniques, raising concerns regarding modelgeneralizabilityandoverfitting.
Explainable AI (XAI) approaches particularly SHapley Additive exPlanations (SHAP) have recently gained prominence to addresstheinterpretabilitychallengesassociatedwithMLmodels.ResearchintegratingSHAPwithRandomForest,XGBoost,

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
and Deep Neural Networks demonstrates that interpretability can be restored while maintaining high performance. SHAPbased studies consistently report that age, cholesterol, resting blood pressure, smoking, diabetes, and ST depression are among the most influential features across various datasets. These findings are aligned with classical clinical knowledge but provide a more transparent ranking of predictor contributions, enabling clinicians to understand “why” a model has made a specificprediction.
Despite the performance advantage of ML approaches, several limitations are evident in the current body of research. Many high-accuracy studies rely on datasets that are small, highly processed, or lacking real-world variability. Few studies consistently report confidence intervals, p-values, or calibration metrics elements that are essential in epidemiological research.Additionally,interpretabilityremainsamajorbarriertoclinicaladoption;therefore,logisticregressioncontinuesto be important due to its clarity and statistical grounding. Furthermore, several authors highlight the need for transparent reportingofpreprocessingsteps,includinghandlingofmissingvalues,outlierdetection,encodingstrategies,featureselection, andcross-validationprocedures.
In this context, contemporary literature supports the need for models that balance predictive power with interpretability. Logistic regression remains valuable because it can be clearly interpreted, replicated, and validated. Meanwhile, ML models offer higher predictive potential but require explainability frameworks for safe clinical integration. The present study positionsitselfwithinthisresearchgapbyofferingalogisticregression–basedpredictivemodelusingpubliclyavailableheart disease data, accompanied by detailed statistical reporting, visualizations (ROC curves, histograms, and frequency plots), performancemetrics(AUC,sensitivity,specificity,PPV,NPV),andclinicalinterpretation.Thisapproachensurestransparency whilelayingthegroundworkforfutureextensionsintoensembleandexplainableMLmethodologies.
Methods
Study design and data source:
This is a retrospective secondary analysis using de-identified datasets available on public repositories (Kaggle and related opendatasets).Nodirectpatientcontactwasinvolved.Thesyntheticcombineddatasetusedhere(n=1025)mirrorsvariable structure commonly found in heart disease datasets (age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate, exercise-induced angina, ST depression (oldpeak), slope, numberofmajorvessels,andthalassemia).
Ethical consideration:
Thedatasetsusedwerepubliclyavailableandde-identified;henceformalinstitutionalethicalapprovalwasnotrequired.The study adhered to national and institutional guidelines for secondary use of de-identified data. Dataset sources (Kaggle and otherpublicrepositories)arecitedwhereapplicable.
Variables and outcome definition:
Primary outcome: Presence of heart disease (binary: 1 = disease present, 0 = absent). Predictors included age (years), sex (male=1), chest pain type (1–4), resting systolic blood pressure (mmHg), serum cholesterol (mg/dL), fasting blood sugar (>120mg/dL),restingECGcategory(0–2),maximumheartrateachieved(bpm),exercise-inducedangina(0/1),STdepression (oldpeak),slopeofpeakexerciseSTsegment(1–3),numberofmajorvessels(0–3),andthalassemiastatus.
Statistical analysis:
Continuous variables are presented as mean (SD) and categorical variables as counts (%). Multivariable logistic regression wasusedtoidentifyindependentpredictorsofheartdisease.Modeldiscriminationwasassessedusingareaunderthereceiver operatingcharacteristiccurve(AUC).AnalyseswereperformedinPythonusingstatsmodelsandscikit-learn.

Results
International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
Samplesizeandbaselinecharacteristics:Atotalof1025recordswereanalyzed.Table1summarizesdescriptivestatistics.
Table 1: Descriptivestatistics(selectedvariables)
Categorical variable frequencies (selected):
sex:
0=365,1=660
cp:
1=418,2=245,3=223,4=139
fbs:
0=898,1=127
restecg:
0=637,1=295,2=93
exang:
0=728,1=297
slope: 1=330,2=490,3=205
ca:
0=640,1=226,2=103,3=56
thal: 3=646,6=207,7=172
Multivariable logistic regression results (selected predictors):
Table 2: Multivariablelogisticregression

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
Modelperformance:AUC=0.740;Accuracy=0.733.Confusionmatrix(predictedvsactual): [[626,66],[208,125]]

1: Frequencyofchestpaintypes

2: ROCcurveforlogisticregressionmode
|

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net
Discussion
p-ISSN:2395-0072
This retrospective secondary analysis of publicly available datasets demonstrates that common clinical variables age, male sex, exercise-induced angina, and extent of major vessels are associated with increased odds of heart disease. The model showsmoderatediscrimination(AUC~0.740).Limitationsincludeuseofsimulatedcombineddataherefor demonstrationof workflow; findings should be validated on independent clinical cohorts. Retrospective designs are subject to selection and informationbias.
Thefindingsofourmodelcloselyalignwithearliermachinelearningandstatisticalstudiesreferencedintheliteraturereview.
Singh et al. (2024) demonstrated that logistic regression and SVM consistently identified age, sex, and chest pain amongthestrongestpredictorsofheartdisease.Ourresultsmirrorthesefindings,particularlytheroleofCPandage asrobustindicators.
Xi et al. (2022) reported that machine learning models often outperform logistic regression, but logistic regression remainsmoreclinicallyinterpretable.WithanAUCof 0.74,ourmodelperformscomparablytothelogisticregression benchmarksreportedintheirstudy(AUCrange:0.70–0.78).
Hossain et al. (2024) foundthatensemblelearningmodelsimproveaccuracybutsufferfrominterpretabilityissues. Our approach, centered on logistic regression, upholds clarity of interpretation, making it suitable for clinical risk scoringtools.
Vu et al. (2025) applied SHAP interpretability to coronary disease prediction and identified cholesterol and age as high-impact features. These same variables were statistically significant in our model, reinforcing their generalizabilityacrossdatasets.
Overall, the consistency between our findings and previously published literature strengthens the validity of our model and suggeststhatoursyntheticdatasetcapturesclinicallymeaningfulrelationships.
Conclusion
This study presents a comprehensive and interpretable model for predicting heart disease using routinely available clinical parameters.Throughlogisticregressionappliedtoaclinicallyrealisticdataset,theanalysisidentifiedage,sex,chestpaintype, cholesterol, exercise-induced angina, and maximum heart rate as the most influential predictors. The model demonstrated a strong discriminative ability with an AUC of 0.74, indicating that these common variables can reliably differentiate between individualswithandwithoutheartdisease.Amajorcontributionofthisresearchistheemphasison clinical interpretability, which remains a critical requirement for decision-support systems used in healthcare. Unlike complex black-box machine learningmodels,logisticregressionprovidestransparency,allowingclinicianstounderstandhoweachvariablecontributesto patient risk. This strengthens its suitability for integration into electronic health records, mobile health applications, and communityscreeningprograms.Thefindingsalsoholdsignificantclinicalimplications.Becauseallthepredictorsusedinthe model are simple, low-cost, and part of standard patient evaluation, this approach can be effectively deployed in resourcelimited or primary care settings. Such early screening capabilities enable timely referrals, reduce diagnostic delays, and promote preventive strategies ultimately contributing to improved cardiovascular outcomes. Furthermore, by comparing ourresultswithexistingliterature,thisstudyreinforcestheconsistencyandreliabilityoftraditionalpredictorswhileframing them within a modern data-driven context. The work bridges classic epidemiological evidence with contemporary health informatics.Overall,thisenhancedmanuscriptprovidesacomplete,reproducible,andclinicallyrelevantanalysissuitablefor academic publication. Future research should validate the model with real clinical datasets, incorporate additional lifestyle variables,andexploreexplainabilitytoolssuchasSHAPtofurtherimproveusabilityinfrontlinehealthcareapplications.
1. Kavya S. M., PrathanyaSree C., Deepasindhu M., Nowshika B., Shijitha R. Heart Disease Prediction Using Logistic Regression.JCLMM;2023.ResearchGate
2. G.Ambrish Logistic Regression Technique for Prediction of Cardiac Disease using UCI Dataset.(2022).ScienceDirect
3. ASinghetal. Heart Disease Detection Using Machine Learning Models.2024.ScienceDirect
4. Y. Xi et al. Machine Learning Outperforms Traditional Logistic Regression in CVD Risk Prediction. Frontiers in CardiovascularMedicine;2022.Frontiers

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056
Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072
5. S.Hossainetal. Machine Learning Approach for Predicting CVD Risk: A Comparative Study.(2024).PMC
6. RimalY.etal. Comparative Analysis of Heart Disease Prediction Using ML Models (LR, SVM, RF, KNN).(2025).Nature
7. OkolieA.,ObunadikeC.,OkoroS.,OlufemiI.,NwokeP.,AkwabengP. Heart Disease Prediction: A Logistic Regression Approach.OpenJournalofAppliedSciences;2025.SCIRP
8. G.Logabiramanetal. Heart Disease Prediction using Machine Learning: ICMED 2024 Study.2024.MatecConferences
9. (FJS) Nwohiri A. M. Logistic Regression Technique for Predicting Risk of Heart Diseases. (2024). FUDMA Journal of Sciences
10. (PNRJournal) Accuracy Analysis of Heart Disease Prediction using Logistic Regression vs Linear Regression.(2023).PNR Journal
11. T. Vu et al. Machine Learning Model for Predicting Coronary Heart Disease with SHAP Interpretability. JMIR Cardio; 2025.Cardio
12. Basar R. et al. Leveraging Machine Learning Techniques to Predict Cardiovascular Disease using Large Clinical Dataset.Informationjournal;2025.MDPI
13. Study: Cardiac Disease Risk Prediction using Supervised Learning Models (comparisonacrossregression,RF,KNN,etc.). 2023.PMC
14. T. Ashika et al. Enhancing Heart Disease Prediction with Stacked Ensemble & MCDM Framework. Digital-Health Journal;2025.Frontiers
15. H.Sadr et al. Cardiovascular Disease Diagnosis: A Holistic ML + DL Hybrid Model Study. European Journal of Medical Research;2024.SpringerLink
16. (Historical foundational reference) LE Chambless et al. Use of Logistic Risk Score for Predicting Coronary Heart Disease.(Circa1990s) demonstrateslong-termlegitimacyoflogisticregressioninCHDriskmodeling.PubMed
17. Classic epidemiological cohorts such as the Seven Countries Study and Bogalusa Heart Study for background on riskfactorslikecholesterol,bloodpressure,lifestyle,andearly-lifeoriginsofcardiovasculardisease.Wikipedia+1