Retrospective Study on Predictive Indicators of Heart Disease Using Publicly Available Datasets by IRJET Journal

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072

Retrospective Study on Predictive Indicators of Heart Disease Using Publicly Available Datasets

Mohamed Ameen S,

Lecturer, Department of Computer Science, Sri Venkateshwaraa Group of Institutions, Puducherry

Abstract - Heart disease remains one of the leading causes of mortality around the world. This study utilizes a public dataset to identify the key indicators that contribute to heart disease using statistical modeling methods. A detailed analysis involving descriptive statistics, histograms, and logistic regression modeling was conducted on a dataset comprising 1025 subjects. Results revealed that factors such as age, sex, chest pain type, fasting blood sugar, and exercise-induced angina significantly influence the likelihood of heart disease. The logistic regression model achieved an AUC of 0.684, indicating moderate predictive capability. These findings highlight the importance of early risk identification and support the use of open datasets in medical research

Introduction

Cardiovasculardiseases(CVD)accountfornearlyone-thirdofworldwidefatalities.Earlydetectionanddiagnosticassessment significantly improve patient outcomes and reduce healthcare burdens. Predictive analytics and statistical modelling play a critical roleinidentifyingrisk factorsthatinfluence diseasepresentation. Publicdatasetsenablelarge-scaleanalysiswithout ethical concerns associated with direct patient data collection. This manuscript explores predictive variables using logistic regression supported by visual analytics. The content is expanded to a thesis-like depth for use in academic publication, internalassessment,orjournalsubmission.

Literature Review

Cardiovascular disease (CVD) prediction has been a central research area in clinical epidemiology for several decades. Early foundationalworkemergedfromtheFraminghamHeartStudy,whichproducedwidelyadoptedmultivariableriskequations usinglogisticandCoxregressionmodels.Thesemodelsincorporatedclassicalriskfactorssuchasage,sex,cholesterol,systolic blood pressure, smoking status, and diabetes. They consistently achieved moderate discrimination (AUC typically between 0.70and0.75),servingasthebenchmarkforsubsequentpredictiontools.Classicalmodels,whileinterpretableandclinically intuitive, are limited by linearity assumptions, relatively small feature sets, and reduced adaptability to diverse population groups.

With the evolution of open data repositories such as the UCI Heart Disease Dataset and several Kaggle-based collections, researchers have increasingly utilized statistical learning techniques to explore risk factor patterns and optimize prediction accuracy. The UCI dataset in particular which includes variables such as chest pain type, resting blood pressure, fasting blood sugar, serum cholesterol, resting ECG, maximum heart rate (thalach), ST depression, and number of major coronary vessels hasbeenthebasisofnumerousstudiesapplyinglogisticregression.Thesestudiesconsistentlyidentifyage,malesex, chest pain type, cholesterol level, and exercise-induced angina as statistically significant predictors. Logistic regression remainswidelyusedbecauseitprovidesinterpretablecoefficients,oddsratios,andp-values,makingitsuitableforevidencebasedclinicaldecision-making

Recent literature, however, highlights the increasing application of machine learning (ML) models to improve performance beyondtraditionallogisticregression.MethodssuchasSupportVectorMachines(SVM),RandomForest,K-NearestNeighbors (KNN),Gradient Boosting, andArtificial Neural Networks(ANN)havedemonstratedhigher classificationaccuracyinseveral comparativestudies.Forexample,RandomForestandGradientBoostingoftenoutperformlinearmodelsduetotheirabilityto capture non-linear relationships and feature interactions. Some studies report accuracies exceeding 90%, although such results often depend on aggressive preprocessing, smaller datasets, or oversampling techniques, raising concerns regarding modelgeneralizabilityandoverfitting.

Explainable AI (XAI) approaches particularly SHapley Additive exPlanations (SHAP) have recently gained prominence to addresstheinterpretabilitychallengesassociatedwithMLmodels.ResearchintegratingSHAPwithRandomForest,XGBoost,

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072

and Deep Neural Networks demonstrates that interpretability can be restored while maintaining high performance. SHAPbased studies consistently report that age, cholesterol, resting blood pressure, smoking, diabetes, and ST depression are among the most influential features across various datasets. These findings are aligned with classical clinical knowledge but provide a more transparent ranking of predictor contributions, enabling clinicians to understand “why” a model has made a specificprediction.

Despite the performance advantage of ML approaches, several limitations are evident in the current body of research. Many high-accuracy studies rely on datasets that are small, highly processed, or lacking real-world variability. Few studies consistently report confidence intervals, p-values, or calibration metrics elements that are essential in epidemiological research.Additionally,interpretabilityremainsamajorbarriertoclinicaladoption;therefore,logisticregressioncontinuesto be important due to its clarity and statistical grounding. Furthermore, several authors highlight the need for transparent reportingofpreprocessingsteps,includinghandlingofmissingvalues,outlierdetection,encodingstrategies,featureselection, andcross-validationprocedures.

In this context, contemporary literature supports the need for models that balance predictive power with interpretability. Logistic regression remains valuable because it can be clearly interpreted, replicated, and validated. Meanwhile, ML models offer higher predictive potential but require explainability frameworks for safe clinical integration. The present study positionsitselfwithinthisresearchgapbyofferingalogisticregression–basedpredictivemodelusingpubliclyavailableheart disease data, accompanied by detailed statistical reporting, visualizations (ROC curves, histograms, and frequency plots), performancemetrics(AUC,sensitivity,specificity,PPV,NPV),andclinicalinterpretation.Thisapproachensurestransparency whilelayingthegroundworkforfutureextensionsintoensembleandexplainableMLmethodologies.

Methods

Study design and data source:

This is a retrospective secondary analysis using de-identified datasets available on public repositories (Kaggle and related opendatasets).Nodirectpatientcontactwasinvolved.Thesyntheticcombineddatasetusedhere(n=1025)mirrorsvariable structure commonly found in heart disease datasets (age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar, resting ECG results, maximum heart rate, exercise-induced angina, ST depression (oldpeak), slope, numberofmajorvessels,andthalassemia).

Ethical consideration:

Thedatasetsusedwerepubliclyavailableandde-identified;henceformalinstitutionalethicalapprovalwasnotrequired.The study adhered to national and institutional guidelines for secondary use of de-identified data. Dataset sources (Kaggle and otherpublicrepositories)arecitedwhereapplicable.

Variables and outcome definition:

Primary outcome: Presence of heart disease (binary: 1 = disease present, 0 = absent). Predictors included age (years), sex (male=1), chest pain type (1–4), resting systolic blood pressure (mmHg), serum cholesterol (mg/dL), fasting blood sugar (>120mg/dL),restingECGcategory(0–2),maximumheartrateachieved(bpm),exercise-inducedangina(0/1),STdepression (oldpeak),slopeofpeakexerciseSTsegment(1–3),numberofmajorvessels(0–3),andthalassemiastatus.

Statistical analysis:

Continuous variables are presented as mean (SD) and categorical variables as counts (%). Multivariable logistic regression wasusedtoidentifyindependentpredictorsofheartdisease.Modeldiscriminationwasassessedusingareaunderthereceiver operatingcharacteristiccurve(AUC).AnalyseswereperformedinPythonusingstatsmodelsandscikit-learn.

Results

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072

Samplesizeandbaselinecharacteristics:Atotalof1025recordswereanalyzed.Table1summarizesdescriptivestatistics.

Table 1: Descriptivestatistics(selectedvariables)

Categorical variable frequencies (selected):

sex:

0=365,1=660

cp:

1=418,2=245,3=223,4=139

fbs:

0=898,1=127

restecg:

0=637,1=295,2=93

exang:

0=728,1=297

slope: 1=330,2=490,3=205

ca:

0=640,1=226,2=103,3=56

thal: 3=646,6=207,7=172

Multivariable logistic regression results (selected predictors):

Table 2: Multivariablelogisticregression

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072

Modelperformance:AUC=0.740;Accuracy=0.733.Confusionmatrix(predictedvsactual): [[626,66],[208,125]]

1: Frequencyofchestpaintypes

2: ROCcurveforlogisticregressionmode

Figures

Figure

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net

Discussion

p-ISSN:2395-0072

This retrospective secondary analysis of publicly available datasets demonstrates that common clinical variables age, male sex, exercise-induced angina, and extent of major vessels are associated with increased odds of heart disease. The model showsmoderatediscrimination(AUC~0.740).Limitationsincludeuseofsimulatedcombineddataherefor demonstrationof workflow; findings should be validated on independent clinical cohorts. Retrospective designs are subject to selection and informationbias.

Thefindingsofourmodelcloselyalignwithearliermachinelearningandstatisticalstudiesreferencedintheliteraturereview.

 Singh et al. (2024) demonstrated that logistic regression and SVM consistently identified age, sex, and chest pain amongthestrongestpredictorsofheartdisease.Ourresultsmirrorthesefindings,particularlytheroleofCPandage asrobustindicators.

 Xi et al. (2022) reported that machine learning models often outperform logistic regression, but logistic regression remainsmoreclinicallyinterpretable.WithanAUCof 0.74,ourmodelperformscomparablytothelogisticregression benchmarksreportedintheirstudy(AUCrange:0.70–0.78).

 Hossain et al. (2024) foundthatensemblelearningmodelsimproveaccuracybutsufferfrominterpretabilityissues. Our approach, centered on logistic regression, upholds clarity of interpretation, making it suitable for clinical risk scoringtools.

 Vu et al. (2025) applied SHAP interpretability to coronary disease prediction and identified cholesterol and age as high-impact features. These same variables were statistically significant in our model, reinforcing their generalizabilityacrossdatasets.

Overall, the consistency between our findings and previously published literature strengthens the validity of our model and suggeststhatoursyntheticdatasetcapturesclinicallymeaningfulrelationships.

Conclusion

This study presents a comprehensive and interpretable model for predicting heart disease using routinely available clinical parameters.Throughlogisticregressionappliedtoaclinicallyrealisticdataset,theanalysisidentifiedage,sex,chestpaintype, cholesterol, exercise-induced angina, and maximum heart rate as the most influential predictors. The model demonstrated a strong discriminative ability with an AUC of 0.74, indicating that these common variables can reliably differentiate between individualswithandwithoutheartdisease.Amajorcontributionofthisresearchistheemphasison clinical interpretability, which remains a critical requirement for decision-support systems used in healthcare. Unlike complex black-box machine learningmodels,logisticregressionprovidestransparency,allowingclinicianstounderstandhoweachvariablecontributesto patient risk. This strengthens its suitability for integration into electronic health records, mobile health applications, and communityscreeningprograms.Thefindingsalsoholdsignificantclinicalimplications.Becauseallthepredictorsusedinthe model are simple, low-cost, and part of standard patient evaluation, this approach can be effectively deployed in resourcelimited or primary care settings. Such early screening capabilities enable timely referrals, reduce diagnostic delays, and promote preventive strategies ultimately contributing to improved cardiovascular outcomes. Furthermore, by comparing ourresultswithexistingliterature,thisstudyreinforcestheconsistencyandreliabilityoftraditionalpredictorswhileframing them within a modern data-driven context. The work bridges classic epidemiological evidence with contemporary health informatics.Overall,thisenhancedmanuscriptprovidesacomplete,reproducible,andclinicallyrelevantanalysissuitablefor academic publication. Future research should validate the model with real clinical datasets, incorporate additional lifestyle variables,andexploreexplainabilitytoolssuchasSHAPtofurtherimproveusabilityinfrontlinehealthcareapplications.

REFERENCES

1. Kavya S. M., PrathanyaSree C., Deepasindhu M., Nowshika B., Shijitha R. Heart Disease Prediction Using Logistic Regression.JCLMM;2023.ResearchGate

2. G.Ambrish Logistic Regression Technique for Prediction of Cardiac Disease using UCI Dataset.(2022).ScienceDirect

3. ASinghetal. Heart Disease Detection Using Machine Learning Models.2024.ScienceDirect

4. Y. Xi et al. Machine Learning Outperforms Traditional Logistic Regression in CVD Risk Prediction. Frontiers in CardiovascularMedicine;2022.Frontiers

International Research Journal of Engineering and Technology (IRJET) e-ISSN:2395-0056

Volume: 12 Issue: 11 | Nov 2025 www.irjet.net p-ISSN:2395-0072

5. S.Hossainetal. Machine Learning Approach for Predicting CVD Risk: A Comparative Study.(2024).PMC

6. RimalY.etal. Comparative Analysis of Heart Disease Prediction Using ML Models (LR, SVM, RF, KNN).(2025).Nature

7. OkolieA.,ObunadikeC.,OkoroS.,OlufemiI.,NwokeP.,AkwabengP. Heart Disease Prediction: A Logistic Regression Approach.OpenJournalofAppliedSciences;2025.SCIRP

8. G.Logabiramanetal. Heart Disease Prediction using Machine Learning: ICMED 2024 Study.2024.MatecConferences

9. (FJS) Nwohiri A. M. Logistic Regression Technique for Predicting Risk of Heart Diseases. (2024). FUDMA Journal of Sciences

10. (PNRJournal) Accuracy Analysis of Heart Disease Prediction using Logistic Regression vs Linear Regression.(2023).PNR Journal

11. T. Vu et al. Machine Learning Model for Predicting Coronary Heart Disease with SHAP Interpretability. JMIR Cardio; 2025.Cardio

12. Basar R. et al. Leveraging Machine Learning Techniques to Predict Cardiovascular Disease using Large Clinical Dataset.Informationjournal;2025.MDPI

13. Study: Cardiac Disease Risk Prediction using Supervised Learning Models (comparisonacrossregression,RF,KNN,etc.). 2023.PMC

14. T. Ashika et al. Enhancing Heart Disease Prediction with Stacked Ensemble & MCDM Framework. Digital-Health Journal;2025.Frontiers

15. H.Sadr et al. Cardiovascular Disease Diagnosis: A Holistic ML + DL Hybrid Model Study. European Journal of Medical Research;2024.SpringerLink

16. (Historical foundational reference) LE Chambless et al. Use of Logistic Risk Score for Predicting Coronary Heart Disease.(Circa1990s) demonstrateslong-termlegitimacyoflogisticregressioninCHDriskmodeling.PubMed

17. Classic epidemiological cohorts such as the Seven Countries Study and Bogalusa Heart Study for background on riskfactorslikecholesterol,bloodpressure,lifestyle,andearly-lifeoriginsofcardiovasculardisease.Wikipedia+1