EssentialGuide toModel Evaluation Metrics
Measurewhatmatters:masterevaluation metricstomakebetterclinicalpredictions.
Adetailedandpracticalguidetomodelevaluationmetrics,witha focusonhealthcareandthereal-worldimpactofpredictionerrors.

Measurewhatmatters:masterevaluation metricstomakebetterclinicalpredictions.
Adetailedandpracticalguidetomodelevaluationmetrics,witha focusonhealthcareandthereal-worldimpactofpredictionerrors.
Indatascience,buildingapredictivemodelisonlyhalfthebattle—theother halfliesinmeasuringhowwellthatmodelperforms.Amodelisonlyasuseful asthereliabilityofthedecisionsitsupports.Especiallyincriticalapplications likemedicine,wheremodelsmayinformdiagnosesortreatments,understanding evaluationmetricsisnotjusttechnicaldiligence;it’samoralresponsibility.
Apredictivemodelthatclassifieswhetherpatientsaresickorhealthymustdo morethanjust“getitright”mostofthetime.Itmustminimizeharmfulmistakes: falsepositives,wherehealthypatientsaremistakenlylabeledassick,and false negatives,wheresickpatientsaredeclaredhealthy—apotentiallylife-threatening error.Hence,evaluatingmodelswithappropriatemetricsisessentialtoensure theirreal-worldutilityandsafety.
Beforedivingintospecificmetrics,weneedtounderstandhowwerepresenta model’spredictionsagainstreality.Thisiswherethe confusionmatrix becomes anindispensabletool.
Let’sconsiderasimplifiedyetrealisticexample.Imaginewehavedevelopeda machinelearningmodeltodiagnoseararediseasebasedonbloodtestresults.
Thedatasetcontains1,000patientrecords.Ofthese,50patientsactuallyhavethe disease(positivecases),and950donot(negativecases).
Nowsupposeourmodelpredicts everyoneishealthy —itsimplysays“nodisease” foreverypatient.Inthiscase,itwouldbe 95%accurate,sinceitcorrectlylabeled 950outof1,000patients.Butitfailedtodetect any ofthe50trulysickpatients. That’snotjustapoormodel—it’sadangerousone.
Accuracyalone,aswe’llsee,iso enmisleadinginimbalanceddatasets.That’s whyweneedothermetricstoevaluatemodelperformancemorecarefully.And ourjourneybeginswiththeconfusionmatrix.
The confusionmatrix isacompactyetpowerfultoolthatsummarizesaclassifier’s predictions.Ittellsusnotonlyhowmanypredictionswerecorrect,butwhatkinds ofmistakesthemodelmade.Here’sastandard2×2confusionmatrixforbinary classification:
• TruePositives(TP): Themodelcorrectlypredictsapatienthasthedisease.
• FalsePositives(FP): Themodelincorrectlypredictsdiseaseinahealthy patient.
• FalseNegatives(FN): Themodelfailstoidentifyapatientwhoactuallyhas thedisease.
• TrueNegatives(TN): Themodelcorrectlypredictsapatientishealthy.
Thisstructureisthefoundationformostclassificationmetrics.Itallowsusto computequantitieslike precision, recall, specificity,and F1score,allofwhich we’llexploreinlaterchapters.
Python’s scikit-learn makesiteasytocomputeandvisualizeconfusionmatrices:
1 from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
2 import matplotlib.pyplotasplt
3
4 y_true =[1,0,1,1,0,1,0,0,1,0] #actuallabels
5 y_pred =[1,0,1,0,0,1,0,1,1,0] #predicted labels
6
7 cm = confusion_matrix(y_true, y_pred)
8 disp = ConfusionMatrixDisplay(confusion_matrix=cm)
9 disp.plot()
10 plt.title("ConfusionMatrixExample") 11 plt.show()
1 from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
2 import matplotlib.pyplotasplt
3
4 y_true =[1,0,1,1,0,1,0,0,1,0] #actuallabels
5 y_pred =[1,0,1,0,0,1,0,1,1,0] #predicted labels
6
7 cm = confusion_matrix(y_true, y_pred)
8 disp = ConfusionMatrixDisplay(confusion_matrix=cm)
9 disp.plot()
10 plt.title("ConfusionMatrixExample") 11 plt.show()
Figure1: Imagegeneratedbytheprovidedcode.
Thissnippetcomputesandplotstheconfusionmatrixfromrealvs.predictedvalues. Inmedicalscenarios,wheredecisionimpactiscritical,visualtoolslikethiscan helpdatascientistscommunicateperformancemoree ectivelytocliniciansand stakeholders.
Inthischapter,we’veintroducedwhyevaluationmatters,exploredamedicalexample,andunpackedthelogicandstructureoftheconfusionmatrix.Butthematrix isonlythebeginning—thenumbersitcontainscanbetransformedintovarious metricsthatrevealdi erentaspectsofmodelbehavior.
Inthenextchapter,we’lldiveintothese basicclassificationmetrics —starting with accuracy,butquicklymovingbeyondittouncoverwhyprecision,recall,and F1scorearemoreinformativeinmedicalsettings.We’llconnecteachmetrictoits clinicalrelevance,andshowhowyoucancomputethemeasilyinPython.
Nowthatweunderstandtheconfusionmatrix,wecanderivefromitasetofmetrics thatquantifyamodel’sperformancefromdi erentperspectives.Eachmetric answersaslightlydi erentquestionandismoreorlessrelevantdependingonthe clinicalcontext.
Inhealthcare,weo encaremoreabout detectingsickpatients(sensitivity) or avoidingunnecessarytreatments(specificity) thanwedoaboutoverallcorrectness(accuracy).Inthischapter,we’llexplore:
• Whateachmetricmeans
• Howtocalculateit
• Whenandwhyit’simportant
• HowtocomputeitinPython
• Howtovisualizeitgeometrically
Accuracy
Definition: Theproportionofcorrectpredictions(bothpositiveandnegative)over allpredictions.
Usefulness: Accuracycanbemisleadinginimbalanceddatasets.If95%ofpatients arehealthy,amodelpredicting“healthy”foreveryonewillachieve95%accuracy —butfailtodetectanysickpatients.
Python:
1 from sklearn.metrics import accuracy_score 2 accuracy_score(y_true, y_pred)
Definition: Theproportionofactualpositives(sickpatients)thatwerecorrectly identified.
Sensitivity = TP TP + FN
Alsoknownas recall,thismetricanswers: “Ofallthesickpatients,howmanydid themodelcatch?”
Clinicalrelevance: Missingadiagnosis(afalsenegative)canbedangerous.High sensitivityiscrucialwhenthecostofmissingadiseaseishigh.
Python:
1 from sklearn.metrics import recall_score 2 recall_score(y_true, y_pred)
Definition: Theproportionofactualnegatives(healthypatients)thatwerecorrectly identified.
Specificity = TN TN + FP
Thismetricanswers: “Ofallthehealthypatients,howmanywerecorrectlyidentified ashealthy?”
Clinicalrelevance: Alowspecificitymeansmanyhealthypeoplearefalselydiagnosed,possiblyundergoingunnecessarystressortreatments.
Note: scikit-learn doesnotincludespecificitybydefault,butyoucan calculateitmanually.
1 cm = confusion_matrix(y_true, y_pred)
2 TN, FP, FN, TP = cm.ravel()
3 specificity = TN /(TN + FP)
Precision
Definition: Theproportionofpredictedpositivesthatwereactuallypositive.
Precision = TP TP + FP
Usefulness: Precisiontellsus howmuchwecantrustpositivepredictions.Inmedical contextswheretreatmentisexpensiveorrisky,precisionmatters.
Python:
1 from sklearn.metrics import precision_score
2 precision_score(y_true, y_pred)
Definition: Theharmonicmeanofprecisionandrecall.Itbalancesthetrade-o betweenbothmetrics.
Usefulness: TheF1scoreisespeciallyusefulinimbalancedscenarioswhereboth falsepositivesandfalsenegativesarecostly.
Python:
1 from sklearn.metrics import f1_score 2 f1_score(y_true, y_pred)
Accuracy
Sensitivity
Specificity
Precision
+TN
+TN +FP +FN +kk
+FN
Overallcorrectness
Abilitytodetectsickpatients
Abilitytoavoidfalsealarmsin healthypatients
Trustworthinessofapositive prediction F1ScoreHarmonicmeanofprecision andrecall
Nowthatwehavethefundamentalmetricscovered,wearereadytodivedeeper into evaluationthroughcurvesandthresholds.Inthenextchapter,wewill explore ROCcurves, AUC,and Precision-Recallcurves —toolsthatshowhowperformancechangesasthedecisionthresholdvaries,whichisparticularlyimportant inclinicalriskprediction.
Sofar,we’veevaluatedmodelswithfixedthresholds—classifyingpredictions aspositiveornegativebasedonadecisionboundary(e.g.,0.5).However,many modelsoutputprobabilitiesratherthanhardlabels.Inthesecases,wecanadjust thethresholdtoexploredi erenttrade-o sbetweensensitivityandspecificityor betweenprecisionandrecall.
Thisiswhere ROCcurves and Precision-Recallcurves becomepowerfultools. Theyallowustovisualizeamodel’sperformanceacrossallthresholds,helpingus makebetterdecisionsdependingonclinicalpriorities.
The ReceiverOperatingCharacteristic(ROC)curve plotsthe truepositiverate (sensitivity) againstthe falsepositiverate atvariousthresholdsettings.
• X-axis: FalsePositiveRate=FP/(FP+TN)
• Y-axis: TruePositiveRate=TP/(TP+FN)
EachpointontheROCcurvecorrespondstoadi erentthreshold.Thecloserthe curveistothetop-le corner,thebetterthemodelisatdistinguishingbetween classes.
The AreaUndertheCurve(AUC) providesasinglevaluesummarizingtheentire ROCcurve.Itrepresentstheprobabilitythatthemodelranksarandomlychosen positiveinstancehigherthanarandomlychosennegativeone.
AUC
Value Interpretation Explanation
1.0Perfect model Indicatesamodelthatcanclassifyallsamplesperfectly withoutanyerrors.
0.9ExcellentRepresentsamodelwithhighdiscriminatorypower, veryclosetoperfectclassification.
0.8VerygoodSignifiesamodelwithstrongdiscriminationability, thoughnotashighas0.9.
0.7ReasonableDenotesamodelwithsomediscriminatorycapability, roomforimprovementmayexist.
0.6Relatively poor Indicatesamodelstrugglingtodistinguishbetween positiveandnegativesamples.
0.5Random guessing Representsamodelthatperformsnobetterthan randomchanceinclassificationtasks.
Inamedicalsetting,AUChelpsanswer: “Howwellcanthemodelranksickpatients abovehealthyones?”
However,AUC-ROCcanbemisleadingin imbalanceddatasets.Forexample,in rarediseaseswherenegativesdominate,ahighAUCmayhidepoorperformance onpositives.
PythonExample
1 from sklearn.metrics import roc_curve, auc
2 import matplotlib.pyplotasplt
3
4 y_scores = model.predict_proba(X_test)[:,1] # probabilitiesforpositiveclass
5 fpr, tpr, thresholds = roc_curve(y_true, y_scores)
6 roc_auc = auc(fpr, tpr)
7
8 plt.plot(fpr, tpr, label=f"ROCcurve(AUC={roc_auc:.2f })")
9 plt.plot([0,1],[0,1], 'k--')
10 plt.xlabel("FalsePositiveRate")
11 plt.ylabel("TruePositiveRate")
12 plt.title("ROCCurve")
13 plt.legend()
14 plt.show()
• X-axis: Recall
• Y-axis: Precision
Itisespeciallyinformativefor imbalanceddatasets,whereROCcurvesmight presentanoverlyoptimisticview.
The areaunderthePRcurve(AUC-PR) providesasummaryofthemodel’sability tomaintainprecisionasrecallincreases.It’stypicallylowerthanAUC-ROCbut morereflectiveofrealperformanceinrare-eventsettingslikediseasedetection.
Ahighprecisionathighrecallmeansthemodelcandetectmostsickpatients without floodingtheclinicwithfalsealarms.Thisisacrucialbalancewhendeploying screeningtools.
PythonExample
1 from sklearn.metrics import precision_recall_curve, average_precision_score
2
3 precision, recall, thresholds = precision_recall_curve( y_true, y_scores)
4 ap = average_precision_score(y_true, y_scores)
5 6 plt.plot(recall, precision, label=f"PRcurve(AP={ap:.2 f})")
7 plt.xlabel("Recall")
8 plt.ylabel("Precision")
9 plt.title("Precision-RecallCurve")
10 plt.legend()
plt.show()
Thechoiceofthresholdimpacts:
• Sensitivity: Higherthreshold->fewerfalsepositives,butalsomorefalse negatives.
• Precision: Lowerthreshold->morepositivespredicted,butpotentiallylower precision.
Inmedicine,thistrade-o mustalignwithclinicalgoals:
• Earlyscreening: Preferhighsensitivity(lowthreshold).
• Confirmatorydiagnosis: Preferhighprecision(highthreshold).
Thescikit-learndocumentationhasinteractivenotebooksthatshowhowmetricschangewiththresholds.Youcanalsouse plotly or ipywidgets tobuild thresholdsliders.
1 #Optional:Interactivethresholdtuning
2 from ipywidgets import interact
3
4 def update(threshold=0.5):
5 y_pred_thresholded =(y_scores >= threshold).astype( int)
6 acc = accuracy_score(y_true, y_pred_thresholded)
7 recall = recall_score(y_true, y_pred_thresholded)
8 precision = precision_score(y_true, y_pred_thresholded)
9 print(f"Threshold:{threshold:.2f}-Accuracy:{acc :.2f},Recall:{recall:.2f},Precision:{ precision:.2f}")
10 11 interact(update, threshold=(0.0,1.0,0.05));
ThefollowingTableprovidesaconciseoverviewoftwocommonlyusedevaluationmetricsinmachinelearning:ROC(ReceiverOperatingCharacteristic)Curve andPrecision-RecallCurve.Thesecurveso erinsightsintotheperformanceof classificationmodelsbyexaminingdi erentaspectsoftheirpredictions.
Curve TypeAxesBestForAUCMeaning
ROC Curve TPRvs.FPRGeneralmodel discrimination
Probabilityapositiveis rankedaboveanegative
PrecisionRecall Precision vs.Recall Imbalancedclasses,rare diseasedetection Averageprecisionacross thresholds
ROCandPRcurvesprovidea dynamic viewofmodelperformanceacrossthresholds.Theyhelpcliniciansanddatascientistschoosemodelsandthresholdsthat alignwithreal-worldneeds.
Inthenextchapter,we’llgobeyondstandardmetricstoexplore lesscommonbut powerfulones —suchas BalancedAccuracy, MatthewsCorrelationCoe icient, and Cohen’sKappa —particularlyusefulincaseswheretraditionalmetricsfall short.
BasicmetricslikeaccuracyandF1scoreareo ensu icienttoevaluatemodels —butnotalways.Insituationswith classimbalance, multipleraters,oraneed for morenuancedcomparisons,advancedmetricsprovidedeeperinsights.In healthcare,wheresubtlemistakescanhaveseriousconsequences,thesemetrics helpusevaluatemodelsmorerigorouslyandfairly.
Inthischapter,wewillcover:
• BalancedAccuracy –adjustsforclassimbalance.
• MatthewsCorrelationCoe icient(MCC) –abalancedmetricevenfor skewedclasses.
• Cohen’sKappa –agreementbeyondchance.
• Youden’sIndex –usefulforthresholdoptimizationindiagnostics.
• Li andGainCharts –forevaluatingprobabilisticmodelsindecision-making pipelines.
Itisespeciallyusefulfor imbalanceddatasets,asitgivesequalweighttothe performanceoneachclass.
Imagineararediseasea ectingonly5%ofpatients.Amodelcouldachievehighaccuracybysimplypredictingeveryoneashealthy—butbalancedaccuracypenalizes suchbehavior,revealingpoorsensitivity.
PythonExample 1 from sklearn.metrics import balanced_accuracy_score
y_pred)
Definition
MCCisacorrelationcoe icientbetweenobservedandpredictedclassifications:
Itreturnsavaluebetween-1(totaldisagreement)and+1(perfectprediction),with 0indicatingrandomprediction.
ClinicalRelevance
UnlikeF1oraccuracy,MCCis invarianttoclassdistribution.Itisidealwhen class imbalanceissevere,and falsepositivesandnegatives areequallyimportant.
PythonExample
Cohen’sKappameasures inter-rateragreement,adjustingfortheagreement expectedbychance:
Where:-$p_o$istheobservedagreement.-$p_e$istheexpectedagreementby chance.
ClinicalRelevance
Kappaisusefulwhencomparingpredictionsfromtwosources(e.g.,modelvs.doctor),highlightingwhetheragreementismeaningfulorjustrandom.
PythonExample
1 from sklearn.metrics import cohen_kappa_score
2
3 cohen_kappa_score(y_true, y_pred)
Youden’sIndex
Definition
Youden’sIndexsummarizestheperformanceofadiagnostictest: J = Sensitivity + Specificity 1
Thisindexrangesfrom0(worthlesstest)to1(perfecttest).It’so enusedtodeterminethe optimalthreshold forbinaryclassifiers.
Inmedicine,choosingthethresholdthatmaximizesYouden’sIndexensuresthe besttrade-o betweendetectingdiseaseandavoidingfalsealarms.
PythonExample(manual):
1 import numpyasnpy
2 def youden_index(y_true, y_scores):
3 from sklearn.metrics import roc_curve
4 fpr, tpr, thresholds = roc_curve(y_true, y_scores)
5 J = tpr - fpr
6 idx = npy.argmax(J)
optimal_thresh, index_val = youden_index(y_true, y_scores
Li andgainchartsshowhowmuch betteramodelperformscomparedtorandomselection whenwerankpatientsbypredictedprobabilityandthenselectthe topX%.
• Gainchart: Showsthecumulative%oftruepositivescapturedasweincrease the%ofpopulationscreened.
• Li chart: Showstheratioofdetectedpositivestowhatwouldbeexpected byrandomselection.
Thesechartsareusefulwhendeployingmodelsto prioritizescreenings or allocate limiteddiagnosticresources.Forexample,testingthetop10%ofpatientsbyrisk mightidentify60%oftruepositives—ali of6xoverrandomscreening.
import scikitplotasskplt
skplt.metrics.plot_cumulative_gain(y_true, model. predict_proba(X_test))
5 plt.title("CumulativeGainChart") 6 plt.show()
8 skplt.metrics.plot_lift_curve(y_true, model.predict_proba (X_test))
9 plt.title("LiftCurve") 10 plt.show()
MetricWhatitMeasuresBestUseCase
Balanced Accuracy Meanofsensitivityand specificity Imbalanceddatasets
MCCOverallclassificationqualityAllcases,especiallywith imbalance
Cohen’s Kappa AgreementbeyondchanceComparingmodelvsexpert predictions
Youden’s Index OptimaldiagnosticthresholdChoosingclinicalthresholds
Li /Gain Charts Screeningperformanceat rankedthresholds Prioritizingpatienttestingor interventions
Advancedmetricsgiveusmorepowertoevaluatemodels fairly and usefully, especiallyinchallengingclinicalconditions.Somearemathematicallycomplex,
buttheyo erricherunderstandingwhenaccuracyandF1arenotenough.
Next,wewillexplore howmetricschangeinreal-worldconditions:whathappens whendataisunbalanced,noisy,orwhenpredictionsareprobabilisticinsteadof binary.We’llalsodiscuss calibration —ano enoverlookedbutessentialconcept inhealthcareriskprediction.
Sofar,we’vefocusedondefiningandcalculatingperformancemetrics.Butunderstandingtheir clinicalimplications isjustasimportant—ando enmore challenging.Agreatmetricinonecontextcanbedangerouslymisleadinginanother.
Inthischapter,weexplore:-Thereal-worldmeaningof falsepositivesandfalse negatives -Howtoquantifythe costoferrors -Howto choosetheoptimal threshold basedonclinicalneeds
PredictedPositivePredictedNegative
ActualNegative FalsePositive(FP)TrueNegative(TN)
FalseNegative(FN):
Asickpatientisclassifiedashealthy.
Clinicalimplications:
• Diseasegoesundetected.
• Notreatmentisgiven.
• Diseasemayprogressunnoticed.
• Higherriskofcomplicationsordeath.
Example:Apatientwithearly-stagecancerisclassifiedashealthy->nofollowupscanorbiopsy->late-stagedetection.
FalsePositive(FP):
Ahealthypatientisclassifiedassick.
Clinicalimplications:
• Patientmayundergounnecessarytestingortreatment.
• Psychologicalstressandanxiety.
• Possiblesidee ectsfromunnecessaryinterventions.
• Resourcewasteinhealthsystems.
Example:Apatientisfalselydiagnosedwithheartdisease->sentforexpensive andinvasivetests->mentalburdenandcosts.
The impactoferrors variesdependingonthedisease,thehealthcaresystem,and theavailabilityoftreatments.Forsomeconditions,missingadiagnosis(FN)is muchworsethanoverdiagnosis(FP).Forothers,it’sthereverse.
DiseaseContextFPCostFNCost
Cancerscreening
Psychologicalstress,testing Missedearlytreatment opportunity
Infectiousdiseases Quarantine,stigmaFurtherspreadofinfection
GeneticdisordersCounseling,lifestyle changes Lossofpreventivemeasures
COVIDtriage (2020)
ICUbedoccupied needlessly Missedemergencysupport
Mostmodelsoutput probabilities.Bydefault,weo enusea0.5thresholdto classifypatients—butthisisarbitrary.Thebestthresholddependsonthe clinical context andthe relativecostofmistakes
6 y_pred =(y_scores >= t).astype(int)
7 p = precision_score(y_true, y_pred)
8 r = recall_score(y_true, y_pred)
9 print(f"Threshold:{t:.2f}|Precision:{p:.2f}, Recall:{r:.2f}")
Thishelpsus visualizethetrade-o :asthresholdincreases, precisionrises,but recallfalls —andviceversa.
Choosethethresholdthatmaximizes:
Sensitivity + Specificity 1
Useweightedlossesforfalsepositivesvs.falsenegatives.Forexample:
1 import numpyasnpy
2 def cost_sensitive_score(y_true, y_scores, fp_cost=1, fn_cost=5):
3 thresholds = npy.arange(0.0,1.01,0.01)
4 best_thresh =0
5 min_cost = float("inf")
6 for t in thresholds:
7 y_pred =(y_scores >= t).astype(int)
8 cm = confusion_matrix(y_true, y_pred)
9 TN, FP, FN, TP = cm.ravel()
10 cost = FP * fp_cost + FN * fn_cost
11 if cost < min_cost:
12 min_cost = cost
Aformalapproachfromdecisiontheory(o enusedindiagnostictestevaluation).
Supposeyouhaveamodelthatpredictstheriskoftype2diabetes.Youcantune yourthresholdbasedon:
• Ifyourgoalis earlyintervention,prioritize highsensitivity (lowerthreshold).
• Ifyouhave limitedcapacity forfollow-uptesting,prioritize highprecision (higherthreshold).
Figuresuggestion:Aslider-basedplotthatupdatestheconfusionmatrixand costscoresasthethresholdischanged(seepreviouschapter’sinteractive widget).
FactorLowThresholdHighThreshold
FalsePositivesMoreFewer
BestForScreening,earlydetectionConfirmatorydiagnosis
Metricsalonearemeaninglesswithoutcontext.Inmedicine,wemustalwaysask: Whathappenstoarealpersonifthemodeliswrong?
Choosingtherightthresholdisanethicalandpracticaldecision.Inthenextchapter, we’llexplorehowtodealwith imbalanceddatasets,wheretraditionalmetricsfail andadvancedtechniquesarerequiredtotrainandevaluatereliablemodels.
Inmedicalapplications,weo enencounter classimbalance:diseasesaretypically rare,meaningthatthemajorityofpatientsinourdatasetarehealthy.Thiscreates aseriouschallengeformodelevaluation.
Amodelthatsimplypredicts“healthy”foreveryonecanappeardeceptivelygood ifevaluatedwiththewrongmetric.Inthischapter,wewillexplore:
• Why accuracy canbemisleading
• Howtohandleimbalancethrough resamplingtechniques
• Which metricsremainrobust underimbalance
Let’ssayonly 5%ofpatients inadatasethaveadisease.Thatmeans95%are healthy.
Ifamodel alwayspredicts“healthy”,itwillbe:
Accuracy = 950 1000 =95%
Thatsoundsgreat—butit’s useless.Themodel misseseverysickpatient.Inthis context, accuracyrewardsthemajorityclass,regardlessofclinicalutility.
Herearemetricsthatprovidemeaningfulinsightsevenwhenoneclassisrare:
• Sensitivity(Recall): Measurestheabilitytofindsickpatients.
• Precision: Measureshowmanypredictedpositivesareactuallycorrect.
• F1Score: Harmonicmeanofprecisionandrecall.
• BalancedAccuracy: Averagesperformanceonbothclasses.
• MCC(MatthewsCorrelationCoe icient): Invarianttoclasssize.
• AUC-PR: MoreinformativethanAUC-ROCinskeweddatasets.
Inrare-diseaseprediction,alwaysreportmetricsbeyondaccuracy.
1 from sklearn.metrics import classification_report 2 from sklearn.datasets import make_classification 3 from sklearn.linear_model import LogisticRegression 4
5 X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.95,0.05], flip_y=0, 6 n_features=20, n_informative =3, random_state=42)
8 model = LogisticRegression() 9 model.fit(X, y)
y_pred = model.predict(X)
12 print(classification_report(y, y_pred, digits=3))
Randomlyremoveexamplesfromthemajorityclasstobalancethedataset.
1 from imblearn.under_sampling import RandomUnderSampler 2
3 rus = RandomUnderSampler()
4 X_res, y_res = rus.fit_resample(X, y)
Pros:-Simpleandfast-Preservestheminorityclass
Cons:-Maydiscardusefuldata
Duplicateorsyntheticallygenerateminorityclassexamples.
1 from imblearn.over_sampling import SMOTE 2
3 smote = SMOTE()
4 X_res, y_res = smote.fit_resample(X, y)
Pros:-Nodataloss-Helpsthemodellearnpatternsinrareclasses
Cons:-Riskofoverfittingifoversamplednaively
Manymodels(likelogisticregression,SVMs,anddecisiontrees)allowyouto penalizeerrors onminorityclassmoreheavily.
1 model = LogisticRegression(class_weight='balanced')
2 model.fit(X, y)
Thisapproachdoes notchangethedataset,onlyhowthemodelinterpretserrors.
IbonMartínez-ArranzPage37
Youcanvisualizehowdi erentstrategiesa ectmodelperformance:
from sklearn.metrics import roc_auc_score
model.fit(X, y)
print("BaselineAUC-ROC:", roc_auc_score(y, model. predict_proba(X)[:,1]))
6 model_resampled = LogisticRegression()
model_resampled.fit(X_res, y_res) 8 print("ResampledAUC-ROC:", roc_auc_score(y, model_resampled.predict_proba(X)[:,1]))
Alsouseful: precision-recallcurves,whichrespondmoreclearlytorare-class improvement.
MethodGoalRisks
MCCRobusttoimbalanceHardertoexplainto clinicians
Imagineyou’rebuildingamodeltodetectararegeneticdisorderwitha prevalence of1%.
• Anaivemodelwillsimplypredict“healthy”foreveryone.
• Awell-designedmodelwill leverageSMOTE or adjustclassweights to detecttheminority.
• Metricslike F1score and AUC-PR willexposethedi erenceinquality.
Youcanadapta precision-recallcurve onimbalancedvs.resampleddatasetsto showhowoversamplingimprovesrecallatlowthresholds.Tryusing:
• scikit-plot forquickplots.
• plotly forinteractivecurves.
Intherealworld,dataisrarelycleanorbalanced.Understandinghowimbalance distortsevaluation—andwhattodoaboutit—iskeytodeployingreliablemodels, especiallyinhealthcare.
Inthenextchapter,we’lllookat probabilisticpredictions and modelcalibration makingsurethatpredictedprobabilitiesactuallyreflectreal-worldrisk,anessential stepforclinicaldecision-making.
Mostmodernclassifiers—logisticregression,randomforests,gradientboosting, neuralnetworks—don’tjustgiveabinaryprediction.Theyproduce probabilities. Thisallowsformorenuanceddecision-making:inclinicalsettings,ithelpsdoctors assess risk, prioritizecases,and allocateresources.
Butthisintroducesanewchallenge:howdowe evaluatethequalityofprobabilities?
Inthischapter,weexplore:- BrierScore:formeasuringtheaccuracyofprobabilistic forecasts- LogLoss:penalizingconfidenceinincorrectpredictions- Calibration: ensuringthatpredictedprobabilitiesmatchobservedfrequencies- Reliability curves:tovisualizecalibration
The BrierScore isthemeansquarederrorbetweenthepredictedprobabilityand thetrueclasslabel:
BrierScore = 1 N N i=1 (pi yi)2
Where:
• pi isthepredictedprobabilityofthepositiveclass
• yi ∈{0, 1} istheactuallabel
Valuesrangefrom0(perfectprediction)to1(worst).
Itpenalizes miscalibration and incorrectconfidence.Amodelthatiscorrectbut overlyconfident(predicting0.99whentheoutcomeis1)willbepenalizedmore thanacautiousmodelpredicting0.7.
1 from sklearn.metrics import brier_score_loss 2
3 brier_score_loss(y_true, y_prob)
LogLoss(Cross-EntropyLoss)
Definition
Logloss(a.k.a. binarycross-entropy)measuresthenegativelog-likelihoodofthe truelabelgiventhepredictedprobability:
LogLoss = 1 N
i
[yi log(pi)+(1 yi)log(1 pi)]
Thismetric stronglypenalizesincorrecthigh-confidencepredictions.
ClinicalRelevance
LoglossismoresensitivethantheBrierScore.Itrewards well-calibrated,confident predictionsandseverelypunishes wrong,overconfidentones.
1 from sklearn.metrics import log_loss 2
3 log_loss(y_true, y_prob)
Alwaysensure y_prob areprobabilities,notclasslabels.
Calibration:DoProbabilitiesReflectReality?
Calibration referstohowwellthepredictedprobabilitiesmatchactual outcomes.
• A well-calibrated modelthatoutputs 0.8 for100patientsshouldbecorrect about80times.
• A poorlycalibrated modelmayoutputextremeprobabilities(0.99or0.01) withoutjustification.
Poorcalibrationcanleadto over-treatmentorunder-treatment inmedical decision-making. IbonMartínez-ArranzPage43
Thesecurvesplot predictedprobability vs observedfrequency ofpositives.A perfectlycalibratedmodelliesonthe diagonal.
1 from sklearn.calibration import calibration_curve
2 import matplotlib.pyplotasplt
3
4 prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)
5
6 plt.plot(prob_pred, prob_true, marker='o' , label='Model')
7 plt.plot([0,1],[0,1], linestyle=' ' , label='Perfectly calibrated')
8 plt.xlabel('Meanpredictedprobability')
9 plt.ylabel('Fractionofpositives')
10 plt.title('CalibrationCurve(ReliabilityDiagram)')
11 plt.legend()
12 plt.show()
Somemodels(like logisticregression)arenaturallywell-calibrated.Others(e.g., randomforests, gradientboosting)mayneedpost-processing:
PlattScaling
Fitsalogisticregressiononthemodeloutputs.
Fitsanon-parametricmonotonicfunctionforflexiblecalibration.
SituationImportanceofCalibration
BinarydecisiononlyLow
Triage/prioritizationHigh
CommunicatingrisktocliniciansVeryhigh
IntegratingmodelintoEHRsystemsExtremelyhigh
MetricMeasuresBestFor
LogLossLikelihoodpenalizationModeloptimization,high sensitivity
Calibration Curve VisualcheckofcalibrationClinicalriskmodels
Calibrated Models ProbabilityalignmentSafety-criticalapplications
Inmedicine,probabilityisn’tjustanumber—it’sasignaltoact.Miscalibrated probabilitiescanleadto overconfidence or misdiagnosis.
Inthefinalchapter,we’llcover explainability,fairness,andethicaluse ofmetrics andmodels.Amodelthatperformswellbuthidesitslogicorembedsbiascando moreharmthangood—evenwithperfectAUC.
Amodelcanbe accurate, well-calibrated,and statisticallyrobust,andstillbe unethical or harmful.
Inhealthcare,machinelearningisnotjustatechnicalexercise—it’sasocialone. Predictivemodelscan amplifyhealthdisparities, excludevulnerablegroups,or makedecisionsnohumancanunderstand.
Inthisfinalchapter,weexamine:
• Whatitmeansforamodeltobe fair
• How bias canemerge—evenfromwell-intentioneddata
• Howtoevaluatemodelsethicallyandtransparently
• Toolsandtechniquestoimprove interpretability
BiasinMedicalModels
Biascanappearinseveralforms:
TypeofBiasExample
Representation Minoritiesunderrepresentedintrainingdata
TypeofBiasExample
Measurement Datacollectionvariesacrossdemographics
Historical Modelstrainedondatathatreflectsystemicinequities
Labeling Labelsreflectsubjectiveorbiasedcliniciandecisions
Recentstudiesrevealedthat pulseoximeters maybelessaccurateinindividualswithdarkerskintones—aformof measurementbias that,ifreplicatedin predictivemodels,couldexacerbateinequalitiesintriageandtreatment.
Severalmetricshelpassessfairness:
Themodel’soutputisindependentofaprotectedattribute(e.g.,race,sex):
EqualOpportunity
Equaltruepositiveratesacrossgroups:
EqualizedOdds
Equalfalsepositiveandtruepositiverates:
Interpretability:MakingModelsUnderstandable
Inhealthcare,amodel’sdecision mustbeexplainable.Cliniciansneedtotrust— andsometimes justify —modeloutputs.
Fortree-basedmodels: 1 import matplotlib.pyplotasplt 2 3 importances = model.feature_importances_ 4 plt.barh(feature_names, importances)
SHAPprovideslocalinterpretabilitybyattributingcontributionstoeachfeature.
import shap
3 explainer = shap.TreeExplainer(model) 4 shap_values = explainer.shap_values(X_test) 5 shap.summary_plot(shap_values, X_test)
LIME
LIMEbuildslocalsurrogatemodelsforexplanation.
ChallengeEthicalRiskExample
Black-box models LackoftransparencyClinicianscan’tverifyortrust decisions
Biased predictions ExacerbatinginequalitySystematicunder-diagnosisof specificgroups
Unclear thresholds UnjusttriagedecisionsDenialoftreatmentbasedon arbitrarycuto s
Overrelianceon models DeskillingofcliniciansIgnoringhumanjudgmentin edgecases
SummaryTable
ConceptTool/MethodApplication
BiasdetectionAIF360,FairlearnAuditingpredictions
InterpretabilitySHAP,LIME,Feature Importance Explainingmodeldecisions
Fairevaluation Equalopportunity,parity Ethicalcomparisonofmodels
EssentialGuidetoModelEvaluationMetrics
ConceptTool/MethodApplication
Clinical transparency ExplainableAI(XAI)Buildingtrustinmodel-based decisions
Metricsarenotenough.Thebestevaluationframeworkscombine mathematical rigor with ethicalawareness.Inhealthcare,ourmodelsmustnotonlybe accurate —theymustbe fair, understandable,and accountable.
Agoodmodelrespectsthetruth. Agreatmodelrespectsthepeopleit’smeanttohelp.
Thisistheendofthetechnicalcontent—butthestartoftherealjourney.Deployingmodelsinmedicinerequires ongoingevaluation, humanoversight,and humility.
Asafinalsectionofthebook,werecommend:
• Acuratedlistof furtherreadings
• Linkstodatasetsandopen-sourcetools
• Codenotebookstoexplorereal-worldusecases