essential-guide-to-model-evaluation-metrics by labortoriosrubio

EssentialGuide toModel Evaluation Metrics

Measurewhatmatters:masterevaluation metricstomakebetterclinicalpredictions.

Adetailedandpracticalguidetomodelevaluationmetrics,witha focusonhealthcareandthereal-worldimpactofpredictionerrors.

EssentialGuideto ModelEvaluation Metrics

IbonMartínez-Arranz

EssentialGuidetoModelEvaluation Metrics

IntroductiontoModelEvaluation

WhyMeasuringModelPerformanceMatters

Indatascience,buildingapredictivemodelisonlyhalfthebattle—theother halfliesinmeasuringhowwellthatmodelperforms.Amodelisonlyasuseful asthereliabilityofthedecisionsitsupports.Especiallyincriticalapplications likemedicine,wheremodelsmayinformdiagnosesortreatments,understanding evaluationmetricsisnotjusttechnicaldiligence;it’samoralresponsibility.

Apredictivemodelthatclassiﬁeswhetherpatientsaresickorhealthymustdo morethanjust“getitright”mostofthetime.Itmustminimizeharmfulmistakes: falsepositives,wherehealthypatientsaremistakenlylabeledassick,and false negatives,wheresickpatientsaredeclaredhealthy—apotentiallylife-threatening error.Hence,evaluatingmodelswithappropriatemetricsisessentialtoensure theirreal-worldutilityandsafety.

Beforedivingintospeciﬁcmetrics,weneedtounderstandhowwerepresenta model’spredictionsagainstreality.Thisiswherethe confusionmatrix becomes anindispensabletool.

AMotivatingExample:DiagnosingaDisease

Let’sconsiderasimpliﬁedyetrealisticexample.Imaginewehavedevelopeda machinelearningmodeltodiagnoseararediseasebasedonbloodtestresults.

Thedatasetcontains1,000patientrecords.Ofthese,50patientsactuallyhavethe disease(positivecases),and950donot(negativecases).

Nowsupposeourmodelpredicts everyoneishealthy —itsimplysays“nodisease” foreverypatient.Inthiscase,itwouldbe 95%accurate,sinceitcorrectlylabeled 950outof1,000patients.Butitfailedtodetect any ofthe50trulysickpatients. That’snotjustapoormodel—it’sadangerousone.

Accuracyalone,aswe’llsee,iso enmisleadinginimbalanceddatasets.That’s whyweneedothermetricstoevaluatemodelperformancemorecarefully.And ourjourneybeginswiththeconfusionmatrix.

UnderstandingtheConfusionMatrix

The confusionmatrix isacompactyetpowerfultoolthatsummarizesaclassiﬁer’s predictions.Ittellsusnotonlyhowmanypredictionswerecorrect,butwhatkinds ofmistakesthemodelmade.Here’sastandard2×2confusionmatrixforbinary classiﬁcation:

• TruePositives(TP): Themodelcorrectlypredictsapatienthasthedisease.

• FalsePositives(FP): Themodelincorrectlypredictsdiseaseinahealthy patient.

• FalseNegatives(FN): Themodelfailstoidentifyapatientwhoactuallyhas thedisease.

• TrueNegatives(TN): Themodelcorrectlypredictsapatientishealthy.

Thisstructureisthefoundationformostclassiﬁcationmetrics.Itallowsusto computequantitieslike precision, recall, speciﬁcity,and F1score,allofwhich we’llexploreinlaterchapters.

ImplementingaConfusionMatrixinPython

Python’s scikit-learn makesiteasytocomputeandvisualizeconfusionmatrices:

1 from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

2 import matplotlib.pyplotasplt

4 y_true =[1,0,1,1,0,1,0,0,1,0] #actuallabels

5 y_pred =[1,0,1,0,0,1,0,1,1,0] #predicted labels

7 cm = confusion_matrix(y_true, y_pred)

8 disp = ConfusionMatrixDisplay(confusion_matrix=cm)

9 disp.plot()

10 plt.title("ConfusionMatrixExample") 11 plt.show()

1 from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

2 import matplotlib.pyplotasplt

4 y_true =[1,0,1,1,0,1,0,0,1,0] #actuallabels

5 y_pred =[1,0,1,0,0,1,0,1,1,0] #predicted labels

7 cm = confusion_matrix(y_true, y_pred)

8 disp = ConfusionMatrixDisplay(confusion_matrix=cm)

9 disp.plot()

10 plt.title("ConfusionMatrixExample") 11 plt.show()

Figure1: Imagegeneratedbytheprovidedcode.

Thissnippetcomputesandplotstheconfusionmatrixfromrealvs.predictedvalues. Inmedicalscenarios,wheredecisionimpactiscritical,visualtoolslikethiscan helpdatascientistscommunicateperformancemoree ectivelytocliniciansand stakeholders.

TransitiontotheNextChapter

Inthischapter,we’veintroducedwhyevaluationmatters,exploredamedicalexample,andunpackedthelogicandstructureoftheconfusionmatrix.Butthematrix isonlythebeginning—thenumbersitcontainscanbetransformedintovarious metricsthatrevealdi erentaspectsofmodelbehavior.

Inthenextchapter,we’lldiveintothese basicclassiﬁcationmetrics —starting with accuracy,butquicklymovingbeyondittouncoverwhyprecision,recall,and F1scorearemoreinformativeinmedicalsettings.We’llconnecteachmetrictoits clinicalrelevance,andshowhowyoucancomputethemeasilyinPython.

BasicClassiﬁcationMetrics

Introduction

Nowthatweunderstandtheconfusionmatrix,wecanderivefromitasetofmetrics thatquantifyamodel’sperformancefromdi erentperspectives.Eachmetric answersaslightlydi erentquestionandismoreorlessrelevantdependingonthe clinicalcontext.

Inhealthcare,weo encaremoreabout detectingsickpatients(sensitivity) or avoidingunnecessarytreatments(speciﬁcity) thanwedoaboutoverallcorrectness(accuracy).Inthischapter,we’llexplore:

• Whateachmetricmeans

• Howtocalculateit

• Whenandwhyit’simportant

• HowtocomputeitinPython

• Howtovisualizeitgeometrically

Accuracy

Deﬁnition: Theproportionofcorrectpredictions(bothpositiveandnegative)over allpredictions.

Usefulness: Accuracycanbemisleadinginimbalanceddatasets.If95%ofpatients arehealthy,amodelpredicting“healthy”foreveryonewillachieve95%accuracy —butfailtodetectanysickpatients.

Python:

1 from sklearn.metrics import accuracy_score 2 accuracy_score(y_true, y_pred)

Sensitivity(Recall)

Deﬁnition: Theproportionofactualpositives(sickpatients)thatwerecorrectly identiﬁed.

Sensitivity = TP TP + FN

Alsoknownas recall,thismetricanswers: “Ofallthesickpatients,howmanydid themodelcatch?”

Clinicalrelevance: Missingadiagnosis(afalsenegative)canbedangerous.High sensitivityiscrucialwhenthecostofmissingadiseaseishigh.

Python:

1 from sklearn.metrics import recall_score 2 recall_score(y_true, y_pred)

Speciﬁcity

Deﬁnition: Theproportionofactualnegatives(healthypatients)thatwerecorrectly identiﬁed.

Speciﬁcity = TN TN + FP

Thismetricanswers: “Ofallthehealthypatients,howmanywerecorrectlyidentiﬁed ashealthy?”

Clinicalrelevance: Alowspeciﬁcitymeansmanyhealthypeoplearefalselydiagnosed,possiblyundergoingunnecessarystressortreatments.

Note: scikit-learn doesnotincludespeciﬁcitybydefault,butyoucan calculateitmanually.

1 cm = confusion_matrix(y_true, y_pred)

2 TN, FP, FN, TP = cm.ravel()

3 specificity = TN /(TN + FP)

Precision

Deﬁnition: Theproportionofpredictedpositivesthatwereactuallypositive.

Precision = TP TP + FP

Usefulness: Precisiontellsus howmuchwecantrustpositivepredictions.Inmedical contextswheretreatmentisexpensiveorrisky,precisionmatters.

Python:

1 from sklearn.metrics import precision_score

2 precision_score(y_true, y_pred)

F1Score

Deﬁnition: Theharmonicmeanofprecisionandrecall.Itbalancesthetrade-o betweenbothmetrics.

Usefulness: TheF1scoreisespeciallyusefulinimbalancedscenarioswhereboth falsepositivesandfalsenegativesarecostly.

Python:

1 from sklearn.metrics import f1_score 2 f1_score(y_true, y_pred)

SummaryTable

MetricFormulaClinicalMeaning

Accuracy

Sensitivity

Speciﬁcity

Precision

+TN

+TN +FP +FN +kk

+FN

Overallcorrectness

Abilitytodetectsickpatients

Abilitytoavoidfalsealarmsin healthypatients

Trustworthinessofapositive prediction F1ScoreHarmonicmeanofprecision andrecall

LookingAhead

Nowthatwehavethefundamentalmetricscovered,wearereadytodivedeeper into evaluationthroughcurvesandthresholds.Inthenextchapter,wewill explore ROCcurves, AUC,and Precision-Recallcurves —toolsthatshowhowperformancechangesasthedecisionthresholdvaries,whichisparticularlyimportant inclinicalriskprediction.

CurvesandAreaUndertheCurve (AUC)

Introduction

Sofar,we’veevaluatedmodelswithﬁxedthresholds—classifyingpredictions aspositiveornegativebasedonadecisionboundary(e.g.,0.5).However,many modelsoutputprobabilitiesratherthanhardlabels.Inthesecases,wecanadjust thethresholdtoexploredi erenttrade-o sbetweensensitivityandspeciﬁcityor betweenprecisionandrecall.

Thisiswhere ROCcurves and Precision-Recallcurves becomepowerfultools. Theyallowustovisualizeamodel’sperformanceacrossallthresholds,helpingus makebetterdecisionsdependingonclinicalpriorities.

ROCCurveandAUC

WhatisaROCCurve?

The ReceiverOperatingCharacteristic(ROC)curve plotsthe truepositiverate (sensitivity) againstthe falsepositiverate atvariousthresholdsettings.

• X-axis: FalsePositiveRate=FP/(FP+TN)

• Y-axis: TruePositiveRate=TP/(TP+FN)

EachpointontheROCcurvecorrespondstoadi erentthreshold.Thecloserthe curveistothetop-le corner,thebetterthemodelisatdistinguishingbetween classes.

AreaUndertheROCCurve(AUC-ROC)

The AreaUndertheCurve(AUC) providesasinglevaluesummarizingtheentire ROCcurve.Itrepresentstheprobabilitythatthemodelranksarandomlychosen positiveinstancehigherthanarandomlychosennegativeone.

AUC

Value Interpretation Explanation

1.0Perfect model Indicatesamodelthatcanclassifyallsamplesperfectly withoutanyerrors.

0.9ExcellentRepresentsamodelwithhighdiscriminatorypower, veryclosetoperfectclassiﬁcation.

0.8VerygoodSigniﬁesamodelwithstrongdiscriminationability, thoughnotashighas0.9.

0.7ReasonableDenotesamodelwithsomediscriminatorycapability, roomforimprovementmayexist.

0.6Relatively poor Indicatesamodelstrugglingtodistinguishbetween positiveandnegativesamples.

0.5Random guessing Representsamodelthatperformsnobetterthan randomchanceinclassiﬁcationtasks.

ClinicalInterpretation

Inamedicalsetting,AUChelpsanswer: “Howwellcanthemodelranksickpatients abovehealthyones?”

However,AUC-ROCcanbemisleadingin imbalanceddatasets.Forexample,in rarediseaseswherenegativesdominate,ahighAUCmayhidepoorperformance onpositives.

PythonExample

1 from sklearn.metrics import roc_curve, auc

2 import matplotlib.pyplotasplt

4 y_scores = model.predict_proba(X_test)[:,1] # probabilitiesforpositiveclass

5 fpr, tpr, thresholds = roc_curve(y_true, y_scores)

6 roc_auc = auc(fpr, tpr)

8 plt.plot(fpr, tpr, label=f"ROCcurve(AUC={roc_auc:.2f })")

9 plt.plot([0,1],[0,1], 'k--')

10 plt.xlabel("FalsePositiveRate")

11 plt.ylabel("TruePositiveRate")

12 plt.title("ROCCurve")

13 plt.legend()

14 plt.show()

• X-axis: Recall

• Y-axis: Precision

Itisespeciallyinformativefor imbalanceddatasets,whereROCcurvesmight presentanoverlyoptimisticview.

AUC-PR

The areaunderthePRcurve(AUC-PR) providesasummaryofthemodel’sability tomaintainprecisionasrecallincreases.It’stypicallylowerthanAUC-ROCbut morereﬂectiveofrealperformanceinrare-eventsettingslikediseasedetection.

ClinicalInterpretation

Ahighprecisionathighrecallmeansthemodelcandetectmostsickpatients without ﬂoodingtheclinicwithfalsealarms.Thisisacrucialbalancewhendeploying screeningtools.

PythonExample

1 from sklearn.metrics import precision_recall_curve, average_precision_score

3 precision, recall, thresholds = precision_recall_curve( y_true, y_scores)

4 ap = average_precision_score(y_true, y_scores)

5 6 plt.plot(recall, precision, label=f"PRcurve(AP={ap:.2 f})")

7 plt.xlabel("Recall")

8 plt.ylabel("Precision")

9 plt.title("Precision-RecallCurve")

10 plt.legend()

plt.show()

Trade-o sandThresholdTuning

Thechoiceofthresholdimpacts:

• Sensitivity: Higherthreshold->fewerfalsepositives,butalsomorefalse negatives.

• Precision: Lowerthreshold->morepositivespredicted,butpotentiallylower precision.

Inmedicine,thistrade-o mustalignwithclinicalgoals:

• Earlyscreening: Preferhighsensitivity(lowthreshold).

• Conﬁrmatorydiagnosis: Preferhighprecision(highthreshold).

SuggestedInteractiveVisualization

Thescikit-learndocumentationhasinteractivenotebooksthatshowhowmetricschangewiththresholds.Youcanalsouse plotly or ipywidgets tobuild thresholdsliders.

1 #Optional:Interactivethresholdtuning

2 from ipywidgets import interact

4 def update(threshold=0.5):

5 y_pred_thresholded =(y_scores >= threshold).astype( int)

6 acc = accuracy_score(y_true, y_pred_thresholded)

7 recall = recall_score(y_true, y_pred_thresholded)

8 precision = precision_score(y_true, y_pred_thresholded)

9 print(f"Threshold:{threshold:.2f}-Accuracy:{acc :.2f},Recall:{recall:.2f},Precision:{ precision:.2f}")

10 11 interact(update, threshold=(0.0,1.0,0.05));

SummaryTable

ThefollowingTableprovidesaconciseoverviewoftwocommonlyusedevaluationmetricsinmachinelearning:ROC(ReceiverOperatingCharacteristic)Curve andPrecision-RecallCurve.Thesecurveso erinsightsintotheperformanceof classiﬁcationmodelsbyexaminingdi erentaspectsoftheirpredictions.

Curve TypeAxesBestForAUCMeaning

ROC Curve TPRvs.FPRGeneralmodel discrimination

Probabilityapositiveis rankedaboveanegative

PrecisionRecall Precision vs.Recall Imbalancedclasses,rare diseasedetection Averageprecisionacross thresholds

ConclusionandWhat’sNext

ROCandPRcurvesprovidea dynamic viewofmodelperformanceacrossthresholds.Theyhelpcliniciansanddatascientistschoosemodelsandthresholdsthat alignwithreal-worldneeds.

Inthenextchapter,we’llgobeyondstandardmetricstoexplore lesscommonbut powerfulones —suchas BalancedAccuracy, MatthewsCorrelationCoe icient, and Cohen’sKappa —particularlyusefulincaseswheretraditionalmetricsfall short.

AdvancedClassiﬁcationMetrics

Introduction

BasicmetricslikeaccuracyandF1scoreareo ensu icienttoevaluatemodels —butnotalways.Insituationswith classimbalance, multipleraters,oraneed for morenuancedcomparisons,advancedmetricsprovidedeeperinsights.In healthcare,wheresubtlemistakescanhaveseriousconsequences,thesemetrics helpusevaluatemodelsmorerigorouslyandfairly.

Inthischapter,wewillcover:

• BalancedAccuracy –adjustsforclassimbalance.

• MatthewsCorrelationCoe icient(MCC) –abalancedmetricevenfor skewedclasses.

• Cohen’sKappa –agreementbeyondchance.

• Youden’sIndex –usefulforthresholdoptimizationindiagnostics.

• Li andGainCharts –forevaluatingprobabilisticmodelsindecision-making pipelines.

BalancedAccuracy

Deﬁnition

Itisespeciallyusefulfor imbalanceddatasets,asitgivesequalweighttothe performanceoneachclass.

ClinicalRelevance

Imagineararediseasea ectingonly5%ofpatients.Amodelcouldachievehighaccuracybysimplypredictingeveryoneashealthy—butbalancedaccuracypenalizes suchbehavior,revealingpoorsensitivity.

PythonExample 1 from sklearn.metrics import balanced_accuracy_score

y_pred)

MatthewsCorrelationCoe icient(MCC)

Deﬁnition

MCCisacorrelationcoe icientbetweenobservedandpredictedclassiﬁcations:

Itreturnsavaluebetween-1(totaldisagreement)and+1(perfectprediction),with 0indicatingrandomprediction.

ClinicalRelevance

UnlikeF1oraccuracy,MCCis invarianttoclassdistribution.Itisidealwhen class imbalanceissevere,and falsepositivesandnegatives areequallyimportant.

PythonExample

Cohen’sKappa

Deﬁnition

Cohen’sKappameasures inter-rateragreement,adjustingfortheagreement expectedbychance:

Where:-$p_o$istheobservedagreement.-$p_e$istheexpectedagreementby chance.

ClinicalRelevance

Kappaisusefulwhencomparingpredictionsfromtwosources(e.g.,modelvs.doctor),highlightingwhetheragreementismeaningfulorjustrandom.

PythonExample

1 from sklearn.metrics import cohen_kappa_score

3 cohen_kappa_score(y_true, y_pred)

Youden’sIndex

Deﬁnition

Youden’sIndexsummarizestheperformanceofadiagnostictest: J = Sensitivity + Speciﬁcity 1

Thisindexrangesfrom0(worthlesstest)to1(perfecttest).It’so enusedtodeterminethe optimalthreshold forbinaryclassiﬁers.

ClinicalRelevance

Inmedicine,choosingthethresholdthatmaximizesYouden’sIndexensuresthe besttrade-o betweendetectingdiseaseandavoidingfalsealarms.

PythonExample(manual):

1 import numpyasnpy

2 def youden_index(y_true, y_scores):

3 from sklearn.metrics import roc_curve

4 fpr, tpr, thresholds = roc_curve(y_true, y_scores)

5 J = tpr - fpr

6 idx = npy.argmax(J)

optimal_thresh, index_val = youden_index(y_true, y_scores

Li andGainCharts

WhatAreThey?

Li andgainchartsshowhowmuch betteramodelperformscomparedtorandomselection whenwerankpatientsbypredictedprobabilityandthenselectthe topX%.

• Gainchart: Showsthecumulative%oftruepositivescapturedasweincrease the%ofpopulationscreened.

• Li chart: Showstheratioofdetectedpositivestowhatwouldbeexpected byrandomselection.

ClinicalRelevance

Thesechartsareusefulwhendeployingmodelsto prioritizescreenings or allocate limiteddiagnosticresources.Forexample,testingthetop10%ofpatientsbyrisk mightidentify60%oftruepositives—ali of6xoverrandomscreening.

import scikitplotasskplt

skplt.metrics.plot_cumulative_gain(y_true, model. predict_proba(X_test))

5 plt.title("CumulativeGainChart") 6 plt.show()

8 skplt.metrics.plot_lift_curve(y_true, model.predict_proba (X_test))

9 plt.title("LiftCurve") 10 plt.show()

SummaryTable

MetricWhatitMeasuresBestUseCase

Balanced Accuracy Meanofsensitivityand speciﬁcity Imbalanceddatasets

MCCOverallclassiﬁcationqualityAllcases,especiallywith imbalance

Cohen’s Kappa AgreementbeyondchanceComparingmodelvsexpert predictions

Youden’s Index OptimaldiagnosticthresholdChoosingclinicalthresholds

Li /Gain Charts Screeningperformanceat rankedthresholds Prioritizingpatienttestingor interventions

Advancedmetricsgiveusmorepowertoevaluatemodels fairly and usefully, especiallyinchallengingclinicalconditions.Somearemathematicallycomplex,

buttheyo erricherunderstandingwhenaccuracyandF1arenotenough.

Next,wewillexplore howmetricschangeinreal-worldconditions:whathappens whendataisunbalanced,noisy,orwhenpredictionsareprobabilisticinsteadof binary.We’llalsodiscuss calibration —ano enoverlookedbutessentialconcept inhealthcareriskprediction.

ClinicalInterpretationofEvaluation Metrics

Introduction

Sofar,we’vefocusedondeﬁningandcalculatingperformancemetrics.Butunderstandingtheir clinicalimplications isjustasimportant—ando enmore challenging.Agreatmetricinonecontextcanbedangerouslymisleadinginanother.

Inthischapter,weexplore:-Thereal-worldmeaningof falsepositivesandfalse negatives -Howtoquantifythe costoferrors -Howto choosetheoptimal threshold basedonclinicalneeds

PredictedPositivePredictedNegative

ActualNegative FalsePositive(FP)TrueNegative(TN)

FalseNegative(FN):

Asickpatientisclassiﬁedashealthy.

Clinicalimplications:

• Diseasegoesundetected.

• Notreatmentisgiven.

• Diseasemayprogressunnoticed.

• Higherriskofcomplicationsordeath.

Example:Apatientwithearly-stagecancerisclassiﬁedashealthy->nofollowupscanorbiopsy->late-stagedetection.

FalsePositive(FP):

Ahealthypatientisclassiﬁedassick.

Clinicalimplications:

• Patientmayundergounnecessarytestingortreatment.

• Psychologicalstressandanxiety.

• Possiblesidee ectsfromunnecessaryinterventions.

• Resourcewasteinhealthsystems.

Example:Apatientisfalselydiagnosedwithheartdisease->sentforexpensive andinvasivetests->mentalburdenandcosts.

ClinicalCostofErrors:WhyOneSizeDoesn’tFitAll

The impactoferrors variesdependingonthedisease,thehealthcaresystem,and theavailabilityoftreatments.Forsomeconditions,missingadiagnosis(FN)is muchworsethanoverdiagnosis(FP).Forothers,it’sthereverse.

DiseaseContextFPCostFNCost

Cancerscreening

Psychologicalstress,testing Missedearlytreatment opportunity

Infectiousdiseases Quarantine,stigmaFurtherspreadofinfection

GeneticdisordersCounseling,lifestyle changes Lossofpreventivemeasures

COVIDtriage (2020)

ICUbedoccupied needlessly Missedemergencysupport

ChoosingtheOptimalThreshold

Mostmodelsoutput probabilities.Bydefault,weo enusea0.5thresholdto classifypatients—butthisisarbitrary.Thebestthresholddependsonthe clinical context andthe relativecostofmistakes

Example:AdjustingtheThreshold

6 y_pred =(y_scores >= t).astype(int)

7 p = precision_score(y_true, y_pred)

8 r = recall_score(y_true, y_pred)

9 print(f"Threshold:{t:.2f}|Precision:{p:.2f}, Recall:{r:.2f}")

Thishelpsus visualizethetrade-o :asthresholdincreases, precisionrises,but recallfalls —andviceversa.

ToolsforThresholdSelection

Youden’sIndex

Choosethethresholdthatmaximizes:

Sensitivity + Speciﬁcity 1

Cost-sensitiveAnalysis

Useweightedlossesforfalsepositivesvs.falsenegatives.Forexample:

1 import numpyasnpy

2 def cost_sensitive_score(y_true, y_scores, fp_cost=1, fn_cost=5):

3 thresholds = npy.arange(0.0,1.01,0.01)

4 best_thresh =0

5 min_cost = float("inf")

6 for t in thresholds:

7 y_pred =(y_scores >= t).astype(int)

8 cm = confusion_matrix(y_true, y_pred)

9 TN, FP, FN, TP = cm.ravel()

10 cost = FP * fp_cost + FN * fn_cost

11 if cost < min_cost:

12 min_cost = cost

ExpectedUtilityorNetBeneﬁt

Aformalapproachfromdecisiontheory(o enusedindiagnostictestevaluation).

ClinicalExample:DiabetesRiskPrediction

Supposeyouhaveamodelthatpredictstheriskoftype2diabetes.Youcantune yourthresholdbasedon:

• Ifyourgoalis earlyintervention,prioritize highsensitivity (lowerthreshold).

• Ifyouhave limitedcapacity forfollow-uptesting,prioritize highprecision (higherthreshold).

Figuresuggestion:Aslider-basedplotthatupdatestheconfusionmatrixand costscoresasthethresholdischanged(seepreviouschapter’sinteractive widget).

FactorLowThresholdHighThreshold

FalsePositivesMoreFewer

BestForScreening,earlydetectionConﬁrmatorydiagnosis

ConclusionandWhat’sNext

Metricsalonearemeaninglesswithoutcontext.Inmedicine,wemustalwaysask: Whathappenstoarealpersonifthemodeliswrong?

Choosingtherightthresholdisanethicalandpracticaldecision.Inthenextchapter, we’llexplorehowtodealwith imbalanceddatasets,wheretraditionalmetricsfail andadvancedtechniquesarerequiredtotrainandevaluatereliablemodels.

EvaluationunderClassImbalance

Introduction

Inmedicalapplications,weo enencounter classimbalance:diseasesaretypically rare,meaningthatthemajorityofpatientsinourdatasetarehealthy.Thiscreates aseriouschallengeformodelevaluation.

Amodelthatsimplypredicts“healthy”foreveryonecanappeardeceptivelygood ifevaluatedwiththewrongmetric.Inthischapter,wewillexplore:

• Why accuracy canbemisleading

• Howtohandleimbalancethrough resamplingtechniques

• Which metricsremainrobust underimbalance

WhyAccuracyCanBeMisleading

Let’ssayonly 5%ofpatients inadatasethaveadisease.Thatmeans95%are healthy.

Ifamodel alwayspredicts“healthy”,itwillbe:

Accuracy = 950 1000 =95%

Thatsoundsgreat—butit’s useless.Themodel misseseverysickpatient.Inthis context, accuracyrewardsthemajorityclass,regardlessofclinicalutility.

MetricsThatSurviveImbalance

Herearemetricsthatprovidemeaningfulinsightsevenwhenoneclassisrare:

• Sensitivity(Recall): Measurestheabilitytoﬁndsickpatients.

• Precision: Measureshowmanypredictedpositivesareactuallycorrect.

• F1Score: Harmonicmeanofprecisionandrecall.

• BalancedAccuracy: Averagesperformanceonbothclasses.

• MCC(MatthewsCorrelationCoe icient): Invarianttoclasssize.

• AUC-PR: MoreinformativethanAUC-ROCinskeweddatasets.

Inrare-diseaseprediction,alwaysreportmetricsbeyondaccuracy.

PracticalExampleinPython

1 from sklearn.metrics import classification_report 2 from sklearn.datasets import make_classification 3 from sklearn.linear_model import LogisticRegression 4

5 X, y = make_classification(n_samples=1000, n_classes=2, weights=[0.95,0.05], flip_y=0, 6 n_features=20, n_informative =3, random_state=42)

8 model = LogisticRegression() 9 model.fit(X, y)

y_pred = model.predict(X)

12 print(classification_report(y, y_pred, digits=3))

Undersampling

Randomlyremoveexamplesfromthemajorityclasstobalancethedataset.

1 from imblearn.under_sampling import RandomUnderSampler 2

3 rus = RandomUnderSampler()

4 X_res, y_res = rus.fit_resample(X, y)

Pros:-Simpleandfast-Preservestheminorityclass

Cons:-Maydiscardusefuldata

Oversampling

Duplicateorsyntheticallygenerateminorityclassexamples.

1 from imblearn.over_sampling import SMOTE 2

3 smote = SMOTE()

4 X_res, y_res = smote.fit_resample(X, y)

Pros:-Nodataloss-Helpsthemodellearnpatternsinrareclasses

Cons:-Riskofoverﬁttingifoversamplednaively

ClassWeighting

Manymodels(likelogisticregression,SVMs,anddecisiontrees)allowyouto penalizeerrors onminorityclassmoreheavily.

1 model = LogisticRegression(class_weight='balanced')

2 model.fit(X, y)

Thisapproachdoes notchangethedataset,onlyhowthemodelinterpretserrors.

IbonMartínez-ArranzPage37

VisualizingtheImpact

Youcanvisualizehowdi erentstrategiesa ectmodelperformance:

from sklearn.metrics import roc_auc_score

model.fit(X, y)

print("BaselineAUC-ROC:", roc_auc_score(y, model. predict_proba(X)[:,1]))

6 model_resampled = LogisticRegression()

model_resampled.fit(X_res, y_res) 8 print("ResampledAUC-ROC:", roc_auc_score(y, model_resampled.predict_proba(X)[:,1]))

Alsouseful: precision-recallcurves,whichrespondmoreclearlytorare-class improvement.

MethodGoalRisks

MCCRobusttoimbalanceHardertoexplainto clinicians

ClinicalCaseStudy:RareGeneticDisorder

Imagineyou’rebuildingamodeltodetectararegeneticdisorderwitha prevalence of1%.

• Anaivemodelwillsimplypredict“healthy”foreveryone.

• Awell-designedmodelwill leverageSMOTE or adjustclassweights to detecttheminority.

• Metricslike F1score and AUC-PR willexposethedi erenceinquality.

SuggestedVisual

Youcanadapta precision-recallcurve onimbalancedvs.resampleddatasetsto showhowoversamplingimprovesrecallatlowthresholds.Tryusing:

• scikit-plot forquickplots.

• plotly forinteractivecurves.

ConclusionandWhat’sNext

Intherealworld,dataisrarelycleanorbalanced.Understandinghowimbalance distortsevaluation—andwhattodoaboutit—iskeytodeployingreliablemodels, especiallyinhealthcare.

Inthenextchapter,we’lllookat probabilisticpredictions and modelcalibration makingsurethatpredictedprobabilitiesactuallyreﬂectreal-worldrisk,anessential stepforclinicaldecision-making.

MetricsforProbabilisticModels

Introduction

Mostmodernclassiﬁers—logisticregression,randomforests,gradientboosting, neuralnetworks—don’tjustgiveabinaryprediction.Theyproduce probabilities. Thisallowsformorenuanceddecision-making:inclinicalsettings,ithelpsdoctors assess risk, prioritizecases,and allocateresources.

Butthisintroducesanewchallenge:howdowe evaluatethequalityofprobabilities?

Inthischapter,weexplore:- BrierScore:formeasuringtheaccuracyofprobabilistic forecasts- LogLoss:penalizingconﬁdenceinincorrectpredictions- Calibration: ensuringthatpredictedprobabilitiesmatchobservedfrequencies- Reliability curves:tovisualizecalibration

BrierScore

Deﬁnition

The BrierScore isthemeansquarederrorbetweenthepredictedprobabilityand thetrueclasslabel:

BrierScore = 1 N N i=1 (pi yi)2

Where:

• pi isthepredictedprobabilityofthepositiveclass

• yi ∈{0, 1} istheactuallabel

Valuesrangefrom0(perfectprediction)to1(worst).

ClinicalRelevance

Itpenalizes miscalibration and incorrectconﬁdence.Amodelthatiscorrectbut overlyconﬁdent(predicting0.99whentheoutcomeis1)willbepenalizedmore thanacautiousmodelpredicting0.7.

PythonExample

1 from sklearn.metrics import brier_score_loss 2

3 brier_score_loss(y_true, y_prob)

LogLoss(Cross-EntropyLoss)

Deﬁnition

Logloss(a.k.a. binarycross-entropy)measuresthenegativelog-likelihoodofthe truelabelgiventhepredictedprobability:

LogLoss = 1 N

[yi log(pi)+(1 yi)log(1 pi)]

Thismetric stronglypenalizesincorrecthigh-conﬁdencepredictions.

ClinicalRelevance

LoglossismoresensitivethantheBrierScore.Itrewards well-calibrated,conﬁdent predictionsandseverelypunishes wrong,overconﬁdentones.

PythonExample

1 from sklearn.metrics import log_loss 2

3 log_loss(y_true, y_prob)

Alwaysensure y_prob areprobabilities,notclasslabels.

Calibration:DoProbabilitiesReﬂectReality?

Calibration referstohowwellthepredictedprobabilitiesmatchactual outcomes.

• A well-calibrated modelthatoutputs 0.8 for100patientsshouldbecorrect about80times.

• A poorlycalibrated modelmayoutputextremeprobabilities(0.99or0.01) withoutjustiﬁcation.

Poorcalibrationcanleadto over-treatmentorunder-treatment inmedical decision-making. IbonMartínez-ArranzPage43

ReliabilityDiagrams(CalibrationCurves)

Thesecurvesplot predictedprobability vs observedfrequency ofpositives.A perfectlycalibratedmodelliesonthe diagonal.

PythonExample

1 from sklearn.calibration import calibration_curve

2 import matplotlib.pyplotasplt

4 prob_true, prob_pred = calibration_curve(y_true, y_prob, n_bins=10)

6 plt.plot(prob_pred, prob_true, marker='o' , label='Model')

7 plt.plot([0,1],[0,1], linestyle=' ' , label='Perfectly calibrated')

8 plt.xlabel('Meanpredictedprobability')

9 plt.ylabel('Fractionofpositives')

10 plt.title('CalibrationCurve(ReliabilityDiagram)')

11 plt.legend()

12 plt.show()

HowtoImproveCalibration

Somemodels(like logisticregression)arenaturallywell-calibrated.Others(e.g., randomforests, gradientboosting)mayneedpost-processing:

PlattScaling

Fitsalogisticregressiononthemodeloutputs.

IsotonicRegression

Fitsanon-parametricmonotonicfunctionforﬂexiblecalibration.

WhenDoesCalibrationMatter?

SituationImportanceofCalibration

BinarydecisiononlyLow

Triage/prioritizationHigh

CommunicatingrisktocliniciansVeryhigh

IntegratingmodelintoEHRsystemsExtremelyhigh

MetricMeasuresBestFor

LogLossLikelihoodpenalizationModeloptimization,high sensitivity

Calibration Curve VisualcheckofcalibrationClinicalriskmodels

Calibrated Models ProbabilityalignmentSafety-criticalapplications

ConclusionandWhat’sNext

Inmedicine,probabilityisn’tjustanumber—it’sasignaltoact.Miscalibrated probabilitiescanleadto overconﬁdence or misdiagnosis.

Intheﬁnalchapter,we’llcover explainability,fairness,andethicaluse ofmetrics andmodels.Amodelthatperformswellbuthidesitslogicorembedsbiascando moreharmthangood—evenwithperfectAUC.

Ethics,Fairness,andInterpretability

Introduction

Amodelcanbe accurate, well-calibrated,and statisticallyrobust,andstillbe unethical or harmful.

Inhealthcare,machinelearningisnotjustatechnicalexercise—it’sasocialone. Predictivemodelscan amplifyhealthdisparities, excludevulnerablegroups,or makedecisionsnohumancanunderstand.

Inthisﬁnalchapter,weexamine:

• Whatitmeansforamodeltobe fair

• How bias canemerge—evenfromwell-intentioneddata

• Howtoevaluatemodelsethicallyandtransparently

• Toolsandtechniquestoimprove interpretability

BiasinMedicalModels

Biascanappearinseveralforms:

TypeofBiasExample

Representation Minoritiesunderrepresentedintrainingdata

TypeofBiasExample

Measurement Datacollectionvariesacrossdemographics

Historical Modelstrainedondatathatreﬂectsystemicinequities

Labeling Labelsreﬂectsubjectiveorbiasedcliniciandecisions

CaseStudy:PulseOximetry

Recentstudiesrevealedthat pulseoximeters maybelessaccurateinindividualswithdarkerskintones—aformof measurementbias that,ifreplicatedin predictivemodels,couldexacerbateinequalitiesintriageandtreatment.

FairnessMetrics

Severalmetricshelpassessfairness:

DemographicParity

Themodel’soutputisindependentofaprotectedattribute(e.g.,race,sex):

EqualOpportunity

Equaltruepositiveratesacrossgroups:

EqualizedOdds

Equalfalsepositiveandtruepositiverates:

Interpretability:MakingModelsUnderstandable

Inhealthcare,amodel’sdecision mustbeexplainable.Cliniciansneedtotrust— andsometimes justify —modeloutputs.

FeatureImportance

Fortree-basedmodels: 1 import matplotlib.pyplotasplt 2 3 importances = model.feature_importances_ 4 plt.barh(feature_names, importances)

SHAPprovideslocalinterpretabilitybyattributingcontributionstoeachfeature.

import shap

3 explainer = shap.TreeExplainer(model) 4 shap_values = explainer.shap_values(X_test) 5 shap.summary_plot(shap_values, X_test)

LIME

LIMEbuildslocalsurrogatemodelsforexplanation.

ClinicalImplications

ChallengeEthicalRiskExample

Black-box models LackoftransparencyClinicianscan’tverifyortrust decisions

Biased predictions ExacerbatinginequalitySystematicunder-diagnosisof speciﬁcgroups

Unclear thresholds UnjusttriagedecisionsDenialoftreatmentbasedon arbitrarycuto s

Overrelianceon models DeskillingofcliniciansIgnoringhumanjudgmentin edgecases

SummaryTable

ConceptTool/MethodApplication

BiasdetectionAIF360,FairlearnAuditingpredictions

InterpretabilitySHAP,LIME,Feature Importance Explainingmodeldecisions

Fairevaluation Equalopportunity,parity Ethicalcomparisonofmodels

EssentialGuidetoModelEvaluationMetrics

ConceptTool/MethodApplication

Clinical transparency ExplainableAI(XAI)Buildingtrustinmodel-based decisions

FinalThoughts

Metricsarenotenough.Thebestevaluationframeworkscombine mathematical rigor with ethicalawareness.Inhealthcare,ourmodelsmustnotonlybe accurate —theymustbe fair, understandable,and accountable.

Agoodmodelrespectsthetruth. Agreatmodelrespectsthepeopleit’smeanttohelp.

What’sNext

Thisistheendofthetechnicalcontent—butthestartoftherealjourney.Deployingmodelsinmedicinerequires ongoingevaluation, humanoversight,and humility.

Asaﬁnalsectionofthebook,werecommend:

• Acuratedlistof furtherreadings

• Linkstodatasetsandopen-sourcetools

• Codenotebookstoexplorereal-worldusecases