essential-guide-to-clean-data

Page 1


EssentialGuide toCleanData

Cleanyourdata,clarifyyourinsights:practical techniquesforreliableanalysis.

Ahands-onguidetocleaningandvalidatingdatawithPythonfor reliableanalysis.

EssentialGuideto CleanData

IbonMartínez-Arranz

EssentialGuidetoCleanData

Introduction—WhyCleanData Matters

Beforeweeventhinkabouttrainingamodel,generatingaplot,orcalculatinga correlation,there’sonethingthateverydatascientist,analyst,orresearcherneeds todo:cleanthedata.

Cleaningdatamightnotbethemostglamorouspartofdatascience,butitis undoubtedlyoneofthemostimportant.It’sthefoundationonwhicheverything elseisbuilt.Nomatterhowadvancedyouralgorithmsareorhowelegantyour visualizationslook,ifyourdataisfulloferrors,inconsistencies,andmissingvalues, yourresultswillbemisleadingatbest—andcompletelywrongatworst.

WhatDoWeMeanby“CleanData”?

“Cleandata”isabitofavagueterm,butbroadlyspeaking,itreferstodatathat:

• Hasconsistentformattingandstructure

• Usesthecorrectdatatypes(e.g.,datesarestoredasdates,numbersas numbers)

• Hasnomissingormalformedvalues—oratleast,missingdatahasbeen handledappropriately

• Doesn’tcontainobviouserrors,duplicates,oroutliers(unlessjustified)

• Matchesacrosssourcesifneeded(e.g.,foreignkeysbetweentables)

• Iswell-documentedandreproducible Cleandataisn’tperfect—it’srealistic,usable,andtrustworthy.Inshort,cleandata is fitforpurpose.

DirtyDataintheRealWorld

Unfortunately,mostreal-worlddatasetsaremessy.Verymessy.Columnsmightbe misnamed,datamightbemissingwithoutexplanation,unitsmightbeinconsistent, anderrorscansneakinduringdataentry,export,ortransformation.Inclinical datasets,it’scommontoseebloodpressuremeasuredinmmHgmixedwithvalues thatlooksuspiciouslyliketheycamefromadi erentscale—orsimplytypos.

Thismessinessisn’tjustannoying—it’sdangerous.AsnotedbyKarretal.(2006), poordataqualitycaninvalidatescientificresults,introducebias,andwasteenormousamountsoftimeandresources.Especiallyinhealthcare,misleadingconclusionsfromdirtydatacanhavereal-worldconsequences.

TheHiddenCostofNotCleaning

Skippingthecleaningprocess—ordoingithalf-heartedly—hascascadinge ects:

• Garbagein,garbageout:Badinputleadstobadoutput.Yourmodelmay appeartoperformwell,butit’slearningfromflawedassumptions.

• Reproducibilityissues:Withoutclearcleaningsteps,itbecomeshardto reproduceresults.

• Lossoftrust:Colleaguesandstakeholdersloseconfidenceinyouranalysis ifinconsistenciespopup.

• Compliancerisks:Inregulatedenvironmentslikeclinicaltrials,dirtydataviolatesstandardssuchasGCP(GoodClinicalPractice)andALCOA+principles.

DataCleaningandALCOA+

ALCOA+standsfor:

• Attributable:Itshouldbeclearwhorecordedthedata.

• Legible:Thedatashouldbereadableandpermanent.

• Contemporaneous:Recordedatthetimeoftheevent.

• Original:Thedatamustbethesourceoracertifiedcopy.

• Accurate:Freefromerror.

The“+”includes:Complete,Consistent,Enduring,andAvailable.Datacleaning isn’tjustabouttechnicalcorrectness—it’saboutmeetingtheseexpectationstoo.

ThisBook’sApproach

Thisbookishands-on.We’regoingtowalkthroughrealproblemsandcleanthem usingPython—mostlywithpandas,butalsowithotherusefultools.Eachchapter tacklesatypicalproblem:

• Whattodowithmissingdata?

• HowdoIhandleweirddateformats?

• Areoutliersalwayswrong?

• HowdoItrackwhatchangesImadetothedata?

We’llkeepthingspractical,reproducible,andhonest.You’llgetcodeexamples, visualizations,andplentyoftipsbasedonreal-worldexperience.

References

• Karr,A.F.,Sanil,A.P.,&Banks,D.L.(2006).Dataquality:Astatisticalperspective. StatisticalMethodology,3(2),137–173.https://doi.org/10.1016/j.stamet IbonMartínez-ArranzPage5

.2005.08.005

• EMA(2010).Reflectionpaperonexpectationsforelectronicsourcedataand datatranscribedtoelectronicdatacollectiontoolsinclinicaltrials. European MedicinesAgency.https://www.ema.europa.eu

• FDA(2018). DataIntegrityandComplianceWithDrugCGMP:Questionsand Answers.https://www.fda.gov/media/119267/download

LoadingandInspectingRawData

Onceyougetyourhandsonadataset,thefirstinstinctiso entorunamodelor generateaplot.Butbeforedoinganythingfancy,it’scrucialtoslowdownandlook atthedatayou’vegot.Nottheideaofthedata.Theactualcontents.

Loaditright

InPython,wecommonlyuse pandas.read_csv() toloadtabulardata.But thingscangettrickyfast.Maybethefileusessemicolonsinsteadofcommas.Maybe there’sanencodingproblem.Maybetheheadersareactuallyinrow3.Loading dataiso enthefirstsignthatsomethingisn’tquiteright.

1 import pandasaspd 2 3 #Defaultload 4 df = pd.read_csv("data.csv") 5 6 #Customseparatorandencoding

7 df = pd.read_csv("data.csv", sep=';' , encoding='latin1')

Someusefulargumentsin read_csv():

• sep, delimiter:Forfilesusing ; or \t

• encoding:Try 'utf-8' , 'latin1',or 'ISO-8859-1' ifyouseeencodingerrors

• skiprows:Iftheheaderisn’tinthefirstline

• na_values:Tospecifywhatstringsshouldbetreatedasmissing

Firstvisualinspection

Onceloaded,startsimple:

1 df.head()

2 df.tail()

3 df.sample(5)

Thengodeeper:

1 df.info()

2 df.describe()

3 df.columns

4 df.dtypes

Thesecommandstellyou:

• Howmanyrowsandcolumnsthereare

• Ifanycolumnsaremissingvalues

• Whatdatatypesarebeingused

• Ifnumericalcolumnshavesuspiciousmax/minvalues

Commonsurprises

Atthisstage,youo enspotthingslike:

• IDsreadasfloatinsteadofstring(e.g., 00123 becomes 123.0)

• Columnswithmixedtypes(e.g.,numbersandtext)

• Datesstoredasobjectinsteadofdatetime

• Missingvaluescodedas“n/a”,“NA”,or“-”

Theseareallsignalsthatthedatasetneedsattentionbeforeanyanalysiscanbegin.

Saveacopy

Alwayssavetherawversionandkeepyourcleaningstepsseparate.Thisisgood scientificpractice,butalsoalignswithGCPandALCOA+principles:traceability, reproducibility,andtransparency.

1 df.to_csv("cleaned/step0_raw_copy.csv", index=False)

Thisway,youcanalwaysgobacktotheoriginalifsomethinggoeswronglater.

Summary

• Alwaysinspectthedatabeforedoinganythingelse.

• Becarefulwithloadingparameters—filescanbemessy.

• Printandscanthefirstfewrows,checkdatatypesandsummaries.

• Saveacopyoftherawdatabeforeyoubegincleaning.

Wehaven’tmodifiedanythingyet—butnowweknowwhatwe’redealingwith.Let’s getourhandsdirtyinthenextchapter.

HandlingMissingValues

Missingvaluesareeverywhere.Inclinicaldata,youmighthavepatientswho skippedvisits.Insurveydata,somerespondentsmightrefusetoanswercertain questions.InIoTdevices,asensormighttemporarilygoo line.Whateverthe source,missingdataisareality—andhandlingitwellisacriticalskill.

Whymissingnessmatters

Thewaywehandlemissingdatacanhaveabigimpactontheconclusionswedraw. Droptoomuch,andyoumaylosesignalorbiasyoursample.Imputethewrong way,andyouriskcreatingfalseconfidence.There’snouniversalsolution,butthere arestrategiesthatworkwellindi erentsituations.

Let’ssayyou’rebuildingamodeltopredictbloodglucoselevels.Ifhalfyourrows aremissingBMIvalues,droppingthoserowscouldeliminateimportantpatterns. Ontheotherhand,fillingthemallwiththeaveragemightmaskrealvariability. Everydecisionhereinfluencesyourfinalmodel.

Typesofmissingdata

Statisticianstypicallydistinguishthreetypes:

• MCAR(MissingCompletelyAtRandom):There’snopatterninthemissingness.Thisisideal(andrare).

• MAR(MissingAtRandom):Themissingnessdependsonobservedvariables.

• MNAR(MissingNotAtRandom):Themissingnessdependsonunobserved variables.

Forexample:

• MCAR:Alabmachinerandomlyfailedtorecordvalues.

• MAR:Olderpatientsarelesslikelytoansweradigitalsurvey.

• MNAR:Patientswithveryhighbloodpressureskippedfollow-upvisits.

Understandingthisclassificationhelpsyoudecidewhatmethodsaresafetouse.

Detectingmissingvaluesinpandas

1 df.isna().sum()

2 df[df['column_name'].isna()]

Togetaquickoverview:

1 missing_percent = df.isna().mean().round(2)*100

2 print(missing_percent.sort_values(ascending=False))

Youcanalsovisualizemissingnesswiththe missingno library:

1 import missingnoasmsno

2 msno.matrix(df)

3 msno.heatmap(df)

Thisshowswhichrowsandcolumnshavegapsandwhetherpatternsexist. Heatmapsaregreatforspottingiftwovariablestendtobemissingtogether.

Strategiestohandlemissingvalues

Dropthem(butbecareful)

1 df_clean = df.dropna()

Thisworksifonlyafewrowsarea ectedandthelossofdatawon’tbiastheoutcome. Butif30%ofyourdatasetdisappears,maybeit’stimetopause.

Youcanalsodroponlyrowsmissingspecificcolumns:

1 df = df.dropna(subset=['age' , 'sex'])

Ordropcolumnswithtoomanymissingvalues:

1 df = df.dropna(axis=1, thresh=len(df)*0.5) #keep columnswith>50%non-null

Imputethem(simplemethods)

1 df['age'].fillna(df['age'].median(), inplace=True)

2 df['sex'].fillna('Unknown' , inplace=True)

Becautiouswithmeanimputation—itcandistortdistributions,especiallywith skeweddata.Medianiso enasaferbet.

Imputethem(fancymethods)

YoucanuseKNNimputation,regressionmodels,ormultipleimputationpackages. Oneexamplewith sklearn:

1 from sklearn.impute import KNNImputer

2 imputer = KNNImputer(n_neighbors=5)

3 df[['age' , 'bmi']]= imputer.fit_transform(df[['age' , ' bmi']])

Orwith IterativeImputer (multivariateregression):

1 from sklearn.experimental import enable_iterative_imputer 2 from sklearn.impute import IterativeImputer

3

4 imp = IterativeImputer(random_state=42)

5 df_imputed = imp.fit_transform(df.select_dtypes(include=' number'))

Thesetechniquestryto“guess”missingvaluesusingothervariables,whichis powerful—butmakesuretocross-validateyourmodelstoavoidoverfittingto imputeddata.

Flagthem

Anothertrickistoaddanindicatorcolumn:

1 df['age_missing']= df['age'].isna()

Thisway,yourmodelcanlearnfromthefactthatdatawasmissing,whichmight itselfbeinformative.Forexample,patientswhoskippedlabtestsmightdi er systematicallyfromthosewhodidn’t.

Real-worldexample:BMIinaclinicaldataset

Supposeyou’reanalyzingdatafromametabolicclinicandnotice15%ofthepatientsaremissingBMI.You:

1.CreateahistogramofBMItocheckitsdistribution.

2.Decidetofillmissingvalueswiththemedian.

3.Addacolumn bmi_missing = True/False.

4.Logthisstepandsaveanintermediatefile.

1 bmi_median = df['bmi'].median()

2 df['bmi_missing']= df['bmi'].isna()

3 df['bmi'].fillna(bmi_median, inplace=True)

4 df.to_csv("cleaned/step1_bmi_imputed.csv", index=False)

Nowyou’vepreservedtheinformation,filledtheblanks,andle atrailotherscan follow.

Documentationiskey

Don’tjustfillinblanks—writedownwhatyoudidandwhy.Thisstepmattersfor ALCOA+andforreproducibility.Also,storeyourimputationsinseparatescriptsor notebookcellssotheycanbereviewed.Ifsomeoneelsepicksupyourproject,they shouldbeabletotraceeverychange.

Summary

• Missingdataisnormal,butmustbehandledthoughtfully.

• Understandthetypeofmissingnessbeforechoosingamethod.

• Usepandasormissingnotodetectandexploremissingvalues.

• Decidewhethertodrop,impute,orflag—anddocumentyourchoices.

• Becautiouswithautomatedimputations:alwaysthinkbeforeyoufill. Handlingmissingvalueswellisasmallstepthatcanmakeahugedi erencelater. Yourfutureself—andanyonereviewingyouranalysis—willthankyouforbeing carefulhere.

FixingDataTypesandFormats

Gettingdataintotherightformatisn’tglamorous,butit’sessential.Ifyourdates arestoredasstrings,yourcategoriesaretreatedasfreetext,andyournumbers comeinasobjects,you’regoingtohitawall.Modelswon’ttrain.Visualizations willfail.Mergeswillbreak.

Correctingdatatypesisoneofthosesilentbutpowerfulstepsthatmakeseverythingelsework.

Whyitmatters

Everycolumninadatasethasatype:integer,float,string,datetime,boolean, categorical...butwhendataisloadedfromCSVs,Excelfiles,ordatabases,pandas hastoguesswhattypeeachcolumnis.Andito enguesseswrong.

Forexample:

• object typemightactuallybeadate,anumber,oracategoricalvariable.

• IDslike 00123 mightbeturnedinto 123.0,losingleadingzeros.

• True/False valuesmightshowupasstrings:“Yes”,“no”,“TRUE”,“0”.

Theearlieryoudetectandfixtheseissues,thesmoothereverythingelse becomes.

Checkingtypesinpandas

1 df.dtypes

Togetmoredetail:

1 print(df.info())

You’llseewhichcolumnsare object, float64, int64, bool, datetime64, andsoon.

Fixingnumericcolumns

Sometimesnumbersarestoredasstrings:

1 df['weight']= pd.to_numeric(df['weight'], errors='coerce ')

The errors='coerce' optionturnsnon-numericstrings(like“n/a”)into NaN, whichyoucandealwithlater.

Ifyou’remixingcommasanddots:

1 df['height']= df['height'].str.replace(' , ' , ' . ').astype( float)

Datesareoneofthemostcommontroublemakers.Theyo encomeinweird formats.

1 df['visit_date']= pd.to_datetime(df['visit_date'], errors='coerce')

Youcanspecifytheformatifneeded:

1 df['visit_date']= pd.to_datetime(df['visit_date'], format='%d/%m/%Y')

Checkfor NaT (missingdatetimes)andmakesureallvalueswereconvertedcorrectly.

Toextractpartsofadate:

1 df['year']= df['visit_date'].dt.year

2 df['month']= df['visit_date'].dt.month

3 df['weekday']= df['visit_date'].dt.day_name()

Handlingcategoricaldata

Free-textcategoriesareapain.Theyleadtoduplicatesandinconsistencies:

1 print(df['sex'].unique())

Youmightsee: ['Male' , 'male' , 'M' , 'F' , 'Female' , 'f' , '']

First,cleanandunify:

1 df['sex']= df['sex'].str.strip().str.lower()

2 df['sex']= df['sex'].replace({'male': 'M' , 'female': 'F' , 'm': 'M' , 'f': 'F'})

Thenconverttocategory:

1 df['sex']= df['sex'].astype('category')

Thissavesmemoryandtellspandasthatit’snotjustastring—it’salabelwitha finitesetofvalues.

Booleanconversions

1 df['is_smoker']= df['is_smoker'].map({'yes': True, 'no': False})

Ormoresafely:

1 df['is_smoker']= df['is_smoker'].str.lower().map(lambda x: x in ['yes' , 'true' , '1'])

Dealingwithunitconversions

Anotherhiddensourceoftroubleismismatchedunits.Imagineweightinkilograms andpoundsmixedinthesamecolumn.

1 #Assumingweightsover250areprobablyinpounds

2 mask = df['weight']>250

3 df.loc[mask, 'weight']= df.loc[mask, 'weight']* 0.453592

Addaflagtotrackwhichrowswerechanged:

1 df['weight_converted']= mask

Alwaysdocumentwhenandwhyyoudidthis.Unitmismatchescanbreakeven well-trainedmodels.

Casestudy:fixingalabdataset

Youreceiveadatasetwiththefollowingissues:

• DatesstoredastextinEuropeanformat

• Bloodpressureasstrings:“120/80”

• BMIasamixoffloatsand“n/a”

Step-by-step:

1 #Convertdates

2 df['visit_date']= pd.to_datetime(df['visit_date'], dayfirst=True, errors='coerce') 3 4 #Splitbloodpressure

5 df[['sbp' , 'dbp']]= df['blood_pressure'].str.split('/' , expand=True)

6 df['sbp']= pd.to_numeric(df['sbp'], errors='coerce')

7 df['dbp']= pd.to_numeric(df['dbp'], errors='coerce')

8 9 #HandleBMI

10 df['bmi']= pd.to_numeric(df['bmi'], errors='coerce')

Inlessthan10lines,you’vefixedthreemajorformatproblemsandpreparedthe datasetformodelingorexploration.

Summary

• Alwayscheckyourdatatypesassoonasyouloadthedata.

• Converttexttonumeric,datetime,category,orbooleanasneeded.

• Watchforleadingzeros,inconsistentformats,andstrangevalues.

• Use .astype(), pd.to_numeric(),and pd.to_datetime() totake control.

• Don’tforgetunitmismatches—theycanbesubtlebutserious.

Whenyourdatatypesarecorrect,everythingelsebecomeseasier.Cleantypes= cleananalysis.

StandardizingandCleaningTextData

Messytextisoneofthemostunderestimateddataproblems.Unlikenumbers, textisinherentlyflexibleandambiguous.Asinglecolumnmightcontainvalues like“Male”,“male”,“MALE”,“M”,orevenjust“m”—andtheyallmeanthesame thing.Unlesswecleanandstandardizethattext,ourmodelsandsummarieswon’t understandit.

Textdatacleaningisaboutconsistency.We’renotdoingNLPhere—we’rejust gettinglabelsandfree-formstringsintoshapesowecangroup,filter,ormodel theme ectively.

Whytextcleaningmatters

Let’ssayyou’reanalyzingclinicaltrialdataandwanttogrouppatientsbydiagnosis. Ifthecolumncontains:

Commonproblemsintextdata

• Inconsistentcapitalization

• Trailingorleadingwhitespace

• Typosandabbreviations

• Accentmarksandspecialcharacters

• Mixingformats(e.g.,“kg”vs“kilograms”)

• Emptystringsorplaceholdertext(“n/a”,“–”)

Basicstringcleaninginpandas

1 df['diagnosis']= df['diagnosis'].str.strip()

2 df['diagnosis']= df['diagnosis'].str.lower()

3 df['diagnosis']= df['diagnosis'].str.replace('-' , '')

Use .str methodstohandlemostcommonissues:

• str.strip() —removewhitespace

• str.lower() —standardizecase

• str.replace() —fixformattingandsymbols

• str.title() —usefulfornames

• str.contains() —findpatterns

Replacingvariantsandtypos

Ifyouknowthemostcommonvariants: 1 mapping ={ 2 't2dm': 'type2diabetes' , 3 'typeiidiabetes': 'type2diabetes' ,

'diabetestype2': 'type2diabetes'

5 }

6 df['diagnosis']= df['diagnosis'].replace(mapping)

Youcanalsouseregexforpattern-basedreplacements:

1 df['code']= df['code'].str.replace(r'[^A-Z0-9]' , '' , regex=True)

Thisremovesnon-alphanumericcharacters,o enusefulincleaningIDs.

Dealingwithmissingormeaninglesstext

Sometimescellscontainstringslike“unknown”,“n/a”,orjust“-”.Youcantreat themasmissing:

1 df['notes']= df['notes'].replace(['n/a' , 'na' , '-' , ' unknown'], pd.NA)

Oncestandardized,youcanhandlethemaspropermissingvalues.

Categoricalcleanupexample

Youreceiveacolumnofpatient-reportedsmokingstatus:

1 def clean_smoking(value): 2 val = str(value).strip().lower() 3 if val in ['yes' , 'y' , '1' , 'true']: return 'Yes'

elif val in ['no' , 'n' , '0' , 'false']: return 'No'

else: return 'Unknown'

7 df['smoker']= df['smoker'].apply(clean_smoking)

df['smoker']= df['smoker'].astype('category')

Nowyoucangrouporanalyzesmokingstatusconfidently.

Textmayincludeinvisiblecharactersordi erentencodings(especiallyinmultilingualdata).Normalizeaccentsusing unicodedata:

import unicodedata

3 def remove_accents(text):

if isinstance(text, str):

return ''.join(

c for c in unicodedata.normalize('NFKD' , text )

ifnot unicodedata.combining(c)

• Textfieldsarefullofinconsistencies—don’ttrustthemblindly.

• Use .str methodsandmappingstocleanandunifyvalues.

• Replaceplaceholdertextlike“n/a”withpropernulls.

• Normalizeaccentsandsymbolsifneeded.

• Alwaysconvertcleanedstringstocategoricalwhenpossible.

Thiskindofcleaningisn’tflashy—butit’sthedi erencebetweenchaosandclarity.

DetectingandHandlingDuplicates

Duplicaterecordsarelikecockroaches—ifyouseeone,there’sprobablymore.They sneakinthroughsystemexports,usererrors,ormergingdatasets.Sometimes duplicatesareexactcopies.Othertimestheydi erbyatimestamp,atypo,oran extraspace.

Notallduplicatesarebad.Youmightexpectrepeatedlabresults,ormultiplevisits perpatient.Butwhenduplicatesareunintended,theycanskewstatistics,inflate samplesizes,orconfusedownstreamprocessing.

Whatisaduplicate?

Inpandas,aduplicaterowisonewhereallcolumnvaluesmatchanotherrow.But youcandefineduplicatesbasedonasubsetofcolumnsifneeded.

Visualinspection

Startbysorting:

1 df.sort_values(by=['patient_id' , 'visit_date']).head(10)

Sometimesduplicatesarenotidenticalbut nearly so:

• SamepatientIDanddate,butdi erentlabvalues

• Samenameandbirthdate,butdi erentphonenumber

Thisiswhereyourdomainknowledgecomesin. Handlingduplicatessafely

1.Dropexactduplicates

1 df = df.drop_duplicates()

Youcanalsokeepthe last occurrenceinsteadofthefirst:

1 df = df.drop_duplicates(keep='last')

2.Dropbykeycolumns

1 df = df.drop_duplicates(subset=['patient_id' , 'visit_date '])

Becareful—thismaydroprowsthatarevalidbutrepeated.

3.Aggregatenear-duplicates

Sometimesit’sbettertocollapserows:

1 df = df.groupby(['patient_id' , 'visit_date']).agg({

2 'glucose': 'mean' ,

3 'bmi': 'first'

4 }).reset_index()

Youchoosewhichfieldstoaverage,count,ortakethefirstvaluefrom.

4.Usefuzzymatching(advanced)

Whennamesortextfieldsareclosebutnotidentical,youcanuse fuzzywuzzy or recordlinkage.Example:

1 from fuzzywuzzy import fuzz

2 fuzz.ratio("JohnSmith", "JonSmith") #96

Thishelpsdetectrecordsthatlooklikeduplicatesevenifthey’renotexact matches.

Loggingduplicateremoval

Alwaysloghowmanyrowswereremovedandwhy:

1 before = len(df)

2 df = df.drop_duplicates()

3 after = len(df)

4 print(f"Removed{before-after}duplicaterows")

IfworkingunderALCOA+,documentcriteriausedandkeepanoriginalbackup.

Casestudy:clinicalvisitlog

Youreceiveanexportofvisitsfromahospitalsystem.Somepatientshave2or3 identicalentriesforthesamedate.A ertalkingtotheteam,youlearnit’sasystem glitchthatrepeatedthesamerecord.

1 duplicates = df[df.duplicated(subset=['patient_id' , ' visit_date'])]

2 print("Duplicatesfound:", len(duplicates))

3

4 #Dropsafely

5 df = df.drop_duplicates(subset=['patient_id' , 'visit_date '])

Youdocumenttheissue,cleanthedataset,andsaveacleancopy:

1 df.to_csv("cleaned/step2_duplicates_removed.csv", index= False)

Summary

• Duplicatescandistortanalysisifle unchecked.

• Use duplicated() and drop_duplicates() todetectandclean.

• Decideifduplicatesaretrueerrorsorexpectedrepetitions.

• Considergroupingandaggregatingnear-duplicates.

• Alwayslogwhatyouremovedandwhy.

Handlingduplicatesisaboutbeingprecise,careful,andcurious.They’renotjust technicalartifacts—they’recluestohowyourdatawascreated.

OutlierDetectionandCorrection

Outliersarevaluesthatlooksuspiciouslyfarfromtherest.Maybethey’reerrors. Maybethey’rejustrareevents.Eitherway,theycanhaveabigimpact—especially onstatisticslikemeanandstandarddeviation,ormodelssensitivetoscale.

Buthere’sthetrickypart:notalloutliersarewrong.Inmedicine,aBMIof60is unusual,butitcouldbecorrect.Infinance,asuddenspikeinsalesmightreflecta realevent.Sobeforedeletingorcorrectinganything,weneedto understand why anoutlierexists.

Whatisanoutlier?

There’snosingledefinition,buthereareafewsigns:

• Avaluefaroutsidethenormalrange

• Apointthatfallsoutsidetheinterquartilerange(IQR)

• Avaluemorethan3standarddeviationsfromthemean

• Apointthatbreaksabusinessorclinicalrule(e.g.,heartrate>300)

Visualizingoutliers

Boxplots

1 import seabornassns

2 sns.boxplot(x=df['bmi'])

Boxplotsaregreatforseeingthespreadofthedataandidentifyingpointsoutside thewhiskers.

Histograms

1 df['glucose'].hist(bins=50)

Helpfulforspottingspikes,longtails,orgaps.

Scatterplots

Usewhenoutliersdependoncontext:

1 sns.scatterplot(x='age' , y='cholesterol' , data=df)

Z-scores

Fornumericalcolumns:

1 from scipy.stats import zscore

2 z = zscore(df['bmi'].dropna())

3 outliers = df.loc[(z <-3)|(z >3)]

IQRmethod

1 Q1 = df['bmi'].quantile(0.25)

2 Q3 = df['bmi'].quantile(0.75)

3 IQR = Q3 - Q1

4

5 outliers = df[(df['bmi']< Q1 -1.5* IQR)|(df['bmi']> Q3 +1.5* IQR)]

Whattodowithoutliers

1.Investigate

• Isitadataentryerror?

• Doesitbreakknownconstraints?

• Coulditbeavalidextremecase?

2.Correctifclearlywrong

1 #Example:ageof250islikelyatypo 2 df.loc[df['age']>120, 'age']= pd.NA

Orsettonulltoreviewlater:

1 mask = df['glucose']>1000 2 df.loc[mask, 'glucose']= pd.NA

3.Caporclip

Limitvaluestoareasonablerange:

1 df['bmi']= df['bmi'].clip(lower=10, upper=60)

Thispreservestherowbutlimitstheinfluenceoftheextremevalue.

4.Flagthem

Markoutliersforlaterreview:

1 Q1 = df['bmi'].quantile(0.25)

2 Q3 = df['bmi'].quantile(0.75)

3 IQR = Q3 - Q1 4

5 df['bmi_outlier']=(df['bmi']< Q1 -1.5* IQR)|(df[' bmi']> Q3 +1.5* IQR)

5.Removethem(lastresort)

Onlywhenjustified:

1 df = df[~df['bmi_outlier']]

Don’tdeleterowsjustbecausetheymakeyourplotprettier.

Real-worldexample:metabolicdata

You’reanalyzingfastingglucoselevels.Mostvaluesarebetween70and120mg/dL. Butafewvaluesareover1000—morethantentimesthetypicalrange.

Step-by-step:

1.Plotahistogram.

2.UseIQRtodefineoutliers.

3.Setoutliersto NaN.

4.Addanoutlierflag.

5.Documentthechange.

1 Q1 = df['glucose'].quantile(0.25)

2 Q3 = df['glucose'].quantile(0.75)

3 IQR = Q3 - Q1

5 mask =(df['glucose']< Q1 -1.5* IQR)|(df['glucose'] > Q3 +1.5* IQR)

6 df['glucose_outlier']= mask

7 df.loc[mask, 'glucose']= pd.NA Summary

• Outlierscanberealorerrors—investigatebeforeacting.

• Useboxplots,histograms,z-scores,orIQRtodetectthem.

• Considercapping,flagging,orremoving—butdocumentallsteps.

• Neverassumethatastrangevalueiswrongwithoutcontext. Outliersarewarninglights.Theytellyouwheretolookmoreclosely—andsometimes,wheretherealstoryishiding.

ResolvingInconsistenciesAcross Columns

Someproblemsaren’tobviousbylookingatindividualcolumns—theyliveinthe relationshipsbetweencolumns.Maybeapatienthasadischargedatethatcomes beforetheiradmission.OraBMIthatdoesn’tmatchtheirheightandweight.Or someonemarkedasdeceasedbutwithfollow-uplabresults.

Thesekindsofcross-fieldinconsistenciesarecommoninrealdatasets.Fixingthem isabitlikedetectivework—youneedtocompare,calculate,andask“doesthis makesense?”

Commontypesofcolumninconsistencies

• Datelogicerrors:enddatebeforestartdate,birthdateinthefuture

• Redundantfieldsmismatch:BMIdoesn’tmatchheightandweight

• Flagcontradictions: pregnant = True and sex = male

• Linkedvariablesdri :weightincreasedby20kgbutheightstayedthesame

1 df['admit_before_discharge']= df['admit_date']<= df[' discharge_date']

2 invalid_dates = df[~df['admit_before_discharge']]

Flaginvalidrowsandreview:

1 print(invalid_dates[['patient_id' , 'admit_date' , ' discharge_date']])

Setthemto NaT ifneeded:

1 df.loc[~df['admit_before_discharge'], 'discharge_date']= pd.NaT

Recalculatingfromrawdata

Ifyouhaveheightandweight,youcanrecomputeBMI:

1 df['height_m']= df['height_cm']/100

2 df['bmi_calc']= df['weight_kg']/ df['height_m']**2

ComparewithrecordedBMI:

1 bmi_diff =(df['bmi']- df['bmi_calc']).abs()

2 df['bmi_mismatch']= bmi_diff >1.5

Nowyoucandecidewhethertotrustthecalculatedorrecordedvalue—orkeep both.

Contradictoryflags

Example:someonemarked pregnant = True and sex = male.

1 mask =(df['pregnant']== True)&(df['sex']== 'M')

2 df.loc[mask,['pregnant' , 'sex']]

Ifit’sadataentrymistake,youmaywantto:

• Set pregnant = False

• Setitto NaN

• Flagtherowforreview

Alwaysloghowmanycontradictionswerefound:

1 print("Contradictionsfound:", mask.sum())

Categorymismatch

Insomecases,valuesdon’tmatchexpectedcategorypairs:

• A“follow-up”recordwithoutaninitialvisit

• A“discharged”flagwithoutanadmission

• A“completed”taskwithanulltimestamp

Youcandetectthesewithconditionalchecks:

1 mask =(df['visit_type']== 'follow-up')& df[' initial_visit_id'].isna()

2 df['visit_issue']= mask

Valuecorrelation

Iftwovariablesshouldbetightlycorrelated,checkthem:

1 sns.scatterplot(x='total_cholesterol' , y='ldl') IbonMartínez-ArranzPage41

IfLDLishigherthantotalcholesterolformultiplecases,something’swrong.

Youcanalsocalculatecorrelation:

1 corr = df[['total_cholesterol' , 'ldl']].corr()

2 print(corr)

Casestudy:verifyingtimestamps

You’reworkingwithICUdata.Eachrowhas start_time, end_time,and duration_minutes.

1 #Recalculateduration

2 df['duration_calc']=(df['end_time']- df['start_time']) .dt.total_seconds()/60

3

4 #Comparewithrecorded

5 df['duration_diff']=(df['duration_calc']- df[' duration_minutes']).abs()

6 df['duration_mismatch']= df['duration_diff']>5

Nowyou’vespottedwheredurationwasmiscalculated,possiblyfromrounding errorsormanualedits.

Summary

• Someerrorsonlyappearwhencomparingmultiplecolumns.

• Uselogicandsimplemathtodetectcontradictions.

• Recalculatevaluesfromrawinputswhenpossible.

• Flaginconsistencies,correctwhenobvious,andalwaysdocument.

Thesearen’tjustbugs—they’reclues.Datathatdoesn’tmakesenseacrosscolumns tellsyouastory.Sometimesthatstoryisaboutbiology.Sometimesit’sabouta typo.

ValidatingandEnforcingConstraints

Datavalidationisaboutmakingsurethevaluesinyourdatasetmakesense—not justinisolation,butaccordingtodefinedrules.Theserulesmightcomefromyour domainknowledge,regulatorystandards,orlogicalexpectations.Thinkofitas teachingyourdatasetsomebasicmanners.

Ifyouranalysisisahouse,validationistheplumbingcheckbeforeyoumovein.

Whyvalidationmatters

Anyonecanmakeamistake.AusermightenteratemperatureinFahrenheitinsteadofCelsius.Atimestampmightgettruncated.Avaluemightbetypedas “999”insteadof“99”.Thesesmallerrorscancascadeintobiggerproblemsifle unchecked.

Byenforcingconstraintsearly,youcatchissuesbeforetheycauseconfusiondownstream.

Typesofvalidations

Rangechecks

1 mask =(df['age']>=0)&(df['age']<=120) 2 df['age_invalid']=~mask

Allowedvalues(categoricalchecks)

1 valid_values =['M' , 'F'] 2 df['sex_valid']= df['sex'].isin(valid_values)

Nullconstraints

Makesurerequiredfieldsaren’tmissing: 1 df['admit_date_valid']= df['admit_date'].notna() Cross-fielddependencies 1 #Pregnancyflagonlyallowedifsexisfemale

mask =(df['sex']== 'F')|(df['pregnant']!= True)

df['pregnancy_constraint_ok']= mask Patterncheckswithregex

#Emailformat

df['email_valid']= df['email'].str.match(pattern)

Validatingwith pandera

pandera isaPythonlibrarytodefineschemasandvalidatepandasDataFrames. Example: 1 import panderaaspa 2 from pandera import Column, DataFrameSchema

4 schema = DataFrameSchema({

5 "age": Column(pa.Int, checks=pa.Check.in_range(0, 120)),

6 "sex": Column(pa.String, checks=pa.Check.isin(["M", " F"])),

7 "bmi": Column(pa.Float, nullable=True),

})

10 validated_df = schema.validate(df)

Ifsomething’swrong,itraisesaclearerror.

Using pydantic forrow-levelvalidation

Ifyourdatacomesrowbyrow(e.g.,fromanAPI), pydantic isagreatoption. 1 from pydantic import BaseModel, Field, validator

3 class Patient(BaseModel): 4 age: int = Field(..., ge=0, le=120)

sex: str

7 @validator("sex") 8 def check_sex(cls, v): 9 if v notin ["M", "F"]:

raise ValueError("Invalidsex")

return v

Referentialintegrity

Ifyou’reworkingwithmultipletables(e.g.,patientsandvisits),youneedtoensure foreignkeysmatch.

1 valid_ids = patients['patient_id']

2 df['id_in_patients']= df['patient_id'].isin(valid_ids)

Ifyoufindrowsthatreferenceanon-existentpatient,that’saredflag.

Constraintenforcementwithassertions

Youcanaddassertionsatthetopofnotebooksorscripts:

1 assertdf['age'].between(0,120).all(), "Ageoutofrange !"

Thishelpsensurethingsdon’tbreaksilently.

Casestudy:validatingaclinicaldataset

Youreceiveadatasetwithdemographics,vitalsigns,andlabresults.Youwantto ensure:

• Ageis0–120

• SexisMorF

• AllrowshaveapatientID

• Glucosevaluesarerealistic

Step-by-step:

1 assertdf['age'].between(0,120).all()

2 assertdf['sex'].isin(['M' , 'F']).all()

3 assertdf['patient_id'].notna().all()

4 assertdf['glucose'].between(30,800).all()

Alternatively,youcouldwraptheseintoa pandera schema.

Logginganddocumentingrules

Wheneveryouapplyvalidations,documentthem.Youcan:

• Keepa constraints.md filedescribingeachrule

• Storevalidationfunctionsinamodule(e.g. validate.py)

• Addloggingtoyourpipeline

1 import logging

2 logging.info("Validatedsexfield:%dinvalidentries", (~df['sex'].isin(['M' , 'F'])).sum())

Summary

• Validationisthepracticeofenforcinglogicandsanitychecksonyourdata.

• UsePythonlogic,regex,ortoolslike pandera and pydantic.

• Catchissuesearly—don’twaitforyourmodeltofail.

• Documentallrulesapplied.

Cleandataisn’tjustaboutmissingvaluesorduplicates—it’salsoaboutvaluesthat makesense.Validationisyourlastlineofdefensebeforeanalysisbegins.

Logging,Versioning,andAuditTrails

You’vecleanedyourdata.Itlooksgreat.Buthowwillsomeoneelseknowwhatyou changed—andwhy?Andmoreimportantly:howwill you rememberwhatyoudid threeweeksfromnow?

That’swherelogging,versioning,andaudittrailscomein.Thesearen’tjust“niceto have”inregulatedenvironments—they’reessentialforreproducibility,trust,and scientificintegrity.

Whytraceabilitymatters

Inclinicalresearch,datamustcomplywithALCOA+principles:

• Attributable:Whomadethechange?

• Legible:Canwereadandunderstandit?

• Contemporaneous:Wasitrecordedatthetime?

• Original:Dowehavetherawversion?

• Accurate:Isitcorrect?

The“+”adds:Complete,Consistent,Enduring,andAvailable.

Theseprinciplesdon’tjustapplytothedata—theyapplytoyour code, decisions, and process.

Keepingrawandcleaneddataseparate

Neveroverwritetheoriginalfile.Alwayssavecleanedversionswithclearnames:

1 df.to_csv("cleaned/step3_validated.csv", index=False)

IncludeaREADMEorchangelogdescribingwhatchanged:

1 2025-05-08| RemovedoutliersandfixedBMImismatches. Storedinstep3_validated.csv

Youcanevenlogthenumberofrowsa ected:

1 removed = initial_len - len(df)

2 print(f"Removed{removed}rowswithinvalidage")

Loggingwiththe logging module

Insteadofjustprinting,logmessagesproperly:

1 import logging

2 logging.basicConfig(filename='logs/cleaning.log' , level= logging.INFO)

3 logging.info("Step2:ReplacedmissingBMIvalueswith median")

Logunexpectedvaluestoo:

1 bad_sex = df[~df['sex'].isin(['M' , 'F'])]

2 logging.warning(f"Found{len(bad_sex)}invalidsex entries")

VersioningdatawithDVC

DVCislikeGitfordata.Youcanversion .csv or .parquet filesandtrackthem withyourcode.

InitializeaDVCproject:

1 dvcinit

2 dvcaddcleaned/step3_validated.csv

3 gitaddcleaned/step3_validated.csv.dvc .gitignore

4 gitcommit -m "TrackvalidateddatasetwithDVC"

Thisallowsrollback,comparison,andreproducibility.

Recordingcodeversions

UseGittotrackyournotebooksandscripts.Commito en:

1 gitcommit -am "Cleanedtextfieldsandrecoded categories"

Tagimportantstages:

1 gittag -av1.0-m "Finalcleaneddatasetbeforemodeling "

Jupyternotebooksandcellhistory

Jupyterkeepsexecutioncounts,butit’seasytoreruncellsinadi erentorderand losetraceability.Usethesetips:

• Runnotebookstoptobottom

• ExporttoHTMLorPDFa ereachmilestone

• Use nbconvert ortoolslike papermill or jupytext IbonMartínez-ArranzPage51

Reproduciblescripts

Movecleaningstepsintoscripts:

1 pythonclean_data.py

Thisensuresanyonecanre-runthepipeline,notjustclickthroughanotebook.

Includecommand-linelogging:

1 print("Step1complete:droppedduplicates")

Orstructuredlogging:

1 import logging

2 logging.info("Step3complete:imputedmissingglucose")

Casestudy:versionedcleaningpipeline

Youcreateascript cleaning_pipeline.py that:

• Loadsrawdatafrom data/raw.csv

• Cleanstypes,missingvalues,andoutliers

• Savesto cleaned/step_final.csv

• Logsallactions

• IstrackedbyGitandDVC

Now,anyoneonyourteam(orareviewer)can:

• Re-runtheentireprocess

• Seeexactlywhatwaschanged

• Compareresultsbeforeanda ercleaning

Summary

• Alwayskeeprawandcleaneddataseparate

• Useloggingtodocumenteverychange

• VersiondatafileswithDVCorcommitintermediatestepsmanually

• TrackyourcodewithGitandtagimportantmilestones

• Packageyourcleaningstepsintoscriptsforfullreproducibility

Cleandataisgreat.Butclean processes areevenbetter.That’swhatmakesyour worksolid,trusted,andreadyfortherealworld.

CaseStudiesinDataCleaning

Let’swrapupwithsomereal-worldexamples.Thesecasestudiescombinemany ofthetechniqueswe’veexploredsofar,andshowhowthey’reappliedtomessy, realisticdata.Eachonestartsfromarawdatasetandwalksthroughthestepsto makeitusable.

Case1:CleaningClinicalEHRData

You’regivenanextractofelectronichealthrecords(EHR)withthefollowing columns:

• patient_id

• sex

• birth_date

• visit_date

• height_cm, weight_kg, bmi

• glucose, blood_pressure, notes

Problemsspotted:

• sex values:“M”,“F”,“male”,“FEMALE”,“unknown”

• birth_date storedastext,someinDD/MM/YYYY,someinMM-DD-YYYY

• bmi doesn’tmatchheightandweightinmanycases

• Notescontaintyposandirregularwhitespace

Cleaningsteps:

'sex']= df['sex'].str.strip().str.lower().replace( sex_map)

df['sex']= df['sex'].where(df['sex'].isin(['M' , 'F']), pd.NA)

df['birth_date']= df['birth_date'].apply(try_parse)

height_m = df['height_cm']/100

df['bmi_calc']= df['weight_kg']/(height_m **2)

df['bmi_diff']=(df['bmi']- df['bmi_calc']).abs()

df['bmi_flag']= df['bmi_diff']>1.5

df['notes']= df['notes'].str.strip().str.lower().str. replace(r'\s+' , '' , regex=True)

• Cleanedcategoricalfields • Consistentdateformat

• MismatchedBMIflaggedforreview

• NotessimplifiedforNLPorgrouping

Case2:PublicHealthIndicators(WHO)

You’reworkingwithadatasetofcountry-levelhealthmetrics.Itincludes:

• country, year

• life_expectancy, population, gdp_per_capita

• urban_pct, mortality_rate, vaccination_coverage

Problemsspotted:

• Missing gdp_per_capita insomecountries

• Duplicaterowsforsome(country,year)pairs

• urban_pct hasvaluesover100

• vaccination_coverage includes“n/a”

Cleaningsteps:

1 #Dropduplicates

2 df = df.drop_duplicates(subset=['country' , 'year']) 3

4 #Converttonumeric

5 df['gdp_per_capita']= pd.to_numeric(df['gdp_per_capita' ], errors='coerce')

6 df['vaccination_coverage']= pd.to_numeric(df[' vaccination_coverage'], errors='coerce')

7

8 #Capinvalidvalues

9 df['urban_pct']= df['urban_pct'].clip(upper=100)

10 11 #ImputeGDPusingcountrymedian 12 df['gdp_per_capita']= df.groupby('country')[' gdp_per_capita'].transform( 13 lambda x: x.fillna(x.median()) 14 )

Results:

• Cleaneddatasetreadyfortimeseriesanalysis

• Noimpossiblevalues(e.g.,120%urbanpopulation)

• Reproduciblescriptwithclearassumptions

Case3:LC-MSMetabolomicsData

You’reanalyzinguntargetedLC-MSmetabolomicsdatawithcompoundintensities acrossmultiplesamples:

• sample_id, compound_name, retention_time, area, flag

Issuesfound:

• Somecompoundshaveduplicatepeaks

• retention_time variesslightlybetweenreplicates

• Someflagsare“bb”,“MM”,ormissing

• Intensitiesvaryby10xbetweenreplicates

Outcome:

• Aggregateddatasetwithcleanpeakspersample

• Medianretentiontimepercompound

• Flagsdocumentedanddiscardedsafely

Summary

Thesecasestudiesshowthatrealdataismessyinmanyways—text,dates,numbers, duplicates,logic,andstructure.Cleaningitwelltakespatience,creativity,andrigor. Butonceyoudo,youunlockinsightsyoucantrust—andmodelsthatwork. You’renotjustfixingaspreadsheet.You’repreparingevidence.

ReferencesandResources

Awell-cleaneddatasetisasilentachievement—ito engoesunnoticed,butit makeseverythingelsepossible.Belowarekeyreferences,tools,anddatasets mentionedthroughoutthebook,especiallyinthecasestudies.

KeyReferences

• Karr,A.F.,Sanil,A.P.,&Banks,D.L.(2006). Dataquality:Astatisticalperspective.StatisticalMethodology,3(2),137–173.https://doi.org/10.1016/j.stamet .2005.08.005

• EMA(2010). Reflectionpaperonexpectationsforelectronicsourcedataand datatranscribedtoelectronicdatacollectiontoolsinclinicaltrials.European MedicinesAgency.https://www.ema.europa.eu/en/documents/scientificguideline/reflection-paper-expectations-electronic-source-data-datatranscribed-electronic-data-collection_en.pdf

• FDA(2018). DataIntegrityandComplianceWithDrugCGMP:Questionsand Answers.https://www.fda.gov/media/119267/download

• Panderadocumentation:https://pandera.readthedocs.io

• Pydanticdocumentation:https://docs.pydantic.dev/

• DVC(DataVersionControl):https://dvc.org

• Missingno:https://github.com/ResidentMario/missingno

DatasetsUsedinCaseStudies

ClinicalEHRDataset(SyntheticExample)

Toreplicatetheclinicalcleaningsteps,youcanuseasyntheticbutrealisticdataset like:

• SyntheaopenEHRdata—Fullysyntheticpatientrecords,downloadablein CSVformat.

• SampledatasetusedinSeerPy(cleanedEHRstructure):https://github.com /SEERData/SeerPy(explorationpurposes)

WHOHealthIndicatorsDataset

• WHOGlobalHealthObservatoryDataRepository

• Specificindicatorslikelifeexpectancy,mortality,andimmunization:https: //www.who.int/data/gho/indicator-metadata-registry

MetabolomicsLC-MSDataset

• ExampleLC-MSdata(MTBLSfiles):

– MetaboLights-MTBLS135—Plasmametabolomedata.

– MTBLS404-HumanserumNMRandMSdata

Theseprovidecompoundtableswithretentiontime,area,andsampleannotations.

EssentialGuidetoCleanData ToolPurposeLink

pandas Datamanipulation https://pandas.pydata.org

seaborn/matplotlibVisualization

scikitlearn Imputation, scaling,model prep

panderaDataFrame schemavalidation

pydanticRow-level validationand models

DVCDataversioning and reproducibility

https://seaborn.pydata.org

https://scikit-learn.org

https://pandera.readthedocs.io

https://docs.pydantic.dev

https://dvc.org

missingno Visualizingmissing data https://github.com/ResidentMario/missingno

Ifyouwanttogofurther,considerbrowsingthedatasetson:

• KaggleDatasets

• UCIMachineLearningRepository

• OpenML

Andremember:behindeverytrustedinsightisadatasetthatsomeonetookthe timetocleanproperly.Now,thatsomeoneisyou.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.