EssentialGuide toCleanData

Cleanyourdata,clarifyyourinsights:practical techniquesforreliableanalysis.
Ahands-onguidetocleaningandvalidatingdatawithPythonfor reliableanalysis.
Cleanyourdata,clarifyyourinsights:practical techniquesforreliableanalysis.
Ahands-onguidetocleaningandvalidatingdatawithPythonfor reliableanalysis.
Beforeweeventhinkabouttrainingamodel,generatingaplot,orcalculatinga correlation,there’sonethingthateverydatascientist,analyst,orresearcherneeds todo:cleanthedata.
Cleaningdatamightnotbethemostglamorouspartofdatascience,butitis undoubtedlyoneofthemostimportant.It’sthefoundationonwhicheverything elseisbuilt.Nomatterhowadvancedyouralgorithmsareorhowelegantyour visualizationslook,ifyourdataisfulloferrors,inconsistencies,andmissingvalues, yourresultswillbemisleadingatbest—andcompletelywrongatworst.
“Cleandata”isabitofavagueterm,butbroadlyspeaking,itreferstodatathat:
• Hasconsistentformattingandstructure
• Usesthecorrectdatatypes(e.g.,datesarestoredasdates,numbersas numbers)
• Hasnomissingormalformedvalues—oratleast,missingdatahasbeen handledappropriately
• Doesn’tcontainobviouserrors,duplicates,oroutliers(unlessjustified)
• Matchesacrosssourcesifneeded(e.g.,foreignkeysbetweentables)
• Iswell-documentedandreproducible Cleandataisn’tperfect—it’srealistic,usable,andtrustworthy.Inshort,cleandata is fitforpurpose.
Unfortunately,mostreal-worlddatasetsaremessy.Verymessy.Columnsmightbe misnamed,datamightbemissingwithoutexplanation,unitsmightbeinconsistent, anderrorscansneakinduringdataentry,export,ortransformation.Inclinical datasets,it’scommontoseebloodpressuremeasuredinmmHgmixedwithvalues thatlooksuspiciouslyliketheycamefromadi erentscale—orsimplytypos.
Thismessinessisn’tjustannoying—it’sdangerous.AsnotedbyKarretal.(2006), poordataqualitycaninvalidatescientificresults,introducebias,andwasteenormousamountsoftimeandresources.Especiallyinhealthcare,misleadingconclusionsfromdirtydatacanhavereal-worldconsequences.
Skippingthecleaningprocess—ordoingithalf-heartedly—hascascadinge ects:
• Garbagein,garbageout:Badinputleadstobadoutput.Yourmodelmay appeartoperformwell,butit’slearningfromflawedassumptions.
• Reproducibilityissues:Withoutclearcleaningsteps,itbecomeshardto reproduceresults.
• Lossoftrust:Colleaguesandstakeholdersloseconfidenceinyouranalysis ifinconsistenciespopup.
• Compliancerisks:Inregulatedenvironmentslikeclinicaltrials,dirtydataviolatesstandardssuchasGCP(GoodClinicalPractice)andALCOA+principles.
ALCOA+standsfor:
• Attributable:Itshouldbeclearwhorecordedthedata.
• Legible:Thedatashouldbereadableandpermanent.
• Contemporaneous:Recordedatthetimeoftheevent.
• Original:Thedatamustbethesourceoracertifiedcopy.
• Accurate:Freefromerror.
The“+”includes:Complete,Consistent,Enduring,andAvailable.Datacleaning isn’tjustabouttechnicalcorrectness—it’saboutmeetingtheseexpectationstoo.
Thisbookishands-on.We’regoingtowalkthroughrealproblemsandcleanthem usingPython—mostlywithpandas,butalsowithotherusefultools.Eachchapter tacklesatypicalproblem:
• Whattodowithmissingdata?
• HowdoIhandleweirddateformats?
• Areoutliersalwayswrong?
• HowdoItrackwhatchangesImadetothedata?
We’llkeepthingspractical,reproducible,andhonest.You’llgetcodeexamples, visualizations,andplentyoftipsbasedonreal-worldexperience.
• Karr,A.F.,Sanil,A.P.,&Banks,D.L.(2006).Dataquality:Astatisticalperspective. StatisticalMethodology,3(2),137–173.https://doi.org/10.1016/j.stamet IbonMartínez-ArranzPage5
.2005.08.005
• EMA(2010).Reflectionpaperonexpectationsforelectronicsourcedataand datatranscribedtoelectronicdatacollectiontoolsinclinicaltrials. European MedicinesAgency.https://www.ema.europa.eu
• FDA(2018). DataIntegrityandComplianceWithDrugCGMP:Questionsand Answers.https://www.fda.gov/media/119267/download
Onceyougetyourhandsonadataset,thefirstinstinctiso entorunamodelor generateaplot.Butbeforedoinganythingfancy,it’scrucialtoslowdownandlook atthedatayou’vegot.Nottheideaofthedata.Theactualcontents.
InPython,wecommonlyuse pandas.read_csv() toloadtabulardata.But thingscangettrickyfast.Maybethefileusessemicolonsinsteadofcommas.Maybe there’sanencodingproblem.Maybetheheadersareactuallyinrow3.Loading dataiso enthefirstsignthatsomethingisn’tquiteright.
1 import pandasaspd 2 3 #Defaultload 4 df = pd.read_csv("data.csv") 5 6 #Customseparatorandencoding
7 df = pd.read_csv("data.csv", sep=';' , encoding='latin1')
Someusefulargumentsin read_csv():
• sep, delimiter:Forfilesusing ; or \t
• encoding:Try 'utf-8' , 'latin1',or 'ISO-8859-1' ifyouseeencodingerrors
• skiprows:Iftheheaderisn’tinthefirstline
• na_values:Tospecifywhatstringsshouldbetreatedasmissing
Onceloaded,startsimple:
1 df.head()
2 df.tail()
3 df.sample(5)
Thengodeeper:
1 df.info()
2 df.describe()
3 df.columns
4 df.dtypes
Thesecommandstellyou:
• Howmanyrowsandcolumnsthereare
• Ifanycolumnsaremissingvalues
• Whatdatatypesarebeingused
• Ifnumericalcolumnshavesuspiciousmax/minvalues
Atthisstage,youo enspotthingslike:
• IDsreadasfloatinsteadofstring(e.g., 00123 becomes 123.0)
• Columnswithmixedtypes(e.g.,numbersandtext)
• Datesstoredasobjectinsteadofdatetime
• Missingvaluescodedas“n/a”,“NA”,or“-”
Theseareallsignalsthatthedatasetneedsattentionbeforeanyanalysiscanbegin.
Alwayssavetherawversionandkeepyourcleaningstepsseparate.Thisisgood scientificpractice,butalsoalignswithGCPandALCOA+principles:traceability, reproducibility,andtransparency.
1 df.to_csv("cleaned/step0_raw_copy.csv", index=False)
Thisway,youcanalwaysgobacktotheoriginalifsomethinggoeswronglater.
• Alwaysinspectthedatabeforedoinganythingelse.
• Becarefulwithloadingparameters—filescanbemessy.
• Printandscanthefirstfewrows,checkdatatypesandsummaries.
• Saveacopyoftherawdatabeforeyoubegincleaning.
Wehaven’tmodifiedanythingyet—butnowweknowwhatwe’redealingwith.Let’s getourhandsdirtyinthenextchapter.
Missingvaluesareeverywhere.Inclinicaldata,youmighthavepatientswho skippedvisits.Insurveydata,somerespondentsmightrefusetoanswercertain questions.InIoTdevices,asensormighttemporarilygoo line.Whateverthe source,missingdataisareality—andhandlingitwellisacriticalskill.
Thewaywehandlemissingdatacanhaveabigimpactontheconclusionswedraw. Droptoomuch,andyoumaylosesignalorbiasyoursample.Imputethewrong way,andyouriskcreatingfalseconfidence.There’snouniversalsolution,butthere arestrategiesthatworkwellindi erentsituations.
Let’ssayyou’rebuildingamodeltopredictbloodglucoselevels.Ifhalfyourrows aremissingBMIvalues,droppingthoserowscouldeliminateimportantpatterns. Ontheotherhand,fillingthemallwiththeaveragemightmaskrealvariability. Everydecisionhereinfluencesyourfinalmodel.
Statisticianstypicallydistinguishthreetypes:
• MCAR(MissingCompletelyAtRandom):There’snopatterninthemissingness.Thisisideal(andrare).
• MAR(MissingAtRandom):Themissingnessdependsonobservedvariables.
• MNAR(MissingNotAtRandom):Themissingnessdependsonunobserved variables.
Forexample:
• MCAR:Alabmachinerandomlyfailedtorecordvalues.
• MAR:Olderpatientsarelesslikelytoansweradigitalsurvey.
• MNAR:Patientswithveryhighbloodpressureskippedfollow-upvisits.
Understandingthisclassificationhelpsyoudecidewhatmethodsaresafetouse.
1 df.isna().sum()
2 df[df['column_name'].isna()]
Togetaquickoverview:
1 missing_percent = df.isna().mean().round(2)*100
2 print(missing_percent.sort_values(ascending=False))
Youcanalsovisualizemissingnesswiththe missingno library:
1 import missingnoasmsno
2 msno.matrix(df)
3 msno.heatmap(df)
Thisshowswhichrowsandcolumnshavegapsandwhetherpatternsexist. Heatmapsaregreatforspottingiftwovariablestendtobemissingtogether.
1 df_clean = df.dropna()
Thisworksifonlyafewrowsarea ectedandthelossofdatawon’tbiastheoutcome. Butif30%ofyourdatasetdisappears,maybeit’stimetopause.
Youcanalsodroponlyrowsmissingspecificcolumns:
1 df = df.dropna(subset=['age' , 'sex'])
Ordropcolumnswithtoomanymissingvalues:
1 df = df.dropna(axis=1, thresh=len(df)*0.5) #keep columnswith>50%non-null
1 df['age'].fillna(df['age'].median(), inplace=True)
2 df['sex'].fillna('Unknown' , inplace=True)
Becautiouswithmeanimputation—itcandistortdistributions,especiallywith skeweddata.Medianiso enasaferbet.
YoucanuseKNNimputation,regressionmodels,ormultipleimputationpackages. Oneexamplewith sklearn:
1 from sklearn.impute import KNNImputer
2 imputer = KNNImputer(n_neighbors=5)
3 df[['age' , 'bmi']]= imputer.fit_transform(df[['age' , ' bmi']])
Orwith IterativeImputer (multivariateregression):
1 from sklearn.experimental import enable_iterative_imputer 2 from sklearn.impute import IterativeImputer
3
4 imp = IterativeImputer(random_state=42)
5 df_imputed = imp.fit_transform(df.select_dtypes(include=' number'))
Thesetechniquestryto“guess”missingvaluesusingothervariables,whichis powerful—butmakesuretocross-validateyourmodelstoavoidoverfittingto imputeddata.
Anothertrickistoaddanindicatorcolumn:
1 df['age_missing']= df['age'].isna()
Thisway,yourmodelcanlearnfromthefactthatdatawasmissing,whichmight itselfbeinformative.Forexample,patientswhoskippedlabtestsmightdi er systematicallyfromthosewhodidn’t.
Real-worldexample:BMIinaclinicaldataset
Supposeyou’reanalyzingdatafromametabolicclinicandnotice15%ofthepatientsaremissingBMI.You:
1.CreateahistogramofBMItocheckitsdistribution.
2.Decidetofillmissingvalueswiththemedian.
3.Addacolumn bmi_missing = True/False.
4.Logthisstepandsaveanintermediatefile.
1 bmi_median = df['bmi'].median()
2 df['bmi_missing']= df['bmi'].isna()
3 df['bmi'].fillna(bmi_median, inplace=True)
4 df.to_csv("cleaned/step1_bmi_imputed.csv", index=False)
Nowyou’vepreservedtheinformation,filledtheblanks,andle atrailotherscan follow.
Don’tjustfillinblanks—writedownwhatyoudidandwhy.Thisstepmattersfor ALCOA+andforreproducibility.Also,storeyourimputationsinseparatescriptsor notebookcellssotheycanbereviewed.Ifsomeoneelsepicksupyourproject,they shouldbeabletotraceeverychange.
• Missingdataisnormal,butmustbehandledthoughtfully.
• Understandthetypeofmissingnessbeforechoosingamethod.
• Usepandasormissingnotodetectandexploremissingvalues.
• Decidewhethertodrop,impute,orflag—anddocumentyourchoices.
• Becautiouswithautomatedimputations:alwaysthinkbeforeyoufill. Handlingmissingvalueswellisasmallstepthatcanmakeahugedi erencelater. Yourfutureself—andanyonereviewingyouranalysis—willthankyouforbeing carefulhere.
Gettingdataintotherightformatisn’tglamorous,butit’sessential.Ifyourdates arestoredasstrings,yourcategoriesaretreatedasfreetext,andyournumbers comeinasobjects,you’regoingtohitawall.Modelswon’ttrain.Visualizations willfail.Mergeswillbreak.
Correctingdatatypesisoneofthosesilentbutpowerfulstepsthatmakeseverythingelsework.
Everycolumninadatasethasatype:integer,float,string,datetime,boolean, categorical...butwhendataisloadedfromCSVs,Excelfiles,ordatabases,pandas hastoguesswhattypeeachcolumnis.Andito enguesseswrong.
Forexample:
• object typemightactuallybeadate,anumber,oracategoricalvariable.
• IDslike 00123 mightbeturnedinto 123.0,losingleadingzeros.
• True/False valuesmightshowupasstrings:“Yes”,“no”,“TRUE”,“0”.
Theearlieryoudetectandfixtheseissues,thesmoothereverythingelse becomes.
1 df.dtypes
Togetmoredetail:
1 print(df.info())
You’llseewhichcolumnsare object, float64, int64, bool, datetime64, andsoon.
Sometimesnumbersarestoredasstrings:
1 df['weight']= pd.to_numeric(df['weight'], errors='coerce ')
The errors='coerce' optionturnsnon-numericstrings(like“n/a”)into NaN, whichyoucandealwithlater.
Ifyou’remixingcommasanddots:
1 df['height']= df['height'].str.replace(' , ' , ' . ').astype( float)
Datesareoneofthemostcommontroublemakers.Theyo encomeinweird formats.
1 df['visit_date']= pd.to_datetime(df['visit_date'], errors='coerce')
Youcanspecifytheformatifneeded:
1 df['visit_date']= pd.to_datetime(df['visit_date'], format='%d/%m/%Y')
Checkfor NaT (missingdatetimes)andmakesureallvalueswereconvertedcorrectly.
Toextractpartsofadate:
1 df['year']= df['visit_date'].dt.year
2 df['month']= df['visit_date'].dt.month
3 df['weekday']= df['visit_date'].dt.day_name()
Free-textcategoriesareapain.Theyleadtoduplicatesandinconsistencies:
1 print(df['sex'].unique())
Youmightsee: ['Male' , 'male' , 'M' , 'F' , 'Female' , 'f' , '']
First,cleanandunify:
1 df['sex']= df['sex'].str.strip().str.lower()
2 df['sex']= df['sex'].replace({'male': 'M' , 'female': 'F' , 'm': 'M' , 'f': 'F'})
Thenconverttocategory:
1 df['sex']= df['sex'].astype('category')
Thissavesmemoryandtellspandasthatit’snotjustastring—it’salabelwitha finitesetofvalues.
1 df['is_smoker']= df['is_smoker'].map({'yes': True, 'no': False})
Ormoresafely:
1 df['is_smoker']= df['is_smoker'].str.lower().map(lambda x: x in ['yes' , 'true' , '1'])
Anotherhiddensourceoftroubleismismatchedunits.Imagineweightinkilograms andpoundsmixedinthesamecolumn.
1 #Assumingweightsover250areprobablyinpounds
2 mask = df['weight']>250
3 df.loc[mask, 'weight']= df.loc[mask, 'weight']* 0.453592
Addaflagtotrackwhichrowswerechanged:
1 df['weight_converted']= mask
Alwaysdocumentwhenandwhyyoudidthis.Unitmismatchescanbreakeven well-trainedmodels.
Youreceiveadatasetwiththefollowingissues:
• DatesstoredastextinEuropeanformat
• Bloodpressureasstrings:“120/80”
• BMIasamixoffloatsand“n/a”
Step-by-step:
1 #Convertdates
2 df['visit_date']= pd.to_datetime(df['visit_date'], dayfirst=True, errors='coerce') 3 4 #Splitbloodpressure
5 df[['sbp' , 'dbp']]= df['blood_pressure'].str.split('/' , expand=True)
6 df['sbp']= pd.to_numeric(df['sbp'], errors='coerce')
7 df['dbp']= pd.to_numeric(df['dbp'], errors='coerce')
8 9 #HandleBMI
10 df['bmi']= pd.to_numeric(df['bmi'], errors='coerce')
Inlessthan10lines,you’vefixedthreemajorformatproblemsandpreparedthe datasetformodelingorexploration.
• Alwayscheckyourdatatypesassoonasyouloadthedata.
• Converttexttonumeric,datetime,category,orbooleanasneeded.
• Watchforleadingzeros,inconsistentformats,andstrangevalues.
• Use .astype(), pd.to_numeric(),and pd.to_datetime() totake control.
• Don’tforgetunitmismatches—theycanbesubtlebutserious.
Whenyourdatatypesarecorrect,everythingelsebecomeseasier.Cleantypes= cleananalysis.
Messytextisoneofthemostunderestimateddataproblems.Unlikenumbers, textisinherentlyflexibleandambiguous.Asinglecolumnmightcontainvalues like“Male”,“male”,“MALE”,“M”,orevenjust“m”—andtheyallmeanthesame thing.Unlesswecleanandstandardizethattext,ourmodelsandsummarieswon’t understandit.
Textdatacleaningisaboutconsistency.We’renotdoingNLPhere—we’rejust gettinglabelsandfree-formstringsintoshapesowecangroup,filter,ormodel theme ectively.
Let’ssayyou’reanalyzingclinicaltrialdataandwanttogrouppatientsbydiagnosis. Ifthecolumncontains:
• Inconsistentcapitalization
• Trailingorleadingwhitespace
• Typosandabbreviations
• Accentmarksandspecialcharacters
• Mixingformats(e.g.,“kg”vs“kilograms”)
• Emptystringsorplaceholdertext(“n/a”,“–”)
1 df['diagnosis']= df['diagnosis'].str.strip()
2 df['diagnosis']= df['diagnosis'].str.lower()
3 df['diagnosis']= df['diagnosis'].str.replace('-' , '')
Use .str methodstohandlemostcommonissues:
• str.strip() —removewhitespace
• str.lower() —standardizecase
• str.replace() —fixformattingandsymbols
• str.title() —usefulfornames
• str.contains() —findpatterns
Ifyouknowthemostcommonvariants: 1 mapping ={ 2 't2dm': 'type2diabetes' , 3 'typeiidiabetes': 'type2diabetes' ,
'diabetestype2': 'type2diabetes'
5 }
6 df['diagnosis']= df['diagnosis'].replace(mapping)
Youcanalsouseregexforpattern-basedreplacements:
1 df['code']= df['code'].str.replace(r'[^A-Z0-9]' , '' , regex=True)
Thisremovesnon-alphanumericcharacters,o enusefulincleaningIDs.
Sometimescellscontainstringslike“unknown”,“n/a”,orjust“-”.Youcantreat themasmissing:
1 df['notes']= df['notes'].replace(['n/a' , 'na' , '-' , ' unknown'], pd.NA)
Oncestandardized,youcanhandlethemaspropermissingvalues.
Youreceiveacolumnofpatient-reportedsmokingstatus:
1 def clean_smoking(value): 2 val = str(value).strip().lower() 3 if val in ['yes' , 'y' , '1' , 'true']: return 'Yes'
elif val in ['no' , 'n' , '0' , 'false']: return 'No'
else: return 'Unknown'
7 df['smoker']= df['smoker'].apply(clean_smoking)
df['smoker']= df['smoker'].astype('category')
Nowyoucangrouporanalyzesmokingstatusconfidently.
Textmayincludeinvisiblecharactersordi erentencodings(especiallyinmultilingualdata).Normalizeaccentsusing unicodedata:
import unicodedata
3 def remove_accents(text):
if isinstance(text, str):
return ''.join(
c for c in unicodedata.normalize('NFKD' , text )
ifnot unicodedata.combining(c)
• Textfieldsarefullofinconsistencies—don’ttrustthemblindly.
• Use .str methodsandmappingstocleanandunifyvalues.
• Replaceplaceholdertextlike“n/a”withpropernulls.
• Normalizeaccentsandsymbolsifneeded.
• Alwaysconvertcleanedstringstocategoricalwhenpossible.
Thiskindofcleaningisn’tflashy—butit’sthedi erencebetweenchaosandclarity.
Duplicaterecordsarelikecockroaches—ifyouseeone,there’sprobablymore.They sneakinthroughsystemexports,usererrors,ormergingdatasets.Sometimes duplicatesareexactcopies.Othertimestheydi erbyatimestamp,atypo,oran extraspace.
Notallduplicatesarebad.Youmightexpectrepeatedlabresults,ormultiplevisits perpatient.Butwhenduplicatesareunintended,theycanskewstatistics,inflate samplesizes,orconfusedownstreamprocessing.
Inpandas,aduplicaterowisonewhereallcolumnvaluesmatchanotherrow.But youcandefineduplicatesbasedonasubsetofcolumnsifneeded.
Startbysorting:
1 df.sort_values(by=['patient_id' , 'visit_date']).head(10)
Sometimesduplicatesarenotidenticalbut nearly so:
• SamepatientIDanddate,butdi erentlabvalues
• Samenameandbirthdate,butdi erentphonenumber
Thisiswhereyourdomainknowledgecomesin. Handlingduplicatessafely
1.Dropexactduplicates
1 df = df.drop_duplicates()
Youcanalsokeepthe last occurrenceinsteadofthefirst:
1 df = df.drop_duplicates(keep='last')
2.Dropbykeycolumns
1 df = df.drop_duplicates(subset=['patient_id' , 'visit_date '])
Becareful—thismaydroprowsthatarevalidbutrepeated.
Sometimesit’sbettertocollapserows:
1 df = df.groupby(['patient_id' , 'visit_date']).agg({
2 'glucose': 'mean' ,
3 'bmi': 'first'
4 }).reset_index()
Youchoosewhichfieldstoaverage,count,ortakethefirstvaluefrom.
Whennamesortextfieldsareclosebutnotidentical,youcanuse fuzzywuzzy or recordlinkage.Example:
1 from fuzzywuzzy import fuzz
2 fuzz.ratio("JohnSmith", "JonSmith") #96
Thishelpsdetectrecordsthatlooklikeduplicatesevenifthey’renotexact matches.
Alwaysloghowmanyrowswereremovedandwhy:
1 before = len(df)
2 df = df.drop_duplicates()
3 after = len(df)
4 print(f"Removed{before-after}duplicaterows")
IfworkingunderALCOA+,documentcriteriausedandkeepanoriginalbackup.
Youreceiveanexportofvisitsfromahospitalsystem.Somepatientshave2or3 identicalentriesforthesamedate.A ertalkingtotheteam,youlearnit’sasystem glitchthatrepeatedthesamerecord.
1 duplicates = df[df.duplicated(subset=['patient_id' , ' visit_date'])]
2 print("Duplicatesfound:", len(duplicates))
3
4 #Dropsafely
5 df = df.drop_duplicates(subset=['patient_id' , 'visit_date '])
Youdocumenttheissue,cleanthedataset,andsaveacleancopy:
1 df.to_csv("cleaned/step2_duplicates_removed.csv", index= False)
• Duplicatescandistortanalysisifle unchecked.
• Use duplicated() and drop_duplicates() todetectandclean.
• Decideifduplicatesaretrueerrorsorexpectedrepetitions.
• Considergroupingandaggregatingnear-duplicates.
• Alwayslogwhatyouremovedandwhy.
Handlingduplicatesisaboutbeingprecise,careful,andcurious.They’renotjust technicalartifacts—they’recluestohowyourdatawascreated.
Outliersarevaluesthatlooksuspiciouslyfarfromtherest.Maybethey’reerrors. Maybethey’rejustrareevents.Eitherway,theycanhaveabigimpact—especially onstatisticslikemeanandstandarddeviation,ormodelssensitivetoscale.
Buthere’sthetrickypart:notalloutliersarewrong.Inmedicine,aBMIof60is unusual,butitcouldbecorrect.Infinance,asuddenspikeinsalesmightreflecta realevent.Sobeforedeletingorcorrectinganything,weneedto understand why anoutlierexists.
There’snosingledefinition,buthereareafewsigns:
• Avaluefaroutsidethenormalrange
• Apointthatfallsoutsidetheinterquartilerange(IQR)
• Avaluemorethan3standarddeviationsfromthemean
• Apointthatbreaksabusinessorclinicalrule(e.g.,heartrate>300)
Visualizingoutliers
1 import seabornassns
2 sns.boxplot(x=df['bmi'])
Boxplotsaregreatforseeingthespreadofthedataandidentifyingpointsoutside thewhiskers.
1 df['glucose'].hist(bins=50)
Helpfulforspottingspikes,longtails,orgaps.
Usewhenoutliersdependoncontext:
1 sns.scatterplot(x='age' , y='cholesterol' , data=df)
Z-scores
Fornumericalcolumns:
1 from scipy.stats import zscore
2 z = zscore(df['bmi'].dropna())
3 outliers = df.loc[(z <-3)|(z >3)]
IQRmethod
1 Q1 = df['bmi'].quantile(0.25)
2 Q3 = df['bmi'].quantile(0.75)
3 IQR = Q3 - Q1
4
5 outliers = df[(df['bmi']< Q1 -1.5* IQR)|(df['bmi']> Q3 +1.5* IQR)]
1.Investigate
• Isitadataentryerror?
• Doesitbreakknownconstraints?
• Coulditbeavalidextremecase?
2.Correctifclearlywrong
1 #Example:ageof250islikelyatypo 2 df.loc[df['age']>120, 'age']= pd.NA
Orsettonulltoreviewlater:
1 mask = df['glucose']>1000 2 df.loc[mask, 'glucose']= pd.NA
3.Caporclip
Limitvaluestoareasonablerange:
1 df['bmi']= df['bmi'].clip(lower=10, upper=60)
Thispreservestherowbutlimitstheinfluenceoftheextremevalue.
Markoutliersforlaterreview:
1 Q1 = df['bmi'].quantile(0.25)
2 Q3 = df['bmi'].quantile(0.75)
3 IQR = Q3 - Q1 4
5 df['bmi_outlier']=(df['bmi']< Q1 -1.5* IQR)|(df[' bmi']> Q3 +1.5* IQR)
5.Removethem(lastresort)
Onlywhenjustified:
1 df = df[~df['bmi_outlier']]
Don’tdeleterowsjustbecausetheymakeyourplotprettier.
You’reanalyzingfastingglucoselevels.Mostvaluesarebetween70and120mg/dL. Butafewvaluesareover1000—morethantentimesthetypicalrange.
Step-by-step:
1.Plotahistogram.
2.UseIQRtodefineoutliers.
3.Setoutliersto NaN.
4.Addanoutlierflag.
5.Documentthechange.
1 Q1 = df['glucose'].quantile(0.25)
2 Q3 = df['glucose'].quantile(0.75)
3 IQR = Q3 - Q1
5 mask =(df['glucose']< Q1 -1.5* IQR)|(df['glucose'] > Q3 +1.5* IQR)
6 df['glucose_outlier']= mask
7 df.loc[mask, 'glucose']= pd.NA Summary
• Outlierscanberealorerrors—investigatebeforeacting.
• Useboxplots,histograms,z-scores,orIQRtodetectthem.
• Considercapping,flagging,orremoving—butdocumentallsteps.
• Neverassumethatastrangevalueiswrongwithoutcontext. Outliersarewarninglights.Theytellyouwheretolookmoreclosely—andsometimes,wheretherealstoryishiding.
Someproblemsaren’tobviousbylookingatindividualcolumns—theyliveinthe relationshipsbetweencolumns.Maybeapatienthasadischargedatethatcomes beforetheiradmission.OraBMIthatdoesn’tmatchtheirheightandweight.Or someonemarkedasdeceasedbutwithfollow-uplabresults.
Thesekindsofcross-fieldinconsistenciesarecommoninrealdatasets.Fixingthem isabitlikedetectivework—youneedtocompare,calculate,andask“doesthis makesense?”
• Datelogicerrors:enddatebeforestartdate,birthdateinthefuture
• Redundantfieldsmismatch:BMIdoesn’tmatchheightandweight
• Flagcontradictions: pregnant = True and sex = male
• Linkedvariablesdri :weightincreasedby20kgbutheightstayedthesame
1 df['admit_before_discharge']= df['admit_date']<= df[' discharge_date']
2 invalid_dates = df[~df['admit_before_discharge']]
Flaginvalidrowsandreview:
1 print(invalid_dates[['patient_id' , 'admit_date' , ' discharge_date']])
Setthemto NaT ifneeded:
1 df.loc[~df['admit_before_discharge'], 'discharge_date']= pd.NaT
Ifyouhaveheightandweight,youcanrecomputeBMI:
1 df['height_m']= df['height_cm']/100
2 df['bmi_calc']= df['weight_kg']/ df['height_m']**2
ComparewithrecordedBMI:
1 bmi_diff =(df['bmi']- df['bmi_calc']).abs()
2 df['bmi_mismatch']= bmi_diff >1.5
Nowyoucandecidewhethertotrustthecalculatedorrecordedvalue—orkeep both.
Contradictoryflags
Example:someonemarked pregnant = True and sex = male.
1 mask =(df['pregnant']== True)&(df['sex']== 'M')
2 df.loc[mask,['pregnant' , 'sex']]
Ifit’sadataentrymistake,youmaywantto:
• Set pregnant = False
• Setitto NaN
• Flagtherowforreview
Alwaysloghowmanycontradictionswerefound:
1 print("Contradictionsfound:", mask.sum())
Insomecases,valuesdon’tmatchexpectedcategorypairs:
• A“follow-up”recordwithoutaninitialvisit
• A“discharged”flagwithoutanadmission
• A“completed”taskwithanulltimestamp
Youcandetectthesewithconditionalchecks:
1 mask =(df['visit_type']== 'follow-up')& df[' initial_visit_id'].isna()
2 df['visit_issue']= mask
Iftwovariablesshouldbetightlycorrelated,checkthem:
1 sns.scatterplot(x='total_cholesterol' , y='ldl') IbonMartínez-ArranzPage41
IfLDLishigherthantotalcholesterolformultiplecases,something’swrong.
Youcanalsocalculatecorrelation:
1 corr = df[['total_cholesterol' , 'ldl']].corr()
2 print(corr)
You’reworkingwithICUdata.Eachrowhas start_time, end_time,and duration_minutes.
1 #Recalculateduration
2 df['duration_calc']=(df['end_time']- df['start_time']) .dt.total_seconds()/60
3
4 #Comparewithrecorded
5 df['duration_diff']=(df['duration_calc']- df[' duration_minutes']).abs()
6 df['duration_mismatch']= df['duration_diff']>5
Nowyou’vespottedwheredurationwasmiscalculated,possiblyfromrounding errorsormanualedits.
• Someerrorsonlyappearwhencomparingmultiplecolumns.
• Uselogicandsimplemathtodetectcontradictions.
• Recalculatevaluesfromrawinputswhenpossible.
• Flaginconsistencies,correctwhenobvious,andalwaysdocument.
Thesearen’tjustbugs—they’reclues.Datathatdoesn’tmakesenseacrosscolumns tellsyouastory.Sometimesthatstoryisaboutbiology.Sometimesit’sabouta typo.
Datavalidationisaboutmakingsurethevaluesinyourdatasetmakesense—not justinisolation,butaccordingtodefinedrules.Theserulesmightcomefromyour domainknowledge,regulatorystandards,orlogicalexpectations.Thinkofitas teachingyourdatasetsomebasicmanners.
Ifyouranalysisisahouse,validationistheplumbingcheckbeforeyoumovein.
Anyonecanmakeamistake.AusermightenteratemperatureinFahrenheitinsteadofCelsius.Atimestampmightgettruncated.Avaluemightbetypedas “999”insteadof“99”.Thesesmallerrorscancascadeintobiggerproblemsifle unchecked.
Byenforcingconstraintsearly,youcatchissuesbeforetheycauseconfusiondownstream.
1 mask =(df['age']>=0)&(df['age']<=120) 2 df['age_invalid']=~mask
Allowedvalues(categoricalchecks)
1 valid_values =['M' , 'F'] 2 df['sex_valid']= df['sex'].isin(valid_values)
Nullconstraints
Makesurerequiredfieldsaren’tmissing: 1 df['admit_date_valid']= df['admit_date'].notna() Cross-fielddependencies 1 #Pregnancyflagonlyallowedifsexisfemale
mask =(df['sex']== 'F')|(df['pregnant']!= True)
df['pregnancy_constraint_ok']= mask Patterncheckswithregex
#Emailformat
df['email_valid']= df['email'].str.match(pattern)
pandera isaPythonlibrarytodefineschemasandvalidatepandasDataFrames. Example: 1 import panderaaspa 2 from pandera import Column, DataFrameSchema
4 schema = DataFrameSchema({
5 "age": Column(pa.Int, checks=pa.Check.in_range(0, 120)),
6 "sex": Column(pa.String, checks=pa.Check.isin(["M", " F"])),
7 "bmi": Column(pa.Float, nullable=True),
})
10 validated_df = schema.validate(df)
Ifsomething’swrong,itraisesaclearerror.
Ifyourdatacomesrowbyrow(e.g.,fromanAPI), pydantic isagreatoption. 1 from pydantic import BaseModel, Field, validator
3 class Patient(BaseModel): 4 age: int = Field(..., ge=0, le=120)
sex: str
7 @validator("sex") 8 def check_sex(cls, v): 9 if v notin ["M", "F"]:
raise ValueError("Invalidsex")
return v
Ifyou’reworkingwithmultipletables(e.g.,patientsandvisits),youneedtoensure foreignkeysmatch.
1 valid_ids = patients['patient_id']
2 df['id_in_patients']= df['patient_id'].isin(valid_ids)
Ifyoufindrowsthatreferenceanon-existentpatient,that’saredflag.
Constraintenforcementwithassertions
Youcanaddassertionsatthetopofnotebooksorscripts:
1 assertdf['age'].between(0,120).all(), "Ageoutofrange !"
Thishelpsensurethingsdon’tbreaksilently.
Youreceiveadatasetwithdemographics,vitalsigns,andlabresults.Youwantto ensure:
• Ageis0–120
• SexisMorF
• AllrowshaveapatientID
• Glucosevaluesarerealistic
Step-by-step:
1 assertdf['age'].between(0,120).all()
2 assertdf['sex'].isin(['M' , 'F']).all()
3 assertdf['patient_id'].notna().all()
4 assertdf['glucose'].between(30,800).all()
Alternatively,youcouldwraptheseintoa pandera schema.
Wheneveryouapplyvalidations,documentthem.Youcan:
• Keepa constraints.md filedescribingeachrule
• Storevalidationfunctionsinamodule(e.g. validate.py)
• Addloggingtoyourpipeline
1 import logging
2 logging.info("Validatedsexfield:%dinvalidentries", (~df['sex'].isin(['M' , 'F'])).sum())
• Validationisthepracticeofenforcinglogicandsanitychecksonyourdata.
• UsePythonlogic,regex,ortoolslike pandera and pydantic.
• Catchissuesearly—don’twaitforyourmodeltofail.
• Documentallrulesapplied.
Cleandataisn’tjustaboutmissingvaluesorduplicates—it’salsoaboutvaluesthat makesense.Validationisyourlastlineofdefensebeforeanalysisbegins.
You’vecleanedyourdata.Itlooksgreat.Buthowwillsomeoneelseknowwhatyou changed—andwhy?Andmoreimportantly:howwill you rememberwhatyoudid threeweeksfromnow?
That’swherelogging,versioning,andaudittrailscomein.Thesearen’tjust“niceto have”inregulatedenvironments—they’reessentialforreproducibility,trust,and scientificintegrity.
Inclinicalresearch,datamustcomplywithALCOA+principles:
• Attributable:Whomadethechange?
• Legible:Canwereadandunderstandit?
• Contemporaneous:Wasitrecordedatthetime?
• Original:Dowehavetherawversion?
• Accurate:Isitcorrect?
The“+”adds:Complete,Consistent,Enduring,andAvailable.
Theseprinciplesdon’tjustapplytothedata—theyapplytoyour code, decisions, and process.
Neveroverwritetheoriginalfile.Alwayssavecleanedversionswithclearnames:
1 df.to_csv("cleaned/step3_validated.csv", index=False)
IncludeaREADMEorchangelogdescribingwhatchanged:
1 2025-05-08| RemovedoutliersandfixedBMImismatches. Storedinstep3_validated.csv
Youcanevenlogthenumberofrowsa ected:
1 removed = initial_len - len(df)
2 print(f"Removed{removed}rowswithinvalidage")
Insteadofjustprinting,logmessagesproperly:
1 import logging
2 logging.basicConfig(filename='logs/cleaning.log' , level= logging.INFO)
3 logging.info("Step2:ReplacedmissingBMIvalueswith median")
Logunexpectedvaluestoo:
1 bad_sex = df[~df['sex'].isin(['M' , 'F'])]
2 logging.warning(f"Found{len(bad_sex)}invalidsex entries")
DVCislikeGitfordata.Youcanversion .csv or .parquet filesandtrackthem withyourcode.
InitializeaDVCproject:
1 dvcinit
2 dvcaddcleaned/step3_validated.csv
3 gitaddcleaned/step3_validated.csv.dvc .gitignore
4 gitcommit -m "TrackvalidateddatasetwithDVC"
Thisallowsrollback,comparison,andreproducibility.
UseGittotrackyournotebooksandscripts.Commito en:
1 gitcommit -am "Cleanedtextfieldsandrecoded categories"
Tagimportantstages:
1 gittag -av1.0-m "Finalcleaneddatasetbeforemodeling "
Jupyterkeepsexecutioncounts,butit’seasytoreruncellsinadi erentorderand losetraceability.Usethesetips:
• Runnotebookstoptobottom
• ExporttoHTMLorPDFa ereachmilestone
• Use nbconvert ortoolslike papermill or jupytext IbonMartínez-ArranzPage51
Movecleaningstepsintoscripts:
1 pythonclean_data.py
Thisensuresanyonecanre-runthepipeline,notjustclickthroughanotebook.
Includecommand-linelogging:
1 print("Step1complete:droppedduplicates")
Orstructuredlogging:
1 import logging
2 logging.info("Step3complete:imputedmissingglucose")
Youcreateascript cleaning_pipeline.py that:
• Loadsrawdatafrom data/raw.csv
• Cleanstypes,missingvalues,andoutliers
• Savesto cleaned/step_final.csv
• Logsallactions
• IstrackedbyGitandDVC
Now,anyoneonyourteam(orareviewer)can:
• Re-runtheentireprocess
• Seeexactlywhatwaschanged
• Compareresultsbeforeanda ercleaning
• Alwayskeeprawandcleaneddataseparate
• Useloggingtodocumenteverychange
• VersiondatafileswithDVCorcommitintermediatestepsmanually
• TrackyourcodewithGitandtagimportantmilestones
• Packageyourcleaningstepsintoscriptsforfullreproducibility
Cleandataisgreat.Butclean processes areevenbetter.That’swhatmakesyour worksolid,trusted,andreadyfortherealworld.
Let’swrapupwithsomereal-worldexamples.Thesecasestudiescombinemany ofthetechniqueswe’veexploredsofar,andshowhowthey’reappliedtomessy, realisticdata.Eachonestartsfromarawdatasetandwalksthroughthestepsto makeitusable.
You’regivenanextractofelectronichealthrecords(EHR)withthefollowing columns:
• patient_id
• sex
• birth_date
• visit_date
• height_cm, weight_kg, bmi
• glucose, blood_pressure, notes
Problemsspotted:
• sex values:“M”,“F”,“male”,“FEMALE”,“unknown”
• birth_date storedastext,someinDD/MM/YYYY,someinMM-DD-YYYY
• bmi doesn’tmatchheightandweightinmanycases
• Notescontaintyposandirregularwhitespace
'sex']= df['sex'].str.strip().str.lower().replace( sex_map)
df['sex']= df['sex'].where(df['sex'].isin(['M' , 'F']), pd.NA)
df['birth_date']= df['birth_date'].apply(try_parse)
height_m = df['height_cm']/100
df['bmi_calc']= df['weight_kg']/(height_m **2)
df['bmi_diff']=(df['bmi']- df['bmi_calc']).abs()
df['bmi_flag']= df['bmi_diff']>1.5
df['notes']= df['notes'].str.strip().str.lower().str. replace(r'\s+' , '' , regex=True)
• Cleanedcategoricalfields • Consistentdateformat
• MismatchedBMIflaggedforreview
• NotessimplifiedforNLPorgrouping
You’reworkingwithadatasetofcountry-levelhealthmetrics.Itincludes:
• country, year
• life_expectancy, population, gdp_per_capita
• urban_pct, mortality_rate, vaccination_coverage
Problemsspotted:
• Missing gdp_per_capita insomecountries
• Duplicaterowsforsome(country,year)pairs
• urban_pct hasvaluesover100
• vaccination_coverage includes“n/a”
Cleaningsteps:
1 #Dropduplicates
2 df = df.drop_duplicates(subset=['country' , 'year']) 3
4 #Converttonumeric
5 df['gdp_per_capita']= pd.to_numeric(df['gdp_per_capita' ], errors='coerce')
6 df['vaccination_coverage']= pd.to_numeric(df[' vaccination_coverage'], errors='coerce')
7
8 #Capinvalidvalues
9 df['urban_pct']= df['urban_pct'].clip(upper=100)
10 11 #ImputeGDPusingcountrymedian 12 df['gdp_per_capita']= df.groupby('country')[' gdp_per_capita'].transform( 13 lambda x: x.fillna(x.median()) 14 )
Results:
• Cleaneddatasetreadyfortimeseriesanalysis
• Noimpossiblevalues(e.g.,120%urbanpopulation)
• Reproduciblescriptwithclearassumptions
You’reanalyzinguntargetedLC-MSmetabolomicsdatawithcompoundintensities acrossmultiplesamples:
• sample_id, compound_name, retention_time, area, flag
Issuesfound:
• Somecompoundshaveduplicatepeaks
• retention_time variesslightlybetweenreplicates
• Someflagsare“bb”,“MM”,ormissing
• Intensitiesvaryby10xbetweenreplicates
• Aggregateddatasetwithcleanpeakspersample
• Medianretentiontimepercompound
• Flagsdocumentedanddiscardedsafely
Summary
Thesecasestudiesshowthatrealdataismessyinmanyways—text,dates,numbers, duplicates,logic,andstructure.Cleaningitwelltakespatience,creativity,andrigor. Butonceyoudo,youunlockinsightsyoucantrust—andmodelsthatwork. You’renotjustfixingaspreadsheet.You’repreparingevidence.
Awell-cleaneddatasetisasilentachievement—ito engoesunnoticed,butit makeseverythingelsepossible.Belowarekeyreferences,tools,anddatasets mentionedthroughoutthebook,especiallyinthecasestudies.
• Karr,A.F.,Sanil,A.P.,&Banks,D.L.(2006). Dataquality:Astatisticalperspective.StatisticalMethodology,3(2),137–173.https://doi.org/10.1016/j.stamet .2005.08.005
• EMA(2010). Reflectionpaperonexpectationsforelectronicsourcedataand datatranscribedtoelectronicdatacollectiontoolsinclinicaltrials.European MedicinesAgency.https://www.ema.europa.eu/en/documents/scientificguideline/reflection-paper-expectations-electronic-source-data-datatranscribed-electronic-data-collection_en.pdf
• FDA(2018). DataIntegrityandComplianceWithDrugCGMP:Questionsand Answers.https://www.fda.gov/media/119267/download
• Panderadocumentation:https://pandera.readthedocs.io
• Pydanticdocumentation:https://docs.pydantic.dev/
• DVC(DataVersionControl):https://dvc.org
• Missingno:https://github.com/ResidentMario/missingno
Toreplicatetheclinicalcleaningsteps,youcanuseasyntheticbutrealisticdataset like:
• SyntheaopenEHRdata—Fullysyntheticpatientrecords,downloadablein CSVformat.
• SampledatasetusedinSeerPy(cleanedEHRstructure):https://github.com /SEERData/SeerPy(explorationpurposes)
• WHOGlobalHealthObservatoryDataRepository
• Specificindicatorslikelifeexpectancy,mortality,andimmunization:https: //www.who.int/data/gho/indicator-metadata-registry
• ExampleLC-MSdata(MTBLSfiles):
– MetaboLights-MTBLS135—Plasmametabolomedata.
– MTBLS404-HumanserumNMRandMSdata
Theseprovidecompoundtableswithretentiontime,area,andsampleannotations.
EssentialGuidetoCleanData ToolPurposeLink
pandas Datamanipulation https://pandas.pydata.org
seaborn/matplotlibVisualization
scikitlearn Imputation, scaling,model prep
panderaDataFrame schemavalidation
pydanticRow-level validationand models
DVCDataversioning and reproducibility
https://seaborn.pydata.org
https://scikit-learn.org
https://pandera.readthedocs.io
https://docs.pydantic.dev
https://dvc.org
missingno Visualizingmissing data https://github.com/ResidentMario/missingno
Ifyouwanttogofurther,considerbrowsingthedatasetson:
• KaggleDatasets
• UCIMachineLearningRepository
• OpenML
Andremember:behindeverytrustedinsightisadatasetthatsomeonetookthe timetocleanproperly.Now,thatsomeoneisyou.