AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitarian cooperationfordisasterpreparedness
MARIABROWARSKA, DelftUniversityofTechnology,Netherlands
KARLASALDAÑAOCHOA, UniversityofFlorida,SchoolofArchitecture,CollegeofDesign,Constructionand
Planning,USA

Fig.1.Sateliteimagesofairportsinthedatabase.
Inrecentyears,naturaldisastershaveincreasedinfrequency,causingsignificantdamagetocommunitiesandinfrastructureworldwide. Whenanaturaldisasterstrikes,airportsintheaffectedregionhavetoadaptquicklyfromservingregularpassengerstobecominga humanitarianhubhandlingamassiveincreaseinpassengersandcargo.Severalcountriesareparticularlyvulnerableandproneto suchadevastatingevent.Althoughexistinginitiativesaimtoraiseawarenessandimproveairportpreparedness,authoritiesareoften isolatedintheirresilienceeffortsastheytendtoactindividually,andtheirresponseisoftenboundbylocalexperience.Consequently, thisresearchaimstobroadenthefieldofviewfromalocaltoaglobalonebycompilingadatabaseof971airportsworldwidewith correspondingsocio-technicalcharacteristicsinvariousdatamodalities.Inaddition,throughadatascienceapproach,atransformation ofthedifferentdatamodalitieswasperformedtoextractnumericalfeaturevectorssothatinfuturestudiesacorrelationbetween airportscanbefound,tofindsimilarairportsfromwhichdifferentapproachestodisasterpreparednessandresponsecanbelearned.
AdditionalKeyWordsandPhrases:airportsdatabase,disasterpreparedness,AI-basedclustering
ACMReferenceFormat:
MariaBrowarskaandKarlaSaldañaOchoa.2021.AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitarian cooperationfordisasterpreparedness.In ACM,NewYork,NY,USA, 18 pages. https://doi.org/XXXXXXX.XXXXXXX
Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesarenot madeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Copyrightsforcomponents ofthisworkownedbyothersthanACMmustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish,topostonserversorto redistributetolists,requirespriorspecificpermissionand/orafee.Requestpermissionsfrompermissions@acm.org. © 2021AssociationforComputingMachinery. ManuscriptsubmittedtoACM
Data-drivenHumanitarianMapping,KDD2022, ,
1INTRODUCTION
BrowarskaandSaldañaOchoa
Whenanaturaldisasterstrikes,thenearestairportbecomesthecriticallinkfordeliveringandorganizingreliefaid whiletryingtostayefficientinevacuatingcitizensandreceivingemergencypersonnel[5].However,theexisting infrastructureoftencannothandlethesuddenspikeinthevolumeofincominggoods[6].Whenairportsbecome nonoperational,theonlywaytoreceivevaluableaidisviaroad,rail,andwater,whichisoftenmuchlessefficientand time-consuming[18].
Eventhoughdisastersandhumanitarianaidarenotthenewestchallenges,thereisstillmuchroomforimprovement. Airportsaresetinanenvironmentoftechnicalandoperationalchallenges,lawsandregulations,internationaland regionalcooperationofstakeholdersfromvariousfieldsimprovinghumanitarianlogistics.Tocharacterizeanairport,we needtoconsidervariousfeaturesthatdescribetheircomplexity,a)geospatialandairport-specificdata:areasurrounding, reachability,numberofrunways,taxiways;b)demographicdata:urbanindexes,andpopulationaroundtheairport;and c)geographicandurbandata:seaportdataandbuiltenvironmentinformation.Alloftheaforementionedcharacteristics influenceairports’preparednessforapotentialdisasterandcollectingthemcanhelpexpertsbetteraddresstheproblem inabroaderscope.
Thus,thisresearchexploreshowdatasciencecouldhelpestablishabaseforformingcollaborationsbetweenairports thatmightfacesimilarchallengesindisasterpreparednessefforts.Thegoalistobuildacomprehensivedatabase describingairportsfromtheperspectiveoftheirdisasterpreparednessthatwillhelpfutureresearchersfindsimilarities betweenthem,basedontheirintrinsicsocio-technicalfeatures,sothatperhapsanairportinIndonesiacouldbematched withitssiblingairportintheCaribbeans.Theresearchinvolvedseveralprogrammingoperations––startingwith collectingdata,throughdataprocessing,uptoexperimentingwithSelfOrganisingMaps(SOM)algorithminorderto findairportsthatsharefeaturesthatarerelevantintermsoftheirdisasterpreparedness.Thedatabasecanbefoundin thefollowingrepository.
https://gitlab.com/maria.browarska/OSM-SOM
Theproposeddatabaseofairportsandtheirnumericalfeaturesarethefirststeptoaprocessthatwillconclude creatinggroup-specificpolicyadviceforsimilarairports.Withthisarticle,wewanttodescribethestepsfromcollection, normalization,andpre-processingofthedatatotransformingthemultimodalityofthegathereddatatoanumerical featurevectorthatcanbeusedforthegroupingofsimilarairportsthroughUnsupervisedMachineLearningalgorithms thatcanclustersimilarairportsbasedonsimilarnumericalfeatures.HavingarelevantscenariotoapplyMLthat benefitssocietyatlarge.
2KNOWLEDGEGAPANDRESEARCHGOAL
Inordertodefinekeyconcepts,narrowdownthescopeoftheresearchandpreciselydefinetheknowledgegap,a literaturereviewwasconducted,followedby5semi-structuredinterviewswithindustryexperts.
2.1Literaturereview
Mostofthereviewedarticlesfocusedonacasestudyastheresearchapproach,oftenlookingatindividualairportsand assessinghistoricalevents.Researchersanalysedthebehaviourofairportsinspecificdisastrousevents,mainlyfocusing onorganisationalprocessesandstakeholders’cooperation[17, 18, 25].Whilealltheconsideredfeatures,withouta doubt,influencelogisticaloperations,theyarealsouniqueforeachairport.Hence,itischallengingtodrawgeneral
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
Data-drivenHumanitarianMapping,KDD2022, , Table1.Socio-technicalfeatures
Structural andcapacityfeatures AccessibilityfeaturesOrganisationalfeaturesRiskrelatedfeatures
Runways andtheircharacteristics Airportconnection HowmuchstaffisavailableRiskofoccurrenceofanaturaldisaster
Aircraftparkinganditscharacteristic GeographicalsurroundingsHowwellthestaffistrainedRegionalcapacityforhandlingdisasters Terminalsandtheircharacteristics AlternativeairportsandseaportsWhoownstheairportWhatistheairport’smainPurpose(civil/military) Storagefacilitiesbothopen-airandcoveredwarehouses Whethertheairportwaspartofanypreparednessprograms
conclusionsthatcouldapplytootherairportssincetheirorganisationalstructuremaydiffer,duetointernationaland regionalregulations,resourcesandneeds.
Someoftheauthorspointedouttheimportanceofthegeographicallocationofanairport,structuralfeaturesaswell asreachability[4, 24, 26].Pandeyetal.[15]provedthatutilisinggeo-spacialdataisbeneficialforairporthumanitarian responseplanningandthatairportauthoritiesareinterestedintoolsthatcanhelptoplanlogisticalprocedures.
Whilesomeoftheauthorssuggestedthatcooperationbetweenairportsthatstrugglewithsimilarchallengeswould haveapositiveoutcome[10, 17],noneofthemexploredthepossiblebackboneofsuchcooperation.Thatfinding, combinedwiththeideaofstructuralfeaturesofairportshavinganimpactontheirhumanitarianlogisticalprocedures, ledtodefiningtheknowledgegap.
Thespecificmethodsappliedinthisresearchwereusedinthefieldofhumanitarianaid-relatedresearchbefore,but onalocalornationalscale,asshownbySaldañaOchoa&Comes,andChen[3, 13].Theglobalapproachisachallenge duetothelimitedavailabilityofreliabledata,butifsuccessful,itpavesthewayformoredetailedresearchonaglobal scale,whichcouldbenefitthelessdevelopedcountries,thatoftendonothaveresourcesforlocaladvancedresearch andpreparednessstrategies.
Untilnow,thepractitionersinthefield,suchasGetAirportsReadyforDisaster(GARD),haveusedstraightforward methodsforassessingthevulnerabilityofairportsandhadtopreparedifferentstrategiesforeachclient.GARD’scapacity isminimal,andthisresearchcouldleadtonewwaysforauthoritiestoprepare,thankstoestablishingcollaborations directlywithotherairportsfacingsimilarchallenges.
2.2Researchgoal
Thegoalofthisresearchisto(1)betterunderstandthechallengesthatairportsfacewhenanaturaldisasterstrikes andtheirpreparednessactivities.Thisunderstandingshallthenbe(2)translatedintoalistofsocio-technicalfeatures influencingthelevelofpreparednessandairportcapabilitiesinfacingadisaster.Thefindingofkeyfeaturesisrelevant for(3)buildingadatabasecontainingvaluablehumanitarianaid-relatedinformationaboutseveralairportsworldwide, composedsolelyfrompubliclyavailablesources.Thefocusonpubliclyavailabledataisconditionedbyalargenumber ofairportsbeinganalyzed,whichmakesitimpossibletoconductsurveysandobtaininformationdirectlywithinthe resourcesandtimeframeofthisresearch.
3METHODOLOGY
Inordertofindspecificqualitiesandfeaturesthatinfluenceairports’preparednessforadisaster,athoroughunderstandingofactivitiesandtheenvironmentinwhichtheytakeplaceisneeded.Thisinformationwasderivedfromadeskstudy accompaniedbysemi-structuredinterviews(table 3 intheAppendixlistsorganizationscontactedforinterviewing) withexpertsonairports’disasterpreparednessandperformance,summarizedintable 1.Thenextstepwastotranslate identifiedchallengesinfluencingtheperformanceofanairportinapost-disasterscenariointosocio-technicalfeatures toachieveagoodstartingpointforthedataminingprocess. 3
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa

Fig.2.971airportschosentobeanalyzed,placedonaworldmap
Thedataminingprocesswascomposedoftwomainiterativephases.First,theidentifiedsocio-technicalfeaturesof airportshadtobetranslatedintomeasurabledatapoints––numerical,categorical,ordescriptive.Thesecondphase wasretrievingdatafrompubliclyavailablesources,asdescribedinmoredetailindiagram 6.Whenbuildingadatabase frompubliclyavailablesources,itiscrucialtohaveastrongunderstandingofwhatwewanttodescribetoallowfor flexibilityandeasyreplacementoradjustmentoforiginallyplannedmeasures.
Tostartbuildingthedatabase,wechoosevulnerablecountriesandairportsusingtheINFORMRiskIndexas qualificationcriteriaforchoosing.First,alistofallairportsthatarelocatedwithinthesecountrieswasexported. Next,theairports.csvfilefromOurAirportswasusedtoselectonlyairportscurrentlyoperating,i.e.,havescheduled services.Anadditionalcriterionwastheairporttype-heliports,seaplanebases,andclosedoneswereexcluded,while small,medium,andlargewerechosen(thesizeofanairportwasdefinedbasedonthenumberofscheduledflightsas describedbyOurAirports’data).Theseoperationsresultedinformingalistof971airports,withtheirnames,coordinates, InternationalAirTransportAssociation(IATA)codes,andInternationalCivilAviationOrganization(ICAO)codes.This listwouldformthebaseforallmassqueriesappliedviaAPIstocollectdataforeachairport.Figure 2 presentsthe971 airportsonaWorldmap.
4BUILDINGTHEDATABASE
Datausedinthisresearchcamefromamultiplicityofsourcesinvariousdatamodalitiesandformats.Inorderto translatesocio-technicalintocomparablesetsofnumericalfeatures,anumberofconditionsneedtobetakeninto account,suchasavailabilityofdata,methodsofmeasuringandquantifyingspecificcharacteristics,theircorrelations, andlevelofimportance.Inordertokeeptrackofchangesandmakethedatabaseeasytonavigate,theSQLitedatabase wasbuiltwiththeuseofDBBrowsersoftware.TheOSMqueries,theGeoDB-citiesAPIwereconnectedtothedatabase throughPythonqueries,asseenintheattachedGitLabrepository.Toaddrecordsandfeaturestothedatabase,outputs 4
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
Data-drivenHumanitarianMapping,KDD2022, , Table2.Descriptionofthedatabase.
SourceFeatureTypeofdataDatahandlingRelevance OurAirports iatatext
airport_nametext latitude_degnumerical longitude_degnumerical countrytext
elevation_ftnumerical
noadditionalhandlingneeded
airportidentificationandlocation
emptyfieldsinputedwith meanvalue lightedcategoricalemptyfieldsimputedwith‘0’ runwaydescriptionforassessing airport’scapacityandaccessibility max_length_ftnumerical
width_ftnumerical
emptyfieldsinputedwith meanvalue
emptyfieldsinputedwith meanvalue
airport_typecategorical textvaluesconverted intocategoricalvalues‘0’,‘1’
seaport_countnumerical
OSM
GeoDB
generalassessmentoftheairport trafficsize
identifyingpotentialalternative seaportswithin100kmradius
airport_countnumerical identifyingpotentialalternative airportswithin100kmradius
manualverification
build_countnumerical describingthesurrounding within5km
industrial_countnumericalassessingairport’scargo handlingpreparedness
tourism_countnumerical terminal_countnumericalassessingairport’scapacity runways_countnumericalassessingairport’scapacity
name_city_ntext
obtainingdataaboutthree closestcities assessingthedistancebetween theairportandpotentialcasualties dist_city_nnumerical population_city_nnumerical assessingthenumberofpotential casualtiesinthearea
aptclasscategorical
Global Airports
INFORM Index
Logistics Performance Index
assessingairport’scapacity international/domestic
textvaluesconverted intocategoricalvalues‘0’,‘1’
apttypecategorical assessingairport’scapacity Airport/Airstrip/Airfield authoritycategorical assessingairport’sorganisational structure:civil/military humusecategorical assessingairport’shumanitarian operationpreparedness
natural_dis_risknumerical emptyfieldsinputedwith meanvalue
assessingregionaldisasterrisk informrisknumerical assessingregionaldisaster preparedness
lpi_customsnumerical assessingregionallogistical capacityandpreparedness
lpi_infrastructurenumerical assessingregionallogistical capacityandpreparedness
GARDgardcategorical
textvaluesconverted intocategoricalvalues‘0’,‘1’ assessingairport’shumanitarian operationpreparedness
Self calculated airport_areanumericalcalculatedbasedonOSMdataassessingairport’scapacity population_aroundnumericalcalculatedbasedonGeoDBdata assessingthenumberofpotential casualtiesinthearea
iso_countrytext noadditionalhandlingneededidentificationpurposes
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa
fromvarioussourceswereconvertedintothe.csvformat.ResultsofOpenStreetMap(OSM)andAPIquerieswere automaticallywrittenintothedatabasedirectly.Adetaileddescriptionofeachdatasourceandstepstakeninthe processofextractingdatacanbefoundin B.1 and B.2
4.1Thedatabase
Astheplanistocompareairportsbasedonnumericalfeatures,eachdatamodalitywasturnedintoan understandable formformathematicalprocessing.Dependingonthemodalityofdata,variouspreprocessingmethodswereapplied, basedonseveralscientificsources[7, 9, 20, 21]andcanbeseeninAppendix C.Thefinallistofallairportsand correspondingfeatureswerebuiltintheDBBrowserandmadeavailablethroughtheGitLabdepository,bothasa.csv fileandanSQLitedatabase.Featuresselectedforeachairport,togetherwiththecorrespondingsource,preprocessing methods,andadescriptionoftheirrelevanceforassessingdisasterpreparedness,arepresentedintable 2
5UNSUPERVISEDMACHINELEARNING
Thissectiondescribestheprocessofapplyinganunsupervisedmachinelearningalgorithm-SelfOrganisingMaps(SOM)onthedatasetbuiltinprevioussteps.Apendix B showsaninitialtrialwithotherclusteringalgorithmsandit explainsthereasonwhyweselectedSOMtoproceedwiththeexperiment.
5.1SelfOrganisingMaps
Inordertoclusterairportsbasedontheirdistinctivefeaturesrelevantfordisasterpreparedness,anunsupervised machinelearningalgorithmwasappliedwiththeuseofSOMPYPythonlibrary([22]).Thewholeprocesswasthoroughly documentedintheattachedGitLabrepository.
5.1.1Training. Thedatasetconsistingof971recordswithairportsandtheirfeatureswassplitintotwosmallersetsthetrainingsetwithrandomlychosen70%ofallrecords,andthetestingsetwiththeremaining30%-resultinginthe trainingsetwith650datapointsandthetestsetwithdatapoints.
Afterthepre-processingwasfinished,allrecordsfromthetrainingsetweretransformedintoinputvectorsthat canbeprocessedbytheSOM.Forthefirstattempteachvectorwasaseriesof36numericalvalues,describingallthe chosenfeaturesforeachairport.WithintheSOMPYAPI,eachvectorwasnormalisedbeforethetrainingoftheSOM.
Thetrainingphasewasrepeated100timesforvarious,randomlychosensizesofthefinalSOMmap,inorderto findthebestperformingone,basedonthecalculatedtopographicandquantisationerrorofeachtrainingrun.The smallerthesevalues,thebettertheperformanceofthefeaturemap([1]).Oncethebestperformingmapwaschosen,a visualisationofeachfeatureonamapwasperformed,asshowninfigure 3.
5.1.2Analysingresults. Initially,theclusteringwasachievedonlyonasmallnumberofairports.Bycomparingthe resultofSOMtotheindividualrepresentationinfigure 3,wediscoveredthatthereweremultipledominatingfeatures thatledtheprocessofclustering.Thedominantfeatureswerethecategoricalones,aproblemknownasthe thecurseof dimensionality ([23]).Simplyput,thereweretoomany0/1dominatingfeaturesthatinfluencedthewholeclustering process.
5.1.3Adjustinginputvectors. GiventheresultofthefirstattemptatapplyingtheSOMalgorithmonthewholedata set,anumberofattemptsatadjustingtheinputvectorswereperformed.
Data-drivenHumanitarianMapping,KDD2022, ,
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness

Fig.3.VisualisationsofeachfeatureinthefirstattemptatSOM.Eachairportisplacedinoneofthecellsonthemap-thebrighter thecellcolour,thehigherthevalueofthefeature(eg.moreterminals)orthevalueisequalto1forcategoricalfeatures(eg.lighted airstrip-yes).Fromthisfigure,wecanderivethatthereisanumberoffeaturesthatmaybecomedominating,duetotheirdistribution -concentrationofthebrightcellsinasmallarea.Itmayleadtheclusteringalgorithmtofocusonthesestrongfeatures,whichdoes notnecessarilyreflecttheirimportanceinreallife.Wecanalsoobservesomecorrelations-largeairportshavemoreterminalsand runways,whichtendtobelongerandwiderthanatmediumandsmallairports.Whileitisaverystraightforwardconclusion,itcan serveasaverificationtool.
First,themostdominatingfeatureswereexcludedfromthedataset,basedontheindividualrepresentationofeach featureinfigure 3.Thecategoricalfeatureofairporttypewaschangedfromthebinaryrepresentationtoatranslation of small,medium,large intonumericalvalues:1,2,3.Whileitshouldnotbeperformedforfeaturesdescribingnoncontinuouscategories,theairporttypedoesinfactsorttheairportsfromtheoneswithsmallesttraffictothelargest, thereforeitisacceptabletotranslateitintocontinuousvalues.
5.1.4Adjustedinputvectors-results. Again,theremainingfeatureswentthroughallthestepsofpre-processing, transformedintoinputvectorsandnormalised.TheresultofrunningtheSOMalgorithmoninputvectorsreducedto 20featuresisrepresentedinfigure 4. 7
Data-drivenHumanitarianMapping,KDD2022, ,
BrowarskaandSaldañaOchoa
WhenanalysingtheSOMcellbycell,wecanobservethatcell15,whichconsistsof14recordsrepresentsairports thathavenoseaportsintheirvicinity,have2-4alternativeairports,have1-2terminals,areallofthemediumtraffic typeandhavethenaturaldisasterriskbetween4.0and6.7.Fortherestofthefeatures,nodominantvalueexists,there isabroadrepresentationofeachfeature.
Anothercell-number46,thatconsistof11records,includesairportswithalargenumberofalternativeairportsbetween6and27,0-1terminals,smalltrafficandhighernaturaldisasterriskthenthepreviousgroup-between5.8and 7.7.
5.1.5Adjustedinputvectors-clustering. Thenextstepwouldbetoclusterindividualcellsintogroups.Anexample resultisshowninfigure 5.ApplyingtheK-Meansalgorithmledtodefiningkey4groupsofairports.

Fig.4.ResultsoftheadjustedSOM.a)showsacolorchangingspectrumthatvisualizedtheconsistencyinclustering,b)shows thesatelliteimagesofairportsthatwereclosertoeachbestmatchingunit,c)showsanoverlappingofthecolorspectrumandthe satelliteimages.)
5.1.6Verification. InordertoverifythethewaytheSOMoperates,avectorwithverificationdataandaddedtothe inputdata.Thisvectorconsistedoffeaturevaluesnearlyidenticaltooneoftherecordsincell46.Thealgorithmwas runagainandtheresultwaspositive-theverificationvectorwasaddedtothecellwithothersimilarones,proving thattheSOMoperatescorrectly.
5.2UsingtheSOMmapinpractice
Regardlessofthecurrentperformanceoftheclusteringalgorithm,orrather,thelevelofpreparednessoftheinputdata -sincethosetwoarestronglydependent-wecandiscusshowtheproposedapproachcouldbeusedinpractice.
Theattachedrepositoryallowsforinvestigatingtheoutputmapindetails.Witharesultliketheoneshowninfigure ??,itcanbederivedwhichairportswereputtogetherinacell,meaning-whichoneswerechosenasthesimilarones. Thisisthestartingpointfordeterminingonwhatareastheseairportscouldcooperatewithoneanother.Sometimes,the similaritywillresultfromaspecificdominatingfeatures,withothersfairlydifferent,thereforeitisimportantanalyse theresultbeforestatingwhichairportsaresimilar,onlybylookingattheircellmembership.Onthiscell-levelanalysis wecanalsofindverysmallgroupsof2-3airportsthataregroupedtogether,whichcouldconstituteanopportunityfor astrongercooperation.Thehigherlevelofsimilaritycanbederivedfromadditionalclusteringofcells,aspresented infigure 5.Herebiggergroupsofairportsareformed-whilestilldifferentfromoneanother,therewillbeabigger 8
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
Data-drivenHumanitarianMapping,KDD2022, , diversitywithinmembersofeachgroup.Thiscanformabaseforanothertypeofcooperation,withmoremembers whomightnotbeidentical,butstillhavesomestrongsimilarities.Hereagainitisimportanttoanalysewhatarethe commonfeaturesthatmainlyinfluencedthegrouping.

Fig.5.ResultofclusteringaSOM.Here,ontopoftheoriginalSOMclusteringandadditionalK-meansclusteringisperformed.Cells from 5.1.4 wereputintolargergroupsinordertofindmain4typesofairports.Ontherightwecanobservehowsimilarthesatellite imagesofairportsgroupedinspecificcellsare,aswellashowdistinctivelydifferenteachclusteris.
Anexampledescribedinsection 5.1.4,withairportsgroupedinonecellshowingstrongsimilarityinthelownumber ofalternativeseaportsandairports,andmediumtraffictype,couldbeusedtoformacooperationfocusingonwaysof preparinganairportwiththesespecificconditions.Eventhoughtheairportsthemselvescanbeindistantpartsofthe world,theirpreparednessstrategiescanbesimilar,giventheirdominantfeatures.Ofcourse,theseareonlyacoupleof areasinwhichtheseairportscanbeseenassimilar,anditisimportanttonotethepossibleorganisationalandcultural differences.Whilethe airportauthority featureaimsatdescribingthepossibleorganisationalscheme,therestillmight bemorefactorsatplace.
Tosumup,theSOMmapcanbeusedasatooltoquicklygroupandfindthedominatingfeaturesofabiggroup ofairports.Thebetterthedatadescribingthegroupedinstitutions,themoreaccuratetheresultwillbe.Itiseasyto visualiseandinterpret,anditcanbeusedbyhumanitarianaidandaviationexpertswithoutadvancedprogrammingor mathematicalskills.Ataskofgroupingsuchabignumberofrecordswhiletakingintoaccountmorethan20factors wouldbeimpossibletoperformbyhand,thereforeitisagreatcombinationofusingsophisticatedunsupervisedmachine learningalgorithminaneasytointerpretway.
6LIMITATIONS
Thequalitydatasourcesusedintheresearchcansometimesbecontested,asthelevelofdetailavailableforvarious airportsandtheirsurroundingswasnotalwaysequal,whichmayleadtoinaccurateresults.Thisisalsoaproblemwith officialsourceswidelyusedbythehumanitariancommunity,suchastheLogisticsCapacityAssessment.Interviewees mentionedtheimportanceofaccesstodynamicdatathatdescribesthestateofeachairportanditssurroundingsat 9
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa
aprecisemomentintime,afteradisasterstrikes,becausethestaticinformationgatheredinassessmentsearliercan beinaccuratethemomentadisasterstrikes.However,intervieweesinvolvedinpreparednessprogramsratherthan immediateresponseoperationsunderlinedtheimportanceofbuildingcomprehensivedatasetswithstaticinformation toassessbetterwhatcanbedoneaheadofatragicevent.
Anotherchallengingfactoristheaccuracyofassumptionsmade––especiallyforassessingairportconnectivity.As provedbyhistoricaldisasters,theinabilitytodistributehumanitarianrelieffromtheairporttothepopulationinneed canunderminetheairport’soperationsandpreparedness.Amoresophisticatedandaccuratewayofquantifyingthe levelofconnectivitycouldbeusedinfutureresearch.
7DISCUSSIONANDCONCLUSION
Thedatabasebuiltinthisresearchisavaluableresourceforfutureclusteringanalysisorfutureresearchrelated toairports’preparednessforhumanitariandisasters.Itcanbefurtheranalyzedinmoredetailedresearch,updated accordingly,andusedtoassessairports’venerabilityandpreparedness.Fromthescientificperspective,thisresearch provesthattherearenowwaysofanalyzingcomplex,specificchallengeswithaglobaloverviewbasedonnumerous publiclyavailabledatasets.Italsoshowsthatscientistsneedtobeverycarefulwhenusingnotpreciselyscientific sourcesandthatbuildingaspecific,tailoreddatabaseisalengthy,challengingprocess.Nevertheless,itcanbeachieved notonlybyITprofessionalsbutalsobymultidisciplinaryresearchers.
Thisresearchprovidedavaluableframeworkforapproachingcomplexsocio-technicalenvironmentsofairports andtheirdisasterpreparedness,throughbuildingadatabasewithrelevantfeatures,basedoninterviewsandliterature review,usingonlypubliclyavailabledata,followedbyacomprehensivedataselection,collectionandpre-processing. Thechallengesandproblemsencounteredalongtheway,bothsolved,andunsolvedcanformavaluabletoolforother professionalsandscientistswillingtoconductsimilarresearch,notonlyrelatedtothedomainofaviationanddisaster preparedness.
Anadditionalfindingisthatweidentifiedtheneedforacommon,reliabledatabasewithallrelevantinformation aboutairportsinvulnerablelocations.Theonedesignedduringthisresearchcouldformabaseforaonebuiltwith officialdatasourcesthatareotherwiseunavailabletothepublic.Withthat,however,comesthechallengeofsecurity; sincedetailedinformationaboutairportscanbeviewedassensitivedata,thereforeaccesstosuchadatabaseshouldbe regulated.
7.1Futureresearch
Theideasforfutureresearchcanbedividedintothreesections-(1)relatedtothedataminingandtheprocessof buildingthedatabase,(2)datapre-processingandapplyinganunsupervisedclusteringalgorithmand(3)usingthe resultsinvariouswaysinordertoimproveairports’disasterpreparedness.
Buildingadatabasesolelyfrompubliclyavailablesourceshassomedrawbacks,asdiscussedinsection 6,suchas limitedtrustworthinessandinabilitytoretrievetheexacttypesofinformationthatareneededinordertodescribe specificfeatures.Inthefuture,itisworthconsideringbuildingasimilardatabasewithdirectinvolvementoftheairports thatarebeingdescribed––withtheuseofsurveysandpossibleinvolvementofinternationalhumanitarianandaviation relatedorganisationssuchasACIorOCHA.Thiswouldallowforretrievingmorespecificdata,uptodateinformation. Moreover,ifregularlyupdatedandmaintained,itcouldbecomeausefulresourceforairportsthatthemselveswouldlike toknowmoreaboutcapabilitiesofalternativeportsintheregion––notonlyforresearchpurposes,butforoperations
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
Data-drivenHumanitarianMapping,KDD2022, , onceadisasterstrikesandhelpfromneighbouringportsisneeded.Otherscientistscouldalsousesuchadatabasefor variousadditionalanalyses,savingtimeforgatheringthedataandfocusingonwhatcanbederivedfromit.
However,thedatabasethatwasbuiltinthisresearchisitselfavaluableresourceforperformingotherresearch relatedtoairports’preparednessforhumanitariandisasters.Withadditionaliterationsofthedatapre-processing, thereisroomforgatheringinsightfulknowledgeonsimilaritiesbetweenairports,thatwouldformasolidbasefor establishingcooperations.Inordertoachievethat,futureresearchshouldfocusonidentifyingthedominatingfeatures andadjustingthealgorithmaccordingly.Thiscouldrequiremoresophisticatedmethodsofdatapre-processingand automatingtheprocessofanalysingresults,inordertoquicklypickupcombinationsoffeaturesthatcannotoffer trustworthyresults.
Buildingpolicyadvicebasedonthedatabasecouldbeachievedbyidentifyingairportsthatareespeciallyvulnerable, duetototheirintrinsicfeaturesandcapabilities.Thisprocesswouldhavetobeaccompaniedbyathoroughanalysisof historicaleventsthattookplaceatsimilarairports,andthelessonslearnedcouldbeusedforimprovingpreparedness ofthosethatmightfacesimilarchallengesinthefuture,leadingtoachievingthefullpotentialofthisresearch.
REFERENCES
[1] LeAnhTu.2020.ImprovingFeatureMapQualityofSOMBasedonAdjustingtheNeighborhoodFunction. SustainabilityinUrbanPlanningand Design (2020). https://doi.org/10.5772/intechopen.89233
[2] Jean-FrançoisArvis,LauriOjala,ChristinaWiederer,BenShepherd,AnasuyaRaj,KarlygashDairabayeva,andTuomasKiiski.2018. Connectingto Compete2018.TechnicalReport.TheWorldBank. https://doi.org/10.1596/29971
[3] NingChen,LuChen,YingchaoMa,andAnChen.2019.Regionaldisasterriskassessmentofchinabasedonself-organizingmap:Clustering, visualizationandranking. InternationalJournalofDisasterRiskReduction 33,October2018(2019),196–206. https://doi.org/10.1016/j.ijdrr.2018.10.005
[4]SunkyungChoiandShinyaHanaoka.2017.Diagrammingdevelopmentforabasecampandstagingareainahumanitarianlogisticsbaseairport. JournalofHumanitarianLogisticsandSupplyChainManagement 7(062017),00–00. https://doi.org/10.1108/JHLSCM-12-2016-0044
[5] DeutschePostDHLGroup.2019. GoHelpProgram-DisasterPreparednessandResponse.TechnicalReport. https://www.dpdhl.com/en/responsibility/ society-and-engagement/disaster-management.html
[6] DeutschePostDHLGroup.2021.DisasterPreparedness-GetAirportsReadyforDisaster. https://www.dpdhl.com/en/sustainability/social-impactprograms/disaster-management/disaster-preparedness.html
[7] PurvaHuilgol.2020.FeatureTransformationandScalingTechniquestoBoostYourModelPerformance. https://www.analyticsvidhya.com/blog/ 2020/07/types-of-feature-transformation-and-scaling/
[8]HumanitarianDataExchange.2019.Globalairports-HumanitarianDataExchange. https://data.humdata.org/dataset/global-airports
[9] GotaKikugawa,YutaNishimura,KojiShimoyama,TakuOhara,TomonagaOkabe,andFumioSOhuchi.2019.Dataanalysisofmulti-dimensional thermophysicalpropertiesofliquidsubstancesbasedonclusteringapproachofmachinelearning. ChemicalPhysicsLetters 728(2019),109–114. https://doi.org/10.1016/j.cplett.2019.04.075
[10] JakubKraus,VladimírPlos,andPeterVittek.2014.TheNewApproachtoAirportEmergencyPlans. InternationalJournalofAerospaceand MechanicalEngineering 8,8(2014),2406–2409. https://publications.waset.org/vol/92
[11]MIT.2021.PythonWrappertoaccesstheOverpassAPI. https://github.com/DinoTools/python-overpy [12]M.Mogley.2017.GeoDBCitiesAPIDocumentation. https://rapidapi.com/wirefreethought/api/geodb-cities
[13] KarlaSaldanaOchoaandTinaComes.2021.AMachinelearningapproachforrapiddisasterresponsebasedonmulti-modaldata.Thecaseof housingshelterneeds.arXiv:2108.00887 [cs.LG]
[14]OurAirports.2007.AboutOurAirports. https://ourairports.com/about.html#overview
[15] B.H.Pandey,CarlosVentura,P.RioFrio,J.Pummell,andS.Dowling.2014.Developmentofresponseplanofairportformegaearthquakesin Nepal. NCEE2014-10thU.S.NationalConferenceonEarthquakeEngineering:FrontiersofEarthquakeEngineering (012014). https://doi.org/10.4231/ D3TH8BN7T
[16] F.Pedregosa,G.Varoquaux,A.Gramfort,V.Michel,B.Thirion,O.Grisel,M.Blondel,P.Prettenhofer,R.Weiss,V.Dubourg,J.Vanderplas,A.Passos, D.Cournapeau,M.Brucher,M.Perrot,andE.Duchesnay.2011.Scikit-learn:MachineLearninginPython. JournalofMachineLearningResearch 12 (2011),2825–2830.
[17] AbdussametPolater.2018.Managingairportsinnon-aviationrelateddisasters:Asystematicliteraturereview. InternationalJournalofDisasterRisk Reduction 31(2018),367–380. https://doi.org/10.1016/j.ijdrr.2018.05.026
[18] AbdüssametPolater.2020.Airports’roleaslogisticscentersinhumanitariansupplychains:Asurgecapacitymanagementperspective. Journalof AirTransportManagement 83(2020),101765. https://doi.org/10.1016/j.jairtraman.2020.101765
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa
[19]QGISDevelopmentTeam.2009. QGISGeographicInformationSystem.OpenSourceGeospatialFoundation. http://qgis.org
[20] JiminQian,NamPhuongNguyen,YutakaOya,GotaKikugawa,TomonagaOkabe,YueHuang,andFumioSOhuchi.2019.Introducingself-organized maps(SOM)asavisualizationtoolformaterialsresearchandeducation. ResultsinMaterials 4(2019),100020. https://doi.org/10.1016/j.rinma.2019. 100020
[21]HRitterandTKohonen.1989.Self-organizingsemanticmaps. BiologicalCybernetics 61,4(1989),241–254. https://doi.org/10.1007/BF00203171
[22]Sevamoo.2018.sevamoo/SOMPY. https://github.com/sevamoo/SOMPY
[23] G.V.Trunk.1979.AProblemofDimensionality:ASimpleExample. IEEETransactionsonPatternAnalysisandMachineIntelligence PAMI-1,3(1979), 306–307. https://doi.org/10.1109/TPAMI.1979.4766926
[24] MichaelVeatchandJarrodGoentzel.2018.Feedingthebottleneck:airportcongestionduringreliefoperations. JournalofHumanitarianLogistics andSupplyChainManagement 8,4(jan2018),430–446. https://doi.org/10.1108/JHLSCM-01-2018-0006
[25] BartelWalleandJulieDugdale.2012.Informationmanagementandhumanitarianreliefcoordination:findingsfromtheHaitiearthquakeresponse. Int.J.ofBusinessContinuityandRiskManagement 3(012012),278–305. https://doi.org/10.1504/IJBCRM.2012.051866
[26] MartijnWarnier,VincentAlkema,T.Comes,andBartelWalle.2020.Humanitarianaccess,interrupted:dynamicnearreal-timenetworkanalytics andmappingforreachingcommunitiesindisaster-affectedcountries. ORSpectrum 42(092020). https://doi.org/10.1007/s00291-020-00582-0
[27] SanfordWeisberg.2001.Yeo-JohnsonPowerTransformations. DepartmentofAppliedStatistics,UniversityofMinnesota 2(2001),1–4. http: //stat.umn.edu/arc/yjpower.pdf
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
Data-drivenHumanitarianMapping,KDD2022, , APROCESSFLOW

Fig.6.Processflowofdatamining.
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa
Table3.Affiliationofinterviewees
IntervieweeOrganisation
ChrisWeeksGARD
VirginieBohlOCHA,IMPACCTWorkingGroup
ThomasRomigACI
BCLUSTERINGCOMPARISON

Fig.7.Clusteringcomparison.a)clusteringusingK-Meansandanalgorithmfordimensionalityreduction.b)clusteringusingdbscan andanalgorithmfordimensionalityreduction.c)clusteringwithaspectralclusteringalgorithmandanalgorithmfordimensionality reduction.AftertryingthismethodwedecidedtoworkwithSelfOrganizingMaps(SOM).ThereasonwhywechooseSOMisbecause weidentifythatthevisualizationoftheSOMresultsinauser-friendlyinteractionandithasavisualoutputthathelpsunderstand theclustering.)
Data-drivenHumanitarianMapping,KDD2022, ,
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
B.1Datasources
B.1.1 OSM InordertoextractdatafromOSM,Overpassturbowasused-aweb-baseddataminingtool,designedto runOSMAPIqueriesandpresentthemonamap.Sincedataneededtobeextractedforover900airports,multiple scriptswerewritten,withtheuseoftheOverPyAPI,publishedundertheMITlicense[11].Adetaileddocumentation ofthescriptsandqueriescanbefoundintheattachedGitLabrepository.
B.1.2 OurAirports OurAirportsisafreeandpublicservicethatmaintainsdataaboutairportsaroundtheworld. SimilarlytoOSM,itisrunbyvolunteers-memberscreaterecordsindividually-butatthesametimemuchofthe informationcomesfromofficialgovernmentalinstitutionssuchastheU.S.FederalAviationAdministration[14].In additionfromexploringanonlineinteractivemap-basedtool,userscanalsodownloaddailyupdatedfileswithdata recordsofallairportsthatarepartoftheservice.Forthisresearch,datasetofallairportsandrunwayswasused.
B.1.3 Globalairports Themostcomprehensive,publiclyavailable,datasetaimedatprovidinginformationon disasterlogisticsiscalled Globalairports andwaspublishedbytheHumanitarianDataservice[8].Officiallycoordinated bytheWorldFoodProgramme,basedonopenlyavailabledatafromsourcessuchasOSMandOurAirports,italso containsinputsfrompartnersthoughtheLogisticsClusterandLogisticsCapacityAssessments[8].Eventhoughthe datasetisupdated,accordingtoaWFPrepresentativeinterviewed,formanyplacesthedatahasnotbeenchecked sincetheoriginaluploadin2013.Furthermore,thedatasetcontainsfairlybasicinformationonairports.Datapoints presentedinthetablearenotavailableforeveryairportintheset.
B.1.4 TheLogisticsPerformanceIndex. TheLogisticsPerformanceIndex(LPI)providesinformationonhoweasy ordifficultitistotransportgoodsintheanalysedcountries.TheWorldBank,togetherwithvariouslogistics-related partnerorganisationsconductsthesurveyeverytwoyears[2].Whileaimedatassessingthelogisticalcapacityin thecontextoftradeandmerchandise,someoftheindicatorsarerelevantforhumanitarianlogistics,suchastheones chosentobeincludedinthisresearch:theassessmentofcustomsproceduresandtheassessmentofgeneralqualityof tradeandtransportrelatedinfrastructure.
B.1.5 TheINFORMRiskIndex LedbytheEuropeanCommission,INFORMisaglobal,open-sourcedriskindexfor humanitariandisastersandcrises,thatdescribesthreedimensions:hazardexposure,vulnerabilityandlackofcoping capacities.Inadditiontobeingthequalificationcriteriaforthefinalairportdatabase,partsoftheINFORMRiskindex werealsousedtocharacterizeairports.
B.2Extractingdata
B.2.1 Airportsurroundings TwostrategiesinOSMweretestedinordertoassesthesurroundingsofeachairport. First,the"landuse"tagwasexplored-allthenodescontaininginformationonthelandusewithin5kmradiusfrom eachairportwereextracted.However,thisledtoinconsistentresults-visualvalidationofmultiplequeryoutputswas conductedanditledtoaconclusionthatbuildings-relatednodesarehighlyoverrepresentedascomparedtofieldsor otherunusedspaces.Therefore,formanyairports,theresultonlyshowedanumberofbuildingswithinthatradius, andnoinformationdescribingtheemptyfieldsthatwerethetruedominantsurrounding. Thesecondstrategy,whichledtomorerepresentativeresults,wasonebasedonpurelythenumberofnodeswiththe tag"building".TheassumptionwasthatifthebuildingsarewelltaggedinOSM,simplythenumberofthosenodes withintheradiuswoulddescribehowdenselybuiltthesurroundingoftheairportis.Thelowerthenumberofbuildings
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa around-themoreusefulspacefororganisinghumanitarianaid.Avisualvalidationofmultiplerecordswasconducted, withaspecialfocusontheoutliers-airportswithveryloworveryhighnumberofbuildingsaround.Thesurroundings ofsomeremoteairportswasunderrepresented,resultingin0buildingsreported.Whileitwasnottrue,thenumberof buildingswasverylittleandtheresultwasstilluseful.
B.2.2 Alternativeairports. Tofindandalternativeairport,wefocusedonthesurroundingswithina100kmradius. Unlikewithchoosingairportsforthemaindatabase,withalternativeonestherewasnoexclusionofthosethatare smallerordonothaveanIATAcode.Theassumptionwasthatanykindofairportwithinaclosevicinitytothemain onemightworkasasupportingspace,evenifnotforlandingthesamesizeofairplanes,butperhapsstorageandother humanitarianoperations.SinceairportsarewelltaggedinOSM,thevalidationofresultswaspositive-therewereno overlookedairportsfound.However,dependingonthequalityanddensityofroads,anairportwithin100kmradius mightinfactbemanyhoursaway,whichwouldnotbeausefulalternative.Infutureresearchitisworthconsidering findingamoreaccuratequalifyingfeaturethantheradius.
B.2.3 Alternativeseaports Similarlytoalternativeairports,alternativeseaportswereinspectedwithinaradiusof 100km.Vastmajorityofresultsshowed0seaportsandthatwasvalidatedthoroughlyandresultedtobetrue.Validation wasalsoconductedforahighnumberofseaportscounted-forsome,thecountedresultswashigherthantheactual numberofports,becauseofmultipletagswithinthesameseaport.Itdidhoweverindicatethesizeoftheseaport-often thenodeswereindicatingmoreseaportterminalsorstoragefacilities.Giventhesmallnumberofrecordsthatindicated seaportsatall,allresultshigherthan0werevalidatedandmanuallycorrectedifneeded.
B.2.4 Tourismvs.industry. Inordertoasseshowwellanairportisequippedtohandleasuddeninfluxofcargo handlingandnotonlyagrowthinpassengerturnaround,itwasdecidedthatitcanbeassessedbythesurroundingof anairport.BasedontheinsightsfromtheinterviewwithChrisWeeksofGARD,itwasdeterminedthatairportsthat aresituatedinmainlytouristicdestinationsarelesslikelytohaveagoodcapacityforhandlingcargo.Therefore,for eachairporttheamountofnodestaggedas"industrial"and"tourismamenities"wascalculated.Inordertoaccountfor over/underrepresentationofcertainregions,aratiooftourismandindustryrelatedfacilitiesiscalculated-basedon theassumptionthatiftheregionisunder/overrepresentedinOSM,itwillhappenforbothtypesofamenities.
B.2.5 Runways Thenumberofrunwayswascalculatedforeachairportbycountingthenumberofnodes/ways/relations witha"runway"tag.Alloutliersweremanuallyvalidated-thosethatresultedin0runwayswerecorrectedsincea functioningairportcannothave0runways.Thesamewasdoneforallrecordsthatshowedmorethantworunways sinceitisnotverycommonforairportstohavemultiplerunways,especiallyinremoteplaces,whichhappenstobe wheremostoftheairportsfromthedatabaseare.
B.2.6 Citiesanddistances Inordertoasseshowdistantanairportisfromthepopulationitmightbeservingwhen adisasterstrikes,threeclosestcitiesforeachrecordwerefound,togetherwiththedirectdistance(notbyroad)and populationofeachcity.Forthispurpose,theGeoDB-citiesAPIwasused[12].Basedonthecoordinatesofeachairport thethreeclosestcitieswithin100km,containingpopulationinformationwerechosen.Validationwasperformedfora numberofrandomlychosenrecordsandoutliers,andmanuallycorrectedifneeded.TheAPIworkswithGeoNames andWikiData,whichsimilarlytoOSMareconsideredtrustworthysources,thankstotheusercommunityinputand validationscheme.
Data-drivenHumanitarianMapping,KDD2022, ,
AnAIunsupervisedclusteringofairports-atooltofindsuitablehumanitariancooperationfordisasterpreparedness
B.2.7 Population Datagatheredtodescribesurroundingcitieswasusedtocalculatethegeneralpopulationaround eachairport-asasummationofpopulationinallthreeclosestcitiesfoundbytheGeoDBcitiesAPI.
B.2.8 Airportarea Inordertoassessthestoragecapacityaswellastheareaavailableforsettingupahumanitarian hub,theareaofeachairportwascalculated.InOSM,eachairportisnotonlyindicatedbyasinglenode,butbya relationthatindicatesitsborders.ThisgeodatawasexportedandanalysedwiththeQGISsoftware[19].Thanksto builtinfeatures,theareaofeachairportwascalculated.Validationwasconductedonarandomsampleofresultsand themethodprovedtobeeffective.
CDATAPRE-PROCESSING
Inorderforairportstobecomparablefortheunsupervisedmachinelearningalgorithms,thefeaturesthataredescribing themneedtobeturnedintoan understandable formformathematicalprocessing.
Inthissection,thepre-processingoftext,categoricalandnumericalfeaturesisdescribed.
C.1Emptyfields
Duetothefactthatvariousdatasourceswereused,therewasanumberofemptyfieldsforsomefeatures.Depending onthefeature,theseemptyfieldswerefilledeitherwithzeroesorthemeanvalueofallexistingrecords.Missingfields infeaturesdescribingwhethertherunwayislightedandwhethertherewasaGARDtrainingconductedbefore,asit wasdecidedthatifthereisnoinformationavailable,itissafertoassumethenegativeoutcome.Theelevation,lengthof therunway,widthoftherunwayandmissingINFORMandLPIriskswerereplacedwiththemeanvalues.
C.2Categoricaldata
Anumberoffeaturesinthefinaldatasetdescribeseachairportasamemberofacertaincategory.Forexample,the airporttype featurecategorisesairportsinto smallairport,mediumairport,largeairport.Whileitisa clearandunderstandabledistinctionforahumaneye,themathematicalalgorithmsrequireanumericalexpression[16]. AsproposedintheoriginalpublicationonSelfOrganisingMaps[21],thecategoricalfeaturewiththreevalueswas transformedintothreebinaryfeatures,withonequalto1,andallothersto0,foreachairport.Anexampleresultcan beseenintable 4.Toachievethatforeachcategoricalfeature,theLabelBinarizerfunctionfromSciKit[16]wasused.
C.3Numericaldata
Itiscommonformanymachinelearningalgorithmstorequirestandardiseddatainputs,inordertoperformwell[16]. Thisalsothecasewithunsupervisedlearningalgorithmusedinthisresearch-theSOM.Therearevariousmathematical transformationsthatcanhelptoachieveanormallydistributeddataanditisimportanttochooseonethatfitsthetype ofdatathebest.Again,theSciKitdocumentation,supportedbyvariousscientificsources[7, 9, 20]andexperiments wasusedtochoosetherightapproach.
TheYeo-Johnsontransform[27]wasusedtochangethedistributionofnumericaldata,sinceitwasoneofafew transformationsthatcanbeappliedonnegativeandzerovalues,whichthedatasetcontained.Theeffectofthe transformationcanbeseeninfigures 8 and 9.Whileitwasnotpossibletosuccessfullytransformallfeatures,especially theonesconsistingof0/1values,formostfeaturestheimprovementisvisible.
Data-drivenHumanitarianMapping,KDD2022, , BrowarskaandSaldañaOchoa
Table4.Anexampleofencodingcategoricalfeatures

Fig.8.AnexampleofdatadistributionbeforetheYeo-Johnsontransform.Mostofthedatapointsareconcentratedaroundthelower values.ApplyingSOMdirectlyonanon-normallydistributeddatacouldleadtospecificfeaturesbeingoverrepresented,therefore thetransformationisneeded.

Fig.9.AnexampleofdatadistributionaftertheYeo-Johnsontransform.Therangeofvalueshaschanged,howevertherelations betweenspecificvaluesarekeptandthedistributionisnowclosertonormal.