Fundamentals of Data Science

FundamentalsofDataScience

IbonMartínez-Arranz

Contents DataScienceWorkflowManagement1 Introduction3 WhatisDataScienceWorkflowManagement?.........................4 WhyisDataScienceWorkflowManagementImportant?....................5 References............................................6 Books............................................6 FundamentalsofDataScience9 WhatisDataScience?.......................................10 DataScienceProcess.......................................10 ProgrammingLanguagesforDataScience...........................11 R...............................................12 Python...........................................12 SQL.............................................13 DataScienceToolsandTechnologies..............................16 References............................................17 Books............................................17 SQLandDataBases.....................................17 So ware..........................................18 i

DataScienceWorkﬂowManagement 1

Introduction

Inrecentyears,theamountofdatageneratedbybusinesses,organizations,andindividualshas increasedexponentially.WiththeriseoftheInternet,mobiledevices,andsocialmedia,wearenow generatingmoredatathaneverbefore.Thisdatacanbeincrediblyvaluable,providinginsightsthat caninformdecision-making,improveprocesses,anddriveinnovation.However,thesheervolume andcomplexityofthisdataalsopresentsigniﬁcantchallenges.

Inthepastfewyears,therehasbeenasigniﬁcantsurgeinthevolumeofdataproducedbycompanies, institutions,andindividuals.TheproliferationoftheInternet,mobiledevices,andsocialmediahas ledtoasituationwherewearecurrentlygeneratingmoredatathanatanyothertimeinhistory.Image generatedwithDALL-E.

Datasciencehasemergedasadisciplinethathelpsusmakesenseofthisdata.Itinvolvesusing statisticalandcomputationaltechniquestoextractinsightsfromdataandcommunicatetheminaway thatisactionableandrelevant.Withtheincreasingavailabilityofpowerfulcomputersandso ware

3

tools,datasciencehasbecomeanessentialpartofmanyindustries,fromﬁnanceandhealthcareto marketingandmanufacturing.

However,datascienceisnotjustaboutapplyingalgorithmsandmodelstodata.Italsoinvolvesa complexando eniterativeprocessofdataacquisition,cleaning,exploration,modeling,andimplementation.Thisprocessiscommonlyknownasthedatascienceworkﬂow.

Managingthedatascienceworkflowcanbeachallengingtask.Itrequirescoordinatingthee ortsof multipleteammembers,integratingvarioustoolsandtechnologies,andensuringthattheworkflow iswell-documented,reproducible,andscalable.Thisiswheredatascienceworkflowmanagement comesin.

Datascienceworkﬂowmanagementisespeciallyimportantintheeraofbigdata.Aswecontinueto collectandanalyzeever-largeramountsofdata,itbecomesincreasinglyimportanttohaverobust mathematicalandstatisticalknowledgetoanalyzeite ectively.Furthermore,astheimportanceof data-drivendecisionmakingcontinuestogrow,itiscriticalthatdatascientistsandotherprofessionals involvedinthedatascienceworkﬂowhavethetoolsandtechniquesneededtomanagethisprocess e ectively.

Toachievethesegoals,datascienceworkﬂowmanagementreliesonacombinationofbestpractices, tools,andtechnologies.SomepopulartoolsfordatascienceworkﬂowmanagementincludeJupyter Notebooks,GitHub,Docker,andvariousprojectmanagementtools.

WhatisDataScienceWorkﬂowManagement?

Datascienceworkflowmanagementisthepracticeoforganizingandcoordinatingthevarioustasks andactivitiesinvolvedinthedatascienceworkflow.Itencompasseseverythingfromdatacollection andcleaningtoanalysis,modeling,andimplementation.E ectivedatascienceworkflowmanagement requiresadeepunderstandingofthedatascienceprocess,aswellasthetoolsandtechnologiesused tosupportit.

Atitscore,datascienceworkﬂowmanagementisaboutmakingthedatascienceworkﬂowmore e icient,e ective,andreproducible.Thiscaninvolvecreatingstandardizedprocessesandprotocols fordatacollection,cleaning,andanalysis;implementingqualitycontrolmeasurestoensuredata accuracyandconsistency;andutilizingtoolsandtechnologiesthatmakeiteasiertocollaborateand communicatewithotherteammembers.

Oneofthekeychallengesofdatascienceworkflowmanagementisensuringthattheworkflowis well-documentedandreproducible.Thisinvolveskeepingdetailedrecordsofallthestepstakeninthe datascienceprocess,fromthedatasourcesusedtothemodelsandalgorithmsapplied.Bydoingso,it becomeseasiertoreproducetheresultsoftheanalysisandverifytheaccuracyofthefindings.

FundamentalsofDataScience

Page4IbonMartínez-Arranz

Anotherimportantaspectofdatascienceworkflowmanagementisensuringthattheworkflowis scalable.Astheamountofdatabeinganalyzedgrows,itbecomesincreasinglyimportanttohavea workflowthatcanhandlelargevolumesofdatawithoutsacrificingperformance.Thismayinvolve usingdistributedcomputingframeworkslikeApacheHadooporApacheSpark,orutilizingcloud-based dataprocessingserviceslikeAmazonWebServices(AWS)orGoogleCloudPlatform(GCP).

E ectivedatascienceworkﬂowmanagementalsorequiresastrongunderstandingofthevarioustools andtechnologiesusedtosupportthedatascienceprocess.Thismayincludeprogramminglanguages likePythonandR,statisticalso warepackageslikeSASandSPSS,anddatavisualizationtoolslike TableauandPowerBI.Inaddition,datascienceworkﬂowmanagementmayinvolveusingproject managementtoolslikeJIRAorAsanatocoordinatethee ortsofmultipleteammembers.

Overall,datascienceworkﬂowmanagementisanessentialaspectofmoderndatascience.Byimplementingbestpracticesandutilizingtherighttoolsandtechnologies,datascientistsandother professionalsinvolvedinthedatascienceprocesscanensurethattheirworkﬂowsaree icient,e ective,andscalable.This,inturn,canleadtomoreaccurateandactionableinsightsthatdriveinnovation andimprovedecision-makingacrossawiderangeofindustriesanddomains.

WhyisDataScienceWorkﬂowManagementImportant?

E ectivedatascienceworkﬂowmanagementiscriticaltothesuccessofanydatascienceproject.By organizingandcoordinatingthevarioustasksandactivitiesinvolvedinthedatascienceprocess,data scienceworkﬂowmanagementhelpsensurethatprojectsarecompletedontime,withinbudget,and withhighlevelsofaccuracyandreproducibility.

Oneofthekeybenefitsofdatascienceworkflowmanagementisthatitpromotesamorestructured, methodologicalapproachtodatascience.Bybreakingdownthedatascienceprocessintodiscrete stepsandtasks,datascienceworkflowmanagementmakesiteasiertomanagecomplexprojects andidentifypotentialbottlenecksorareaswhereimprovementscanbemade.This,inturn,canhelp ensurethatdatascienceprojectsarecompletedmoree icientlyandwithgreaterlevelsofaccuracy.

Anotherimportantbenefitofdatascienceworkflowmanagementisthatitcanhelpensurethatthe resultsofdatascienceprojectsaremorereproducible.Bykeepingdetailedrecordsofallthesteps takeninthedatascienceprocess,datascienceworkflowmanagementmakesiteasiertoreplicatethe resultsofanalysesandverifytheiraccuracy.Thisisparticularlyimportantinfieldswhereaccuracyand reproducibilityareessential,suchasscientificresearchandfinancialmodeling.

Inadditiontothesebenefits,e ectivedatascienceworkflowmanagementcanalsoleadtomore e ectivecollaborationandcommunicationamongteammembers.Byutilizingprojectmanagement toolsandotherso waredesignedfordatascienceworkflowmanagement,teammemberscanwork

FundamentalsofDataScience

IbonMartínez-ArranzPage5

togethermoree icientlyande ectively,sharingdata,insights,andfeedbackinreal-time.Thiscan helpensurethatprojectsstayontrackandthateveryoneinvolvedisworkingtowardthesamegoals. Thereareanumberofso waretoolsavailablefordatascienceworkflowmanagement,including popularplatformslikeJupyterNotebooks,ApacheAirflow,andApacheNiFi.Eachoftheseplatforms o ersauniquesetoffeaturesandcapabilitiesdesignedtosupportdi erentaspectsofthedatascience workflow,fromdatacleaningandpreparationtomodeltraininganddeployment.Byleveragingthese tools,datascientistsandotherprofessionalsinvolvedinthedatascienceprocesscanworkmore e icientlyande ectively,improvingthequalityandaccuracyoftheirwork.

Overall,datascienceworkﬂowmanagementisanessentialaspectofmoderndatascience.Bypromotingamorestructured,methodologicalapproachtodatascienceandleveragingtherighttools andtechnologies,datascientistsandotherprofessionalsinvolvedinthedatascienceprocesscan ensurethattheirprojectsarecompletedontime,withinbudget,andwithhighlevelsofaccuracyand reproducibility.

References

Books

• Peng,R.D.(2016).Rprogrammingfordatascience.Availableathttps://bookdown.org/rdpeng/ rprogdatascience/

• Wickham,H.,&Grolemund,G.(2017).Rfordatascience:import,tidy,transform,visualize,and modeldata.Availableathttps://r4ds.had.co.nz/

• Géron,A.(2019).Hands-onmachinelearningwithScikit-Learn,Keras,andTensorFlow:Concepts, tools,andtechniquestobuildintelligentsystems.Availableathttps://www.oreilly.com/library/ view/hands-on-machine-learning/9781492032632/

• Shrestha,S.(2020).DataScienceWorkﬂowManagement:FromBasicstoDeployment.Available athttps://www.springer.com/gp/book/9783030495362

• Grollman,D.,&Spencer,B.(2018).Datascienceprojectmanagement:fromconceptionto deployment.Apress.

• Kelleher,J.D.,Tierney,B.,&Tierney,B.(2018).DatascienceinR:acasestudiesapproachto computationalreasoningandproblemsolving.CRCPress.

• VanderPlas,J.(2016).Pythondatasciencehandbook:Essentialtoolsforworkingwithdata. O’ReillyMedia,Inc.

FundamentalsofDataScience

Page6IbonMartínez-Arranz

FundamentalsofDataScience

• Kluyver,T.,Ragan-Kelley,B.,Pérez,F.,Granger,B.,Bussonnier,M.,Frederic,J.,...&Ivanov, P.(2016).JupyterNotebooks-apublishingformatforreproduciblecomputationalworkﬂows. PositioningandPowerinAcademicPublishing:Players,AgentsandAgendas,87.

• Pérez,F.,&Granger,B.E.(2007).IPython:asystemforinteractivescientiﬁccomputing.ComputinginScience&Engineering,9(3),21-29.

• Rule,A.,Tabard-Cossa,V.,&Burke,D.T.(2018).Opensciencegoesmicroscopic:anapproachto knowledgesharinginneuroscience.ScientiﬁcData,5(1),180268.

• Shen,H.(2014).Interactivenotebooks:Sharingthecode.Nature,515(7525),151-152.

IbonMartínez-ArranzPage7

FundamentalsofDataScience

Datascienceisaninterdisciplinaryﬁeldthatcombinestechniquesfromstatistics,mathematics,and computersciencetoextractknowledgeandinsightsfromdata.Theriseofbigdataandtheincreasing complexityofmodernsystemshavemadedatascienceanessentialtoolfordecision-makingacrossa widerangeofindustries,fromﬁnanceandhealthcaretotransportationandretail.

Datascienceisamultidisciplinaryareathatblendsmethodsfromstatistics,mathematics,and computersciencetoderivewisdomandgainunderstandingfromdata.Theemergenceofbigdataand thegrowingintricacyofcontemporarysystemshavetransformeddatascienceintoacrucial instrumentforinformeddecision-makinginvarioussectors,includingﬁnance,healthcare, transportation,andretail.ImagegeneratedwithDALL-E.

Theﬁeldofdatasciencehasarichhistory,withrootsinstatisticsanddataanalysisdatingbacktothe 19thcentury.However,itwasnotuntilthe21stcenturythatdatasciencetrulycameintoitsown,as advancementsincomputingpowerandthedevelopmentofsophisticatedalgorithmsmadeitpossible

9

toanalyzelargerandmorecomplexdatasetsthaneverbefore.

Thischapterwillprovideanoverviewofthefundamentalsofdatascience,includingthekeyconcepts, tools,andtechniquesusedbydatascientiststoextractinsightsfromdata.Wewillcovertopicssuchas datavisualization,statisticalinference,machinelearning,anddeeplearning,aswellasbestpractices fordatamanagementandanalysis.

WhatisDataScience?

Datascienceisamultidisciplinaryﬁeldthatusestechniquesfrommathematics,statistics,andcomputersciencetoextractinsightsandknowledgefromdata.Itinvolvesavarietyofskillsandtools, includingdatacollectionandstorage,datacleaningandpreprocessing,exploratorydataanalysis, statisticalinference,machinelearning,anddatavisualization.

Thegoalofdatascienceistoprovideadeeperunderstandingofcomplexphenomena,identifypatterns andrelationships,andmakepredictionsordecisionsbasedondata-driveninsights.Thisisdoneby leveragingdatafromvarioussources,includingsensors,socialmedia,scientiﬁcexperiments,and businesstransactions,amongothers.

Datasciencehasbecomeincreasinglyimportantinrecentyearsduetotheexponentialgrowthof dataandtheneedforbusinessesandorganizationstoextractvaluefromit.Theriseofbigdata, cloudcomputing,andartiﬁcialintelligencehasopenedupnewopportunitiesandchallengesfordata scientists,whomustnavigatecomplexandrapidlyevolvinglandscapesoftechnologies,tools,and methodologies.

Tobesuccessfulindatascience,oneneedsastrongfoundationinmathematicsandstatistics,aswellas programmingskillsanddomain-speciﬁcknowledge.Datascientistsmustalsobeabletocommunicate e ectivelyandworkcollaborativelywithteamsofexpertsfromdi erentbackgrounds.

Overall,datasciencehasthepotentialtorevolutionizethewayweunderstandandinteractwith theworldaroundus,fromimprovinghealthcareandeducationtodrivinginnovationandeconomic growth.

DataScienceProcess

Thedatascienceprocessisasystematicapproachforsolvingcomplexproblemsandextractinginsights fromdata.Itinvolvesaseriesofsteps,fromdeﬁningtheproblemtocommunicatingtheresults,and requiresacombinationoftechnicalandnon-technicalskills.

FundamentalsofDataScience

Page10IbonMartínez-Arranz

Thedatascienceprocesstypicallybeginswithunderstandingtheproblemanddeﬁningtheresearch questionorhypothesis.Oncethequestionisdeﬁned,thedatascientistmustgatherandcleanthe relevantdata,whichcaninvolveworkingwithlargeandmessydatasets.Thedataisthenexploredand visualized,whichcanhelptoidentifypatterns,outliers,andrelationshipsbetweenvariables.

Oncethedataisunderstood,thedatascientistcanbegintobuildmodelsandperformstatistical analysis.Thiso eninvolvesusingmachinelearningtechniquestotrainpredictivemodelsorperform clusteringanalysis.Themodelsarethenevaluatedandtestedtoensuretheyareaccurateandrobust.

Finally,theresultsarecommunicatedtostakeholders,whichcaninvolvecreatingvisualizations, dashboards,orreportsthatareaccessibleandunderstandabletoanon-technicalaudience.Thisisan importantstep,astheultimategoalofdatascienceistodriveactionanddecision-makingbasedon data-driveninsights.

Thedatascienceprocessiso eniterative,asnewinsightsorquestionsmayariseduringtheanalysisthatrequirerevisitingprevioussteps.Theprocessalsorequiresacombinationoftechnicaland non-technicalskills,includingprogramming,statistics,anddomain-speciﬁcknowledge,aswellas communicationandcollaborationskills.

Tosupportthedatascienceprocess,thereareavarietyofso waretoolsandplatformsavailable, includingprogramminglanguagessuchasPythonandR,machinelearninglibrariessuchasscikit-learn andTensorFlow,anddatavisualizationtoolssuchasTableauandD3.js.Therearealsospeciﬁcdata scienceplatformsandenvironments,suchasJupyterNotebookandApacheSpark,thatprovidea comprehensivesetoftoolsfordatascientists.

Overall,thedatascienceprocessisapowerfulapproachforsolvingcomplexproblemsanddrivingdecision-makingbasedondata-driveninsights.Itrequiresacombinationoftechnicalandnontechnicalskills,andreliesonavarietyofso waretoolsandplatformstosupporttheprocess.

ProgrammingLanguagesforDataScience

DataScienceisaninterdisciplinaryﬁeldthatcombinesstatisticalandcomputationalmethodologies toextractinsightsandknowledgefromdata.Programmingisanessentialpartofthisprocess,asit allowsustomanipulateandanalyzedatausingso waretoolsspeciﬁcallydesignedfordatascience tasks.Thereareseveralprogramminglanguagesthatarewidelyusedindatascience,eachwithits strengthsandweaknesses.

Risalanguagethatwasspeciﬁcallydesignedforstatisticalcomputingandgraphics.Ithasanextensive libraryofstatisticalandgraphicalfunctionsthatmakeitapopularchoicefordataexplorationand analysis.Python,ontheotherhand,isageneral-purposeprogramminglanguagethathasbecome increasinglypopularindatascienceduetoitsversatilityandpowerfullibrariessuchasNumPy,Pandas,

FundamentalsofDataScience

IbonMartínez-ArranzPage11

andScikit-learn.SQLisalanguageusedtomanageandmanipulaterelationaldatabases,makingitan essentialtoolforworkingwithlargedatasets.

Inadditiontothesepopularlanguages,therearealsodomain-speciﬁclanguagesusedindatascience, suchasSAS,MATLAB,andJulia.Eachlanguagehasitsownuniquefeaturesandapplications,andthe choiceoflanguagewilldependonthespeciﬁcrequirementsoftheproject.

Inthischapter,wewillprovideanoverviewofthemostcommonlyusedprogramminglanguagesin datascienceanddiscusstheirstrengthsandweaknesses.Wewillalsoexplorehowtochoosetheright languageforagivenprojectanddiscussbestpracticesforprogrammingindatascience.

R

Risaprogramminglanguagespeciﬁcallydesignedforstatisticalcomputingandgraphics.It isanopen-sourcelanguagethatiswidelyusedindatasciencefortaskssuchasdatacleaning, visualization,andstatisticalmodeling.Rhasavastlibraryofpackagesthatprovidetoolsfor datamanipulation,machinelearning,andvisualization.

OneofthekeystrengthsofRisitsﬂexibilityandversatility.Itallowsuserstoeasilyimportand manipulatedatafromawiderangeofsourcesandprovidesawiderangeofstatisticaltechniquesfor dataanalysis.Ralsohasanactiveandsupportivecommunitythatprovidesregularupdatesandnew packagesforusers.

SomepopularapplicationsofRincludedataexplorationandvisualization,statisticalmodeling,and machinelearning.Risalsocommonlyusedinacademicresearchandhasbeenusedinmanypublished papersacrossavarietyofﬁelds.

Python

Pythonisapopulargeneral-purposeprogramminglanguagethathasbecomeincreasingly popularindatascienceduetoitsversatilityandpowerfullibrariessuchasNumPy,Pandas,and Scikit-learn.Python’ssimplicityandreadabilitymakeitanexcellentchoicefordataanalysis andmachinelearningtasks.

OneofthekeystrengthsofPythonisitsextensivelibraryofpackages.TheNumPypackage,for example,providespowerfultoolsformathematicaloperations,whilePandasisapackagedesigned fordatamanipulationandanalysis.Scikit-learnisamachinelearningpackagethatprovidestoolsfor classiﬁcation,regression,clustering,andmore.

FundamentalsofDataScience

Page12IbonMartínez-Arranz

Pythonisalsoanexcellentlanguagefordatavisualization,withpackagessuchasMatplotlib,Seaborn, andPlotlyprovidingtoolsforcreatingawiderangeofvisualizations.

Python’spopularityinthedatasciencecommunityhasledtothedevelopmentofmanytoolsand frameworksspeciﬁcallydesignedfordataanalysisandmachinelearning.Somepopulartoolsinclude JupyterNotebook,Anaconda,andTensorFlow.

SQL

StructuredQueryLanguage(SQL)isaspecializedlanguagedesignedformanagingandmanipulatingrelationaldatabases.SQLiswidelyusedindatascienceformanagingandextracting informationfromdatabases.

SQLallowsuserstoretrieveandmanipulatedatastoredinarelationaldatabase.Userscancreate tables,insertdata,updatedata,anddeletedata.SQLalsoprovidespowerfultoolsforqueryingand aggregatingdata.

OneofthekeystrengthsofSQLisitsabilitytohandlelargeamountsofdatae iciently.SQLisa declarativelanguage,whichmeansthatuserscanspecifywhattheywanttoretrieveormanipulate, andthedatabasemanagementsystem(DBMS)handlestheimplementationdetails.ThismakesSQL anexcellentchoiceforworkingwithlargedatasets.

ThereareseveralpopularimplementationsofSQL,includingMySQL,Oracle,Microso SQLServer, andPostgreSQL.Eachimplementationhasitsownspeciﬁcsyntaxandfeatures,butthecoreconcepts ofSQLarethesameacrossallimplementations.

Indatascience,SQLiso enusedincombinationwithothertoolsandlanguages,suchasPythonorR, toextractandmanipulatedatafromdatabases.

HowtoUse

Inthissection,wewillexploretheusageofSQLcommandswithtwotables: iris and species. The iris tablecontainsinformationaboutﬂowermeasurements,whilethe species tableprovides detailsaboutdi erentspeciesofﬂowers.SQL(StructuredQueryLanguage)isapowerfultoolfor managingandmanipulatingrelationaldatabases.

iristable

FundamentalsofDataScience

1 | sepal_length | sepal_width | petal_length | petal_width | species | 2 |--------------|-------------|--------------|-------------|-----------| 3 |5.1|3.5|1.4|0.2| Setosa | 4 |4.9|3.0|1.4|0.2| Setosa | IbonMartínez-ArranzPage13

5

6

7 |5.0|3.6|1.4|0.2| Setosa

8 |5.4|3.9|1.7|0.4| Setosa

9 |4.6|3.4|1.4|0.3| Setosa

speciestable

2 |------------|----------------|------------|------------|

3 |1| Setosa | Flower | Red |

10 |8| Amanita | Fungus | Red |

Usingthe iris and species tablesasexamples,wecanperformvariousSQLoperationstoextract meaningfulinsightsfromthedata.SomeofthecommonlyusedSQLcommandswiththesetables include:

DataRetrieval:

SQL(StructuredQueryLanguage)isessentialforaccessingandretrievingdatastoredinrelational databases.Theprimarycommandusedfordataretrievalis SELECT,whichallowsuserstospecify exactlywhatdatatheywanttosee.Thiscommandcanbecombinedwithotherclauseslike WHERE for ﬁltering, ORDERBY forsorting,and JOIN formergingdatafrommultipletables.Masteryofthese commandsenablesuserstoe icientlyquerylargedatabases,extractingonlytherelevantinformation neededforanalysisorreporting. Page14IbonMartínez-Arranz

FundamentalsofDataScience

Setosa |

|4.7|3.2|1.3|0.2|

|

|4.6|3.1|1.5|0.2| Setosa

|

Setosa |

|

10 |5.0|3.4|1.5|0.2|

11 |4.4|2.9|1.4|0.2| Setosa

12 |4.9|3.1|1.5|0.1| Setosa

SQLCommand Purpose

Example

SELECT Retrievedatafromatable SELECT*FROMiris

WHERE Filterrowsbasedona condition

SELECT*FROMirisWHEREsepal_length>5.0

ORDERBY Sorttheresultset SELECT*FROMirisORDERBYsepal_widthDESC

LIMIT Limitthenumberofrows returned SELECT*FROMirisLIMIT10

JOIN Combinerowsfrom multipletables

SELECT*FROMirisJOINspeciesONiris.species= species.name

Table1: CommonSQLcommandsfordataretrieval.

DataManipulation:

Datamanipulationisacriticalaspectofdatabasemanagement,allowinguserstomodifyexisting data,addnewdata,ordeleteunwanteddata.ThekeySQLcommandsfordatamanipulationare

INSERTINTO foraddingnewrecords, UPDATE formodifyingexistingrecords,and DELETEFROM forremovingrecords.Thesecommandsarepowerfultoolsformaintainingandupdatingthecontent withinadatabase,ensuringthatthedataremainscurrentandaccurate.

SQLCommand Purpose

Example

INSERTINTO Insertnewrecordsintoa table INSERTINTOiris(sepal_length,sepal_width)VALUES (6.3,2.8)

UPDATE Updateexistingrecords inatable UPDATEirisSETpetal_length=1.5WHEREspecies= ’Setosa’

DELETEFROM Deleterecordsfroma table DELETEFROMirisWHEREspecies=’Versicolor’

Table2: CommonSQLcommandsformodifyingandmanagingdata.

DataAggregation:

SQLprovidesrobustfunctionalityforaggregatingdata,whichisessentialforstatisticalanalysisand generatingmeaningfulinsightsfromlargedatasets.Commandslike GROUPBY enablegroupingof databasedononeormorecolumns,while SUM, AVG, COUNT,andotheraggregationfunctionsallow forthecalculationofsums,averages,andcounts.The HAVING clausecanbeusedinconjunctionwith GROUPBY toﬁltergroupsbasedonspeciﬁcconditions.Theseaggregationcapabilitiesarecrucialfor summarizingdata,facilitatingcomplexanalyses,andsupportingdecision-makingprocesses.

FundamentalsofDataScience

IbonMartínez-ArranzPage15

SQLCommand Purpose Example

GROUPBY Grouprowsbya column(s)

HAVING Filtergroupsbasedona condition

SUM Calculatethesumofa column

AVG Calculatetheaverageof acolumn

SELECTspecies,COUNT(*)FROMirisGROUPBY species

SELECTspecies,COUNT(*)FROMirisGROUPBY speciesHAVINGCOUNT(*)>5

SELECTspecies,SUM(petal_length)FROMirisGROUP BYspecies

SELECTspecies,AVG(sepal_width)FROMirisGROUP BYspecies

Table3: CommonSQLcommandsfordataaggregationandanalysis.

DataScienceToolsandTechnologies

Datascienceisarapidlyevolvingﬁeld,andassuch,thereareavastnumberoftoolsandtechnologiesavailabletodatascientiststohelptheme ectivelyanalyzeanddrawinsightsfromdata.These toolsrangefromprogramminglanguagesandlibrariestodatavisualizationplatforms,datastorage technologies,andcloud-basedcomputingresources.

Inrecentyears,twoprogramminglanguageshaveemergedastheleadingtoolsfordatascience: PythonandR.Bothlanguageshaverobustecosystemsoflibrariesandtoolsthatmakeiteasyfordata scientiststoworkwithandmanipulatedata.Pythonisknownforitsversatilityandeaseofuse,whileR hasamorespecializedfocusonstatisticalanalysisandvisualization.

Datavisualizationisanessentialcomponentofdatascience,andthereareseveralpowerfultools availabletohelpdatascientistscreatemeaningfulandinformativevisualizations.Somepopular visualizationtoolsincludeTableau,PowerBI,andmatplotlib,aplottinglibraryforPython.

Anothercriticalaspectofdatascienceisdatastorageandmanagement.Traditionaldatabasesare notalwaysthebestﬁtforstoringlargeamountsofdatausedindatascience,andassuch,newer technologieslikeHadoopandApacheSparkhaveemergedaspopularoptionsforstoringandprocessingbigdata.Cloud-basedstorageplatformslikeAmazonWebServices(AWS),GoogleCloud Platform(GCP),andMicroso Azurearealsoincreasinglypopularfortheirscalability,ﬂexibility,and cost-e ectiveness.

Inadditiontothesecoretools,thereareawidevarietyofothertechnologiesandplatformsthatdata scientistsuseintheirwork,includingmachinelearninglibrarieslikeTensorFlowandscikit-learn,data processingtoolslikeApacheKafkaandApacheBeam,andnaturallanguageprocessingtoolslikespaCy andNLTK.

FundamentalsofDataScience

Page16IbonMartínez-Arranz

Giventhevastnumberoftoolsandtechnologiesavailable,it’simportantfordatascientiststocarefully evaluatetheiroptionsandchoosethetoolsthatarebestsuitedfortheirparticularusecase.This requiresadeepunderstandingofthestrengthsandweaknessesofeachtool,aswellasawillingness toexperimentandtryoutnewtechnologiesastheyemerge.

References

Books

• Peng,R.D.(2015).ExploratoryDataAnalysiswithR.Springer.

• Hastie,T.,Tibshirani,R.,&Friedman,J.(2009).Theelementsofstatisticallearning:datamining, inference,andprediction.Springer.

• Provost,F.,&Fawcett,T.(2013).Datascienceanditsrelationshiptobigdataanddata-driven decisionmaking.BigData,1(1),51-59.

• Press,W.H.,Teukolsky,S.A.,Vetterling,W.T.,&Flannery,B.P.(2007).Numericalrecipes:Theart ofscientiﬁccomputing.CambridgeUniversityPress.

• James,G.,Witten,D.,Hastie,T.,&Tibshirani,R.(2013).Anintroductiontostatisticallearning. Springer.

• Wickham,H.,&Grolemund,G.(2017).Rfordatascience:import,tidy,transform,visualize,and modeldata.O’ReillyMedia,Inc.

• VanderPlas,J.(2016).Pythondatasciencehandbook:Essentialtoolsforworkingwithdata. O’ReillyMedia,Inc.

SQLandDataBases

• SQL:https://www.w3schools.com/sql/

• MySQL:https://www.mysql.com/

• PostgreSQL:https://www.postgresql.org/

• SQLite:https://www.sqlite.org/index.html

• DuckDB:https://duckdb.org/

FundamentalsofDataScience

IbonMartínez-ArranzPage17

So ware

• Python:https://www.python.org/

• TheRProjectforStatisticalComputing:https://www.r-project.org/

• Tableau:https://www.tableau.com/

• PowerBI:https://powerbi.microsoft.com/

• Hadoop:https://hadoop.apache.org/

• ApacheSpark:https://spark.apache.org/

• AWS:https://aws.amazon.com/

• GCP:https://cloud.google.com/

• Azure:https://azure.microsoft.com/

• TensorFlow:https://www.tensorﬂow.org/

• scikit-learn:https://scikit-learn.org/

• ApacheKafka:https://kafka.apache.org/

• ApacheBeam:https://beam.apache.org/

• spaCy:https://spacy.io/

• NLTK:https://www.nltk.org/

• NumPy:https://numpy.org/

• Pandas:https://pandas.pydata.org/

• Scikit-learn:https://scikit-learn.org/

• Matplotlib:https://matplotlib.org/

• Seaborn:https://seaborn.pydata.org/

• Plotly:https://plotly.com/

• JupyterNotebook:https://jupyter.org/

• Anaconda:https://www.anaconda.com/

• TensorFlow:https://www.tensorﬂow.org/

• RStudio:https://www.rstudio.com/

FundamentalsofDataScience

Page18IbonMartínez-Arranz

Turn static files into dynamic content formats.

Create a flipbook