PDF Data science in theory and practice: techniques for big data analytics and complex data sets mar

Page 1


Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani

Visit to download the full and correct content document: https://ebookmass.com/product/data-science-in-theory-and-practice-techniques-for-bi g-data-analytics-and-complex-data-sets-maria-c-mariani/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services Dr. Shitalkumar R. Sukhdeve

https://ebookmass.com/product/google-cloud-platform-for-datascience-a-crash-course-on-big-data-machine-learning-and-dataanalytics-services-dr-shitalkumar-r-sukhdeve/

Data Mining for Business Analytics: Concepts, Techniques and Applications in Python eBook

https://ebookmass.com/product/data-mining-for-business-analyticsconcepts-techniques-and-applications-in-python-ebook/

Distrust: Big Data, Data-Torturing, and the Assault on Science Gary Smith

https://ebookmass.com/product/distrust-big-data-data-torturingand-the-assault-on-science-gary-smith/

Big Data Management and Analytics Brij B Gupta & Mamta

https://ebookmass.com/product/big-data-management-and-analyticsbrij-b-gupta-mamta/

Data Wrangling on AWS: Clean and organize complex data for analysis Shukla

https://ebookmass.com/product/data-wrangling-on-aws-clean-andorganize-complex-data-for-analysis-shukla/

Machine Intelligence, Big Data Analytics, and IoT in Image Processing Ashok Kumar

https://ebookmass.com/product/machine-intelligence-big-dataanalytics-and-iot-in-image-processing-ashok-kumar/

The Big R-Book: From Data Science to Learning Machines and Big Data Philippe J. S. De Brouwer

https://ebookmass.com/product/the-big-r-book-from-data-scienceto-learning-machines-and-big-data-philippe-j-s-de-brouwer/

(eBook PDF) Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud

https://ebookmass.com/product/ebook-pdf-intro-to-python-forcomputer-science-and-data-science-learning-to-program-with-aibig-data-and-the-cloud/

Big Data Analytics and Machine Intelligence in Biomedical and Health Informatics Sunil Kuma Dhal

https://ebookmass.com/product/big-data-analytics-and-machineintelligence-in-biomedical-and-health-informatics-sunil-kumadhal/

DataScienceinTheoryandPractice

DataScienceinTheoryandPractice

TechniquesforBigDataAnalyticsandComplexDataSets

MariaCristinaMariani

UniversityofTexas,ElPaso ElPaso,UnitedStates

OseiKofiTweneboah

RamapoCollegeofNewJersey Mahwah,UnitedStates

MariaPiaBeccar-Varela

UniversityofTexas,ElPaso ElPaso,UnitedStates

Thisfirsteditionfirstpublished2022 ©2022JohnWileyandSons,Inc.

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem, ortransmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingor otherwise,exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerial fromthistitleisavailableathttp://www.wiley.com/go/permissions

TherightofMariaCristinaMariani,OseiKofiTweneboah,andMariaPiaBeccar-Varelatobe identifiedastheauthorsofthisworkhasbeenassertedinaccordancewithlaw.

RegisteredOffice

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

EditorialOffice 111RiverStreet,Hoboken,NJ07030,USA

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWiley productsvisitusatwww.wiley.com

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Some contentthatappearsinstandardprintversionsofthisbookmaynotbeavailableinother formats.

LimitofLiability/DisclaimerofWarranty

Inviewofongoingresearch,equipmentmodifications,changesingovernmentalregulations, andtheconstantflowofinformationrelatingtotheuseofexperimentalreagents,equipment, anddevices,thereaderisurgedtoreviewandevaluatetheinformationprovidedinthepackage insertorinstructionsforeachchemical,pieceofequipment,reagent,ordevicefor,amongother things,anychangesintheinstructionsorindicationofusageandforaddedwarningsand precautions.Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork, theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthe contentsofthisworkandspecificallydisclaimallwarranties,includingwithoutlimitationany impliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybe createdorextendedbysalesrepresentatives,writtensalesmaterialsorpromotionalstatements forthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasa citationand/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherand authorsendorsetheinformationorservicestheorganization,website,orproductmayprovide orrecommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisheris notengagedinrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmay notbesuitableforyoursituation.Youshouldconsultwithaspecialistwhereappropriate. Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernor authorsshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnot limitedtospecial,incidental,consequential,orotherdamages.

LibraryofCongressCataloging-in-PublicationDataappliedfor ISBN:9781119674689

CoverDesign:Wiley

CoverImage:©nobeastsofierce/Shutterstock

Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India 10987654321

Contents

ListofFigures xvii

ListofTables xxi

Preface xxiii

1BackgroundofDataScience 1

1.1Introduction 1

1.2OriginofDataScience 2

1.3WhoisaDataScientist? 2

1.4BigData 3

1.4.1CharacteristicsofBigData 4

1.4.2BigDataArchitectures 5

2MatrixAlgebraandRandomVectors 7

2.1Introduction 7

2.2SomeBasicsofMatrixAlgebra 7

2.2.1Vectors 7

2.2.2Matrices 8

2.3RandomVariablesandDistributionFunctions 12

2.3.1TheDirichletDistribution 15

2.3.2MultinomialDistribution 17

2.3.3MultivariateNormalDistribution 18

2.4Problems 19

3MultivariateAnalysis 21

3.1Introduction 21

3.2MultivariateAnalysis:Overview 21

3.3MeanVectors 22

3.4Variance–CovarianceMatrices 24

3.5CorrelationMatrices 26

3.6LinearCombinationsofVariables 28

3.6.1LinearCombinationsofSampleMeans 29

3.6.2LinearCombinationsofSampleVarianceandCovariance 29

3.6.3LinearCombinationsofSampleCorrelation 30

3.7Problems 31

4TimeSeriesForecasting 35

4.1Introduction 35

4.2Terminologies 36

4.3ComponentsofTimeSeries 39

4.3.1Seasonal 39

4.3.2Trend 40

4.3.3Cyclical 41

4.3.4Random 42

4.4TransformationstoAchieveStationarity 42

4.5EliminationofSeasonalityviaDifferencing 44

4.6AdditiveandMultiplicativeModels 44

4.7MeasuringAccuracyofDifferentTimeSeriesTechniques 45

4.7.1MeanAbsoluteDeviation 46

4.7.2MeanAbsolutePercentError 46

4.7.3MeanSquareError 47

4.7.4RootMeanSquareError 48

4.8AveragingandExponentialSmoothingForecastingMethods 48

4.8.1AveragingMethods 49

4.8.1.1SimpleMovingAverages 49

4.8.1.2WeightedMovingAverages 51

4.8.2ExponentialSmoothingMethods 54

4.8.2.1SimpleExponentialSmoothing 54

4.8.2.2AdjustedExponentialSmoothing 55

4.9Problems 57

5IntroductiontoR 61

5.1Introduction 61

5.2BasicDataTypes 62

5.2.1NumericDataType 62

5.2.2IntegerDataType 62

5.2.3Character 63

5.2.4ComplexDataTypes 63

5.2.5LogicalDataTypes 64

5.3SimpleManipulations–NumbersandVectors 64

5.3.1VectorsandAssignment 64

5.3.2VectorArithmetic 65

5.3.3VectorIndex 66

5.3.4LogicalVectors 67

5.3.5MissingValues 68

5.3.6IndexVectors 69

5.3.6.1IndexingwithLogicals 69

5.3.6.2AVectorofPositiveIntegralQuantities 69

5.3.6.3AVectorofNegativeIntegralQuantities 69

5.3.6.4NamedIndexing 69

5.3.7OtherTypesofObjects 70

5.3.7.1Matrices 70

5.3.7.2List 72

5.3.7.3Factor 73

5.3.7.4DataFrames 75

5.3.8DataImport 76

5.3.8.1ExcelFile 76

5.3.8.2CSVFile 76

5.3.8.3TableFile 77

5.3.8.4MinitabFile 77

5.3.8.5SPSSFile 77

5.4Problems 78

6IntroductiontoPython 81

6.1Introduction 81

6.2BasicDataTypes 82

6.2.1NumberDataType 82

6.2.1.1Integer 82

6.2.1.2Floating-PointNumbers 83

6.2.1.3ComplexNumbers 84

6.2.2Strings 84

6.2.3Lists 85

6.2.4Tuples 86

6.2.5Dictionaries 86

6.3NumberTypeConversion 87

6.4PythonConditions 87

6.4.1IfStatements 88

6.4.2TheElseandElifClauses 89

6.4.3TheWhileLoop 90

6.4.3.1TheBreakStatement 91

6.4.3.2TheContinueStatement 91

6.4.4ForLoops 91

6.4.4.1NestedLoops 92

6.5PythonFileHandling:Open,Read,andClose 93

6.6PythonFunctions 93

6.6.1CallingaFunctioninPython 94

6.6.2ScopeandLifetimeofVariables 94

6.7Problems 95

7Algorithms 97

7.1Introduction 97

7.2Algorithm–Definition 97

7.3HowtoWriteanAlgorithm 98

7.3.1AlgorithmAnalysis 99

7.3.2AlgorithmComplexity 99

7.3.3SpaceComplexity 100

7.3.4TimeComplexity 100

7.4AsymptoticAnalysisofanAlgorithm 101

7.4.1AsymptoticNotations 102

7.4.1.1BigONotation 102

7.4.1.2TheOmegaNotation, Ω 102

7.4.1.3The Θ Notation 102

7.5ExamplesofAlgorithms 104

7.6Flowchart 104

7.7Problems 105

8DataPreprocessingandDataValidations 109

8.1Introduction 109

8.2Definition–DataPreprocessing 109

8.3DataCleaning 110

8.3.1HandlingMissingData 110

8.3.2TypesofMissingData 110

8.3.2.1MissingCompletelyatRandom 110

8.3.2.2MissingatRandom 110

8.3.2.3MissingNotatRandom 111

8.3.3TechniquesforHandlingtheMissingData 111

8.3.3.1ListwiseDeletion 111

8.3.3.2PairwiseDeletion 111

8.3.3.3MeanSubstitution 112

8.3.3.4RegressionImputation 112

8.3.3.5MultipleImputation 112

8.3.4IdentifyingOutliersandNoisyData 113

8.3.4.1Binning 113

8.3.4.2BoxandWhiskerplot 113

8.4DataTransformations 115

8.4.1Min–MaxNormalization 115

8.4.2 Z -scoreNormalization 115

8.5DataReduction 116

8.6DataValidations 117

8.6.1MethodsforDataValidation 117

8.6.1.1SimpleStatisticalCriterion 117

8.6.1.2FourierSeriesModelingandSSC 118

8.6.1.3PrincipalComponentAnalysisandSSC 118

8.7Problems 119

9DataVisualizations 121

9.1Introduction 121

9.2Definition–DataVisualization 121

9.2.1ScientificVisualization 123

9.2.2InformationVisualization 123

9.2.3VisualAnalytics 124

9.3DataVisualizationTechniques 126

9.3.1TimeSeriesData 126

9.3.2StatisticalDistributions 127

9.3.2.1Stem-and-LeafPlots 127

9.3.2.2Q–QPlots 127

9.4DataVisualizationTools 129

9.4.1Tableau 129

9.4.2Infogram 130

9.4.3GoogleCharts 132

9.5Problems 133

10BinomialandTrinomialTrees 135

10.1Introduction 135

10.2TheBinomialTreeMethod 135

10.2.1OneStepBinomialTree 136

10.2.2UsingtheTreetoPriceaEuropeanOption 139

10.2.3UsingtheTreetoPriceanAmericanOption 140

10.2.4UsingtheTreetoPriceAnyPathDependentOption 141

10.3BinomialDiscreteModel 141

10.3.1One-StepMethod 141

10.3.2Multi-stepMethod 145

10.3.2.1Example:EuropeanCallOption 146

10.4TrinomialTreeMethod 147

x Contents

10.4.1WhatistheMeaningofLittleoandBigO? 148 10.5Problems 148

11PrincipalComponentAnalysis 151

11.1Introduction 151

11.2BackgroundofPrincipalComponentAnalysis 151 11.3Motivation 152

11.3.1CorrelationandRedundancy 152

11.3.2Visualization 153

11.4TheMathematicsofPCA 153

11.4.1TheEigenvaluesandEigenvectors 156

11.5HowPCAWorks 159

11.5.1Algorithm 160

11.6Application 161 11.7Problems 162

12DiscriminantandClusterAnalysis 165

12.1Introduction 165

12.2Distance 165

12.3DiscriminantAnalysis 166

12.3.1Kullback–LeiblerDivergence 167

12.3.2ChernoffDistance 167

12.3.3Application–SeismicTimeSeries 169

12.3.4Application–FinancialTimeSeries 171

12.4ClusterAnalysis 173

12.4.1PartitioningAlgorithms 174

12.4.2 k-MeansAlgorithm 174

12.4.3 k-MedoidsAlgorithm 175

12.4.4Application–SeismicTimeSeries 176

12.4.5Application–FinancialTimeSeries 176 12.5Problems 177

13MultidimensionalScaling 179

13.1Introduction 179

13.2Motivation 180

13.3NumberofDimensionsandGoodnessofFit 182

13.4ProximityMeasures 183

13.5MetricMultidimensionalScaling 183

13.5.1TheClassicalSolution 184

13.6NonmetricMultidimensionalScaling 186

13.6.1Shepard–KruskalAlgorithm 186 13.7Problems 187

14ClassificationandTree-BasedMethods 191

14.1Introduction 191

14.2AnOverviewofClassification 191

14.2.1TheClassificationProblem 192

14.2.2LogisticRegressionModel 192

14.2.2.1 l1 Regularization 193

14.2.2.2 l2 Regularization 194

14.3LinearDiscriminantAnalysis 194

14.3.1OptimalClassificationandEstimationofGaussianDistribution 195

14.4Tree-BasedMethods 197

14.4.1OneSingleDecisionTree 197

14.4.2RandomForest 198

14.5Applications 200

14.6Problems 202

15AssociationRules 205

15.1Introduction 205

15.2MarketBasketAnalysis 205

15.3Terminologies 207

15.3.1ItemsetandSupportCount 207

15.3.2FrequentItemset 207

15.3.3ClosedFrequentItemset 207

15.3.4MaximalFrequentItemset 208

15.3.5AssociationRule 208

15.3.6RuleEvaluationMetrics 208

15.4TheAprioriAlgorithm 210

15.4.1AnexampleoftheAprioriAlgorithm 211

15.5Applications 213

15.5.1Confidence 214

15.5.2Lift 215

15.5.3Conviction 215

15.6Problems 216

16SupportVectorMachines 219

16.1Introduction 219

16.2TheMaximalMarginClassifier 219

16.3ClassificationUsingaSeparatingHyperplane 223

16.4KernelFunctions 225

16.5Applications 225

16.6Problems 227

17NeuralNetworks 231

17.1Introduction 231

17.2Perceptrons 231

17.3FeedForwardNeuralNetwork 231

17.4RecurrentNeuralNetworks 233

17.5LongShort-TermMemory 234

17.5.1ResidualConnections 235

17.5.2LossFunctions 236

17.5.3StochasticGradientDescent 236

17.5.4Regularization–EnsembleLearning 237

17.6Application 237

17.6.1EmergentandDevelopedMarket 237

17.6.2TheLehmanBrothersCollapse 237

17.6.3Methodology 238

17.6.4AnalysesofData 238

17.6.4.1ResultsoftheEmergentMarketIndex 238

17.6.4.2ResultsoftheDevelopedMarketIndex 238

17.7SignificanceofStudy 239

17.8Problems 240

18FourierAnalysis 245

18.1Introduction 245

18.2Definition 245

18.3DiscreteFourierTransform 246

18.4TheFastFourierTransform(FFT)Method 247

18.5DynamicFourierAnalysis 250

18.5.1Tapering 251

18.5.2DaniellKernelEstimation 252

18.6ApplicationsoftheFourierTransform 253

18.6.1ModelingPowerSpectrumofFinancialReturnsUsingFourier Transforms 253

18.6.2ImageCompression 259

18.7Problems 259

19WaveletsAnalysis 261

19.1Introduction 261

19.1.1WaveletsTransform 262

19.2DiscreteWaveletsTransforms 264

19.2.1HaarWavelets 265

19.2.1.1HaarFunctions 265

19.2.1.2HaarTransformMatrix 266

19.2.2DaubechiesWavelets 267

19.3ApplicationsoftheWaveletsTransform 269

19.3.1DiscriminatingBetweenMiningExplosionsandClusterof Earthquakes 269

19.3.1.1BackgroundofData 269

19.3.1.2Results 269

19.3.2Finance 271

19.3.3DamageDetectioninFrameStructures 275

19.3.4ImageCompression 275

19.3.5SeismicSignals 275

19.4Problems 276

20StochasticAnalysis 279

20.1Introduction 279

20.2NecessaryDefinitionsfromProbabilityTheory 279

20.3StochasticProcesses 280

20.3.1TheIndexSet  281

20.3.2TheStateSpace  281

20.3.3StationaryandIndependentComponents 281

20.3.4StationaryandIndependentIncrements 282

20.3.5FiltrationandStandardFiltration 283

20.4ExamplesofStochasticProcesses 284

20.4.1MarkovChains 285

20.4.1.1ExamplesofMarkovProcesses 286

20.4.1.2TheChapman–KolmogorovEquation 287

20.4.1.3ClassificationofStates 289

20.4.1.4LimitingProbabilities 290

20.4.1.5BranchingProcesses 291

20.4.1.6TimeHomogeneousChains 293

20.4.2Martingales 294

20.4.3SimpleRandomWalk 294

20.4.4TheBrownianMotion(WienerProcess) 294

20.5MeasurableFunctionsandExpectations 295

20.5.1Radon–NikodymTheoremandConditionalExpectation 296

20.6Problems 299

21FractalAnalysis–Lévy,Hurst,DFA,DEA 301

21.1IntroductionandDefinitions 301

21.2LévyProcesses 301

21.2.1ExamplesofLévyProcesses 304

21.2.1.1ThePoissonProcess(Jumps) 305

21.2.1.2TheCompoundPoissonProcess 305

21.2.1.3InverseGaussian(IG)Process 306

21.2.1.4TheGammaProcess 307

21.2.2ExponentialLévyModels 307

21.2.3SubordinationofLévyProcesses 308

21.2.4StableDistributions 309

21.3LévyFlightModels 311

21.4RescaledRangeAnalysis(HurstAnalysis) 312

21.5DetrendedFluctuationAnalysis(DFA) 315

21.6DiffusionEntropyAnalysis(DEA) 316

21.6.1EstimationProcedure 317

21.6.1.1TheShannonEntropy 317

21.6.2The H –�� RelationshipfortheTruncatedLévyFlight 319

21.7Application–CharacterizationofVolcanicTimeSeries 321

21.7.1BackgroundofVolcanicData 321

21.7.2Results 321

21.8Problems 323

22StochasticDifferentialEquations 325

22.1Introduction 325

22.2StochasticDifferentialEquations 325

22.2.1SolutionMethodsofSDEs 326

22.3Examples 335

22.3.1ModelingAssetPrices 335

22.3.2ModelingMagnitudeofEarthquakeSeries 336

22.4MultidimensionalStochasticDifferentialEquations 337

22.4.1ThemultidimensionalOrnstein–UhlenbeckProcesses 337

22.4.2SolutionoftheOrnstein–UhlenbeckProcess 338

22.5SimulationofStochasticDifferentialEquations 340

22.5.1Euler–MaruyamaSchemeforApproximatingStochasticDifferential Equations 340

22.5.2Euler–MilsteinSchemeforApproximatingStochasticDifferential Equations 341

22.6Problems 343

23Ethics:WithGreatPowerComesGreatResponsibility 345

23.1Introduction 345

23.2DataScienceEthicalPrinciples 346

23.2.1EnhanceValueinSociety 346

23.2.2AvoidingHarm 346

23.2.3ProfessionalCompetence 347

23.2.4IncreasingTrustworthiness 348

23.2.5MaintainingAccountabilityandOversight 348

23.3DataScienceCodeofProfessionalConduct 348

23.4Application 350

23.4.1ProjectPlanning 350

23.4.2DataPreprocessing 350

23.4.3DataManagement 350

23.4.4AnalysisandDevelopment 351

23.5Problems 351

Bibliography 353 Index 359

ListofFigures

Figure4.1 Timeseriesdataofphasearrivaltimesofanearthquake. 36

Figure4.2 TimeseriesdataoffinancialreturnscorrespondingtoBankof America(BAC)stockindex. 37

Figure4.3 Seasonaltrendcomponent. 40

Figure4.4 Lineartrendcomponent.Thehorizontalaxisistime t,andthe verticalaxisisthetimeseries Yt .(a)Linearincreasingtrend. (b)Lineardecreasingtrend. 41

Figure4.5 Nonlineartrendcomponent.Thehorizontalaxisistime t andthe verticalaxisisthetimeseries Yt .(a)Nonlinearincreasingtrend. (b)Nonlineardecreasingtrend. 41

Figure4.6 Cyclicalcomponent(imposedontheunderlyingtrend).The horizontalaxisistime t andtheverticalaxisisthetimeseries Yt . 42

Figure7.1 ThebigOnotation. 102

Figure7.2 The Ω notation. 103

Figure7.3 The Θ notation. 103

Figure7.4 Symbolsusedinflowchart. 105

Figure7.5 Flowcharttoaddtwonumbersenteredbyuser. 106

Figure7.6 Flowcharttofindallrootsofaquadraticequation ax 2 + bx + c = 0. 107

Figure7.7 Flowchart. 108

Figure8.1 Theboxplot. 113

Figure8.2 Boxplotexample. 114

Figure9.1 Scatterplotoftemperatureversusicecreamsales. 122

Figure9.2 Heatmapofhandwrittendigitdata. 124

Figure9.3 MapofearthquakemagnitudesrecordedinChile. 125

Figure9.4 Spatialdistributionofearthquakemagnitudes(Marianietal. 2016). 126

Figure9.5 Numberoftextmessagessent. 128

Figure9.6 NormalQ–Qplot. 128

Figure9.7 Riskofloandefault.Source:TableauVizGallery. 130

Figure9.8 Topfivepublishingmarkets.Source:ModifiedfromInternational PublishersAssociation–AnnualReport. 131

Figure9.9 Highyielddefaultedissuerandvolumetrends.Source:Basedon FitchHighYieldDefaultIndex,Bloomberg. 131

Figure9.10 Statisticspageforpopularmoviesandcinemalocations.Source: GoogleCharts. 132

Figure10.1 One-stepbinomialtreeforthereturnprocess. 137

Figure11.1 Heightversusweight. 153

Figure11.2 Visualizinglow-dimensionaldata. 154

Figure11.3 2Ddataset. 157

Figure11.4 FirstPCAaxis. 157

Figure11.5 SecondPCAaxis. 157

Figure11.6 Newaxis. 158

Figure11.7 ScatterplotofRoyalDutchShellstockversusExxonMobil stock. 161

Figure12.1 Classification(byquadrant)ofearthquakesandexplosionsusing theChernoffandKullback–Leiblerdifferences. 171

Figure12.2 Classification(byquadrant)ofLehmanBrotherscollapseand FlashcrasheventusingtheChernoffandKullback–Leibler differences. 173

Figure12.3 Clusteringresultsfortheearthquakeandexplosionseriesbased onsymmetricdivergenceusingPAMalgorithm. 176

Figure12.4 ClusteringresultsfortheLehmanBrotherscollapse,Flashcrash event,Citigroup(2009),andIAG(2011)stockdatabasedon symmetricdivergenceusingthePAMalgorithm. 177

Figure13.1 ScatterplotofdatainTable13.1. 180

Figure16.1 The xy-planeandseveralotherhorizontalplanes. 220

Figure16.2 The xy-planeandseveralparallelplanes. 221

Figure16.3 Theplane x + y + z = 1. 221

Figure16.4 Twoclassproblemwhendataislinearlyseparable. 224

Figure16.5 Twoclassproblemwhendataisnotlinearlyseparable. 224

Figure16.6 ROCcurveforlinearSVM. 226

Figure16.7 ROCcurvefornonlinearSVM. 227

Figure17.1 Singlehiddenlayerfeed-forwardneuralnetworks. 232

Figure17.2 Simplerecurrentneuralnetwork. 234

Figure17.3 Longshort-termmemoryunit. 235

Figure17.4 Philippines(PSI).(a)BasicRNN.(b)LTSM. 239

Figure17.5 Thailand(SETI).(a)BasicRNN.(b)LTSM. 240

Figure17.6 UnitedStates(NASDAQ).(a)BasicRNN.(b)LTSM. 241

Figure17.7 JPMorganChase&Co.(JPM).(a)BasicRNN.(b)LTSM. 242

Figure17.8 Walmart(WMT).(a)BasicRNN.(b)LTSM. 243

Figure18.1 3Dpowerspectraofthedailyreturnsfromthefouranalyzedstock companies.(a)Discover.(b)Microsoft.(c)Walmart.(d)JPM Chase. 255

Figure18.2 3Dpowerspectraofthereturns(generatedperminute)fromthe fouranalyzedstockcompanies.(a)Discover.(b)Microsoft. (c)Walmart.(d)JPMChase. 257

Figure19.1 Time-frequencyimageofexplosion1recordedbyANMO (Table19.2). 270

Figure19.2 Time-frequencyimageofearthquake1recordedbyANMO (Table19.2). 270

Figure19.3 Three-dimensionalgraphicinformationofexplosion1recorded byANMO(Table19.2). 272

Figure19.4 Three-dimensionalgraphicinformationofearthquake1recorded byANMO(Table19.2). 272

Figure19.5 Time-frequencyimageofexplosion2recordedbyTUC (Table19.3). 273

Figure19.6 Time-frequencyimageofearthquake2recordedbyTUC (Table19.3). 273

Figure19.7 Three-dimensionalgraphicinformationofexplosion2recorded byTUC(Tabl19.3). 274

Figure19.8 Three-dimensionalgraphicinformationofearthquake2recorded byTUC(Table19.3). 274

Figure21.1 R∕S forvolcaniceruptions1and2. 322

Figure21.2 DFAforvolcaniceruptions1and2. 323

Figure21.3 DEAforvolcaniceruptions1and2. 323

ListofTables

Table2.1 Examplesofrandomvectors. 13

Table3.1 RamusBoneLengthatFourAgesfor20Boys. 33

Table4.1 Timeseriesdataofthevolumeofsalesofoverasixhour period. 50

Table4.2 Simplemovingaverageforecasts. 50

Table4.3 TimeseriesdatausedinExample4.6. 52

Table4.4 Weightedmovingaverageforecasts. 52

Table4.5 Trendprojectionofweightedmovingaverageforecasts. 53

Table4.6 Exponentialsmoothingforecastsofvolumeofsales. 55

Table4.7 ExponentialsmoothingforecastsfromExample4.9. 56

Table4.8 Adjustedexponentialsmoothingforecasts. 57

Table6.1 Numbers. 83

Table6.2 FilesmodeinPython. 93

Table7.1 Commonasymptoticnotations. 103

Table9.1 Temperatureversusicecreamsales. 122

Table12.1 Eventsinformation. 170

Table12.2 Discriminantscoresforearthquakesandexplosionsgroups. 170

Table12.3 DiscriminantscoresforLehmanBrotherscollapseandFlashcrash event. 172

Table12.4 DiscriminantscoresforCitigroupin2009andIAGstockin 2011. 172

Table13.1 Datamatrix. 180

Table13.2 Distancematrix. 181

Table13.3 Stressandgoodnessoffit. 182

Table13.4 Datamatrix. 188

Table14.1 Models’performancesonthetestdatasetwith23variablesusing AUCandmeansquareerror(MSE)valuesforthefive models. 201

Table14.2 Top10variablesselectedbytheRandomforestalgorithm. 201

Table14.3 Performanceforthefourmodelsusingthetop10featuresfrom modelRandomforestonthetestdataset. 201

Table15.1 Marketbaskettransactiondata. 206

Table15.2 Abinary0∕1representationofmarketbaskettransaction data. 206

Table15.3 Grocerytransactionaldata. 211

Table15.4 Transactiondata. 216

Table16.1 Modelsperformancesonthetestdataset. 226

Table18.1 PercentageofpowerforDiscoverdata. 254

Table18.2 PercentageofpowerforJPMdata. 254

Table18.3 PercentageofpowerforMicrosoftdata. 254

Table18.4 PercentageofpowerforWalmartdata. 254

Table19.1 Determining p and q for N = 16. 266

Table19.2 Percentageoftotalpower(energy)forAlbuquerque,NewMexico (ANMO)seismicstation. 271

Table19.3 Percentageoftotalpower(energy)forTucson,Arizona(TUC) seismicstation. 271

Table21.1 MomentsofthePoissondistributionwithintensity ��. 306

Table21.2 Momentsofthe Γ(a, b) distribution. 307

Table21.3 ScalingexponentsofVolcanicDatatimeseries. 322

Preface

Thistextbookisdedicatedtopractitioners,graduate,andadvancedundergraduate studentswhohaveinterestinDataScience,Businessanalytics,andStatisticaland MathematicalModelingindifferentdisciplinessuchasFinance,Geophysics,and Engineering.Thisbookisdesignedtoserveasatextbookforseveralcoursesinthe aforementionedareasandareferenceguideforpractitionersintheindustry.

Thebookhasastrongtheoreticalbackgroundandseveralapplicationsto specificpracticalproblems.Itcontainsnumeroustechniquesapplicableto moderndatascienceandotherdisciplines.Intoday’sworld,manyfieldsare confrontedwithincreasinglylargeamountsofcomplexdata.Financial,healthcare,andgeophysicaldatasampledwithhighfrequencyisnoexception.These staggeringamountsofdataposespecialchallengestotheworldoffinanceand otherdisciplinessuchashealthcareandgeophysics,astraditionalmodelsand informationtechnologytoolscanbepoorlysuitedtograpplewiththeirsize andcomplexity.Probabilisticmodeling,mathematicalmodeling,andstatistical dataanalysisattempttodiscoverorderfromapparentdisorder;thistextbookmay serveasaguidetovariousnewsystematicapproachesonhowtoimplementthese quantitativeactivitieswithcomplexdatasets.

Thetextbookissplitintofivedistinctparts.Inthefirstpartofthisbook,foundationsofDataScience,wewilldiscusssomefundamentalmathematicaland statisticalconceptswhichformthebasisforthestudyofdatascience.Inthesecond partofthebook,DataScienceinPractice,wewillpresentabriefintroductionto RandPythonprogrammingandhowtowritealgorithms.Inaddition,varioustechniquesfordatapreprocessing,validations,andvisualizationswillbe discussed.Inthethirdpart,DataMiningandMachineLearningtechniquesfor ComplexDataSetsandfourthpartofthebook,AdvancedModelsforBigData AnalyticsandComplexDataSets,wewillprovideexhaustivetechniquesfor analyzingandpredictingdifferenttypesofcomplexdatasets.

xxiv Preface

Weconcludethisbookwithadiscussionofethicsindatascience:Withgreat powercomesgreatresponsibility. TheauthorsexpresstheirdeepestgratitudetoWileyformakingthepublication areality.

ElPaso,TXandMahwah,NJ,USA September2021

MariaCristinaMariani OseiKofiTweneboah MariaPiaBeccar-Varela

BackgroundofDataScience

1.1Introduction

Datascienceisoneofthemostpromisingandhigh-demandcareerpathsforskilled professionalsinthe21stcentury.Currently,successfuldataprofessionalsunderstandthattheymustadvancepastthetraditionalskillsofanalyzinglargeamounts ofdata,statisticallearning,andprogrammingskills.Inordertoexploreanddiscoverusefulinformationfortheircompaniesororganizations,datascientistsmust haveagoodgripofthefullspectrumofthedatasciencelifecycleandhavealevel offlexibilityandunderstandingtomaximizereturnsateachphaseoftheprocess.

Datascienceisa“concepttounifystatistics,mathematics,computerscience, dataanalysis,machinelearningandtheirrelatedmethods”inordertofindtrends, understand,andanalyzeactualphenomenawithdata.DuetotheCoronavirusdisease(COVID-19)manycolleges,institutions,andlargeorganizationsaskedtheir nonessentialemployeestoworkvirtually.Thevirtualmeetingshaveprovidedcollegesandcompanieswithplentyofdata.Someaspectofthedatasuggestthat virtualfatigueisontherise.Virtualfatigueisdefinedastheburnoutassociated withtheoverdependenceonvirtualplatformsforcommunication.Datascience providestoolstoexploreandrevealthebestandworstaspectsofvirtualwork.

Inthepastdecade,datascientistshavebecomenecessaryassetsandarepresent inalmostallinstitutionsandorganizations.Theseprofessionalsaredata-driven individualswithhigh-leveltechnicalskillswhoarecapableofbuildingcomplex quantitativealgorithmstoorganizeandsynthesizelargeamountsofinformation usedtoanswerquestionsanddrivestrategyintheirorganization.Thisiscoupled withtheexperienceincommunicationandleadershipneededtodelivertangible resultstovariousstakeholdersacrossanorganizationorbusiness.

Datascientistsneedtobecuriousandresult-oriented,withgoodknowledge (domainspecific)andcommunicationskillsthatallowthemtoexplainverytechnicalresultstotheirnontechnicalcounterparts.Theypossessastrongquantitative backgroundinstatisticsandmathematicsaswellasprogrammingknowledgewith

DataScienceinTheoryandPractice:TechniquesforBigDataAnalyticsandComplexDataSets, FirstEdition.MariaCristinaMariani,OseiKofiTweneboah,andMariaPiaBeccar-Varela. ©2022JohnWiley&Sons,Inc.Published2022byJohnWiley&Sons,Inc.

focusesindatawarehousing,mining,andmodelingtobuildandanalyzealgorithms.Infact,datascientistsareagroupofanalyticaldataexpertwhohavethe technicalskillstosolvecomplexproblemsandthecuriositytoexplorehowproblemsneedtobesolved.

1.2OriginofDataScience

Datascientistsarepartmathematicians,statisticiansandcomputerscientists. Andbecausetheyspanboththebusinessandinformationtechnology(IT)worlds, they’reinhighdemandandwell-paid.Datascientistswerenotverypopular somedecadesago;however,theirsuddenpopularityreflectshowbusinessesnow thinkabout“Bigdata.”Bigdataisdefinedasafieldthattreatswaystoanalyze, systematicallyextractinformationfrom,orotherwisedealwithdatasetsthatare toolargeorcomplextobedealtwithbytraditionaldata-processingapplication software.Thatbulkymassofunstructuredinformationcannolongerbeignored andforgotten.Itisavirtualgoldminethathelpsboostrevenueaslongasthere issomeonewhoexploresanddiscoversbusinessinsightsthatnoonethought tolookforbefore.Manydatascientistsbegantheircareersasstatisticiansor businessanalystordataanalysts.However,asbigdatabegantogrowandevolve, thoserolesevolvedaswell.DataisnolongerjustanaddonforITtohandle. Itisvitalinformationthatrequiresanalysis,creativecuriosity,andtheability tointerprethigh-techideasintoinnovativewaystomakeprofitandtohelp practitionersmakeinformeddecisions.

1.3WhoisaDataScientist?

Theterm“datascientist”wasinventedasrecentlyas2008whencompaniesrealizedtheneedfordataprofessionalswhoareskilledinorganizingandanalyzingmassiveamountsofdata.Datascientistsarequantitativeandanalyticaldata expertswhoutilizetheirskillsinbothtechnologyandsocialsciencetofindtrends andmanagethedataaroundthem.Withthegrowthofbigdataintegrationinbusiness,theyhaveevolvedattheforefrontofthedatarevolution.Theyarepartmathematicians,statisticians,computerprogrammers,andanalystswhoareequipped withadiverseandwide-rangingskillset,balancingknowledgeinseveralcomputerprogramminglanguageswithadvancedexperienceinstatisticallearning anddatavisualization.

Thereisnotadefinitivejobdescriptionwhenitcomestoadatascientistrole. However,weoutlineheresomestuffstheydo:

● Collectingandrecordinglargeamountsofunrulydataandtransformingitinto amoreusableformat.

● Solvingbusiness-relatedproblemsusingdata-driventechniques.

● Workingwithavarietyofprogramminglanguages,includingSAS,Minitab,R, andPython.

● Havingastrongbackgroundofmathematicsandstatisticsincludingstatistical testsanddistributions.

● Stayingontopofquantitativeandanalyticaltechniquessuchasmachinelearning,deeplearning,andtextanalytics.

● CommunicatingandcollaboratingwithbothITandbusiness.

● Lookingfororderandpatternsindata,aswellasspottingtrendsthatenables businessestomakeinformeddecisions.

Someoftheusefultoolsthateverydatascientistorpractitionerneedsareoutlined below:

● Datapreparation: Theprocessofcleaningandtransformingrawdataintosuitableformatspriortoprocessingandanalysis.

● Datavisualization: Thepresentationofdatainapictorialorgraphicalformatso itcanbeeasilyanalyzed.

● StatisticallearningorMachinelearning: Abranchofartificialintelligencebased onmathematicalalgorithmsandautomation.Artificialintelligence(AI)refers totheprocessofbuildingsmartmachinescapableofperformingtasksthattypicallyrequirehumanintelligence.Theyaredesignedtomakedecisions,often usingreal-timedata.Real-timedataareinformationthatispassedalongtothe enduserimmediatelyitisgathered.

● Deeplearning: Anareaofstatisticallearningresearchthatusesdatatomodel complexabstractions.

● Patternrecognition: Technologythatrecognizespatternsindata(oftenused interchangeablywithmachinelearning).

● Textanalytics: Theprocessofexaminingunstructureddataanddrawingmeaningoutofwrittencommunication.

Wewilldiscussalltheabovetoolsindetailsinthisbook.Thereareseveralscientificandprogrammingskillsthateverydatascientistshouldhave.Theymust beabletoutilizekeytechnicaltoolsandskills,includingR,Python,SAS,SQL, Tableau,andseveralothers.Duetotheevergrowingtechnology,datascientist mustalwayslearnnewandemergingtechniquestostayontopoftheirgame.We willdiscusstheRandPythonprogramminginChapters5and6.

1.4BigData

Bigdataisatermappliedtowaystoanalyze,systematicallyextractinformation from,orotherwisedealwithdatasetsthataretoolargeorcomplextobedealt withbyclassicaldata-processingtools.Inparticular,itreferstodatasetswhose

sizeortypeisbeyondtheabilityoftraditionalrelationaldatabasestocapture, manage,andprocessthedatawithlowlatency.Sourcesofbigdataincludesdata fromsensors,stockmarket,devices,video/audio,networks,logfiles,transactional applications,web,andsocialmediaandmuchofitgeneratedinrealtimeandata verylargescale.

Inrecenttimes,theuseoftheterm“bigdata”(bothstoredandreal-time)tend torefertotheuseofuserbehavioranalytics(UBA),predictiveanalytics,orcertain otheradvanceddataanalyticsmethodsthatextractvaluefromdata.UBAsolutions lookatpatternsofhumanbehavior,andthenapplyalgorithmsandstatisticalanalysistodetectmeaningfulanomaliesfromthosepatterns’anomaliesthatindicate potentialthreats.Forexampledetectionofhackers,detectionofinsiderthreats, targetedattacks,financialfraud,andseveralothers.

Predictiveanalyticsdealswiththeprocessofextractinginformationfrom existingdatasetsinordertodeterminepatternsandpredictfutureoutcomesand trends.Generally,predictiveanalyticsdoesnottellyouwhatwillhappeninthe future.However,itforecastswhatmighthappeninthefuturewithsomedegree ofcertainty.Predictiveanalyticsgoeshandinhandwithbigdata:Businessesand organizationscollectlargeamountsofreal-timecustomerdataandpredictive analyticsandusesthishistoricaldata,combinedwithcustomerinsight,toforecast futureevents.Predictiveanalyticshelpsorganizationstousebigdatatomove fromahistoricalviewtoaforward-lookingperspectiveofthecustomer.Inthis book,wewilldiscussseveralmethodsforanalyzingbigdata.

1.4.1CharacteristicsofBigData

Bigdatahasoneormoreofthefollowingcharacteristics:highvolume,highvelocity,highvariety,andhighveracity.Thatis,thedatasetsarecharacterizedbyhuge amounts(volume)offrequentlyupdateddata(velocity)invarioustypes,suchas numeric,textual,audio,imagesandvideos(variety),withhighquality(veracity). Webrieflydiscusseachindetail. Volume:Volumedescribesthequantityof generatedandstoreddata.Thesizeofthedatadeterminesthevalueandpotential insight,andwhetheritcanbeconsideredbigdataornot. Velocity:Velocity describesthespeedatwhichthedataisgeneratedandprocessedtomeetthe demandsandchallengesthatlieinthepathofgrowthanddevelopment.Bigdata isoftenavailableinbothstoredandreal-time.Comparedtosmalldata,bigdata areproducedmorecontinually(itcouldbenanosecond,second,minute,hours, etc.).Twotypesofvelocityrelatedtobigdataarethefrequencyofgenerationand thefrequencyofhandling,recording,andreporting. Variety:Varietydescribes thetypeandformatsofthedata.Thishelpspeoplewhoanalyzeittoeffectively usetheresultinginsight.Bigdatadrawsfromdifferentformatsandcompletes missingpiecesthroughdatafusion.Datafusionisatermusedtodescribethe techniqueofintegratingmultipledatasourcestoproducemoreconsistent,

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
PDF Data science in theory and practice: techniques for big data analytics and complex data sets mar by Education Libraries - Issuu