Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani
Visit to download the full and correct content document: https://ebookmass.com/product/data-science-in-theory-and-practice-techniques-for-bi g-data-analytics-and-complex-data-sets-maria-c-mariani/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services Dr. Shitalkumar R. Sukhdeve
https://ebookmass.com/product/google-cloud-platform-for-datascience-a-crash-course-on-big-data-machine-learning-and-dataanalytics-services-dr-shitalkumar-r-sukhdeve/

Data Mining for Business Analytics: Concepts, Techniques and Applications in Python eBook
https://ebookmass.com/product/data-mining-for-business-analyticsconcepts-techniques-and-applications-in-python-ebook/

Distrust: Big Data, Data-Torturing, and the Assault on Science Gary Smith
https://ebookmass.com/product/distrust-big-data-data-torturingand-the-assault-on-science-gary-smith/

Big Data Management and Analytics Brij B Gupta & Mamta
https://ebookmass.com/product/big-data-management-and-analyticsbrij-b-gupta-mamta/

Data Wrangling on AWS: Clean and organize complex data for analysis Shukla
https://ebookmass.com/product/data-wrangling-on-aws-clean-andorganize-complex-data-for-analysis-shukla/

Machine Intelligence, Big Data Analytics, and IoT in Image Processing Ashok Kumar
https://ebookmass.com/product/machine-intelligence-big-dataanalytics-and-iot-in-image-processing-ashok-kumar/

The Big R-Book: From Data Science to Learning Machines and Big Data Philippe J. S. De Brouwer
https://ebookmass.com/product/the-big-r-book-from-data-scienceto-learning-machines-and-big-data-philippe-j-s-de-brouwer/

(eBook PDF) Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud
https://ebookmass.com/product/ebook-pdf-intro-to-python-forcomputer-science-and-data-science-learning-to-program-with-aibig-data-and-the-cloud/

Big Data Analytics and Machine Intelligence in Biomedical and Health Informatics Sunil Kuma Dhal
https://ebookmass.com/product/big-data-analytics-and-machineintelligence-in-biomedical-and-health-informatics-sunil-kumadhal/

DataScienceinTheoryandPractice
DataScienceinTheoryandPractice
TechniquesforBigDataAnalyticsandComplexDataSets
MariaCristinaMariani
UniversityofTexas,ElPaso ElPaso,UnitedStates
OseiKofiTweneboah
RamapoCollegeofNewJersey Mahwah,UnitedStates
MariaPiaBeccar-Varela
UniversityofTexas,ElPaso ElPaso,UnitedStates
Thisfirsteditionfirstpublished2022 ©2022JohnWileyandSons,Inc.
Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem, ortransmitted,inanyformorbyanymeans,electronic,mechanical,photocopying,recordingor otherwise,exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerial fromthistitleisavailableathttp://www.wiley.com/go/permissions
TherightofMariaCristinaMariani,OseiKofiTweneboah,andMariaPiaBeccar-Varelatobe identifiedastheauthorsofthisworkhasbeenassertedinaccordancewithlaw.
RegisteredOffice
JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA
EditorialOffice 111RiverStreet,Hoboken,NJ07030,USA
Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWiley productsvisitusatwww.wiley.com
Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Some contentthatappearsinstandardprintversionsofthisbookmaynotbeavailableinother formats.
LimitofLiability/DisclaimerofWarranty
Inviewofongoingresearch,equipmentmodifications,changesingovernmentalregulations, andtheconstantflowofinformationrelatingtotheuseofexperimentalreagents,equipment, anddevices,thereaderisurgedtoreviewandevaluatetheinformationprovidedinthepackage insertorinstructionsforeachchemical,pieceofequipment,reagent,ordevicefor,amongother things,anychangesintheinstructionsorindicationofusageandforaddedwarningsand precautions.Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork, theymakenorepresentationsorwarrantieswithrespecttotheaccuracyorcompletenessofthe contentsofthisworkandspecificallydisclaimallwarranties,includingwithoutlimitationany impliedwarrantiesofmerchantabilityorfitnessforaparticularpurpose.Nowarrantymaybe createdorextendedbysalesrepresentatives,writtensalesmaterialsorpromotionalstatements forthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasa citationand/orpotentialsourceoffurtherinformationdoesnotmeanthatthepublisherand authorsendorsetheinformationorservicestheorganization,website,orproductmayprovide orrecommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisheris notengagedinrenderingprofessionalservices.Theadviceandstrategiescontainedhereinmay notbesuitableforyoursituation.Youshouldconsultwithaspecialistwhereappropriate. Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedor disappearedbetweenwhenthisworkwaswrittenandwhenitisread.Neitherthepublishernor authorsshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnot limitedtospecial,incidental,consequential,orotherdamages.
LibraryofCongressCataloging-in-PublicationDataappliedfor ISBN:9781119674689
CoverDesign:Wiley
CoverImage:©nobeastsofierce/Shutterstock
Setin9.5/12.5ptSTIXTwoTextbyStraive,Chennai,India 10987654321
Contents
ListofFigures xvii
ListofTables xxi
Preface xxiii
1BackgroundofDataScience 1
1.1Introduction 1
1.2OriginofDataScience 2
1.3WhoisaDataScientist? 2
1.4BigData 3
1.4.1CharacteristicsofBigData 4
1.4.2BigDataArchitectures 5
2MatrixAlgebraandRandomVectors 7
2.1Introduction 7
2.2SomeBasicsofMatrixAlgebra 7
2.2.1Vectors 7
2.2.2Matrices 8
2.3RandomVariablesandDistributionFunctions 12
2.3.1TheDirichletDistribution 15
2.3.2MultinomialDistribution 17
2.3.3MultivariateNormalDistribution 18
2.4Problems 19
3MultivariateAnalysis 21
3.1Introduction 21
3.2MultivariateAnalysis:Overview 21
3.3MeanVectors 22
3.4Variance–CovarianceMatrices 24
3.5CorrelationMatrices 26
3.6LinearCombinationsofVariables 28
3.6.1LinearCombinationsofSampleMeans 29
3.6.2LinearCombinationsofSampleVarianceandCovariance 29
3.6.3LinearCombinationsofSampleCorrelation 30
3.7Problems 31
4TimeSeriesForecasting 35
4.1Introduction 35
4.2Terminologies 36
4.3ComponentsofTimeSeries 39
4.3.1Seasonal 39
4.3.2Trend 40
4.3.3Cyclical 41
4.3.4Random 42
4.4TransformationstoAchieveStationarity 42
4.5EliminationofSeasonalityviaDifferencing 44
4.6AdditiveandMultiplicativeModels 44
4.7MeasuringAccuracyofDifferentTimeSeriesTechniques 45
4.7.1MeanAbsoluteDeviation 46
4.7.2MeanAbsolutePercentError 46
4.7.3MeanSquareError 47
4.7.4RootMeanSquareError 48
4.8AveragingandExponentialSmoothingForecastingMethods 48
4.8.1AveragingMethods 49
4.8.1.1SimpleMovingAverages 49
4.8.1.2WeightedMovingAverages 51
4.8.2ExponentialSmoothingMethods 54
4.8.2.1SimpleExponentialSmoothing 54
4.8.2.2AdjustedExponentialSmoothing 55
4.9Problems 57
5IntroductiontoR 61
5.1Introduction 61
5.2BasicDataTypes 62
5.2.1NumericDataType 62
5.2.2IntegerDataType 62
5.2.3Character 63
5.2.4ComplexDataTypes 63
5.2.5LogicalDataTypes 64
5.3SimpleManipulations–NumbersandVectors 64
5.3.1VectorsandAssignment 64
5.3.2VectorArithmetic 65
5.3.3VectorIndex 66
5.3.4LogicalVectors 67
5.3.5MissingValues 68
5.3.6IndexVectors 69
5.3.6.1IndexingwithLogicals 69
5.3.6.2AVectorofPositiveIntegralQuantities 69
5.3.6.3AVectorofNegativeIntegralQuantities 69
5.3.6.4NamedIndexing 69
5.3.7OtherTypesofObjects 70
5.3.7.1Matrices 70
5.3.7.2List 72
5.3.7.3Factor 73
5.3.7.4DataFrames 75
5.3.8DataImport 76
5.3.8.1ExcelFile 76
5.3.8.2CSVFile 76
5.3.8.3TableFile 77
5.3.8.4MinitabFile 77
5.3.8.5SPSSFile 77
5.4Problems 78
6IntroductiontoPython 81
6.1Introduction 81
6.2BasicDataTypes 82
6.2.1NumberDataType 82
6.2.1.1Integer 82
6.2.1.2Floating-PointNumbers 83
6.2.1.3ComplexNumbers 84
6.2.2Strings 84
6.2.3Lists 85
6.2.4Tuples 86
6.2.5Dictionaries 86
6.3NumberTypeConversion 87
6.4PythonConditions 87
6.4.1IfStatements 88
6.4.2TheElseandElifClauses 89
6.4.3TheWhileLoop 90
6.4.3.1TheBreakStatement 91
6.4.3.2TheContinueStatement 91
6.4.4ForLoops 91
6.4.4.1NestedLoops 92
6.5PythonFileHandling:Open,Read,andClose 93
6.6PythonFunctions 93
6.6.1CallingaFunctioninPython 94
6.6.2ScopeandLifetimeofVariables 94
6.7Problems 95
7Algorithms 97
7.1Introduction 97
7.2Algorithm–Definition 97
7.3HowtoWriteanAlgorithm 98
7.3.1AlgorithmAnalysis 99
7.3.2AlgorithmComplexity 99
7.3.3SpaceComplexity 100
7.3.4TimeComplexity 100
7.4AsymptoticAnalysisofanAlgorithm 101
7.4.1AsymptoticNotations 102
7.4.1.1BigONotation 102
7.4.1.2TheOmegaNotation, Ω 102
7.4.1.3The Θ Notation 102
7.5ExamplesofAlgorithms 104
7.6Flowchart 104
7.7Problems 105
8DataPreprocessingandDataValidations 109
8.1Introduction 109
8.2Definition–DataPreprocessing 109
8.3DataCleaning 110
8.3.1HandlingMissingData 110
8.3.2TypesofMissingData 110
8.3.2.1MissingCompletelyatRandom 110
8.3.2.2MissingatRandom 110
8.3.2.3MissingNotatRandom 111
8.3.3TechniquesforHandlingtheMissingData 111
8.3.3.1ListwiseDeletion 111
8.3.3.2PairwiseDeletion 111
8.3.3.3MeanSubstitution 112
8.3.3.4RegressionImputation 112
8.3.3.5MultipleImputation 112
8.3.4IdentifyingOutliersandNoisyData 113
8.3.4.1Binning 113
8.3.4.2BoxandWhiskerplot 113
8.4DataTransformations 115
8.4.1Min–MaxNormalization 115
8.4.2 Z -scoreNormalization 115
8.5DataReduction 116
8.6DataValidations 117
8.6.1MethodsforDataValidation 117
8.6.1.1SimpleStatisticalCriterion 117
8.6.1.2FourierSeriesModelingandSSC 118
8.6.1.3PrincipalComponentAnalysisandSSC 118
8.7Problems 119
9DataVisualizations 121
9.1Introduction 121
9.2Definition–DataVisualization 121
9.2.1ScientificVisualization 123
9.2.2InformationVisualization 123
9.2.3VisualAnalytics 124
9.3DataVisualizationTechniques 126
9.3.1TimeSeriesData 126
9.3.2StatisticalDistributions 127
9.3.2.1Stem-and-LeafPlots 127
9.3.2.2Q–QPlots 127
9.4DataVisualizationTools 129
9.4.1Tableau 129
9.4.2Infogram 130
9.4.3GoogleCharts 132
9.5Problems 133
10BinomialandTrinomialTrees 135
10.1Introduction 135
10.2TheBinomialTreeMethod 135
10.2.1OneStepBinomialTree 136
10.2.2UsingtheTreetoPriceaEuropeanOption 139
10.2.3UsingtheTreetoPriceanAmericanOption 140
10.2.4UsingtheTreetoPriceAnyPathDependentOption 141
10.3BinomialDiscreteModel 141
10.3.1One-StepMethod 141
10.3.2Multi-stepMethod 145
10.3.2.1Example:EuropeanCallOption 146
10.4TrinomialTreeMethod 147
x Contents
10.4.1WhatistheMeaningofLittleoandBigO? 148 10.5Problems 148
11PrincipalComponentAnalysis 151
11.1Introduction 151
11.2BackgroundofPrincipalComponentAnalysis 151 11.3Motivation 152
11.3.1CorrelationandRedundancy 152
11.3.2Visualization 153
11.4TheMathematicsofPCA 153
11.4.1TheEigenvaluesandEigenvectors 156
11.5HowPCAWorks 159
11.5.1Algorithm 160
11.6Application 161 11.7Problems 162
12DiscriminantandClusterAnalysis 165
12.1Introduction 165
12.2Distance 165
12.3DiscriminantAnalysis 166
12.3.1Kullback–LeiblerDivergence 167
12.3.2ChernoffDistance 167
12.3.3Application–SeismicTimeSeries 169
12.3.4Application–FinancialTimeSeries 171
12.4ClusterAnalysis 173
12.4.1PartitioningAlgorithms 174
12.4.2 k-MeansAlgorithm 174
12.4.3 k-MedoidsAlgorithm 175
12.4.4Application–SeismicTimeSeries 176
12.4.5Application–FinancialTimeSeries 176 12.5Problems 177
13MultidimensionalScaling 179
13.1Introduction 179
13.2Motivation 180
13.3NumberofDimensionsandGoodnessofFit 182
13.4ProximityMeasures 183
13.5MetricMultidimensionalScaling 183
13.5.1TheClassicalSolution 184
13.6NonmetricMultidimensionalScaling 186
13.6.1Shepard–KruskalAlgorithm 186 13.7Problems 187
14ClassificationandTree-BasedMethods 191
14.1Introduction 191
14.2AnOverviewofClassification 191
14.2.1TheClassificationProblem 192
14.2.2LogisticRegressionModel 192
14.2.2.1 l1 Regularization 193
14.2.2.2 l2 Regularization 194
14.3LinearDiscriminantAnalysis 194
14.3.1OptimalClassificationandEstimationofGaussianDistribution 195
14.4Tree-BasedMethods 197
14.4.1OneSingleDecisionTree 197
14.4.2RandomForest 198
14.5Applications 200
14.6Problems 202
15AssociationRules 205
15.1Introduction 205
15.2MarketBasketAnalysis 205
15.3Terminologies 207
15.3.1ItemsetandSupportCount 207
15.3.2FrequentItemset 207
15.3.3ClosedFrequentItemset 207
15.3.4MaximalFrequentItemset 208
15.3.5AssociationRule 208
15.3.6RuleEvaluationMetrics 208
15.4TheAprioriAlgorithm 210
15.4.1AnexampleoftheAprioriAlgorithm 211
15.5Applications 213
15.5.1Confidence 214
15.5.2Lift 215
15.5.3Conviction 215
15.6Problems 216
16SupportVectorMachines 219
16.1Introduction 219
16.2TheMaximalMarginClassifier 219
16.3ClassificationUsingaSeparatingHyperplane 223
16.4KernelFunctions 225
16.5Applications 225
16.6Problems 227
17NeuralNetworks 231
17.1Introduction 231
17.2Perceptrons 231
17.3FeedForwardNeuralNetwork 231
17.4RecurrentNeuralNetworks 233
17.5LongShort-TermMemory 234
17.5.1ResidualConnections 235
17.5.2LossFunctions 236
17.5.3StochasticGradientDescent 236
17.5.4Regularization–EnsembleLearning 237
17.6Application 237
17.6.1EmergentandDevelopedMarket 237
17.6.2TheLehmanBrothersCollapse 237
17.6.3Methodology 238
17.6.4AnalysesofData 238
17.6.4.1ResultsoftheEmergentMarketIndex 238
17.6.4.2ResultsoftheDevelopedMarketIndex 238
17.7SignificanceofStudy 239
17.8Problems 240
18FourierAnalysis 245
18.1Introduction 245
18.2Definition 245
18.3DiscreteFourierTransform 246
18.4TheFastFourierTransform(FFT)Method 247
18.5DynamicFourierAnalysis 250
18.5.1Tapering 251
18.5.2DaniellKernelEstimation 252
18.6ApplicationsoftheFourierTransform 253
18.6.1ModelingPowerSpectrumofFinancialReturnsUsingFourier Transforms 253
18.6.2ImageCompression 259
18.7Problems 259
19WaveletsAnalysis 261
19.1Introduction 261
19.1.1WaveletsTransform 262
19.2DiscreteWaveletsTransforms 264
19.2.1HaarWavelets 265
19.2.1.1HaarFunctions 265
19.2.1.2HaarTransformMatrix 266
19.2.2DaubechiesWavelets 267
19.3ApplicationsoftheWaveletsTransform 269
19.3.1DiscriminatingBetweenMiningExplosionsandClusterof Earthquakes 269
19.3.1.1BackgroundofData 269
19.3.1.2Results 269
19.3.2Finance 271
19.3.3DamageDetectioninFrameStructures 275
19.3.4ImageCompression 275
19.3.5SeismicSignals 275
19.4Problems 276
20StochasticAnalysis 279
20.1Introduction 279
20.2NecessaryDefinitionsfromProbabilityTheory 279
20.3StochasticProcesses 280
20.3.1TheIndexSet 281
20.3.2TheStateSpace 281
20.3.3StationaryandIndependentComponents 281
20.3.4StationaryandIndependentIncrements 282
20.3.5FiltrationandStandardFiltration 283
20.4ExamplesofStochasticProcesses 284
20.4.1MarkovChains 285
20.4.1.1ExamplesofMarkovProcesses 286
20.4.1.2TheChapman–KolmogorovEquation 287
20.4.1.3ClassificationofStates 289
20.4.1.4LimitingProbabilities 290
20.4.1.5BranchingProcesses 291
20.4.1.6TimeHomogeneousChains 293
20.4.2Martingales 294
20.4.3SimpleRandomWalk 294
20.4.4TheBrownianMotion(WienerProcess) 294
20.5MeasurableFunctionsandExpectations 295
20.5.1Radon–NikodymTheoremandConditionalExpectation 296
20.6Problems 299
21FractalAnalysis–Lévy,Hurst,DFA,DEA 301
21.1IntroductionandDefinitions 301
21.2LévyProcesses 301
21.2.1ExamplesofLévyProcesses 304
21.2.1.1ThePoissonProcess(Jumps) 305
21.2.1.2TheCompoundPoissonProcess 305
21.2.1.3InverseGaussian(IG)Process 306
21.2.1.4TheGammaProcess 307
21.2.2ExponentialLévyModels 307
21.2.3SubordinationofLévyProcesses 308
21.2.4StableDistributions 309
21.3LévyFlightModels 311
21.4RescaledRangeAnalysis(HurstAnalysis) 312
21.5DetrendedFluctuationAnalysis(DFA) 315
21.6DiffusionEntropyAnalysis(DEA) 316
21.6.1EstimationProcedure 317
21.6.1.1TheShannonEntropy 317
21.6.2The H –�� RelationshipfortheTruncatedLévyFlight 319
21.7Application–CharacterizationofVolcanicTimeSeries 321
21.7.1BackgroundofVolcanicData 321
21.7.2Results 321
21.8Problems 323
22StochasticDifferentialEquations 325
22.1Introduction 325
22.2StochasticDifferentialEquations 325
22.2.1SolutionMethodsofSDEs 326
22.3Examples 335
22.3.1ModelingAssetPrices 335
22.3.2ModelingMagnitudeofEarthquakeSeries 336
22.4MultidimensionalStochasticDifferentialEquations 337
22.4.1ThemultidimensionalOrnstein–UhlenbeckProcesses 337
22.4.2SolutionoftheOrnstein–UhlenbeckProcess 338
22.5SimulationofStochasticDifferentialEquations 340
22.5.1Euler–MaruyamaSchemeforApproximatingStochasticDifferential Equations 340
22.5.2Euler–MilsteinSchemeforApproximatingStochasticDifferential Equations 341
22.6Problems 343
23Ethics:WithGreatPowerComesGreatResponsibility 345
23.1Introduction 345
23.2DataScienceEthicalPrinciples 346
23.2.1EnhanceValueinSociety 346
23.2.2AvoidingHarm 346
23.2.3ProfessionalCompetence 347
23.2.4IncreasingTrustworthiness 348
23.2.5MaintainingAccountabilityandOversight 348
23.3DataScienceCodeofProfessionalConduct 348
23.4Application 350
23.4.1ProjectPlanning 350
23.4.2DataPreprocessing 350
23.4.3DataManagement 350
23.4.4AnalysisandDevelopment 351
23.5Problems 351
Bibliography 353 Index 359
ListofFigures
Figure4.1 Timeseriesdataofphasearrivaltimesofanearthquake. 36
Figure4.2 TimeseriesdataoffinancialreturnscorrespondingtoBankof America(BAC)stockindex. 37
Figure4.3 Seasonaltrendcomponent. 40
Figure4.4 Lineartrendcomponent.Thehorizontalaxisistime t,andthe verticalaxisisthetimeseries Yt .(a)Linearincreasingtrend. (b)Lineardecreasingtrend. 41
Figure4.5 Nonlineartrendcomponent.Thehorizontalaxisistime t andthe verticalaxisisthetimeseries Yt .(a)Nonlinearincreasingtrend. (b)Nonlineardecreasingtrend. 41
Figure4.6 Cyclicalcomponent(imposedontheunderlyingtrend).The horizontalaxisistime t andtheverticalaxisisthetimeseries Yt . 42
Figure7.1 ThebigOnotation. 102
Figure7.2 The Ω notation. 103
Figure7.3 The Θ notation. 103
Figure7.4 Symbolsusedinflowchart. 105
Figure7.5 Flowcharttoaddtwonumbersenteredbyuser. 106
Figure7.6 Flowcharttofindallrootsofaquadraticequation ax 2 + bx + c = 0. 107
Figure7.7 Flowchart. 108
Figure8.1 Theboxplot. 113
Figure8.2 Boxplotexample. 114
Figure9.1 Scatterplotoftemperatureversusicecreamsales. 122
Figure9.2 Heatmapofhandwrittendigitdata. 124
Figure9.3 MapofearthquakemagnitudesrecordedinChile. 125
Figure9.4 Spatialdistributionofearthquakemagnitudes(Marianietal. 2016). 126
Figure9.5 Numberoftextmessagessent. 128
Figure9.6 NormalQ–Qplot. 128
Figure9.7 Riskofloandefault.Source:TableauVizGallery. 130
Figure9.8 Topfivepublishingmarkets.Source:ModifiedfromInternational PublishersAssociation–AnnualReport. 131
Figure9.9 Highyielddefaultedissuerandvolumetrends.Source:Basedon FitchHighYieldDefaultIndex,Bloomberg. 131
Figure9.10 Statisticspageforpopularmoviesandcinemalocations.Source: GoogleCharts. 132
Figure10.1 One-stepbinomialtreeforthereturnprocess. 137
Figure11.1 Heightversusweight. 153
Figure11.2 Visualizinglow-dimensionaldata. 154
Figure11.3 2Ddataset. 157
Figure11.4 FirstPCAaxis. 157
Figure11.5 SecondPCAaxis. 157
Figure11.6 Newaxis. 158
Figure11.7 ScatterplotofRoyalDutchShellstockversusExxonMobil stock. 161
Figure12.1 Classification(byquadrant)ofearthquakesandexplosionsusing theChernoffandKullback–Leiblerdifferences. 171
Figure12.2 Classification(byquadrant)ofLehmanBrotherscollapseand FlashcrasheventusingtheChernoffandKullback–Leibler differences. 173
Figure12.3 Clusteringresultsfortheearthquakeandexplosionseriesbased onsymmetricdivergenceusingPAMalgorithm. 176
Figure12.4 ClusteringresultsfortheLehmanBrotherscollapse,Flashcrash event,Citigroup(2009),andIAG(2011)stockdatabasedon symmetricdivergenceusingthePAMalgorithm. 177
Figure13.1 ScatterplotofdatainTable13.1. 180
Figure16.1 The xy-planeandseveralotherhorizontalplanes. 220
Figure16.2 The xy-planeandseveralparallelplanes. 221
Figure16.3 Theplane x + y + z = 1. 221
Figure16.4 Twoclassproblemwhendataislinearlyseparable. 224
Figure16.5 Twoclassproblemwhendataisnotlinearlyseparable. 224
Figure16.6 ROCcurveforlinearSVM. 226
Figure16.7 ROCcurvefornonlinearSVM. 227
Figure17.1 Singlehiddenlayerfeed-forwardneuralnetworks. 232
Figure17.2 Simplerecurrentneuralnetwork. 234
Figure17.3 Longshort-termmemoryunit. 235
Figure17.4 Philippines(PSI).(a)BasicRNN.(b)LTSM. 239
Figure17.5 Thailand(SETI).(a)BasicRNN.(b)LTSM. 240
Figure17.6 UnitedStates(NASDAQ).(a)BasicRNN.(b)LTSM. 241
Figure17.7 JPMorganChase&Co.(JPM).(a)BasicRNN.(b)LTSM. 242
Figure17.8 Walmart(WMT).(a)BasicRNN.(b)LTSM. 243
Figure18.1 3Dpowerspectraofthedailyreturnsfromthefouranalyzedstock companies.(a)Discover.(b)Microsoft.(c)Walmart.(d)JPM Chase. 255
Figure18.2 3Dpowerspectraofthereturns(generatedperminute)fromthe fouranalyzedstockcompanies.(a)Discover.(b)Microsoft. (c)Walmart.(d)JPMChase. 257
Figure19.1 Time-frequencyimageofexplosion1recordedbyANMO (Table19.2). 270
Figure19.2 Time-frequencyimageofearthquake1recordedbyANMO (Table19.2). 270
Figure19.3 Three-dimensionalgraphicinformationofexplosion1recorded byANMO(Table19.2). 272
Figure19.4 Three-dimensionalgraphicinformationofearthquake1recorded byANMO(Table19.2). 272
Figure19.5 Time-frequencyimageofexplosion2recordedbyTUC (Table19.3). 273
Figure19.6 Time-frequencyimageofearthquake2recordedbyTUC (Table19.3). 273
Figure19.7 Three-dimensionalgraphicinformationofexplosion2recorded byTUC(Tabl19.3). 274
Figure19.8 Three-dimensionalgraphicinformationofearthquake2recorded byTUC(Table19.3). 274
Figure21.1 R∕S forvolcaniceruptions1and2. 322
Figure21.2 DFAforvolcaniceruptions1and2. 323
Figure21.3 DEAforvolcaniceruptions1and2. 323
ListofTables
Table2.1 Examplesofrandomvectors. 13
Table3.1 RamusBoneLengthatFourAgesfor20Boys. 33
Table4.1 Timeseriesdataofthevolumeofsalesofoverasixhour period. 50
Table4.2 Simplemovingaverageforecasts. 50
Table4.3 TimeseriesdatausedinExample4.6. 52
Table4.4 Weightedmovingaverageforecasts. 52
Table4.5 Trendprojectionofweightedmovingaverageforecasts. 53
Table4.6 Exponentialsmoothingforecastsofvolumeofsales. 55
Table4.7 ExponentialsmoothingforecastsfromExample4.9. 56
Table4.8 Adjustedexponentialsmoothingforecasts. 57
Table6.1 Numbers. 83
Table6.2 FilesmodeinPython. 93
Table7.1 Commonasymptoticnotations. 103
Table9.1 Temperatureversusicecreamsales. 122
Table12.1 Eventsinformation. 170
Table12.2 Discriminantscoresforearthquakesandexplosionsgroups. 170
Table12.3 DiscriminantscoresforLehmanBrotherscollapseandFlashcrash event. 172
Table12.4 DiscriminantscoresforCitigroupin2009andIAGstockin 2011. 172
Table13.1 Datamatrix. 180
Table13.2 Distancematrix. 181
Table13.3 Stressandgoodnessoffit. 182
Table13.4 Datamatrix. 188
Table14.1 Models’performancesonthetestdatasetwith23variablesusing AUCandmeansquareerror(MSE)valuesforthefive models. 201
Table14.2 Top10variablesselectedbytheRandomforestalgorithm. 201
Table14.3 Performanceforthefourmodelsusingthetop10featuresfrom modelRandomforestonthetestdataset. 201
Table15.1 Marketbaskettransactiondata. 206
Table15.2 Abinary0∕1representationofmarketbaskettransaction data. 206
Table15.3 Grocerytransactionaldata. 211
Table15.4 Transactiondata. 216
Table16.1 Modelsperformancesonthetestdataset. 226
Table18.1 PercentageofpowerforDiscoverdata. 254
Table18.2 PercentageofpowerforJPMdata. 254
Table18.3 PercentageofpowerforMicrosoftdata. 254
Table18.4 PercentageofpowerforWalmartdata. 254
Table19.1 Determining p and q for N = 16. 266
Table19.2 Percentageoftotalpower(energy)forAlbuquerque,NewMexico (ANMO)seismicstation. 271
Table19.3 Percentageoftotalpower(energy)forTucson,Arizona(TUC) seismicstation. 271
Table21.1 MomentsofthePoissondistributionwithintensity ��. 306
Table21.2 Momentsofthe Γ(a, b) distribution. 307
Table21.3 ScalingexponentsofVolcanicDatatimeseries. 322
Preface
Thistextbookisdedicatedtopractitioners,graduate,andadvancedundergraduate studentswhohaveinterestinDataScience,Businessanalytics,andStatisticaland MathematicalModelingindifferentdisciplinessuchasFinance,Geophysics,and Engineering.Thisbookisdesignedtoserveasatextbookforseveralcoursesinthe aforementionedareasandareferenceguideforpractitionersintheindustry.
Thebookhasastrongtheoreticalbackgroundandseveralapplicationsto specificpracticalproblems.Itcontainsnumeroustechniquesapplicableto moderndatascienceandotherdisciplines.Intoday’sworld,manyfieldsare confrontedwithincreasinglylargeamountsofcomplexdata.Financial,healthcare,andgeophysicaldatasampledwithhighfrequencyisnoexception.These staggeringamountsofdataposespecialchallengestotheworldoffinanceand otherdisciplinessuchashealthcareandgeophysics,astraditionalmodelsand informationtechnologytoolscanbepoorlysuitedtograpplewiththeirsize andcomplexity.Probabilisticmodeling,mathematicalmodeling,andstatistical dataanalysisattempttodiscoverorderfromapparentdisorder;thistextbookmay serveasaguidetovariousnewsystematicapproachesonhowtoimplementthese quantitativeactivitieswithcomplexdatasets.
Thetextbookissplitintofivedistinctparts.Inthefirstpartofthisbook,foundationsofDataScience,wewilldiscusssomefundamentalmathematicaland statisticalconceptswhichformthebasisforthestudyofdatascience.Inthesecond partofthebook,DataScienceinPractice,wewillpresentabriefintroductionto RandPythonprogrammingandhowtowritealgorithms.Inaddition,varioustechniquesfordatapreprocessing,validations,andvisualizationswillbe discussed.Inthethirdpart,DataMiningandMachineLearningtechniquesfor ComplexDataSetsandfourthpartofthebook,AdvancedModelsforBigData AnalyticsandComplexDataSets,wewillprovideexhaustivetechniquesfor analyzingandpredictingdifferenttypesofcomplexdatasets.
xxiv Preface
Weconcludethisbookwithadiscussionofethicsindatascience:Withgreat powercomesgreatresponsibility. TheauthorsexpresstheirdeepestgratitudetoWileyformakingthepublication areality.
ElPaso,TXandMahwah,NJ,USA September2021
MariaCristinaMariani OseiKofiTweneboah MariaPiaBeccar-Varela