[Ebooks PDF] download The big r-book: from data science to learning machines and big data philippe j

Page 1


Visit to download the full and correct content document: https://ebookmass.com/product/the-big-r-book-from-data-science-to-learning-machine s-and-big-data-philippe-j-s-de-brouwer/

More products digital (pdf, epub, mobi) instant download maybe you interests ...

(eBook PDF) Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud

https://ebookmass.com/product/ebook-pdf-intro-to-python-forcomputer-science-and-data-science-learning-to-program-with-aibig-data-and-the-cloud/

Google Cloud Platform for Data Science: A Crash Course on Big Data, Machine Learning, and Data Analytics Services Dr. Shitalkumar R. Sukhdeve

https://ebookmass.com/product/google-cloud-platform-for-datascience-a-crash-course-on-big-data-machine-learning-and-dataanalytics-services-dr-shitalkumar-r-sukhdeve/

Distrust: Big Data, Data-Torturing, and the Assault on Science Gary Smith

https://ebookmass.com/product/distrust-big-data-data-torturingand-the-assault-on-science-gary-smith/

Big Data Balamurugan Balusamy

https://ebookmass.com/product/big-data-balamurugan-balusamy/

BIG DATA ANALYTICS:

Introduction to Hadoop, Spark, and Machine-Learning Raj Kamal

https://ebookmass.com/product/big-data-analytics-introduction-tohadoop-spark-and-machine-learning-raj-kamal/

Machine Learning, Big Data, and IoT for Medical Informatics Pardeep Kumar

https://ebookmass.com/product/machine-learning-big-data-and-iotfor-medical-informatics-pardeep-kumar/

Data Science in Theory and Practice: Techniques for Big Data Analytics and Complex Data Sets Maria C. Mariani

https://ebookmass.com/product/data-science-in-theory-andpractice-techniques-for-big-data-analytics-and-complex-data-setsmaria-c-mariani/

Principles and Practice of Big Data Second Edition

Jules J. Berman

https://ebookmass.com/product/principles-and-practice-of-bigdata-second-edition-jules-j-berman/

Process Safety and Big Data Sagit Valeev

https://ebookmass.com/product/process-safety-and-big-data-sagitvaleev/

THEBIG R-BOOK

THEBIG R -BOOK

FROMDATASCIENCETOLEARNINGMACHINES ANDBIGDATA

Thiseditionfirstpublished2021 © 2021JohnWiley&Sons,Inc.

Allrightsreserved.Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmitted,inanyformorbyanymeans,electronic, mechanical,photocopying,recordingorotherwise,exceptaspermittedbylaw.Adviceonhowtoobtainpermissiontoreusematerialfromthistitleis availableathttp://www.wiley.com/go/permissions.

TherightofPhilippeJ.S.DeBrouwertobeidentifiedastheauthorofthisworkhasbeenassertedinaccordancewithlaw.

RegisteredOffice

JohnWiley&Sons,Inc.,111RiverStreet,Hoboken,NJ07030,USA

EditorialOffice 111RiverStreet,Hoboken,NJ07030,USA

Fordetailsofourglobaleditorialoffices,customerservices,andmoreinformationaboutWileyproductsvisitusatwww.wiley.com.

Wileyalsopublishesitsbooksinavarietyofelectronicformatsandbyprint-on-demand.Somecontentthatappearsinstandardprintversionsofthis bookmaynotbeavailableinotherformats.

LimitofLiability/DisclaimerofWarranty

Whilethepublisherandauthorshaveusedtheirbesteffortsinpreparingthiswork,theymakenorepresentationsorwarrantieswithrespecttothe accuracyorcompletenessofthecontentsofthisworkandspecificallydisclaimallwarranties,includingwithoutlimitationanyimpliedwarranties of merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysalesrepresentatives,writtensalesmaterialsor promotionalstatementsforthiswork.Thefactthatanorganization,website,orproductisreferredtointhisworkasacitationand/orpotentialsource offurtherinformationdoesnotmeanthatthepublisherandauthorsendorsetheinformationorservicestheorganization,website,orproductmay provideorrecommendationsitmaymake.Thisworkissoldwiththeunderstandingthatthepublisherisnotengagedinrenderingprofessional services.Theadviceandstrategiescontainedhereinmaynotbesuitableforyoursituation.Youshouldconsultwithaspecialistwhereappropriate. Further,readersshouldbeawarethatwebsiteslistedinthisworkmayhavechangedordisappearedbetweenwhenthisworkwaswrittenandwhenit isread.Neitherthepublishernorauthorsshallbeliableforanylossofprofitoranyothercommercialdamages,includingbutnotlimitedtospecial, incidental,consequential,orotherdamages.

LibraryofCongressCataloging-in-PublicationData

Names:DeBrouwer,PhilippeJ.S.,author.

Title:ThebigR-book:fromdatasciencetolearningmachinesandbigdata/PhilippeJ.S.DeBrouwer.

Description:Hoboken,NJ,USA:Wiley,2020.|Includesbibliographical referencesandindex.

Identifiers:LCCN2019057557(print)|LCCN2019057558(ebook)|ISBN 9781119632726(hardback)|ISBN9781119632764(adobepdf)|ISBN 9781119632771(epub)

Subjects:LCSH:R(Computerprogramlanguage)

Classification:LCCQA76.73.R3.D432020(print)|LCCQA76.73.R3(ebook) |DDC005.13/3–dc23

LCrecordavailableathttps://lccn.loc.gov/2019057557

LCebookrecordavailableathttps://lccn.loc.gov/2019057558

CoverDesign:Wiley

CoverImages:InformationTideseriesandParticleGeometryseries © agsandrew/Shutterstock,Abstractgeometriclandscape © gremlin/GettyImages,3Dillustration Rendering © MR.Cole_Photographer/GettyImages

Setin9.5/12.5ptSTIXTwoTextbySPiGlobal,Chennai,India

PrintedintheUnitedStatesofAmerica 10987654321

ToJoanna,AmeliaandMaximilian

ShortOverview

Foreword xxv

AbouttheAuthorxxvii

Acknowledgementsxxix

Preface xxxi

AbouttheCompanionSitexxxv

IIntroduction1

1TheBigPicturewithKondratievandKardashev3

2TheScientificMethodandData7

3Conventions 11

IIStartingwithRandElementsofStatistics19

4TheBasicsofR21

5LexicalScopingandEnvironments81

6TheImplementationofOO87

7TidyRwiththeTidyverse121

8ElementsofDescriptiveStatistics139

9VisualisationMethods159

10TimeSeriesAnalysis197

11FurtherReading211

32RMarkdown699

33knitrandLATEX703

34AnAutomatedDevelopmentCycle707

35WritingandCommunicationSkills709

36InteractiveApps713

VIIIBiggerandFasterR741

37ParallelComputing743

38RandBigData761

39ParallelismforBigData767

40TheNeedforSpeed793

IXAppendices819

ACreateyourownRpackage821

BLevelsofMeasurement829

CTrademarkNotices833

DCodeNotShownintheBodyoftheBook839

EAnswerstoSelectedQuestions845

Bibliography 859

Nomenclature 869

Index 881

ShortOverview ix

Contents

Foreword xxv

AbouttheAuthorxxvii

Acknowledgementsxxix

Preface xxxi

AbouttheCompanionSitexxxv

IIntroduction1

1TheBigPicturewithKondratievandKardashev3

2TheScientificMethodandData7

3Conventions 11

IIStartingwithRandElementsofStatistics19

4TheBasicsofR21

4.1GettingStartedwithR..................................23

4.2Variables..........................................26

4.3DataTypes.........................................28

4.3.1TheElementaryTypes..............................28

4.3.2Vectors.......................................29

4.3.2.1CreatingVectors............................29

4.3.3AccessingDatafromaVector..........................29

4.3.3.1VectorArithmetic...........................30

4.3.3.2VectorRecycling............................30

4.3.3.3ReorderingandSorting........................31

4.3.4Matrices......................................32

4.3.4.1CreatingMatrices...........................32

4.3.4.2NamingRowsandColumns.....................33

4.3.4.3AccessSubsetsofaMatrix......................33

4.6.1Built-inFunctions................................69

4.6.2HelpwithFunctions...............................69

4.6.3User-definedFunctions.............................70

5LexicalScopingandEnvironments81

5.1EnvironmentsinR....................................81

5.2LexicalScopinginR....................................83

6TheImplementationofOO87

6.1BaseTypes.........................................89

6.2S3Objects.........................................91

6.2.1CreatingS3Objects................................94

6.2.2CreatingGenericMethods............................96

6.2.3MethodDispatch.................................97

6.2.4GroupGenericFunctions............................98

6.3S4Objects.........................................100

6.3.1CreatingS4Objects................................100

6.3.2UsingS4Objects.................................101

6.3.3ValidationofInput................................105

6.3.4Constructorfunctions..............................107

6.3.5The.Dataslot...................................108

6.3.6RecognisingObjects,GenericFunctions,andMethods............108

6.3.7CreatingS4Generics...............................110

6.3.8MethodDispatch.................................111

6.4TheReferenceClass,refclass,RCorR5Model.....................113

6.4.1CreatingRCObjects...............................113

6.4.2ImportantMethodsandAttributes.......................117

6.5ConclusionsabouttheOOImplementation......................119

7TidyRwiththeTidyverse121

7.1ThePhilosophyoftheTidyverse.............................121

7.2PackagesintheTidyverse.................................124

7.2.1TheCoreTidyverse................................124

7.2.2TheNon-coreTidyverse.............................125

7.3WorkingwiththeTidyverse...............................127

7.3.1Tibbles.......................................127

7.3.2PipingwithR...................................132

7.3.3AttentionPointsWhenUsingthePipe.....................133

7.3.4AdvancedPiping.................................134

7.3.4.1TheDollarPipe............................134

7.3.4.2TheT-Pipe...............................135

7.3.4.3TheAssignmentPipe.........................136

7.3.5Conclusion....................................137

8ElementsofDescriptiveStatistics139

8.1MeasuresofCentralTendency..............................139

8.1.1Mean........................................139

8.1.1.1TheArithmeticMean.........................139

8.1.1.2GeneralisedMeans..........................140

8.1.2TheMedian....................................142

8.1.3TheMode.....................................143

8.2MeasuresofVariationorSpread.............................145

8.3MeasuresofCovariation.................................147

14.2.2CreatingtheDatabase..............................228

14.2.3CreatingtheTablesandRelations........................229 14.3AddingDatatotheDatabase...............................235 14.4QueryingtheDatabase..................................239

14.4.1TheBasicSelectQuery..............................239

14.4.2MoreComplexQueries..............................240 14.5ModifyingtheDatabaseStructure............................244 14.6SelectedFeaturesofSQL.................................249

14.6.1ChangingData..................................249

14.6.2FunctionsinSQL.................................249

15ConnectingRtoanSQLDatabase253

IVDataWrangling257

16AnonymousData261

17DataWranglinginthetidyverse265 17.1ImportingtheData....................................266

17.1.1ImportingfromanSQLRDBMS........................266 17.1.2ImportingFlatFilesintheTidyverse......................267 17.1.2.1CSVFiles................................270 17.1.2.2MakingSenseofFixed-widthFiles.................271 17.2TidyData..........................................275 17.3TidyingUpDatawithtidyr................................277

17.3.1SplittingTables..................................278 17.3.2ConvertHeaderstoData.............................281 17.3.3SpreadingOneColumnOverMany.......................284 17.3.4SplitOneColumnsintoMany..........................285 17.3.5MergeMultipleColumnsIntoOne.......................286 17.3.6WrongData....................................287 17.4SQL-likeFunctionalityviadplyr.............................288

17.4.1SelectingColumns................................288

17.4.2FilteringRows..................................289

17.4.3Joining.......................................290

17.4.4MutatingData...................................293

17.4.5SetOperations...................................296 17.5StringManipulationinthetidyverse..........................299 17.5.1BasicStringManipulation............................300 17.5.2PatternMatchingwithRegularExpressions..................302 17.5.2.1TheSyntaxofRegularExpressions.................303 17.5.2.2FunctionsUsingRegex........................308 17.6Dateswithlubridate....................................314 17.6.1ISO8601Format.................................315 17.6.2Time-zones....................................317

17.6.3ExtractDateandTimeComponents......................318 17.6.4CalculatingwithDate-times...........................319 17.6.4.1Durations................................320

xvi Contents

17.6.4.2Periods.................................321 17.6.4.3Intervals................................323 17.6.4.4Rounding................................324 17.7FactorswithForcats....................................325

18DealingwithMissingData333

19DataBinning343

23LearningMachines405

23.1DecisionTree.......................................407

23.1.1EssentialBackground..............................407

23.1.1.1TheLinearAdditiveDecisionTree..................407 23.1.1.2TheCARTMethod..........................407 23.1.1.3TreePruning..............................408 23.1.1.4ClassificationTrees..........................409 23.1.1.5BinaryClassificationTrees......................411

23.1.2ImportantConsiderations............................412

23.1.2.1BroadeningtheScope.........................412 23.1.2.2SelectedIssues.............................413 23.1.3GrowingTreeswiththePackagerpart.....................414 23.1.3.1GettingStartedwiththeFunctionrpart()..............414 23.1.3.2ExampleofaClassificationTreewithrpart.............415 23.1.3.3VisualisingaDecisionTreewithrpart.plot.............418 23.1.3.4ExampleofaRegressionTreewithrpart..............419 23.1.4EvaluatingthePerformanceofaDecisionTree................424 23.1.4.1ThePerformanceoftheRegressionTree..............424 23.1.4.2ThePerformanceoftheClassificationTree.............424 23.2RandomForest......................................428 23.3ArtificialNeuralNetworks(ANNs)...........................434

23.3.1TheBasicsofANNsinR.............................434 23.3.2NeuralNetworksinR..............................436

23.3.3TheWork-flowtoforFittingaNN.......................438 23.3.4CrossValidatetheNN..............................444 23.4SupportVectorMachine.................................447 23.4.1FittingaSVMinR................................447 23.4.2OptimizingtheSVM...............................449 23.5UnsupervisedLearningandClustering.........................450 23.5.1k-MeansClustering................................450 23.5.1.1k-MeansClusteringinR.......................452 23.5.1.2PCAbeforeClustering........................455 23.5.1.3OntheRelationBetweenPCAandk-Means............461 23.5.2VisualizingClustersinThreeDimensions...................462 23.5.3FuzzyClustering.................................464 23.5.4HierarchicalClustering.............................466 23.5.5OtherClusteringMethods............................468

24TowardsaTidyModellingCyclewithmodelr469 24.1AddingPredictions....................................470 24.2AddingResiduals.....................................471 24.3BootstrappingData....................................472 24.4OtherFunctionsofmodelr................................474

25ModelValidation475

25.1ModelQualityMeasures.................................476 25.2PredictionsandResiduals................................477 25.3Bootstrapping.......................................479 25.3.1BootstrappinginBaseR.............................479

xviii Contents

25.3.2Bootstrappinginthetidyversewithmodelr..................481 25.4Cross-Validation......................................483

25.4.1ElementaryCrossValidation..........................483

25.4.2MonteCarloCrossValidation..........................486

25.4.3 k -FoldCrossValidation.............................488 25.4.4ComparingCrossValidationMethods.....................489 25.5ValidationinaBroaderPerspective...........................492

26.1FinancialAnalysiswithquantmod...........................495 26.1.1TheBasicsofquantmod.............................495 26.1.2TypesofDataAvailableinquantmod......................496

26.1.3Plottingwithquantmod.............................497

26.1.4ThequantmodDataStructure..........................500

26.1.4.1Sub-settingbyTimeandDate....................500

26.1.6.1FinancialModelsinquantmod....................504

26.1.6.2ASimpleModelwithquantmod...................504 26.1.6.3TestingtheModelRobustness....................507

27MultiCriteriaDecisionAnalysis(MCDA)511

27.1WhatandWhy.......................................511

27.2GeneralWork-flow....................................513

27.3IdentifytheIssueatHand:Steps1and2........................516

27.4Step3:theDecisionMatrix................................518

27.4.1ConstructaDecisionMatrix...........................518

27.4.2NormalizetheDecisionMatrix.........................520

27.5Step4:DeleteInefficientandUnacceptableAlternatives...............521

27.5.1UnacceptableAlternatives............................521

27.5.2Dominance–InefficientAlternatives......................521

27.6PlottingPreferenceRelationships............................524

27.7Step5:MCDAMethods..................................526

27.7.1ExamplesofNon-compensatoryMethods...................526

27.7.1.1TheMaxMinMethod.........................526

27.7.1.2TheMaxMaxMethod.........................526

27.7.2TheWeightedSumMethod(WSM).......................527

27.7.3WeightedProductMethod(WPM).......................530

27.7.4ELECTRE.....................................530

27.7.4.1ELECTREI...............................532

27.7.4.2ELECTREII..............................538

27.7.4.3ConclusionsELECTRE........................539

27.7.5PROMethEE....................................540

27.7.5.1PROMethEEI.............................543

27.7.5.2PROMethEEII.............................549

27.7.6PCA(Gaia)....................................553

27.7.7OutrankingMethods...............................557

27.7.8GoalProgramming................................558

27.8SummaryMCDA.....................................561

VIIntroductiontoCompanies563

28FinancialAccounting(FA)567

28.1TheStatementsofAccounts...............................568

28.1.1IncomeStatement................................568

28.1.2NetIncome:TheP&Lstatement........................568

28.1.3BalanceSheet...................................569

28.2TheValueChain......................................571

28.3Further,Terminology...................................573

28.4SelectedFinancialRatios.................................575

29ManagementAccounting583

29.1Introduction........................................583

29.1.1DefinitionofManagementAccounting(MA).................583

29.1.2ManagementInformationSystems(MIS)...................584

29.2SelectedMethodsinMA.................................585

29.2.1CostAccounting.................................585

29.2.2SelectedCostTypes................................587

29.3SelectedUseCasesofMA................................590

29.3.1BalancedScorecard................................590

29.3.2KeyPerformanceIndicators(KPIs).......................591

29.3.2.1LaggingIndicators...........................592

29.3.2.2LeadingIndicators...........................592

29.3.2.3SelectedUsefulKPIs.........................593

30AssetValuationBasics597

30.1TimeValueofMoney...................................598

30.1.1InterestBasics...................................598

30.1.2SpecificInterestRateConcepts.........................598

30.1.3Discounting....................................600

30.2Cash............................................601

30.3Bonds............................................602

30.3.1FeaturesofaBond................................602

30.3.2ValuationofBonds................................604

30.3.3Duration......................................606

30.3.3.1MacaulayDuration..........................606

30.3.3.2ModifiedDuration...........................607

30.4TheCapitalAssetPricingModel(CAPM)........................610

30.4.1TheCAPMFramework.............................610

30.4.2TheCAPMandRisk...............................612

30.4.3LimitationsandShortcomingsoftheCAPM..................612

30.5Equities...........................................614

xx Contents

30.5.1Definition.....................................614

30.5.2ShortHistory...................................614

30.5.3ValuationofEquities...............................615

30.5.4AbsoluteValueModels..............................616

30.5.4.1DividendDiscountModel(DDM)..................616

30.5.4.2FreeCashFlow(FCF).........................620

30.5.4.3DiscountedCashFlowModel....................622

30.5.4.4DiscountedAbnormalOperatingEarningsModel.........623

30.5.4.5NetAssetValueMethodorCostMethod..............624

30.5.4.6ExcessEarningsMethod.......................625

30.5.5RelativeValueModels..............................625

30.5.5.1TheConceptofRelativeValueModels...............625

30.5.5.2ThePriceEarningsRatio(PE)....................626

30.5.5.3PitfallswhenusingPEAnalysis...................627

30.5.5.4OtherCompanyValueRatios.....................627

30.5.6SelectionofValuationMethods.........................630

30.5.7PitfallsinCompanyValuation..........................631

30.5.7.1ForecastingPerformance.......................631

30.5.7.2ResultsandSensitivity........................631

30.6ForwardsandFutures...................................638 30.7Options...........................................640

30.7.1Definitions....................................640

30.7.2CommercialAspects...............................642

30.7.3ShortHistory...................................643

30.7.4ValuationofOptionsatMaturity........................644

30.7.4.1ALongCallatMaturity........................644

30.7.4.2AShortCallatMaturity.......................645

30.7.4.3LongandShortPut..........................646

30.7.4.4ThePut-CallParity..........................648

30.7.5TheBlackandScholesModel..........................649

30.7.5.1PricingofOptionsBeforeMaturity.................649

30.7.5.2ApplytheBlackandScholesFormula................650

30.7.5.3TheLimitsoftheBlackandScholesModel.............653

30.7.6TheBinomialModel...............................654

30.7.6.1RiskNeutralMethod.........................655

30.7.6.2TheEquivalentPortfolioBinomialModel..............659

30.7.6.3SummaryBinomialModel......................660

30.7.7DependenciesoftheOptionPrice........................660

30.7.7.1DependenciesinaLongCallOption................661

30.7.7.2DependenciesinaLongPutOption.................662

30.7.7.3SummaryofFindings.........................664

30.7.8TheGreeks....................................664

30.7.9DeltaHedging...................................665

30.7.10LinearOptionStrategies.............................667

30.7.10.1PlottingaPortfolioofOptions....................667

30.7.10.2SingleOptionStrategies........................670

30.7.10.3CompositeOptionStrategies.....................671

30.7.11IntegratedOptionStrategies...........................674

30.7.11.1TheCoveredCall...........................675

30.7.11.2TheMarriedPut............................676

30.7.11.3TheCollar...............................677

30.7.12ExoticOptions..................................678

30.7.13CapitalProtectedStructures...........................680

VIIReporting683

31AGrammarofGraphicswithggplot2687

31.1TheBasicsofggplot2...................................688

31.2Over-plotting........................................692

31.3CaseStudyforggplot2..................................696

32RMarkdown699

33knitrandLATEX703

34AnAutomatedDevelopmentCycle707

35WritingandCommunicationSkills709

36InteractiveApps713

36.1Shiny............................................715

36.2BrowserBornDataVisualization............................719

36.2.1HTML-widgets..................................719

36.2.2InteractiveMapswithleaflet..........................720

36.2.3InteractiveDataVisualisationwithggvis....................721

36.2.3.1GettingStartedinRwithggvis....................721

36.2.3.2CombiningthePowerofggvisandShiny..............723

36.2.4googleVis.....................................723

36.3Dashboards........................................725

36.3.1TheBusinessCase:aDiversityDashboard...................726

36.3.2ADashboardwithflexdashboard........................731

36.3.2.1AStaticDashboard..........................731

36.3.2.2InteractiveDashboardswithflexdashboard.............736

36.3.3ADashboardwithshinydashboard.......................737

VIIIBiggerandFasterR741

37ParallelComputing743

37.1CombineforeachanddoParallel.............................745

37.2DistributeCalculationsoverLANwithSnow......................748

37.3UsingtheGPU.......................................752

37.3.1GettingStartedwithgpuR............................754

37.3.2OntheImportanceofMemoryuse.......................757

37.3.3ConclusionsforGPUProgramming......................759

38RandBigData761

38.1UseaPowerfulServer...................................763

38.1.1UseRonaServer.................................763

38.1.2LettheDatabaseServerdotheHeavyLifting.................763

38.2UsingmoreMemorythanwehaveRAM........................765

39ParallelismforBigData767

39.2.3.1AUserDefinedFunctiononSpark.................780

40.2.2UseVectorisationwhereAppropriate......................797 40.2.3Pre-allocatingMemory..............................799 40.2.4UsetheFastestFunction.............................800 40.2.5UsetheFastestPackage.............................801 40.2.6BeMindfulaboutDetails............................802

40.2.7CompileFunctions................................804

40.2.8UseCorC++CodeinR.............................806

40.2.9UsingaC++SourceFileinR..........................809

40.2.10CallCompiledC++FunctionsinR.......................811

40.3ProfilingCode.......................................812

40.3.1ThePackageprofr................................813

ACreateyourownRPackage821

A.1CreatingthePackageintheRConsole.........................823

A.2UpdatethePackageDescription.............................825 A.3DocumentingtheFunctionsxs..............................826 A.4LoadingthePackage...................................827

A.5FurtherSteps........................................828

BLevelsofMeasurement829

B.1NominalScale.......................................829

B.2OrdinalScale........................................830

B.3IntervalScale.......................................831

B.4RatioScale.........................................832

CTrademarkNotices833

C.1GeneralTrademarkNotices...............................834

C.2R-RelatedNotices.....................................835

C.2.1CreditingDevelopersofRPackages.......................835

C.2.2TheR-packagesusedinthisBook........................835

DCodeNotShownintheBodyoftheBook839

EAnswerstoSelectedQuestions845

Bibliography 859

Nomenclature 869

Index 881

Foreword

Thisbookbringstogetherskillsandknowledgethatcanhelptoboostyourcareer.Itisanexcellent toolforpeopleworkingasdatabasemanager,datascientist,quant,modeller,statistician,analyst andmore,whoareknowledgeableaboutcertaintopics,butwanttowidentheirhorizonand understandwhattheothersinthislistdo.Awiderunderstandingmeansthatwecandoourjob betterandeventuallyopendoorstoneworenhancedcareers.

Thestudentwhograduatedfromascience,technology,engineeringormathematicsorsimilar programwillfindthatthisbookhelpstomakeasuccessfulstepfromtheacademicworldintoa anyprivateorgovernmentalcompany.

Thisbookusesthepopular(andfree)softwareRasleitmotiftobuildupessentialprogrammingproficiency,understanddatabases,collectdata,wrangledata,buildmodelsandselectmodelsfromasuitofpossibilitiessuchlinearregression,logisticregression,neuralnetworks,decision trees,multicriteriadecisionmodels,etc.andultimatelyevaluateamodelandreportonit.

Wewillgotheextramilebyexplainingsomeessentialsofaccountinginordertobuildupto pricingofassetssuchasbonds,equitiesandoptions.Thishelpstodeepentheunderstandinghow acompanyfunctions,isusefultobemoreresultorientedinaprivatecompany,helpsforone’sown investments,andprovidesagoodexampleofthetheoriesmentionedbefore.Wealsospendtime onthepresentationofresultsandweuseRtogenerateslides,textdocumentsandeveninteractive websites!Finallyweexplorebigdataandprovidehandytipsonspeedingupcode.

Ihopethatthisbookhelpsyoutolearnfasterthanme,andbuildagreatandinterestingcareer. Enjoyreading!

AbouttheAuthor

Dr.PhilippeJ.S.DeBrouwerleadsexpertteamsintheservicecentreofHSBCinKrakow,isHonoraryConsulforBelgiuminKrakow,andisalsoguestprofessorattheUniversityofWarsaw, JagiellonianUniversity,andAGHUniversityofScienceandTechnology.HeteachesbothatexecutiveMBAprogramsandmathematicsfaculties.

Hestudiedtheoreticalphysics,andlateracquiredhissecondMasterdegreewhileworking. FinishingthisMaster,hesolvedthe“fallacyoflargenumberspuzzle”thatwasformulatedbyP.A. Samuelson38yearsbeforeandremainedunsolvedsincethen.InhisPh.D.,hesuccessfullychallengedtheassumptionsofthenoblepricewinning“ModernportfolioTheory”ofH.Markovitz, bycreating“MaslowianPortfolioTheory.”

Hiscareerbroughthimintoinsurance,banking,investmentmanagement,andbacktobanking,whilehisspecializationshiftedfromIT,datasciencetopeoplemanagement.

ForFortis(nowBNP),hecreatedoneofthefirstcapitalguaranteedfundsandgotpromoted todirectorin2000.In2002,hejoinedKBC,wherehemergedfourcompaniesintooneandsubsequentlybecameCEOofthemergedentityin2005.Underhisdirection,thecompanyclimbed fromnumber11tonumber5onthemarket,whilethenumberofcompetitorsincreasedby50%. Intheaftermathofthe2008crisis,hehelpedcreatinganewassetmanagerforKBCinIrelandthat soonaccommodatedthemanagementofca.1000investmentfundsandhadabout =C120billion undermanagement.In2012,hewidenedhisscopebyjoiningtheriskmanagementofthebank andspecializedinstatisticsandnumericalmethods.Later,PhilippeworkedfortheRoyalBank ofScotland(RBS)inLondonandspecializedinBigData,analyticsandpeoplemanagement.In 2016,hejoinedHSBCandispassionateaboutbuildingupaCentreofExcellenceinriskmanagementintheservicecentreinKrakow.Oneofhisteams,theindependentmodelreviewteam, validatesthemostimportantmodelsusedinthebankinggroupworldwide.

Marriedandfatheroftwo,heinvestshisprivatetimeinthefutureoftheeducationbyvolunteeringasboardmemberoftheInternationalSchoolofKrakow.Thisway,hecontributesmodestly tothecosmopolitanambitionsofKrakow.Hegivesbacktosocietybyassumingtheresponsibility ofHonoraryConsulforBelgiuminKrakow,andmainlyhelpstravellersinneed.

Inhisfreetime,heteachesatthemathematicsdepartmentsofAGHUniversityofScience andTechnologyandJagiellonianUniversityinKrakowandattheexecutiveMBAprogramsthe KrakowBusinessSchooloftheUniversityofEconomicsinKrakowandtheWarsawUniversity.Heteachessubjectslikefinance,behaviouraleconomics,decisionmaking,BigData,bank management,structuredfinance,corporatebanking,financialmarkets,financialinstruments, team-building,andleadership.Whatstandsoutishisdataandanalyticscourse:withthiscourse hemanagestoprovidesimilarcontentwithpassionforundergraduatemathematicsstudentsand experiencedprofessionalsofanMBAprogram.Thisvarietyofexperienceandteachingexperience inbothbusinessandmathematicsiswhatlaysthefoundationsofthisbook:thepassiontobridge thegapbetweentheoryandpractice.

Acknowledgements

Writingabookthatissoeclecticandholdssomanyinformationwouldnothavebeenpossible withouttremendoussupportfromsomanypeople:mentors,family,colleagues,andex-colleagues atworkoratuniversities.ThisbookisinthefirstplaceacondensationofafewdecadesofinterestingworkinassetmanagementandbankingandmixesthingsthatIhavelearnedinC-level jobsandmoretechnicalassignments.

IthankthecolleaguesofthefacultiesofappliedmathematicsattheAGHUniversityofScience andTechnology,thefacultyofmathematicsoftheJagiellonianUniversityofKrakow,andthe colleaguesofHSBCforthemanystimulatingdiscussionsandsharedinsightsinmathematical modellingandmachinelearning.

TotheMBAprogramoftheCracovianBusinessSchool,theUniversityofWarsaw,andto themanyleadersthatmarkedmyjourney,Iamindebtedforthebusinessinsight,stakeholder managementandcommercialwitthatmakethisbookcomplete.

AspecialthanksgoestoPiotrKowalczyk,FRMandDr.GrzegorzGoryl,PRM,forreading largechunksofthisbookandprovidingdetailedsuggestions.IamalsogratefulforthegeneralremarksandsuggestionsfromDr.JerzyDzieza,facultyofappliedmathematicsattheAGH UniversityofScienceandTechnologyofKrakowandthefruitfuldiscussionswithDr.Tadeusz Czernik,fromtheUniversityofEconomicsofKatowiceandalsoSeniorManageratHSBC,IndependentModelReview,Krakow.

Thisbookwouldnotbewhatitisnowwithoutthemanyyearsofexperience,thestimulating discussionswithsomanyfriends,andinparticularmywife,JoannaDeBrouwerwhoencouraged metomovefromLondoninordertoworkforHSBCinKrakow,Poland.Somehow,IfeelthatI shouldthankthecitycouncilandallthepeopleforthewonderfulanddynamicenvironmentthat attractssomanynewservicecentresandthatmakestheonesthatalreadyhadselectedforKrakow growtheirsuccessfulinvestments.Thisdynamicenvironmenthascertainlybeenanimportant stimulatingfactorinwritingthisbook.

However,nothingwouldhavebeenpossiblewithoutthedevotionandsupportofmyfamily: mywifeJoanna,bothchildren,AmeliaandMaximilian,werewonderfulandareaconstantsource ofinspirationandsupport.

Finally,Iwouldliketothankthethousandsofpeoplewhocontributetofreeandopensource software,peoplethatspendthousandsofhourstocreateandimprovesoftwarethatotherscanuse forfree.Iprofoundlybelievethattheseselflessactsmakethisworldabetterandmoreinclusive place,becausetheymakecomputers,software,andstudyingmoreaccessibleforthelessfortunate.

AspecialhonorarymentioningshouldgotothepeoplethathavebuiltLinux,LATEX,R,and theecosystemsaroundeachofthemaswellasthecompaniesthatcontributetothoseprojects, suchasMicrosoftthathasembracedRandRStudiothatenhancesRandneverfailstosharethe fruitsoftheireffortswiththelargercommunity.

Preface

Theauthorhaswrittenthisbookbasedonhisexperiencethatspansroughlythreedecadesin insurance,banking,andassetmanagement.Duringhiscareer,theauthorworkedinIT,structuredandmanagedhighlytechnicalinvestmentportfolios(atsomepointoversaw =C24billion inthousandinvestmentfunds),fulfilledmanyC-levelroles(e.g.wasCEOofKBCTFISA[an assetmanagerinPoland],wasCIOandCOOforEperonSA[afundmanagerinIreland]and satonboardsofinvestmentfunds,andwasinvolvedinbig-dataprojectsinLondon),anddid quantitativeanalysisinriskdepartmentsofbanks.Thisgavetheauthorauniqueandin-depth viewofmanyareasrangingformanalytics,big-data,databases,businessrequirements,financial modelling,etc.

Inthisbook,theauthorpresentsastructuredoverviewofhisknowledgeandexperiencefor anyonewhoworkswithdataandinvitesthereadertounderstandthebiggerpicture,anddiscover newaspects.ThisbookalsodemystifieshypearoundmachinelearningandAI,byhelpingthe readertounderstandthemodelsandprogramtheminRwithoutspendingtoomuchtimeonthe theory.

Thisbookaimstobeastartingpointforquants,datascientists,modellers,etc.Itaimsto bethebookthatbridgesdifferentdisciplinessothataspecialistinonedomaincangrabthis book,understandhowhis/herdisciplinefitsinthebiggerpicture,andgetenoughmaterialto understandthepersonwhoisspecializedinarelateddiscipline.Therefore,itcouldbetheideal bookthathelpsyoutomakecareermovetoanotherdisciplinesothatinafewyearsyouarethat personwhounderstandsthewholedata-chain.Inshort,theauthorwantstogiveyouashort-cut totheknowledgethathespent30yearstoaccumulate.

Anotherimportantpointisthatthisbookiswrittenbyandforpractitioners:peoplethatwork withdata,programmingandmathematicsforalivinginacorporateenvironment.So,thisbook wouldbemostinterestingforanyoneinterestedindata-science,machinelearning,statistical learningandmathematicalmodellingandwhomeverwantstoconveytechnicalmattersinaclear andconcisewaytonon-specialists.

Thisalsomeansthatthisbookisnotnecessarilythebestbookinanyofthedisciplinesthatit spans.Ineveryspecialisationtherearealreadygoodcontenders.

• Moreformalintroductionstostatisticsareforexamplein:Cyganowski,Kloeden,and Ombach(2001)andAndersenetal.(1987).Therearealsomanybooksaboutspecific stochasticprocessesandtheirapplicationsinfinancialmarkets:seee.g.Wolfgangand Baschnagel(1999),MalliarisandBrock(1982),andMikosch(1998).Whileknowledgeof stochasticprocessesandtheirimportanceinassetpricingareimportant,thiscoversonly averynarrowspotofapplicationsandtheory.Thisbookismoregeneral,moregentlyon theoreticalfoundationsandfocussesmoreontheuseofdatatoanswerreal-lifeproblems ineverydaybusinessenvironment.

• AcomprehensiveintroductiontostatisticsoreconometricscanbefoundinPeracchi(2001) orGreene(1997).AgeneralandcomprehensiveintroductioninstatisticsisalsoinNeter, Wasserman,andWhitmore(1988).

• Thisisnotsimplyabookaboutprogrammingand/oranyrelatedtechniques.Ifyoujust wanttolearnprogramminginR,thenGrolemund(2014)willbegetyoustartedfaster.Our PartIIwillalsogetyoustartedinprogramming,thoughitassumesacertainfamiliarity withprogrammingandmainlyzoomsinonaspectsthatwillbeimportantintherestofthe book.

• Thisbookisnotacomprehensivebooksaboutfinancialmodelling.Otherbooksdoabetter jobinlistingalltypesofpossiblemodels.NobookdoesabetterjobherethanBernardMarr’s publication:Marr(2016):“KeyBusinessAnalytics,the60+businessanalysistoolevery managerneedstoknow.”Thisbookwilllistyouallwordsthatsomemanagersmightuse andwhatitmeans,withoutanyofthemathematicsnoranyortheprogrammingbehind.I warmlyrecommendkeepingthisbooknexttoours.Wheneversomeonecomesupwitha termlike“customerchurnanalytics”forexample,youcanuseBernard’sbooktofindout whatitactuallymeansandthenturntooursto“getyourhandsdirty”andactuallydoit.

• Ifyouareonlyinterestedinstatisticallearningandmodelling,youwillfindthefollowing booksmorefocused:Hastie,Tibshirani,andFriedman(2009)oralsoJames,Witten,Hastie, andTibshirani(2013)whoalsousesR.

• Amorein-depthintroductiontoAIcanbefoundinRussellandNorvig(2016).

• DatascienceismoreelaboratelytreatedinBaesens(2014)andtherecentbookbyWickham andGrolemund(2016)thatprovidesanexcellentintroductiontoRanddatasciencein general.Thislastbookisagreatadd-ontothisbookasitfocussesmoreonthedata-aspects (butlessonthestatisticallearningpart).Wealsofocusmoreonthepracticalaspectsand realdataproblemsincorporateenvironment.

AbookthatcomesclosetooursinpurposeisthebookthatmyfriendprofessorBartBaetens hascompiled“AnalyticsinaBigDataWorld,theEssentialguidetodatascienceanditsapplications”:Baesens(2014).Ifthemathematics,programming,andRitselfscareyouinthisbook, thenBart’sbookisforyou.Bart’sbookcoversdifferentmethods,butaboveall,forthereader,itis sufficienttobeabletouseaspreadsheettodosomebasiccalculations.Therefore,itwillnothelp youtotacklebigdatanorprogramminganeuralnetworkyourself,butyouwillunderstandvery wellwhatitmeansandhowthingswork.

AnotherbookthatmightworkwellifthemathsinthisoneareprohibitivetoyouisProvost andFawcett(2013),itwillgiveyousomeinsightinwhatthestatisticallearningisandhowit works,butwillnotprepareyoutouseitonrealdata.

Summarizing,IsuggestyoubuynexttothisbookalsoMarr(2016)andBaesens(2014). Thiswillprovideyouacompletechainfrombusinessandbuzzwords(Bernard’sbook)over understandingwhatmodellingisandwhatpracticalissuesonewillencounter(Bart’sbook)to implementingthisinacorporatesettingandsolvethepracticalproblemsofadatascientistand modelleronsizeabledata(thisbook).

Inanutshell,thisbookdoesitall,isgentleontheoreticalfoundationsandaimstobeaonestopshoptoshowthebigpicture,learnallthosethingsandactuallyapplyit.Itaimstoserveas abasiswhenlaterpickingupmoreadvancedbooksincertainnarrowareas.Thisbookwilltake youonajourneyofworkingwithdatainarealcompany,andhence,itwilldiscussalsopractical problemssuchaspeoplefillinginformsorextractingdatafromaSQLdatabase.

xxxii Preface

Itshouldbereadableforanypersonthatfinished(orisfinishing)universityleveleducationin aquantitativefieldsuchasphysics,civilengineering,mathematics,econometrics,etc.Itshould alsobereadablebytheseniormanagerwithatechnicalbackground,whotriestounderstand whathisarmyofquants,datascientists,anddevelopersareupto,whilehavingfunlearning R.Afterreadingthisbookyouwillbeabletotalktoall,challengetheirwork,andmakemost analysisyourselforbepartofabiggerentityandspecializeinoneofthestepsofmodellingor data-manipulation.

Insomeway,thisbookcanalsobeseenasacelebrationofFOSS(FreeandOpenSourceSoftware).Weproudlymentionthatforthisbooknocommercialsoftwarewasusedatall.TheoperatingsystemisLinux,thewindowsmanagerFluxbox(sometimesLXDEorKDE),Kileandvihelped theeditingprocess,OkulardisplayedthePDF-file,eventhedatabaseserversandHadoop/Spark areFOSS...andofcourseRandLATEXprovidedtheicingonthecake.FOSSmakesthisworlda moreinclusiveplaceasitmakestechnologymoreattainableinpoorerplacesonthisworld.

Hence,weextendawarmthankstoallpeoplethatspendsomuchtimetocontributingtofree software.

xxxiii

Preface
FOSS

AbouttheCompanionSite

Thisbookisaccompaniedbyacompanionwebsite:

www.wiley.com/go/DeBrouwer/TheBigR-Book

Thewebsiteincludesmaterialsforstudentsandinstructors: TheStudentcompanionsitewillcontaintheR-code,andtheInstructorcompanionsitewill containPDFslidesbasedonthebook’scontent.

PARTI Introduction

TheBigPicturewithKondratiev andKardashev

Youhavecertainlyheardthewords:“dataisthenewoil,”andyouprobablywondered“arewe indeedonthevergeofaneweraofinnovationandwealthcreationor...isthisjusthypeandwill itblowoversoonenough?”

Sinceourancestorsleftthetreesabout6millionyearsago,weroamedtheAfricansteppesand weevolvedamoreuprightpositionandlimbsbettersuitedforwalkingthanclimbing.However, forabout4millionyearsphysiologicalchangesdidnotincludealargerbrain.Itisonlyinthelast millionyearsthatwegraduallyevolvedamorepotentfrontallobecapableofabstractandlogical thinking.

ThefirstgoodevidenceofabstractthinkingistheMakapansgatpebble,ajasperitecobble–roughly260gand5by8cm–thatbygeologicaltearandwearshowsafewholesandlinesthat vaguelyresemble(tous)ahumanface.About2.5millionyearsagooneofouraustralopithecine ancestorsnotonlyrealizedthisresemblancebutalsodeemeditinterestingenoughtopickupthe pebble,keepit,andfinallyleaveitinacavemilesfromtheriverwhereitwasfound.

Thisdevelopmentofabstractthinkingthatgoesbeyondvagueresemblancewasamajormilestone.Ashistoryunfolded,itbecameclearthatthiswasonlythefirstofmanystepsthatwould leadustotheeraofdataandknowledgethatweliveintoday.Manymorestepstowardsmore abstractthinking complexandabstractthinking,genemutationsandinnovationwouldbeneeded.

Soonwedevelopedlanguage.Withlanguagewewereabletotransformlearningfromanindividualleveltoacollectivelevel.Now,experiencescouldbepassedontothenextgenerationor peersmuchmoreefficiently,itbecamepossibletopreparesomeoneforsomethingthatheorshe didnotyetencounterandtoaccumulatemoreknowledgewitheverygeneration.

Morethaneverbeforethisabstractthinkingandaccumulationofcollectiveexperienceslead toa“knowledgeadvantage”andsmartnessbecameanattractivetraitinamate.Thisallowedour braintodevelopfurtherandgreatinnovationssuchasthewheel,scripture,bronze,agriculture, iron,specialisationoflaboursoonstartedtotransformnotonlyoursocietalcoherencebutalso theworldaroundus.

Withoutthoseinnovations,wewouldnotbewherewearenow.Whileitisdiscussableto classifytheseinventionsasthefruitofscientificwork,itisequallyhardtodenythatsomekind ofscientificapproachwasnecessary.Forexample,realizingthepatternsinthemovementsofthe sun,wecouldpredictseasonsandweatherchangestocomeandthisallowedustoputthegrains ontherightmomentintheground.Thiswasbasedonobservationsandexperience.

TheBigR-Book:FromDataSciencetoLearningMachinesandBigData, FirstEdition.PhilippeJ.S.DeBrouwer. © 2021JohnWiley&Sons,Inc.Published2021byJohnWiley&Sons,Inc. CompanionWebsite:www.wiley.com/go/DeBrouwer/TheBigR-Book

1TheBigPicturewithKondratievandKardashev

Scienceandprogressflourished,butthefalloftheWesternEuropeanempiremadeEurope sinkinthedarkmedievalperiodwherethinkingwasdominatedbyreligiousfearandsuperstition andhencescientificprogresscametogrindinghalt,anditiswakeimprovementsinmedicalcare, foodproductionandtechnology.

TheArabworldcontinuedthelegacyofAristotle(384–322BCE,Greece)andAlhazen(Ibn al-Haytham,965–1039Iraq),whobymanyisconsideredasthefatherofthemodernscientific method.1 Itwasthismodernscientificmethodthatbecameacatalystforscientificandtechno- scientificmethod logicaldevelopment.

Aclassofpeoplethataccumulatedwealththroughsmartchoicesemerged.Thiswasmade possiblebyprivateenterpriseandanefficientwayofsharingrisksandinvestments.In1602,the EastIndiesCompanybecamethefirstcommonstockcompanyandin1601theAmsterdamStock Exchangecreatedaplatformwhereinnovative,exploratoryandtradeideascouldfindthenecessarycapitaltoflourish.

In1775,JamesWatt’simprovementofthesteamengineallowedtoleverageontheprogress madearoundthejointstockcompanyandthestockexchange.Thiscombinationpoweredthe raiseofanewsocietalorganization,capitalismandfueledthefirstindustrialwavebasedon capitalism automation(mainlyinthetextileindustry).

Whilethisfirstindustrialwavebroughtmuchmiseryandsocialinjustice,asaspecieswewere preparingforthenextstage.Itcreatedwealthasneverbeforeonascaleneverseenbefore.From England,theindustrialization,spreadfastoverEuropeandtheyoungstateinNorthAmerica.It allendedin“thePanicof1873,”thatbroughtthe“LongDepression”toEuropeandtheUnited StatesofAmerica.Thisdepressionwassodeepthatitwouldindirectlygiverisetoatheinvention ofanneweconomicorder:communism.

Thesamesteamengine,however,hadanothertrickupitssleeve:afterindustrialisationit wasabletoprovidemasstransportbytherailway.Thisfuelledanewwaveofwealthcreation andlastedtillthe1900s,wherethatwaveofwealthcreationendedinthe“Panicof1901”and the“Panicof1907”–thefirststockmarketcrashestostartintheUnitedStatesofAmerica.The internalcombustionengine,electricityandmagnetismbecamethecornerstonesofannewwave ofexponentialgrowthbasedoninnovation.The“WallStreetCrashof1929”endedthiswaveand startedthe“GreatDepression.”

Itwasabout1935whenKondratievnoticedtheselongtermwavesofexponentialgrowth anddevastatingmarketcrashesandpublishedhisfindingsinKondratieffandStolper(1935)–republishedinKondratieff(1979).Theworkbecamepropheticastheautomobileindustryand Kondratiev chemistryfuelledanewwaveofdevelopmentthatgaveusindividualmobilityandlastedtill1973–1974stockmarketcrash.

Thescenariorepeateditselfasclockworkwhenitwastheturnoftheelectroniccomputerand informationtechnology(IT)tofuelexponentialgrowthtillthecrashesof2002–2008. IT informationtechnology

Now,momentumisgatheringpacewithafewstrongcontenderstopullanewwaveofeconomicdevelopmentandwealthcreation,anewphaseofexponentialgrowth.Thesecontenders includeinouropinion:

• quantumcomputing(ifwewillmanagetogetthemwork,thatis),

• nanotechnologyanddevelopmentinmedicaltreatments,

• machinelearning,artificialintelligenceandbigdata.

1 With“scientificmethod”werefertotheempiricalmethodofacquiringknowledgebasedonobservations,scepticismandscrutinyfrompeers,reproducibilityofresults.Theideaistoformulateahypothesis,basedonlogical inductionbasedonobservations,thenallowingpeerstoreviewandpublishtheresults,sothatothercanfalsifyof conformtheresults.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.